If you want to annotate a corpus, here are the main steps you should follow:
- Check that your corpus is submitted in an editable, open and non-proprietary format such as .txt, .xml or .json. Documents in .doc, .pdf, .docx, etc. format should be prepared for annotation
- Establish an annotation scheme: define objects to be annotated (units, relations, complex structures), types of linguistic units involved (characters, words, statements, paragraphs, undefined units), characteristics to be associated with the annotated objects
- Choose an annotation software (if possible, after having tested several)
- Write the annotation guide
- Test the guide with several people on the same text
- Compare annotations to stabilize the final version of the guide
- Select and train annotators (it is a good idea to propose a first annotation that can be compared with a reference version, for example the text used to stabilize the guide)
- Annotate
- Check annotation quality, in particular by calculating inter-annotator agreement
- If possible, provide an adjudicated version (reference version in which disagreements will have been resolved)
- Describe the collected annotations
- If possible, add new examples (including examples of uncertainties and disagreements) and annotators’ testimonies to the annotation guide