FAQ

Annoter un corpus

What tools are available for oral or multimodal corpus annotation?
Various tools dedicated to oral or multimodal corpora annotation are listed in the software inventory section ; to get the complete list, you can filter the tools by type (Type=Analysis) and by type of data (Data=Audio/Video). Some have been demonstrated during training sessions organized by CORLI, including :
  • ELAN, a software for creating complex annotations on video and audio resources
What tools are available for corpus annotation?

Many tools dedicated to corpus annotation are listed on the software inventory page; to get the complete list, you can filter tools by type (Type=Annotation).

Some have been presented during training sessions offered by CORLI:

  • ELAN, a software for creating complex annotations on video and audio resources
  • Glozz, an annotation and exploration environment for textual corpora
  • INCEpTION, a platform for multi-level collaborative annotation
What is corpus annotation?

Annotating a corpus means adding one or more layers of linguistic interpretation to raw data. Annotations added can be of very diverse natures: they can be morpho-syntactic categories, semantic or discursive annotations, but also, in the case of oral or multi-modal corpora, information on prosody, gestures, etc.

Annotations are performed during annotation campaigns by human annotators, more or less expert, who rely on an annotation guide.

More resources on the CORLI website:

  • The CORLI Annotation Network-Group is dedicated to issues related to corpus annotation – You can subscribe to its mailing list here.
  • Several training sessions organized by CORLI members have been dedicated to corpus annotation – You will find the list of these trainings as well as the course materials here.
What are the main steps of an annotation campaign?

If you want to annotate a corpus, here are the main steps you should follow:

  • Check that your corpus is submitted in an editable, open and non-proprietary format such as .txt, .xml or .json. Documents in .doc, .pdf, .docx, etc. format should be prepared for annotation
  • Establish an annotation scheme: define objects to be annotated (units, relations, complex structures), types of linguistic units involved (characters, words, statements, paragraphs, undefined units), characteristics to be associated with the annotated objects
  • Choose an annotation software (if possible, after having tested several)
  • Write the annotation guide
  • Test the guide with several people on the same text
  • Compare annotations to stabilize the final version of the guide
  • Select and train annotators (it is a good idea to propose a first annotation that can be compared with a reference version, for example the text used to stabilize the guide)
  • Annotate
  • Check annotation quality, in particular by calculating inter-annotator agreement
  • If possible, provide an adjudicated version (reference version in which disagreements will have been resolved)
  • Describe the collected annotations
  • If possible, add new examples (including examples of uncertainties and disagreements) and annotators’ testimonies to the annotation guide
How to write an annotation guide?

Any annotation project should be accompanied by the writing of an annotation guide detailing decisions made regarding corpus annotation, linguistic objects to be identified by annotators, categories that can be assigned to them, etc.

To write an annotation guide, it may be useful to consult other guides written in the state of the art. This is why CORLI lists annotation guides produced during various annotation campaigns, as well as scientific articles dealing with corpus annotation. They can be consulted on this page.

How to use the INCEpTION platform?

As part of the Annotation project (CORLI 2022-2025) and a student project within the LITL Masters degree in Linguistics (Toulouse, France), we have created a set of files to get started and annotate with the INCEpTION platform.

The student project’s objective was to participate in the design of a high-level annotation platform with active learning functionalities, in collaboration with the INCEpTION team (TU Darmstadt) by testing the current prototype – an objective the CORLI consortium has set: the constitution of a collaborative annotation resource. This second objective served as a use case for writing documentation in French, evaluating and improving the platform.

The sheets provided here are in French and are a first draft that may be modified.

Any comment from you is welcome to improve them!

To do so, please use the contact form by selecting the theme “Comments on the website content” and specifying in the subject line: “Comments on the INCEpTION sheet [Number]”.

How to assess annotations quality?

To check the quality of annotations, it is essential to evaluate the inter-annotator agreement. To do this, we compare the annotations of multiple annotators to whom we have submitted the same data. The most common measure used to evaluate inter-annotator agreement is Cohen’s Kappa.

Aspects juridiques

What guidelines and texts regulate the creation and use of corpora?

Guidelines:

Before creating a corpus, it is recommended to establish a data management plan and to follow the FAIR principles (to produce Findable, Accessible, Interoperable, Reusable data).

Before using any corpus, it is recommended to find out about associated licenses and contact, if possible, the producers or managers of corpora to find out about any restrictions.

In both cases, you can ask for help from a research support staff (university library, House of Human Sciences (MSH), research departments, etc.) or by following a training/self-training course thanks to online resources like Doranum, Inist webinars

What are the legal and ethical issues involved in collecting data and making it available in a corpus?

Sharing resources is essential in an open science approach as promoted by CORLI. When data collected to build a corpus comes from speakers, thus from individuals, personal information and intellectual property should be protected. In some cases, relevant data for linguistic analysis are directly identifying (information on the speaker, voice, image…) or even sensitive (opinions, origins, health, etc.). There is therefore a balance to be found to allow the dissemination of corpora in compliance with legislation and ethics. The objective of the QuECJ network-group is to inform and accompany the community on these issues.

On this page, you will find various documents concerning best legal practices.

Should I anonymize my corpus?

If the corpus includes personal data (i.e. directly or indirectly recognizable data), the publication of the corpus (extracts or in its entirety) requires prior anonymization (of textual, oral, or audiovisual data). Otherwise, a usage limitation will be necessary (to be defined with the competent data protection representative).

Do I need consent from the speakers to collect data for my corpus? How do I get it?

Consent is mandatory, with some exceptions.

It is up to the researcher to prove they have obtained consent -for data collection as well as their automated processing or dissemination in all the considered media. The purpose of data collection should be clearly stated to ensure the quality of the consent obtained. If the personal data collected involves images of the person (video recordings), the image right applies and specific authorization must be obtained. In addition, any disclosure of privacy information that is not expressly provided for is not permitted.

Informed consent is consent obtained after the person has been informed. The preferred method of obtaining consent is in writing – using a consent form.

Constituer un corpus

Why should I deposit my corpus and how?

There are several reasons to deposit your corpus. On the one hand, building up a corpus is a very costly process; it is therefore important to share this effort so other researchers can benefit from it; indeed, it could lead to new analyses. On the other hand, the data which constitute a corpus sometimes have a patrimonial value (for example for the documentation of rare languages) which makes them precious and is sufficient to make their archiving desirable. Finally, the deposit of data responds to a problem of control and evaluation of research: any experimental work must be reproducible, and the availability of corpora (as well as their documentation and possibly the tools that have allowed them to be analyzed) is a sine qua non condition for ensuring reproducibility.

When depositing a corpus, it should be formatted in a way that conforms to international standards (TEI and other adapted XML formats, etc.) and described by metadata that are also standardized. The deposited corpus should respect the FAIR principles: Findable, Accessible, Interoperable, Reusable. This is why CORLI is leading an action aiming at funding corpora finalization in order to respect the FAIR principles, so that corpora can be deposited and promoted.

The repository of a corpus can be done on specialized websites; for instance, in France, COCOON and ORTOLANG.

Why and how to assess the quality of a corpus?

Sharing corpora, and respecting best practices, can make the development of corpora very costly. Corpora must be considered as scientific productions in their own right. Being able to assess quality is therefore an essential issue.

Corpus evaluation is a very important question within CORLI:

Which formats for oral or multimodal data?

Not all formats are appropriate for storing data from a corpus. Indeed, it is essential that data are stored in a structured and standardized format, so that they can be exploited automatically.

  • An overview of formats can be found here
  • The TEI-CORPO tool allows you to convert files in Elan, Clan, Transcriber and Praat formats to TEI and vice versa
Whare are metadata and what are they used for?

Metadata are a set of information that one decides to keep in addition to the linguistic data itself, in order to document them and to facilitate the use of the corpus by other researchers. Such information can be very different: data sources, software (and its exact version) used for data collection or processing, information about the speakers (age, gender, mother tongue…) or about the acquisition situation for oral or multimodal data, etc.

A very important point: metadata should be standardized, i.e. expressed according to an international standard recognized by the scientific community. As practices are still very heterogeneous today, CORLI is leading an action of corpus enhancement aiming at finalizing the formatting of existing corpora, following the FAIR (Findable, Accessible, Interoperable, Reusable) principles.

More resources on the CORLI website:

How do I collect data to build a corpus?

The data used in corpus linguistics can be of different natures: written or oral data, but also videos, movement and eye-tracking captures, etc. The acquisition of data to build a corpus must be carefully prepared beforehand and the method used must be well defined and documented to ensure traceability. In particular, the question of required equipment (in case of recordings), necessary tools, and metadata to be associated with the collected data must be addressed.

More information on the CORLI website:

Corpus bilingues et multilingues

Which tools can I use to annotate my bi-/multilingual corpus?

There are many tools available on the market to annotate corpora. If corpora are to be parallel or comparable, their annotation schemes, or at least some annotation layers, need to be identical in order to allow comparisons between data items. For instance, one might want to extract all nouns in a bilingual corpus, which entails using the same POS tag in both languages. Therefore, to support multilinguality, it is important to provide common annotation schemes for data of different languages. 

In the case of automatic annotation, there are a number of tools that apply identical annotation schemes on multilingual data. In the case of grammatical annotation, the Universal Dependencies (UD) project (de Marneffe et al. 2021) aims to develop a framework including parts of speech, morphological features, and syntactic dependencies across different languages. It is possible to apply the scheme automatically with tools such as UDpipe and Spacy, which are two libraries implemented in Python and R. The existence of automated tools depends on the level of analysis that is required.  

In the case of manual annotation, a number of schemes have been designed for multilingual corpora. User guides describe which languages and which levels of analysis are taken into account. For instance, discourse analysis may require the annotation of discourse relations used to express causality or contrast. Provided there is agreement on the scheme and its underlying theoretical underpinning, the codes may be applied to texts of different languages. The ANNODIS project (Péry-Woodley et al., 2011) offers useful insights in this respect. It describes a number of rhetorical relations between entities and these relations also exist in other languages, making the encoding system transferable. In a similar fashion, the sense tagset devised for the Penn Discourse Treebank project (Prasad et al., 2008) can also be applied to other languages than French. Many projects have developed their own encoding system depending on the level of analysis they target. For a non-comprehensive list please refer to Annotation Guides Section of the CORLI website. 

Depending on the nature of the corpus and the objectives in terms of annotation, researchers may need to choose a tool in relation to a format. Below you can find some examples for comparable and parallel bi/multilingual corpus tools and the file types they output. The following table shows a number of tools that allow several layers of annotation. It is important to note that many of these tools are interoperable in terms of their output formats (see Section 4). The output files can be converted automatically.

Tools File types
ELAN .eaf
EXMARALDA .exb
PRAAT .textgrid
CLAN .cha
TXM .txm
UDPipe .conll

Table 1: annotation tools and their file type

 

References :  

  • de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255‑308. https://doi.org/10.1162/coli_a_00402
  • Péry-Woodley, M.-P., Afantenos, S., Ho-Dac, L.-M., & Asher, N. (2011). La ressource ANNODIS, un corpus enrichi d’annotations discursives. Revue TAL, 52(3), 71‑101.
  • Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008, mai). The Penn Discourse TreeBank 2.0. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). LREC 2008, Marrakech, Morocco. http://www.lrec-conf.org/proceedings/lrec2008/pdf/754_paper.pdf

 

Manuals for transcription and coding of bilingual/multilingual data : 

  • Barnett, R., Codó, E., Eppler, E., Forcadell, M., Gardner-Chloros, P., van Hout, R., Moyer, M., Torras, M. C., Turell, M. T., Sebba, M., Starren, M., & Wensing, S. (2000). The LIDES Coding Manual: A document for preparing and analyzing language interaction data Version 1.1—July, 1999. International Journal of Bilingualism, 4(2), 131–132. https://doi.org/10.1177/13670069000040020101
  • Soroli, E. & Tsikulina, A. (2020). Bilingual Discourse Analysis Manual (BILDA2-v2): a manual for transcription, coding and analysis of bilingual and second language learning data. [University report] University of Lille; CORLI Huma-Num consortium. ⟨hal-02567511

 

For a practical guide about coding second language data validly and reliably see here.

Tools for the analysis of language data offered by CLARIN-ERIC : https://switchboard.clarin.eu/tools

What is an alignment in bilingual/multilingual corpora?

Alignment in parallel corpora: is an operation that makes explicit the correspondences between language segments in terms of translation equivalence. A parallel corpus consists of a text and its translation into one or more languages. In order to align parallel corpora, text needs to be divided into segments. A segment usually corresponds to a sentence. Alignment refers to information that tells the machine which segment (sentence) in one language is the translation of which segment (sentence) in another. Corpus management systems like concordancers are tools that can extract target words/constructions in parallel aligned corpora (e.g., Sketch Engine, NoSketch engine) – see for example  Rychly (2007) and Kilgarriff (2014).

For downloadable parallel aligned corpora (bilingual and multilingual), see here.

For access to a concordancer of parallel aligned corpora, see here.

Alignment in comparable corpora: is an operation that makes explicit the correspondences between a recording (generally an audio or a video recording) and a text transcription in a way that the phones, words, sentences or discourse segments selected as targets and the signal (audio/video) are timestamped. This procedure is easy when researchers work with well-ordered talk with little or no overlaps.

For an example of audio/video-transcript linking/timestamping with CLAN software, see here.

For an example of alignment with PRAAT software, see here (English) and here (French).

References

  • RYCHLÝ, Pavel. Manatee/Bonito-A Modular Corpus Manager. In: RASLAN. 2007. p. 65-70.
  • KILGARRIFF, Adam, et al. The Sketch Engine: Ten Years on. Lexicography, 2014, 1.1: 7-36.
What is a bilingual or a multilingual corpus?

Bilingual and Multilingual corpora are very common in language studies and are relevant to researchers working, among other domains, in historical linguistics, language acquisition, variation, dialectal and typological studies.

Typically, we distinguish among two types of Bilingual/Multilingual corpora: comparable corpora and parallel corpora. Often of modest size, as compared to corpora from the general domain, comparable and parallel corpora are specialized and constructed to respond to specific needs or answer specific research questions.

Bilingual/Multilingual comparable corpora: In this kind of corpora, the target languages are put together on the basis of “comparability.” Such corpora consist of texts, oral or multimodal productions of speakers in the languages under investigation, which share similar criteria of composition, genre, and topic but are not direct translations of each other.

Bilingual/Multilingual parallel corpora: In this kind of corpora, the target languages are put together on the basis of “parallelism”. Such corpora consist of texts, oral or multimodal productions in language A and their translation into language B, C, D etc. and/or their combinations. The relationship between texts in the target languages is direct and directional, e.g., it goes from one text (the source text) to the other(s) (the translated text(s)) and requires some minimum alignment.

For more information about bilingual/multilingual corpora, see Barrière (2016): here.

Further information about the main issues related to bi-/multilingual corpora and examples/demonstrations of data collection, annotation, exploration, analysis and storage of such corpora according to the FAIR data principles can be found  here and here.

References

BARRIERE, C. (2016). Bilingual Corpora. In: Natural Language Understanding in a Semantic Web Context. Springer, Cham. https://doi.org/10.1007/978-3-319-41337-2_7

What file format should I use for my bi-/multilingual data ?

Multilingual data from the same corpus must be represented in specific formats. Their representation depends on a schema that ensures the distinction between languages.  This distinction depends on the internal organization of the data, i.e. its format. Formats are intimately linked to the tools that allow the data to be represented and the files produced.

The Text Endocing Initiative (TEI) is developing a framework for the digital representation of oral and written data in XML format. The goal is to provide corpus transcribers with a set of guidelines for encoding characters from different languages, identifying the language of the data, and describing the data regardless of the language. The goal is to make the data machine-readable.

To achieve this, different levels of data processing have been defined by the consortium. At the character set level, the Unicode standard has been adopted and allows the universal encoding of (almost) all glyphs used in human languages. For the transcription of contents, the structure of a TEI document allows the identification of the language of the document and also the glyphs of other languages when they exist in the text. A TEI document consists of two parts:

  • the “TeiHeader” part in which the language identification can be found as an attribute. Other metadata such as title, publication information, etc. can also be found in this part.
  • the “text” part includes all information describing the text. This includes excerpts corresponding to other languages than the one initially declared in the document. It also includes a characterization of the data that can be divided with a division element “div” to describe what the text strings “do”. For example a division can indicate a chapter. TEI allows to refine the description of textual elements by proposing “structural” components in the form of paragraph elements, sentences, lines of verse or turns of speech in a dialogue.  All this information allows to characterize the texts of a corpus with an identical scheme whatever the language. Identical queries can then process all texts in a multilingual corpus in TEI format.  

Many transcription tools offer user-friendly interfaces (see Section 5). Most of these tools format data in XML and allow alignments between signal, transcription/translation layers and annotation layers. The XML language and the TEI structuring, the Unicode encoding, possible multilingual segments with identical boundaries make the files interoperable. File-to-file conversion utilities such as TEI-corpo (Parisse et al. 2020) or Pepper (Zipser & Romary, 2010) ensure TEI compatibility. These common formats allow the processing of multilingual data.  

The CoNNL-U format is another manner of shaping textual data and its Universal Dependencies (UD) annotations (de Marneffe et al. 2021). It consists in a text file in which words are split into lines. Each line containing the words and a number of annotations. Some lines can be reserved for comments. This format is designed to be machine readable and can be applied to comparable corpora of different languages.    

 

References: 

  • Burnard, L. (2014). What Is the Text Encoding Initiative? : How to Add Intelligent Markup to Digital Resources. Marseille: OpenEdition Press. https://books.openedition.org/oep/426
  • TEI Consortium (Eds.). TEI P5 : Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Consulté 11 octobre 2022, à l’adresse http://www.tei-c.org/Guidelines/P5/ 
  • Zipser, F. & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/
  • de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255‑308. https://doi.org/10.1162/coli_a_00402

Parisse, C., Etienne, C., & Liégeois, L. (2020). TEICORPO : A conversion tool for spoken language transcription with a pivot file in TEI. Journal of the Text Encoding Initiative. https://halshs.archives-ouvertes.fr/halshs-03043572. URL for web interface: https://ct3.ortolang.fr/teiconvert/index-fr.html

What are “FAIR data principles”?

FAIR data principles refer to a set of guiding principles that aim to make data Findable, Accessible, Interoperable, and Reusable. The term FAIR was proposed by Wilkinson et al. (2016) in a paper accessible here.

One of the most important challenges in data-driven science is the way researchers share knowledge. Knowledge passes through the exploitation of data that need to be collected, analyzed and stored. Sharing knowledge in a FAIR way means offering knowledge discovery, access to, integration and analysis of task-appropriate scientific data and their associated algorithms and workflows.

Making data findable means that: they are described with rich metadata that specify the data identifier, assigned a globally unique and eternally persistent identifier, registered or indexed in a searchable resource.

Making data accessible means that data and their metadata are retrievable by their identifier using a standardized communications protocol, that the protocol is open, free, and universally implementable allowing for an authentication and authorization procedure, where necessary.

Making data interoperable means that: a formal, accessible, shared, and broadly applicable language for knowledge representation is used, including qualified references to other (meta)data.

Making data reusable means that: (meta)data have a plurality of accurate and relevant attributes, are released with a clear and accessible data usage license, associated with their provenance and meet domain-relevant community standards.

For more information about the FAIR principles and guidelines to make your data FAIR, see here.

References

WILKINSON, M., DUMONTIER, M., AALBERSBERG, I. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18

Corpus oraux et multimodaux

Which formats for oral or multimodal data?

Not all formats are appropriate for storing data from a corpus. Indeed, it is essential that data are stored in a structured and standardized format, so that they can be exploited automatically.

  • An overview of formats can be found here
  • The TEI-CORPO tool allows you to convert files in Elan, Clan, Transcriber and Praat formats to TEI and vice versa
What tools are available to explore or analyze my oral or multimodal corpus?

Various tools dedicated to the exploration and analysis of oral or multimodal corpora are listed in the software inventory section ; to get the complete list, you can filter the tools by type (Type=Analysis) and by type of data (Data=Audio/Video).

Some have been demonstrated during training sessions organized by CORLI, including :

  • CLAN, a software allowing the analysis of data transcribed in CHILDES format
What tools are available for oral or multimodal corpus annotation?

Various tools dedicated to oral or multimodal corpora annotation are listed in the software inventory section ; to get the complete list, you can filter the tools by type (Type=Analysis) and by type of data (Data=Audio/Video).

Some have been demonstrated during training sessions organized by CORLI, including :

  • ELAN, a software for creating complex annotations on video and audio resources

Explorer / analyser un corpus

What tools are available to explore or analyze my oral or multimodal corpus?

Various tools dedicated to the exploration and analysis of oral or multimodal corpora are listed in the software inventory section ; to get the complete list, you can filter the tools by type (Type=Analysis) and by type of data (Data=Audio/Video).

Some have been demonstrated during training sessions organized by CORLI, including :

  • CLAN, a software allowing the analysis of data transcribed in CHILDES format
What tools are available to explore or analyze my corpus?

Various tools dedicated to the exploration and analysis of corpora are listed in the software inventory section ; to get the complete list, you can filter the tools by type (Type=Analysis) and by type of data (Data=Audio/Video).

Some have been demonstrated during training sessions organized by CORLI, including :

  • CLAN, a software allowing the analysis of data transcribed in CHILDES format