Which tools can I use to annotate my bi-/multilingual corpus? – Consortium HN CORpus, Langues et Interactions

There are many tools available on the market to annotate corpora. If corpora are to be parallel or comparable, their annotation schemes, or at least some annotation layers, need to be identical in order to allow comparisons between data items. For instance, one might want to extract all nouns in a bilingual corpus, which entails using the same POS tag in both languages. Therefore, to support multilinguality, it is important to provide common annotation schemes for data of different languages.

In the case of automatic annotation, there are a number of tools that apply identical annotation schemes on multilingual data. In the case of grammatical annotation, the Universal Dependencies (UD) project (de Marneffe et al. 2021) aims to develop a framework including parts of speech, morphological features, and syntactic dependencies across different languages. It is possible to apply the scheme automatically with tools such as UDpipe and Spacy, which are two libraries implemented in Python and R. The existence of automated tools depends on the level of analysis that is required.

In the case of manual annotation, a number of schemes have been designed for multilingual corpora. User guides describe which languages and which levels of analysis are taken into account. For instance, discourse analysis may require the annotation of discourse relations used to express causality or contrast. Provided there is agreement on the scheme and its underlying theoretical underpinning, the codes may be applied to texts of different languages. The ANNODIS project (Péry-Woodley et al., 2011) offers useful insights in this respect. It describes a number of rhetorical relations between entities and these relations also exist in other languages, making the encoding system transferable. In a similar fashion, the sense tagset devised for the Penn Discourse Treebank project (Prasad et al., 2008) can also be applied to other languages than French. Many projects have developed their own encoding system depending on the level of analysis they target. For a non-comprehensive list please refer to Annotation Guides Section of the CORLI website.

Depending on the nature of the corpus and the objectives in terms of annotation, researchers may need to choose a tool in relation to a format. Below you can find some examples for comparable and parallel bi/multilingual corpus tools and the file types they output. The following table shows a number of tools that allow several layers of annotation. It is important to note that many of these tools are interoperable in terms of their output formats (see Section 4). The output files can be converted automatically.

Tools	File types
ELAN	.eaf
EXMARALDA	.exb
PRAAT	.textgrid
CLAN	.cha
TXM	.txm
UDPipe	.conll

Table 1: annotation tools and their file type

References :

de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255‑308. https://doi.org/10.1162/coli_a_00402
Péry-Woodley, M.-P., Afantenos, S., Ho-Dac, L.-M., & Asher, N. (2011). La ressource ANNODIS, un corpus enrichi d’annotations discursives. Revue TAL, 52(3), 71‑101.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008, mai). The Penn Discourse TreeBank 2.0. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). LREC 2008, Marrakech, Morocco. http://www.lrec-conf.org/proceedings/lrec2008/pdf/754_paper.pdf

Manuals for transcription and coding of bilingual/multilingual data :

Barnett, R., Codó, E., Eppler, E., Forcadell, M., Gardner-Chloros, P., van Hout, R., Moyer, M., Torras, M. C., Turell, M. T., Sebba, M., Starren, M., & Wensing, S. (2000). The LIDES Coding Manual: A document for preparing and analyzing language interaction data Version 1.1—July, 1999. International Journal of Bilingualism, 4(2), 131–132. https://doi.org/10.1177/13670069000040020101
Soroli, E. & Tsikulina, A. (2020). Bilingual Discourse Analysis Manual (BILDA2-v2): a manual for transcription, coding and analysis of bilingual and second language learning data. [University report] University of Lille; CORLI Huma-Num consortium. ⟨hal-02567511⟩

For a practical guide about coding second language data validly and reliably see here.

Tools for the analysis of language data offered by CLARIN-ERIC : https://switchboard.clarin.eu/tools