What file format should I use for my bi-/multilingual data ?

Multilingual data from the same corpus must be represented in specific formats. Their representation depends on a schema that ensures the distinction between languages.  This distinction depends on the internal organization of the data, i.e. its format. Formats are intimately linked to the tools that allow the data to be represented and the files produced.

The Text Endocing Initiative (TEI) is developing a framework for the digital representation of oral and written data in XML format. The goal is to provide corpus transcribers with a set of guidelines for encoding characters from different languages, identifying the language of the data, and describing the data regardless of the language. The goal is to make the data machine-readable.

To achieve this, different levels of data processing have been defined by the consortium. At the character set level, the Unicode standard has been adopted and allows the universal encoding of (almost) all glyphs used in human languages. For the transcription of contents, the structure of a TEI document allows the identification of the language of the document and also the glyphs of other languages when they exist in the text. A TEI document consists of two parts:

  • the “TeiHeader” part in which the language identification can be found as an attribute. Other metadata such as title, publication information, etc. can also be found in this part.
  • the “text” part includes all information describing the text. This includes excerpts corresponding to other languages than the one initially declared in the document. It also includes a characterization of the data that can be divided with a division element “div” to describe what the text strings “do”. For example a division can indicate a chapter. TEI allows to refine the description of textual elements by proposing “structural” components in the form of paragraph elements, sentences, lines of verse or turns of speech in a dialogue.  All this information allows to characterize the texts of a corpus with an identical scheme whatever the language. Identical queries can then process all texts in a multilingual corpus in TEI format.  

Many transcription tools offer user-friendly interfaces (see Section 5). Most of these tools format data in XML and allow alignments between signal, transcription/translation layers and annotation layers. The XML language and the TEI structuring, the Unicode encoding, possible multilingual segments with identical boundaries make the files interoperable. File-to-file conversion utilities such as TEI-corpo (Parisse et al. 2020) or Pepper (Zipser & Romary, 2010) ensure TEI compatibility. These common formats allow the processing of multilingual data.  

The CoNNL-U format is another manner of shaping textual data and its Universal Dependencies (UD) annotations (de Marneffe et al. 2021). It consists in a text file in which words are split into lines. Each line containing the words and a number of annotations. Some lines can be reserved for comments. This format is designed to be machine readable and can be applied to comparable corpora of different languages.    

 

References: 

  • Burnard, L. (2014). What Is the Text Encoding Initiative? : How to Add Intelligent Markup to Digital Resources. Marseille: OpenEdition Press. https://books.openedition.org/oep/426
  • TEI Consortium (Eds.). TEI P5 : Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Consulté 11 octobre 2022, à l’adresse http://www.tei-c.org/Guidelines/P5/ 
  • Zipser, F. & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/
  • de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2), 255‑308. https://doi.org/10.1162/coli_a_00402

Parisse, C., Etienne, C., & Liégeois, L. (2020). TEICORPO : A conversion tool for spoken language transcription with a pivot file in TEI. Journal of the Text Encoding Initiative. https://halshs.archives-ouvertes.fr/halshs-03043572. URL for web interface: https://ct3.ortolang.fr/teiconvert/index-fr.html