Which tools can I use to annotate my bi-/multilingual corpus?

There are many tools available on the market to annotate corpora. If corpora are to be parallel or comparable, their annotation schemes, or at least some annotation layers, need to be identical in order to allow comparisons between data items. For instance, one might want to extract all nouns in a bilingual corpus, which entails … Read more

What file format should I use for my bi-/multilingual data ?

Multilingual data from the same corpus must be represented in specific formats. Their representation depends on a schema that ensures the distinction between languages.  This distinction depends on the internal organization of the data, i.e. its format. Formats are intimately linked to the tools that allow the data to be represented and the files produced. … Read more

What is an alignment in bilingual/multilingual corpora?

Alignment in parallel corpora: is an operation that makes explicit the correspondences between language segments in terms of translation equivalence. A parallel corpus consists of a text and its translation into one or more languages. In order to align parallel corpora, text needs to be divided into segments. A segment usually corresponds to a sentence. … Read more

What are “FAIR data principles”?

FAIR data principles refer to a set of guiding principles that aim to make data Findable, Accessible, Interoperable, and Reusable. The term FAIR was proposed by Wilkinson et al. (2016) in a paper accessible here. One of the most important challenges in data-driven science is the way researchers share knowledge. Knowledge passes through the exploitation … Read more

What is a bilingual or a multilingual corpus?

Bilingual and Multilingual corpora are very common in language studies and are relevant to researchers working, among other domains, in historical linguistics, language acquisition, variation, dialectal and typological studies. Typically, we distinguish among two types of Bilingual/Multilingual corpora: comparable corpora and parallel corpora. Often of modest size, as compared to corpora from the general domain, … Read more