What is a bilingual or a multilingual corpus?

Bilingual and Multilingual corpora are very common in language studies and are relevant to researchers working, among other domains, in historical linguistics, language acquisition, variation, dialectal and typological studies.

Typically, we distinguish among two types of Bilingual/Multilingual corpora: comparable corpora and parallel corpora. Often of modest size, as compared to corpora from the general domain, comparable and parallel corpora are specialized and constructed to respond to specific needs or answer specific research questions.

Bilingual/Multilingual comparable corpora: In this kind of corpora, the target languages are put together on the basis of “comparability.” Such corpora consist of texts, oral or multimodal productions of speakers in the languages under investigation, which share similar criteria of composition, genre, and topic but are not direct translations of each other.

Bilingual/Multilingual parallel corpora: In this kind of corpora, the target languages are put together on the basis of “parallelism”. Such corpora consist of texts, oral or multimodal productions in language A and their translation into language B, C, D etc. and/or their combinations. The relationship between texts in the target languages is direct and directional, e.g., it goes from one text (the source text) to the other(s) (the translated text(s)) and requires some minimum alignment.

For more information about bilingual/multilingual corpora, see Barrière (2016): here.

Further information about the main issues related to bi-/multilingual corpora and examples/demonstrations of data collection, annotation, exploration, analysis and storage of such corpora according to the FAIR data principles can be found  here and here.

References

BARRIERE, C. (2016). Bilingual Corpora. In: Natural Language Understanding in a Semantic Web Context. Springer, Cham. https://doi.org/10.1007/978-3-319-41337-2_7