What is an alignment in bilingual/multilingual corpora?

Alignment in parallel corpora: is an operation that makes explicit the correspondences between language segments in terms of translation equivalence. A parallel corpus consists of a text and its translation into one or more languages. In order to align parallel corpora, text needs to be divided into segments. A segment usually corresponds to a sentence. Alignment refers to information that tells the machine which segment (sentence) in one language is the translation of which segment (sentence) in another. Corpus management systems like concordancers are tools that can extract target words/constructions in parallel aligned corpora (e.g., Sketch Engine, NoSketch engine) – see for example  Rychly (2007) and Kilgarriff (2014).

For downloadable parallel aligned corpora (bilingual and multilingual), see here.

For access to a concordancer of parallel aligned corpora, see here.

Alignment in comparable corpora: is an operation that makes explicit the correspondences between a recording (generally an audio or a video recording) and a text transcription in a way that the phones, words, sentences or discourse segments selected as targets and the signal (audio/video) are timestamped. This procedure is easy when researchers work with well-ordered talk with little or no overlaps.

For an example of audio/video-transcript linking/timestamping with CLAN software, see here.

For an example of alignment with PRAAT software, see here (English) and here (French).

References

  • RYCHLÝ, Pavel. Manatee/Bonito-A Modular Corpus Manager. In: RASLAN. 2007. p. 65-70.
  • KILGARRIFF, Adam, et al. The Sketch Engine: Ten Years on. Lexicography, 2014, 1.1: 7-36.