What formats can be used to create an oral or multimodal corpus?
To use oral or multimodal data in large corpora today, data must be available in formats that can be automatically processed. For this reason, word processing formats (e.g. Microsoft Word or OpenOffice) or full text formats cannot be used for storage and dissemination. These formats do not automatically tell you which parts of the corpus correspond to what.
Formats worth considering are either formats produced by structured software (CLAN, ELAN, Praat, Transcriber), or standardized formats such as the TEI. Other formats could be added to this list – which is only provisional – and can be modified with the development of tools and techniques.
Preferred formats:
Tool-related formats:
- CLAN (http://dali.talkbank.org/clan/)
- ELAN (https://archive.mpi.nl/tla/elan/download)
- PRAAT (https://www.fon.hum.uva.nl/praat/)
- TRANSCRIBER (http://trans.sourceforge.net or http://perso.ens-lyon.fr/matthieu.quignard/Transcriber/)
- EXMARALDA (https://exmaralda.org/en/) – allows TEI file creation
- TRJS (http://ct3.ortolang.fr/trjs/) – editing in TEI format
Generic format for sharing and storage:
TEI: format from the Text Encoding Initiative. Unfortunately, this format has many variants. We suggest using the TEI standard for speech (). You can create TEI files following this standard with Exmaralda or the TEICORPO conversion tool.