Which formats for oral or multimodal data?

Not all formats are appropriate for storing data from a corpus. Indeed, it is essential that data are stored in a structured and standardized format, so that they can be exploited automatically. An overview of formats can be found here The TEI-CORPO tool allows you to convert files in Elan, Clan, Transcriber and Praat formats … Read more

Why and how to assess the quality of a corpus?

Sharing corpora, and respecting best practices, can make the development of corpora very costly. Corpora must be considered as scientific productions in their own right. Being able to assess quality is therefore an essential issue. Corpus evaluation is a very important question within CORLI: The network-group “Corpus evaluation” is dedicated to this topic Two workshops … Read more

Why should I deposit my corpus and how?

There are several reasons to deposit your corpus. On the one hand, building up a corpus is a very costly process; it is therefore important to share this effort so other researchers can benefit from it; indeed, it could lead to new analyses. On the other hand, the data which constitute a corpus sometimes have … Read more

What are metadata and what are they for?

Metadata are a set of information that one decides to keep in addition to the linguistic data itself, in order to document them and to facilitate the use of the corpus by other researchers. Such information can be very different: data sources, software (and its exact version) used for data collection or processing, information about … Read more

How do I collect data to build a corpus?

The data used in corpus linguistics can be of different natures: written or oral data, but also videos, movement and eye-tracking captures, etc. The acquisition of data to build a corpus must be carefully prepared beforehand and the method used must be well defined and documented to ensure traceability. In particular, the question of required … Read more