There are several reasons to deposit your corpus. On the one hand, building up a corpus is a very costly process; it is therefore important to share this effort so other researchers can benefit from it; indeed, it could lead to new analyses. On the other hand, the data which constitute a corpus sometimes have a patrimonial value (for example for the documentation of rare languages) which makes them precious and is sufficient to make their archiving desirable. Finally, the deposit of data responds to a problem of control and evaluation of research: any experimental work must be reproducible, and the availability of corpora (as well as their documentation and possibly the tools that have allowed them to be analyzed) is a sine qua non condition for ensuring reproducibility.
When depositing a corpus, it should be formatted in a way that conforms to international standards (TEI and other adapted XML formats, etc.) and described by metadata that are also standardized. The deposited corpus should respect the FAIR principles: Findable, Accessible, Interoperable, Reusable. This is why CORLI is leading an action aiming at funding corpora finalization in order to respect the FAIR principles, so that corpora can be deposited and promoted.
The repository of a corpus can be done on specialized websites; for instance, in France, COCOON and ORTOLANG.
- The ‘Training’ page dedicated to the repository and dissemination of corpora
- Corpora finalization
- Repository, storage, evaluation and sharing of corpora
- Corpus deposit, storage, dissemination