CORPUCIT: Citation of extracts from language corpora

Language corpora have become an essential part of research in many disciplines such as linguistics, literature, history, psychology and anthropology. In these areas, researchers are led to support their demonstrations with corpora, to present extracts from corpora as examples or elements of scientific discussion, or to base their descriptions and models on corpora. The link between scientific publications using corpora and corpora themselves is extremely relevant, the publication/corpus combination often forming an indivisible unit in scientific research.
However, it is still rare that the language data forming these corpora and used as a basis for analysis are shared. And when they are shared, they are rarely linked to scientific publications and treated as an integral part of these publications.

The CORPUCIT project aims to allow the editing of scientific texts containing citations or extracts pointing directly (through a hyperlink) to corpora or extracts of language corpora. The project will also allow the editing of corpora to structure them into examples or citations to give them a clear scientific status and fully integrate them both in the scientific process as well as the open science field.

Goals:

1. From corpora, enable corpora editing to generate persistent identifiers (PID) on parts of these corpora and build examples or citations extracted from the corpus. The PIDs will be based on existing standards of open data dissemination.

Tools will be open source web services, allowing their integration into other websites and availability for other services. For known format corpora (TEI), it will be possible to create PIDs for subparts of documents. For other formats, PIDs will point to complete documents.
Beyond the creation of a PID, the tool will allow editing of extracts or citations, and combining all the available metadata and additional information according to the researchers’ needs. For known corpus formats, the corresponding part of the corpus can be displayed.

2. Use PIDs as a citation in scientific writings or in any document presentation on the Internet to point to corpora and extracts. The citation mechanism will respect the standard format of scientific citations and therefore can be used by citation management tools such as Zotero. This will give corpora a much clearer status as a scientific deliverable and will allow researchers to value the design, collection and sharing of corpora as a scientific activity of its own.

Project launch presentation (PDF version ; French).