Team Leader : Carole Etienne
PART 1 – FACILITATING THE REUSE OF CORPORA BY OTHER RESEARCHERS
- Why would a researcher reuse a corpus?
- Having a larger quantity of data
- Exploring the same data from a different perspective: syntactic, prosodic, phonological, or interactional analyses of the same data
- To take advantage of different sets of annotations which are not available in one’s own domain
- To contrast the same object of study but with other data sets:
- written, oral, spontaneous writing
- in other languages for diachronic studies
- for written corpora: the nature of texts, authors, …
- for oral corpora: children/adults, professional/private, number of speakers, native/non-native speakers, face to face or by telephone or by video-conference,…
An evolution in research projects can be seen as regards extant corpora, they make use of more sources of data with greater frequency and can also link oral corpora with written corpora (spontaneous writing).
At the beginning of the project, at least one “Work Package” was dedicated to making available already described and annotated data to the community. This step, a part of each project, could be simplified to free up time and resources for the analyses themselves.
And at the end of the project new annotations were done and thus available, often in different formats and with different automatic or semi-automatic tools, such additions should be described in the original corpus which serves as base for all the available annotations in order to make their reuse more practicable.
The role of metadata
- Precisely identifying the data to use them in the study.
- To make a corpus study using data from several sources homogeneous without re-doing it (since its already done in each source)
- Making available information for each analysis
- Adding new annotations which were done manually or (semi) automatically (TAL): documenting annotations, what software/tool and which version, what format?)
-> Not forgetting that metadata evolves with the data
-> Metadata should be formatted in an international standard in order to be reused by the larger community
Constraints
- The heterogenity of the community’s practices
- Adapting the reuse of extant data as well as undertaking new projects without repeating theoretical discussions
- Permitting occasional use without having to consult overly-long documentation
- Not hesitating too long between several possible choices
- Having examples close to one’s own project Without mastering XML, OLAC, or the TEI
Solutions
- A set of metadata common across resources in order to make the use or common use of data easier
- A set of common levels which differ in terms of granularity, for example regarding age: adult > age group > precise age
- The choice of the TEI as a standard:
- grouping together metadata and data in a unique file
- definition of a personalized ODD that allows:
- marking off a set of elements and properties
- to define and exemplify their structure,
- mostly used in written corpora
A customizable application, Teimeta, to use metadata:
- online
- ofrom a TEI/ODD file, currently defined for oral corpora and widely distributed
- owith a defined vocabulary
- omultilingual application
F | Findability | Different audiences thus different metadata (cf objects to be described) |
A | Accessibility | A tradition of linguistic archiving platforms in use since 2000 Verification of scientific processes: the need to preserve a version of a corpus in order to reproduce an analysis already performed with an aim at improving it |
I | Interoperability | Different practices for linguistic disciplines as well as different communities a. Interoperable metadata and forms b. Solutions for metadata A common set of metadata across resources to make it easier to make use of metadata and making data available A personalizable application to extract such metadata from a TEI/ODD file defined for known oral corpora A set of common but of different granularity verified vocabulary Multilingual application online Teimeta Metadata in an international standard TEI c. Annotation solutions A pivot format for transcriptions, independent of conventions, software, or tools A international standard, TEI, for that pivot format |
R | Reusability | User licenses for Teimata data: a common set of metadata for research Teiconver: conversion tools to avoid loosing data when going from one annotation program to another |