Metadata Coordination : Carole Etienne
PART 1 - FACILITATING THE REUSE OF CORPORA BY OTHER SCHOLARS
Why would a scholar reuse a corpus?
- To have more data
- To explore the same data from different perspectives: syntactic, prosodic, phonological, or interactional analyses of a single data point
- To take advantage of different annotation conventions which might not be available in an established data set
- To contrast a single object of study with other data sets:
- written, oral, spontaneous writing
- in different languages
- for diachronic studies
- for written corpora: the nature of the texts, authors, ...
- for oral corpora: children/adults, professional/private, number of speakers, native or nonnative speakers, face to face or by telephone or by teleconference, ...
There is a clear evolution in research projects based on existing corpora, more and more they are based on multiple data sources and can draw on oral and written corpora (spontaneous writing).
At the earliest stage of a project, at the least a “Work Package” is dedicated to establishing a common base for already described and annotated corpora. This state is a part of each project and could be streamlined to free up time and resources for the analyses themselves.
At the end of the project, the new annotations can be integrated into the database, often under different formats and with different automized or semi-automized tools, these additions should be described in the original corpus which brings together all of the available annotations to make their future use easier.
The role of metadata
- To precisely identify data to use them in a study.
- To make a corpus composed of data from several sources homogenous without having to redescribe it (since this is already done in each source)
- To make information available at the time of analysis
- To add new annotations done either manually or (semi) automatically (TAL): document the annotations, which program/tool and which version, which format?)
-> Don’t forget that metadata evolves with the data
-> Metadata should be recorded in an international standard to be reusable by a large community.
- The heterogenity of the community’s practices
- Adapting to both the reuse of existing projects as well as new projects without rehashing theoretical discussions
- Allowing for occasional use without having to consult long documentation
- Not hesitation for too long between several possible choices
- Making available examples similar to the current project
- Not needing to master XML, OLA, or TEI
- A set of metadata common to all available resources for facilitating the common use of data
- A common set that has different levels of granularity, for example for age: adult >> age group >> precise age
- The choice of TEI as a standard :
- Collecting metadata and data in a single file
- definition of a personal ODD which allows for :
- delimiting a set of elements and properties
- defining and exemplifying their structure
- mostly used in written corpora
- A personlizable application, Teimata, to use metadata
- from a TEI/ODD file, currently defined for oral corpora and widely distributed
- with a defined vocabulary
- multilingual application
Findability : Différents publics donc différentes métadonnées (cf les objets à décrire)
Interoperability : Différentes disciplines de la linguistique mais aussi différentes communautés pour différentes pratiques
a. Métadonnées et format des données interopérables
c. Les solutions pour les annotations