Metadata Coordination : Carole Etienne


Why would a scholar reuse a corpus?

  1. To have more data
  2. To explore the same data from different perspectives: syntactic, prosodic, phonological, or interactional analyses of a single data point
  3. To take advantage of different annotation conventions which might not be available in an established data set
  4. To contrast a single object of study with other data sets:
    • written, oral, spontaneous writing
    • in different languages
    • for diachronic studies
    • for written corpora: the nature of the texts, authors, ...
    • for oral corpora: children/adults, professional/private, number of speakers, native or nonnative speakers, face to face or by telephone or by teleconference, ...

There is a clear evolution in research projects based on existing corpora, more and more they are based on multiple data sources and can draw on oral and written corpora (spontaneous writing).

At the earliest stage of a project, at the least a “Work Package” is dedicated to establishing a common base for already described and annotated corpora. This state is a part of each project and could be streamlined to free up time and resources for the analyses themselves.

At the end of the project, the new annotations can be integrated into the database, often under different formats and with different automized or semi-automized tools, these additions should be described in the original corpus which brings together all of the available annotations to make their future use easier.


The role of metadata

  • To precisely identify data to use them in a study.
  • To make a corpus composed of data from several sources homogenous without having to redescribe it (since this is already done in each source)
  • To make information available at the time of analysis
  • To add new annotations done either manually or (semi) automatically (TAL): document the annotations, which program/tool and which version, which format?)

-> Don’t forget that metadata evolves with the data
-> Metadata should be recorded in an international standard to be reusable by a large community.


  • The heterogenity of the community’s practices
  • Adapting to both the reuse of existing projects as well as new projects without rehashing theoretical discussions
  • Allowing for occasional use without having to consult long documentation
  • Not hesitation for too long between several possible choices
  • Making available examples similar to the current project
  • Not needing to master XML, OLA, or TEI

The solutions

  • A set of metadata common to all available resources for facilitating the common use of data
  • A common set that has different levels of granularity, for example for age: adult >> age group >> precise age
  • The choice of TEI as a standard :
    • Collecting metadata and data in a single file
    • definition of a personal ODD which allows for :
      • delimiting a set of elements and properties
      • defining and exemplifying their structure
      • mostly used in written corpora
  • A personlizable application, Teimata, to use metadata
    • online
    • from a TEI/ODD file, currently defined for oral corpora and widely distributed
    • with a defined vocabulary
    • multilingual application


Le principe FAIR


Findability : Différents publics donc différentes métadonnées (cf les objets à décrire)



  • Tradition de plateformes d'archivage en linguistique depuis les années 2000
  • Vérification du processus scientifique : besoin de conserver une version d'un corpus pour reproduire une analyse déjà effectuée en vue de l'améliorer


Interoperability : Différentes disciplines de la linguistique mais aussi différentes communautés pour différentes pratiques

a. Métadonnées et format des données interopérables
b. Les solutions pour les métadonnées

  • Un jeu commun de métadonnées à toutes les ressources pour faciliter la prise en main des métadonnées et la mise en commun des données
  • Une application personnalisée teimeta pour saisir ces métadonnées à partir d'un fichier TEI/ODD défini pour les corpus oraux et largement diffusé
  • Jeu commun mais différents niveaux de granularité
  • vocabulaire contrôlé
  • Application multilingue en ligne teimeta
  • Des métadonnées dans un standard international TEI

c. Les solutions pour les annotations

  • Un format de transcription pivot, indépendant des conventions, logiciels ou outils
  • Un standard international TEI pour ce format pivot



  • Les licences d'utilisation des données
  • Teimeta : un jeu commun de métadonnées orientées recherche
  • Teiconvert : des outils de conversion pour passer sans perte d'information d'un logiciel d'annotation à un autre