Best practice in corpus building

Good and bad practices in corpus building

When building a corpus, various events can occur at any stage of its creation. There can be technical problems, practical problems, and many more!

This section gathers good and bad practices when building a corpus and different experiences that linguists might have encountered when building a corpus.

The linguist check-list

Try out various types of equipment.Not being aware of the right equipment for the task. 
Try the selected equipment. Go unprepared to the recording location while using a device for the first time.
Remember to have fresh batteries or to charge your device if it runs on batteries.Forget to bring a fresh set of batteries or forget to charge the recording device. (See anecdote)
Have your data collection materials.Forget important documents for the task.

Before gathering data, the linguist… 

Must   Mustn’t 
Try out different types of equipment to find THE right equipment for the recording session!Not know the right material for the task and risk a lesser recording quality – would you choose a pair of shoes not in your size? It’s the same thing!

Try out the material and master it like a pro or almost! You even have the right to take notes on how it works and take them with you everywhere! Master your equipment, not the opposite!

Going unprepared to the recording location when using the device for the first time. Playing a soccer match without training means a risk of injury! The same goes for the linguist!

Remember to bring fresh batteries or charge your device before the task!
Forget to bring new batteries for the recording device or forget to charge it… Don’t forget new batteries!
Or have a charged device ready for hours and hours of recording, at your disposal. This can lead to unfortunate events… For example, recordings with electrical noises so loud, you can see them on a spectrogram!
(See anecdote)  

Bring your data collection documents and copies of information sheets (metadata collection) and informed consent forms is what we call going out prepared.
Forget important documents for the task or the consent forms, no consent equals no recording!

Anecdote 1: Forget new batteries before a recording session

Study topic: The audio recording of read texts, to produce prosodic and phonetic analyses.

To do so, the steps are:

  • Data collection
  • Automatic segmentation using the distinction between silence, speech and transcriptions of what was said.

Aim: upload the corpus on Ortolang for archiving and sharing. The corpus is in free access: OpenProDat – Open Speech Database

Equipment used:

  • Mobile recorder (Zoom H4N)
  • Headset microphone (AKG C520) with better audio quality than a tie-clip or a shotgun microphone


For the first 3 recordings, everything goes well. However, I get anxious : “I didn’t put new batteries… what if they run out of juice in the middle of the recording?”
Without a second thought, I connect the recorder to the power supply and I carry on with my 20+ recording sessions. At the end of the day, I got three very correct recordings, the others had an “electric noise” (50 Hz) that you can see perfectly well on the spectrogram – you can also hear it very well. For the subsequent sessions… I used new batteries each time!

Standard spectrogram
Electrical noise spectrogram
Audio recording

Vincent, C. (2015). L’acquisition et le traitement de données multimodales en linguistique : Pratiques et perspectives. Colloque des doctorants et jeunes chercheurs associés du laboratoire MoDyCo (COLDOC) sur ”Dimensions multimodales des pratiques discursives : une perspective actuelle pour les linguistes”. Présenté à Nanterre, France.

Share your experiences!