Alignment is about the temporal correspondence between two resources, for example between audio and its transcription, where the alignment can be based in terms of speaker order, intonational group, word, or phoneme, or between video and annotation (gestures or sign-language). Alignment can also refer to the temporal correspondence between two sources, such as in the case of simultaneous use of two videos at different angles.
There are semi-automatic tools which allow for fine-grained alignment such as on the phonetic level from an orthographic transcription: see EasyAlign, SailAlign.
Semi-automatic tools allowing for the automatic segmentation of gestural events in videos are beginning to appear.
The processes generally happens over several successive steps and manual fine-tuning. These tools are implemented as software extensions of existing annotation suites or as stand-alone software with export functions in the usual formats.
More generally, data alignment is a matter of specifying a relation between the units of kinds of data. Alignments can be references to time signals (phonemes are alignment in an audio-signal) or to other data. For example, syllables are aligned with phonemes, syntactic units with tokens, etc. Alignments can be strict (the boundaries of units must be the same) or flexible (the boundaries should be close). Alighment can be partial (a portion of the units are aligned).
Content validated by Groupe de Travail 4 (multimodality and visual-gestural modality).”