ELAN module¶
speach
supports reading and manipulating multi-tier transcriptions from ELAN directly.
Note
For better security, speach
will use the package defusedxml
automatically if available to parse XML streams (instead of Python’s default parser).
When defusedxml
is available, the flag speach.elan.SAFE_MODE
will be set to True.
For common code samples to processing ELAN, see ELAN Recipes page.
Table of Contents
ELAN module functions¶
ELAN module - manipulating ELAN transcript files (*.eaf, *.pfsx)
- speach.elan.read_eaf(eaf_path, encoding='utf-8', *args, **kwargs)¶
Read an EAF file and return an elan.Doc object
>>> from speach import elan >>> eaf = elan.read_eaf("myfile.eaf")
- Parameters
eaf_path (str or Path-like object) – Path to existing EAF file
encoding (str) – Encoding of the eaf stream, defaulted to UTF-8
- Return type
- speach.elan.parse_eaf_stream(eaf_stream, *args, **kwargs)¶
Parse an EAF input stream and return an elan.Doc object
>>> with open('test/data/test.eaf').read() as eaf_stream: >>> eaf = elan.parse_eaf_stream(eaf_stream)
- Parameters
eaf_stream – EAF text input stream
- Return type
- speach.elan.parse_string(eaf_string, *args, **kwargs)¶
Parse EAF content in a string and return an elan.Doc object
>>> with open('test/data/test.eaf').read() as eaf_stream: >>> eaf_content = eaf_stream.read() >>> eaf = elan.parse_string(eaf_content)
- Parameters
eaf_string (str) – EAF content stored in a string
- Return type
ELAN Document model¶
- class speach.elan.Doc(**kwargs)[source]¶
This class represents an ELAN file (*.eaf)
- classmethod create(media_file='audio.wav', media_url=None, relative_media_url=None, author='', *args, **kwargs)[source]¶
Create a new blank ELAN doc
>>> from speach import elan >>> eaf = elan.create()
- Parameters
encoding (str) – Encoding of the eaf stream, defaulted to UTF-8
- Return type
- cut(section, outfile, media_file=None, use_concat=False, *args, **kwargs)[source]¶
Cut the source media with timestamps defined in section object
For example, the following code cut all annotations in tier “Tier 1” into appopriate audio files
>>> for idx, ann in enumerate(eaf["Tier 1"], start=1): >>> eaf.cut(ann, f"tier1_ann{idx}.wav")
- Parameters
section – Any object with
from_ts
andto_ts
attributes which return TimeSlot objectsoutfile – Path to output media file, must not exist or a FileExistsError will be raised
media_file – Use to specify source media file. This will override the value specified in source EAF file
- Raises
FileExistsError, ValueError
- get_participant_map()[source]¶
Map participants to tiers Return a map from participant name to a list of corresponding tiers
- new_timeslot(value)[source]¶
Create a new timeslot object
- Parameters
value (int or str) – Timeslot value (in milliseconds)
- classmethod parse_string(eaf_string, *args, **kwargs)[source]¶
Parse EAF content in a string and return an elan.Doc object
>>> with open('test/data/test.eaf').read() as eaf_stream: >>> eaf_content = eaf_stream.read() >>> eaf = elan.parse_string(eaf_content)
- Parameters
eaf_string (str) – EAF content stored in a string
- Return type
- save(path, encoding='utf-8', xml_declaration=None, default_namespace=None, short_empty_elements=True, *args, **kwargs)[source]¶
Write ELAN Doc to an EAF file
- tiers() Tuple[speach.elan.Tier] [source]¶
Collect all existing Tier in this ELAN file
- to_csv_rows() List[List[str]] [source]¶
Convert this ELAN Doc into a CSV-friendly structure (i.e. list of list of strings)
- Returns
A list of list of strings
- Return type
List[List[str]]
- to_xml_bin(encoding='utf-8', default_namespace=None, short_empty_elements=True, *args, **kwargs)[source]¶
Generate EAF content (bytes) in XML format
- Returns
EAF content
- Return type
bytes
- property constraints: Tuple[speach.elan.Constraint]¶
A tuple of all existing constraints in this ELAN file
- property external_refs: Tuple[speach.elan.ExternalRef]¶
Get all external references
- property languages: Tuple[speach.elan.Language]¶
Get all languages
- property licenses: Tuple[speach.elan.License]¶
Get all licenses
- property linguistic_types: Tuple[speach.elan.LinguisticType]¶
A tuple of all existing linguistic types in this ELAN file
- property roots: Tuple[speach.elan.Tier]¶
All root-level tiers in this ELAN doc
- property vocabs: Tuple[speach.elan.ControlledVocab]¶
A tuple of all existing controlled vocabulary objects in this ELAN file
ELAN Tier model¶
- class speach.elan.Tier(doc=None, xml_node=None, **kwargs)[source]¶
Represents an ELAN annotation tier
- filter(from_ts=None, to_ts=None)[source]¶
Filter utterances by from_ts or to_ts or both If this tier is not a time-based tier everything will be returned
- new_annotation(value, from_ts=None, to_ts=None, ann_ref_id=None, values=None, timeslots=None, check_cv=True)[source]¶
Create new annotation(s) in this current tier ELAN provides 5 different tier stereotypes.
To create a new standard annotation (in a tier with no constraints), a text value and a pair of from-to timestamp must be provided.
>>> from speach import elan >>> eaf = elan.create() # create a new ELAN transcript >>> # create a new utterance tier >>> tier = eaf.new_tier('Person1 (Utterance)') >>> # create a new annotation between 00:00:01.000 and 00:00:02.000 >>> a1 = tier.new_annotation('Xin chào', 1000, 2000)
Included-In tiers
>>> eaf.new_linguistic_type('Phoneme', 'Included_In') >>> tp = eaf.new_tier('Person1 (Phoneme)', 'Phoneme', 'Person1 (Utterance)') >>> # string-based timestamps can also be used with the helper function elan.ts2msec() >>> tt.new_annotation('ch', elan.ts2msec("00:00:01.500"), elan.ts2msec("00:00:01.600"), ann_ref_id=a1.ID)
Annotations in Symbolic-Associtation tiers:
>>> eaf.new_linguistic_type('Translate', 'Symbolic_Association') >>> tt = eaf.new_tier('Person1 (Translate)', 'Translate', 'Person1 (Utterance)') >>> tt.new_annotation('Hello', ann_ref_id=a1.ID)
Symbolic-Subdivision tiers:
>>> eaf.new_linguistic_type('Tokens', 'Symbolic_Subdivision') >>> tto = eaf.new_tier('Person1 (Tokens)', 'Tokens', 'Person1 (Utterance)') >>> # extra annotations can be provided with the argument values >>> tto.new_annotation('Xin', values=['chào'], ann_ref_id=a1.ID) >>> # alternative method (set value to None and provide everything with values) >>> tto.new_annotation(None, values=['Xin', 'chào'], ann_ref_id=a1.ID)
- property linguistic_type: speach.elan.LinguisticType¶
Linguistic type object of this Tier (alias of type_ref
- property name¶
An alias to tier’s ID
- property parent_ref¶
ID of the parent tier. Return None if this is a root tier
- property time_alignable¶
Check if this tier contains time alignable annotations
- property type_ref: speach.elan.LinguisticType¶
Tier type object
- property type_ref_id¶
ID of the tier type ref
ELAN Annotation model¶
There are two different annotation types in ELAN: TimeAnnotation
and RefAnnotation
.
TimeAnnotation objects are time-alignable annotations and contain timestamp pairs from_ts, to_ts
to refer back to specific chunks in the source media.
On the other hand, RefAnnotation objects are annotations that link to something else, such as another annotation
or an annotation sequence in the case of symbolic subdivision tiers.
- class speach.elan.TimeAnnotation(ID, from_ts, to_ts, value, xml_node=None, **kwargs)[source]¶
An ELAN time-alignable annotation
- overlap(other)[source]¶
Calculate overlap score between two time annotations Score = 0 means adjacent, score > 0 means overlapped, score < 0 means no overlap (the distance between the two)
- property duration: float¶
Duration of this annotation (in seconds)
- property from_ts: speach.elan.TimeSlot¶
Start timestamp of this annotation
- property to_ts: speach.elan.TimeSlot¶
End timestamp of this annotation
- class speach.elan.RefAnnotation(ID, ref_id, previous, value, xml_node=None, **kwargs)[source]¶
An ELAN ref annotation (not time alignable)
- property ref_id¶
ID of the referenced annotation
- class speach.elan.Annotation(ID, value, cve_ref=None, xml_node=None, **kwargs)[source]¶
An ELAN abstract annotation (for both alignable and non-alignable annotations)
- property text¶
An alias to ELANAnnotation.value
- property value: str¶
Annotated text value.
It is possible to change value of an annotation
>>> ann.value 'Old value' >>> ann.value = "New value" >>> ann.value 'New value'
- class speach.elan.TimeSlot(xml_node=None, ID=None, value=None, *args, **kwargs)[source]¶
- property sec¶
Get TimeSlot value in seconds
- property ts: str¶
Return timestamp of this annotation in vtt format (00:01:02.345)
- Returns
An empty string will be returned if TimeSlot value is None
- property value¶
TimeSlot value (in milliseconds)