medkit.audio.transcription.transcribed_text_document#

Classes:

TranscribedTextDocument(text, ...[, anns, ...])

Subclass for TextDocument instances generated by audio transcription.

class TranscribedTextDocument(text, text_spans_to_audio_spans, audio_doc_id, anns=None, attrs=None, metadata=None, uid=None)[source]#

Subclass for TextDocument instances generated by audio transcription.

Variables

Methods:

from_dict(doc_dict)

Create a TranscribedTextDocument from a dict

from_dir(path[, pattern, encoding])

Create documents from text files in a directory

from_file(path[, encoding])

Create a document from a text file

get_containing_audio_spans(text_ann_spans)

Return the audio spans used to transcribe the text referenced by a text annotation.

get_snippet(segment, max_extend_length)

Return a portion of the original text containing the annotation

get_subclass_for_data_dict(data_dict)

Return the subclass that corresponds to the class name found in a data dict

get_containing_audio_spans(text_ann_spans)[source]#

Return the audio spans used to transcribe the text referenced by a text annotation.

For instance, if the audio ranging from 1.0 to 20.0 seconds is transcribed to some text ranging from character 10 to 56 in the transcribed document, and then a text annotation is created referencing the span 15 to 25, then the containing audio span will be the one ranging from 1.0 to 20.0 seconds.

Note that some text annotations maybe be contained in more that one audio spans.

Parameters

text_ann_spans (List[AnySpan]) – Text spans of a text annotation referencing some characters in the transcribed document.

Return type

List[Span]

Returns

List[AudioSpan] – Audio spans used to transcribe the text referenced by the spans of text_ann.

classmethod from_dir(path, pattern='*.txt', encoding='utf-8')#

Create documents from text files in a directory

Parameters
  • path (PathLike) – Path of the directory containing text files

  • pattern (str) – Glob pattern to match text files in path

  • encoding (Optional[str]) – Text encoding to use

Return type

List[Self]

Returns

List[TextDocument] – Text documents with contents of each file as text

classmethod from_file(path, encoding='utf-8')#

Create a document from a text file

Parameters
  • path (PathLike) – Path of the text file

  • encoding (Optional[str]) – Text encoding to use

Return type

Self

Returns

TextDocument – Text document with contents of path as text. The file path is included in the document metadata.

get_snippet(segment, max_extend_length)#

Return a portion of the original text containing the annotation

Parameters
  • segment (Segment) – The annotation

  • max_extend_length (int) – Maximum number of characters to use around the annotation

Return type

str

Returns

str – A portion of the text around the annotation

classmethod get_subclass_for_data_dict(data_dict)#

Return the subclass that corresponds to the class name found in a data dict

Parameters

data_dict (Dict[str, Any]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)

Return type

Optional[Type[Self]]

Returns

subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.

classmethod from_dict(doc_dict)[source]#

Create a TranscribedTextDocument from a dict

Parameters

doc_dict (Dict[str, Any]) – A dictionary from a serialized TranscribedTextDocument as generated by to_dict()

Return type

Self