medkit.io#

APIs#

For accessing these APIs, you may use import like this:

from medkit.io import <api_to_import>

Classes:

`BratInputConverter`([detect_cuis_in_notes, ...])	Class in charge of converting brat annotations
`BratOutputConverter`([anns_labels, attrs, ...])	Class in charge of converting a list of TextDocuments into a brat collection file.
`DoccanoClientConfig`([column_text, column_label])	A class representing the configuration in the doccano client.
`DoccanoInputConverter`(task[, client_config, ...])	Convert doccano files (.JSONL) containing annotations for a given task.
`DoccanoOutputConverter`(task[, anns_labels, ...])	Convert medkit files to doccano files (.JSONL) for a given task.
`DoccanoTask`(value)	Supported doccano tasks.
`RTTMInputConverter`([turn_label, ...])	Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.
`RTTMOutputConverter`([turn_label, speaker_label])	Build Rich Transcription Time Marked (.rttm) files containing diarization information from `Segment` objects.
`SRTInputConverter`([turn_segment_label, ...])	Convert .srt files containing transcription information into turn segments with transcription attributes.
`SRTOutputConverter`([segment_turn_label, ...])	Build .srt files containing transcription information from `Segment` objects.

class BratInputConverter(detect_cuis_in_notes=True, notes_label='brat_note', uid=None)[source]#

Class in charge of converting brat annotations

Parameters

notes_label (str) – Label to use for attributes created from annotator notes.
detect_cuis_in_notes (bool) – If True, strings looking like CUIs in annotator notes of entities will be converted to UMLS normalization attributes rather than creating an Attribute with the whole note text as value.
uid (Optional[str]) – Identifier of the converter.

Methods:

`load`(dir_path[, ann_ext, text_ext])	Create a list of TextDocuments from a folder containing text files and associated brat annotations files.
`load_annotations`(ann_file)	Load a .ann file and return a list of `Annotation` objects.
`load_doc`(ann_path, text_path)	Create a TextDocument from a .ann file and its associated .txt file

load(dir_path, ann_ext='.ann', text_ext='.txt')[source]#

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

Parameters

dir_path (Union[str, Path]) – The path to the directory containing the text files and the annotation files (.ann)
ann_ext (str) – The extension of the brat annotation file (e.g. .ann)
text_ext (str) – The extension of the text file (e.g. .txt)

Return type

List[TextDocument]

Returns

List[TextDocument] – The list of TextDocuments

load_doc(ann_path, text_path)[source]#

Create a TextDocument from a .ann file and its associated .txt file

Parameters

text_path (Union[str, Path]) – The path to the text document file.
ann_path (Union[str, Path]) – The path to the brat annotation file.

Return type

TextDocument

Returns

TextDocument – The document containing the text and the annotations

load_annotations(ann_file)[source]#

Load a .ann file and return a list of Annotation objects.

Parameters: ann_file (Union[str, Path]) – Path to the .ann file.
Return type: List[TextAnnotation]

class BratOutputConverter(anns_labels=None, attrs=None, notes_label='brat_note', ignore_segments=True, convert_cuis_to_notes=True, create_config=True, top_values_by_attr=50, uid=None)[source]#

Class in charge of converting a list of TextDocuments into a brat collection file.

Hint

BRAT checks the coherence between span and text for each annotation. This converter adjusts the text and spans to get the right visualization and ensure compatibility.

Initialize the Brat output converter

Parameters

anns_labels (Optional[List[str]]) – Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted
attrs (Optional[List[str]]) – Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes
notes_label (str) – Label of attributes that will be converted to annotator notes.
ignore_segments (bool) – If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.
convert_cuis_to_notes (bool) – If True, UMLS normalization attributes will be converted to annotator notes rather than attributes. For entities with multiple UMLS attributes, CUIs will be separated by spaces (ex: “C0011849 C0004096”).
create_config (bool) – Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.
top_values_by_attr (int) – Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.
uid (Optional[str]) – Identifier of the converter

Methods:

save(docs, dir_path[, doc_names])

Convert and save a collection or list of TextDocuments into a Brat collection.

save(docs, dir_path, doc_names=None)[source]#

Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files; an ‘annotation.conf’ is saved if required.

Parameters

docs (List[TextDocument]) – List of medkit doc objects to convert
dir_path (Union[str, Path]) – String or path object to save the generated files
doc_names (Optional[List[str]]) – Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.

class DoccanoInputConverter(task, client_config=None, attr_label='doccano_category', uid=None)[source]#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters

task (DoccanoTask) – The doccano task for the input converter
client_config (Optional[DoccanoClientConfig]) – Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.
attr_label (str) – The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.
uid (Optional[str]) – Identifier of the converter.

Methods:

`load_from_directory_zip`(dir_path)	Create a list of TextDocuments from zip files in a directory.
`load_from_file`(input_file)	Create a list of TextDocuments from a doccano JSONL file.
`load_from_zip`(input_file)	Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the input converter init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the input converter init parameters.

Return type: OperationDescription

load_from_directory_zip(dir_path)[source]#

Create a list of TextDocuments from zip files in a directory. The zip files should contain a JSONL file coming from doccano.

Parameters: dir_path (Union[str, Path]) – The path to the directory containing zip files.
Return type: List[TextDocument]
Returns: List[TextDocument] – A list of TextDocuments

load_from_zip(input_file)[source]#

Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.

Parameters: input_file (Union[str, Path]) – The path to the zip file containing a docanno JSONL file
Return type: List[TextDocument]
Returns: List[TextDocument] – A list of TextDocuments

load_from_file(input_file)[source]#

Create a list of TextDocuments from a doccano JSONL file.

Parameters: input_file (Union[str, Path]) – The path to the JSONL file containing doccano annotations
Return type: List[TextDocument]
Returns: List[TextDocument] – A list of TextDocuments

class DoccanoClientConfig(column_text='text', column_label='label')[source]#

A class representing the configuration in the doccano client. The default values are the default values used by doccano.

Variables

column_text (str) – Name or key representing the text
column_label (str) – Name or key representing the label

class DoccanoOutputConverter(task, anns_labels=None, attr_label=None, ignore_segments=True, include_metadata=True, uid=None)[source]#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters

task (DoccanoTask) – The doccano task for the input converter
anns_labels (Optional[List[str]]) – Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.
attr_label (Optional[str]) – The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.
ignore_segments (bool) – If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.
include_metadata (Optional[bool]) – Whether include medkit metadata in the converted documents
uid (Optional[str]) – Identifier of the converter.

Methods:

save(docs, output_file)

Convert and save a list of TextDocuments into a doccano file (.JSONL)

save(docs, output_file)[source]#

Convert and save a list of TextDocuments into a doccano file (.JSONL)

Parameters

docs (List[TextDocument]) – List of medkit doc objects to convert
output_file (Union[str, Path]) – Path or string of the JSONL file where to save the converted documents

class DoccanoTask(value)[source]#

Supported doccano tasks. The task defines the type of document to convert.

Variables

TEXT_CLASSIFICATION – Documents with a category
RELATION_EXTRACTION – Documents with entities and relations (including IDs)
SEQUENCE_LABELING – Documents with entities in tuples

class RTTMInputConverter(turn_label='turn', speaker_label='speaker', converter_id=None)[source]#

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

For each turn in a .rttm file, a Segment will be created, with an associated Attribute holding the name of the turn speaker as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters

turn_label (str) – Label of segments representing turns in the .rttm file.
speaker_label (str) – Label of speaker attributes to add to each segment.
converter_id (Optional[str]) – Identifier of the converter.

Attributes:

description

Contains all the input converter init parameters.

Methods:

`load`(rttm_dir[, audio_dir, audio_ext])	Load all .rttm files in a directory into a list of `AudioDocument` objects.
`load_doc`(rttm_file, audio_file)	Load a single .rttm file into an `AudioDocument`.
`load_turns`(rttm_file, audio_file)	Load a .rttm file and return a list of `Segment` objects.
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the input converter init parameters.

Return type: OperationDescription

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

load(rttm_dir, audio_dir=None, audio_ext='.wav')[source]#

Load all .rttm files in a directory into a list of AudioDocument objects.

For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters

rttm_dir (Union[str, Path]) – Directory containing the .rttm files.
audio_dir (Union[str, Path, None]) – Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.
audio_ext (str) – File extension to use for audio files.

Return type

List[AudioDocument]

Returns

List[AudioDocument] – List of generated documents.

load_doc(rttm_file, audio_file)[source]#

Load a single .rttm file into an AudioDocument.

Parameters

rttm_file (Union[str, Path]) – Path to the .rttm file.
audio_file (Union[str, Path]) – Path to the corresponding audio file.

Return type

AudioDocument

Returns

AudioDocument – Generated document.

load_turns(rttm_file, audio_file)[source]#

Load a .rttm file and return a list of Segment objects.

Parameters

rttm_file (Union[str, Path]) – Path to the .rttm file.
audio_file (Union[str, Path]) – Path to the corresponding audio file.

Return type

List[Segment]

Returns

List[Segment] – Turn segments as found in the .rttm file.

class RTTMOutputConverter(turn_label='turn', speaker_label='speaker')[source]#

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the name of the turn speaker as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters

turn_label (str) – Label of segments representing turns in the audio documents.
speaker_label (str) – Label of speaker attributes attached to each turn segment.

Methods:

`save`(docs, rttm_dir[, doc_names])	Save `AudioDocument` instances as .rttm files in a directory.
`save_doc`(doc, rttm_file[, rttm_doc_id])	Save a single `AudioDocument` as a .rttm file.
`save_turn_segments`(turn_segments, rttm_file, ...)	Save `Segment` objects into a .rttm file.

save(docs, rttm_dir, doc_names=None)[source]#

Save AudioDocument instances as .rttm files in a directory.

Parameters

docs (List[AudioDocument]) – List of audio documents to save.
rttm_dir (Union[str, Path]) – Directory into which the generated .rttm files will be stored.
doc_names (Optional[List[str]]) – Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.

save_doc(doc, rttm_file, rttm_doc_id=None)[source]#

Save a single AudioDocument as a .rttm file.

Parameters

doc (AudioDocument) – Audio document to save.
rttm_file (Union[str, Path]) – Path of the generated .rttm file.
rttm_doc_id (Optional[str]) – File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.

save_turn_segments(turn_segments, rttm_file, rttm_doc_id)[source]#

Save Segment objects into a .rttm file.

Parameters

turn_segments (List[Segment]) – Turn segments to save.
rttm_file (Union[str, Path]) – Path of the generated .rttm file.
rttm_doc_id (Optional[str]) – File uid to use for the generated .rttm file (2d column).

class SRTInputConverter(turn_segment_label='turn', transcription_attr_label='transcribed_text', converter_id=None)[source]#

Convert .srt files containing transcription information into turn segments with transcription attributes.

For each turn in a .srt file, a Segment will be created, with an associated Attribute holding the transcribed text as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters

turn_segment_label (str) – Label to use for segments representing turns in the .srt file.
transcription_attr_label (str) – Label to use for segments attributes containing the transcribed text.
converter_id (Optional[str]) – Identifier of the converter.

Attributes:

description

Contains all the input converter init parameters.

Methods:

`load`(srt_dir[, audio_dir, audio_ext])	Load all .srt files in a directory into a list of `AudioDocument` objects.
`load_doc`(srt_file, audio_file)	Load a single .srt file into an `AudioDocument` containing turn segments with transcription attributes.
`load_segments`(srt_file, audio_file)	Load a .srt file and return a list of `Segment` objects corresponding to turns, with transcription attributes.
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the input converter init parameters.

Return type: OperationDescription

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

load(srt_dir, audio_dir=None, audio_ext='.wav')[source]#

Load all .srt files in a directory into a list of AudioDocument objects.

For each .srt file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters

srt_dir (Union[str, Path]) – Directory containing the .srt files.
audio_dir (Union[str, Path, None]) – Directory containing the audio files corresponding to the .srt files, if they are not in srt_dir.
audio_ext (str) – File extension to use for audio files.

Return type

List[AudioDocument]

Returns

List[AudioDocument] – List of generated documents.

load_doc(srt_file, audio_file)[source]#

Load a single .srt file into an AudioDocument containing turn segments with transcription attributes.

Parameters

srt_file (Union[str, Path]) – Path to the .srt file.
audio_file (Union[str, Path]) – Path to the corresponding audio file.

Return type

AudioDocument

Returns

AudioDocument – Generated document.

load_segments(srt_file, audio_file)[source]#

Load a .srt file and return a list of Segment objects corresponding to turns, with transcription attributes.

Parameters

srt_file (Union[str, Path]) – Path to the .srt file.
audio_file (Union[str, Path]) – Path to the corresponding audio file.

Return type

List[Segment]

Returns

List[Segment] – Turn segments as found in the .srt file, with transcription attributes attached.

class SRTOutputConverter(segment_turn_label='turn', transcription_attr_label='transcribed_text')[source]#

Build .srt files containing transcription information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the transcribed text as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters

segment_turn_label (str) – Label of segments representing turns in the audio documents.
transcription_attr_label (str) – Label of segments attributes containing the transcribed text.

Methods:

`save`(docs, srt_dir[, doc_names])	Save `AudioDocument` instances as .srt files in a directory.
`save_doc`(doc, srt_file)	Save a single `AudioDocument` as a .srt file.
`save_segments`(segments, srt_file)	Save `Segment` objects representing turns into a .srt file.

save(docs, srt_dir, doc_names=None)[source]#

Save AudioDocument instances as .srt files in a directory.

Parameters

docs (List[AudioDocument]) – List of audio documents to save.
str_dir – Directory into which the generated .str files will be stored.
doc_names (Optional[List[str]]) – Optional list of names to use as basenames for the generated .srt files.

save_doc(doc, srt_file)[source]#

Save a single AudioDocument as a .srt file.

Parameters

doc (AudioDocument) – Audio document to save.
srt_file (Union[str, Path]) – Path of the generated .srt file.

save_segments(segments, srt_file)[source]#

Save Segment objects representing turns into a .srt file.

Parameters

segments (List[Segment]) – Turn segments to save.
srt_file (Union[str, Path]) – Path of the generated .srt file.

Subpackages / Submodules#

`medkit.io.brat`
`medkit.io.doccano`
`medkit.io.medkit_json`
`medkit.io.rttm`
`medkit.io.spacy`	This module needs extra-dependencies not installed as core dependencies of medkit.
`medkit.io.srt`

medkit.io

Contents

medkit.io#

APIs#

Subpackages / Submodules#