I/O components#

This page lists all components for converting and loading/saving data.

Note

For more details about public APIs, refer to medkit.io.

medkit-json#

medkit has some utilities to export and import medkit documents to json format.

You can use medkit.io.medkit_json.save_text_documents to save a list of documents, and then medkit.io.medkit_json.load_text_documents to load them in medkit.

Warning

load_text_documents is a generator function that returns a generator iterator. That avoids keeping data in memory.

Pay attention that the generator variable becomes empty after the first iteration.

However, if you need to keep all the list in memory, you may cast it to a list:

from medkit.io.medkit_json import load_text_documents

MEDKIT_JSONL_PATH = "path_to_medkit_jsonl_file"

docs = list(load_text_documents(MEDKIT_JSONL_PATH))

For more details, refer to medkit.io.medkit_json.

Brat#

Brat is a web-based tool for text annotation. Medkit supports the input and output conversion of text documents.

See also

For more details, refer to medkit.io.brat. You may refer to this example for more information.

Doccano#

Doccano is a text annotation tool from multiple tasks. Medkit supports the input and output conversion of doccano files (.JSONL format).

You can load annotations from a .jsonl file or a zip directory.

Supported tasks#

Doccano Project

Task for io converter

Sequence labeling

medkit.io.doccano.DoccanoTask.SEQUENCE_LABELING
i.e : {'text':...,'label':[(int,int,label)]}

Sequence labeling with relations

medkit.io.doccano.DoccanoTask.RELATION_EXTRACTION
i.e : {'text':...,'entities':[{...}],'relations':[{...}]}

Text Classification

medkit.io.doccano.DoccanoTask.TEXT_CLASSIFICATION
i.e : {'text':...,'label':[str]}

Client configuration#

The doccano user interface allows custom configuration over certain annotation parameters. The medkit.io.doccano.DoccanoClientConfig class contains the configuration to be used by the input converter.

You can modify the settings depending on the configuration of your project. If you don’t provide a config, the converter will use the default doccano configuration.

Note

Metadata

  • Doccano to medkit: All the extra fields are imported as a dictionary in TextDocument.metadata

  • Medkit to Doccano: The TextDocument.metadata is exported as extra fields in the output data. You can set include_metadata to False to remove the extra fields.

For more details, refer to medkit.io.doccano.

Spacy#

Medkit supports the input and output conversion of spacy documents.

Important

For using spacy converters, you need to install spacy. These dependencies may be installed with pip install medkit-lib[spacy]

See also

You may refer to this example for more information.

For more details, refer to medkit.io.spacy.

RTTM#

Rich Transcription Time Marked (.rttm) files contains diarization information. Medkit supports input and output conversion of audio documents.

For more details, refer to medkit.io.rttm.

SRT#

For more details, refer to medkit.io.srt.