medkit.io.doccano#

Classes:

DoccanoClientConfig([column_text, column_label])

A class representing the configuration in the doccano client.

DoccanoInputConverter(task[, client_config, ...])

Convert doccano files (.JSONL) containing annotations for a given task.

DoccanoOutputConverter(task[, anns_labels, ...])

Convert medkit files to doccano files (.JSONL) for a given task.

DoccanoTask(value)

Supported doccano tasks.

class DoccanoTask(value)[source]#

Supported doccano tasks. The task defines the type of document to convert.

Variables
  • TEXT_CLASSIFICATION – Documents with a category

  • RELATION_EXTRACTION – Documents with entities and relations (including IDs)

  • SEQUENCE_LABELING – Documents with entities in tuples

class DoccanoClientConfig(column_text='text', column_label='label')[source]#

A class representing the configuration in the doccano client. The default values are the default values used by doccano.

Variables
  • column_text (str) – Name or key representing the text

  • column_label (str) – Name or key representing the label

class DoccanoInputConverter(task, client_config=None, attr_label='doccano_category', uid=None)[source]#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters
  • task (DoccanoTask) – The doccano task for the input converter

  • client_config (Optional[DoccanoClientConfig]) – Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.

  • attr_label (str) – The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.

  • uid (Optional[str]) – Identifier of the converter.

Methods:

load_from_directory_zip(dir_path)

Create a list of TextDocuments from zip files in a directory.

load_from_file(input_file)

Create a list of TextDocuments from a doccano JSONL file.

load_from_zip(input_file)

Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the input converter init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the input converter init parameters.

Return type

OperationDescription

load_from_directory_zip(dir_path)[source]#

Create a list of TextDocuments from zip files in a directory. The zip files should contain a JSONL file coming from doccano.

Parameters

dir_path (Union[str, Path]) – The path to the directory containing zip files.

Return type

List[TextDocument]

Returns

List[TextDocument] – A list of TextDocuments

load_from_zip(input_file)[source]#

Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.

Parameters

input_file (Union[str, Path]) – The path to the zip file containing a docanno JSONL file

Return type

List[TextDocument]

Returns

List[TextDocument] – A list of TextDocuments

load_from_file(input_file)[source]#

Create a list of TextDocuments from a doccano JSONL file.

Parameters

input_file (Union[str, Path]) – The path to the JSONL file containing doccano annotations

Return type

List[TextDocument]

Returns

List[TextDocument] – A list of TextDocuments

class DoccanoOutputConverter(task, anns_labels=None, attr_label=None, ignore_segments=True, include_metadata=True, uid=None)[source]#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters
  • task (DoccanoTask) – The doccano task for the input converter

  • anns_labels (Optional[List[str]]) – Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

  • attr_label (Optional[str]) – The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.

  • ignore_segments (bool) – If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

  • include_metadata (Optional[bool]) – Whether include medkit metadata in the converted documents

  • uid (Optional[str]) – Identifier of the converter.

Methods:

save(docs, output_file)

Convert and save a list of TextDocuments into a doccano file (.JSONL)

save(docs, output_file)[source]#

Convert and save a list of TextDocuments into a doccano file (.JSONL)

Parameters
  • docs (List[TextDocument]) – List of medkit doc objects to convert

  • output_file (Union[str, Path]) – Path or string of the JSONL file where to save the converted documents