medkit.io.doccano
medkit.io.doccano#
Classes:
|
A class representing the configuration in the doccano client. |
|
Convert doccano files (.JSONL) containing annotations for a given task. |
|
Convert medkit files to doccano files (.JSONL) for a given task. |
|
Supported doccano tasks. |
- class DoccanoTask(value)[source]#
Supported doccano tasks. The task defines the type of document to convert.
- Variables
TEXT_CLASSIFICATION – Documents with a category
RELATION_EXTRACTION – Documents with entities and relations (including IDs)
SEQUENCE_LABELING – Documents with entities in tuples
- class DoccanoClientConfig(column_text='text', column_label='label')[source]#
A class representing the configuration in the doccano client. The default values are the default values used by doccano.
- Variables
column_text (str) – Name or key representing the text
column_label (str) – Name or key representing the label
- class DoccanoInputConverter(task, client_config=None, attr_label='doccano_category', uid=None)[source]#
Convert doccano files (.JSONL) containing annotations for a given task.
For each line, a
TextDocument
will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f.
DoccanoClientConfig
)Warning
If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.
- Parameters
task (
DoccanoTask
) – The doccano task for the input converterclient_config (
Optional
[DoccanoClientConfig
]) – Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.attr_label (
str
) – The label to use for the medkit attribute that represents the doccano category. This is related toTEXT_CLASSIFICATION
projects.uid (
Optional
[str
]) – Identifier of the converter.
Methods:
load_from_directory_zip
(dir_path)Create a list of TextDocuments from zip files in a directory.
load_from_file
(input_file)Create a list of TextDocuments from a doccano JSONL file.
load_from_zip
(input_file)Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the input converter init parameters.
- set_prov_tracer(prov_tracer)[source]#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the input converter init parameters.
- Return type
- load_from_directory_zip(dir_path)[source]#
Create a list of TextDocuments from zip files in a directory. The zip files should contain a JSONL file coming from doccano.
- Parameters
dir_path (
Union
[str
,Path
]) – The path to the directory containing zip files.- Return type
List
[TextDocument
]- Returns
List[TextDocument] – A list of TextDocuments
- load_from_zip(input_file)[source]#
Create a list of TextDocuments from a zip file containing a JSONL file coming from doccano.
- Parameters
input_file (
Union
[str
,Path
]) – The path to the zip file containing a docanno JSONL file- Return type
List
[TextDocument
]- Returns
List[TextDocument] – A list of TextDocuments
- load_from_file(input_file)[source]#
Create a list of TextDocuments from a doccano JSONL file.
- Parameters
input_file (
Union
[str
,Path
]) – The path to the JSONL file containing doccano annotations- Return type
List
[TextDocument
]- Returns
List[TextDocument] – A list of TextDocuments
- class DoccanoOutputConverter(task, anns_labels=None, attr_label=None, ignore_segments=True, include_metadata=True, uid=None)[source]#
Convert medkit files to doccano files (.JSONL) for a given task.
For each
TextDocument
a jsonline will be created.- Parameters
task (
DoccanoTask
) – The doccano task for the input converteranns_labels (
Optional
[List
[str
]]) – Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful forSEQUENCE_LABELING
orRELATION_EXTRACTION
converters.attr_label (
Optional
[str
]) – The label of the medkit attribute that represents the text category. Useful forTEXT_CLASSIFICATION
converters.ignore_segments (
bool
) – If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful forSEQUENCE_LABELING
orRELATION_EXTRACTION
converters.include_metadata (
Optional
[bool
]) – Whether include medkit metadata in the converted documentsuid (
Optional
[str
]) – Identifier of the converter.
Methods:
save
(docs, output_file)Convert and save a list of TextDocuments into a doccano file (.JSONL)
- save(docs, output_file)[source]#
Convert and save a list of TextDocuments into a doccano file (.JSONL)
- Parameters
docs (
List
[TextDocument
]) – List of medkit doc objects to convertoutput_file (
Union
[str
,Path
]) – Path or string of the JSONL file where to save the converted documents