medkit.text.postprocessing.document_splitter
medkit.text.postprocessing.document_splitter#
Classes:
|
Split text documents using its segments as a reference. |
- class DocumentSplitter(segment_label, entity_labels=None, attr_labels=None, relation_labels=None, name=None, uid=None)[source]#
Split text documents using its segments as a reference.
The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.
This operation can be used to create datasets from medkit text documents.
Instantiate the document splitter
- Parameters
segment_label (
str
) – Label of the segments to use as references for the splitterentity_labels (
Optional
[List
[str
]]) – Labels of entities to be included in the mini documents. If None, all entities from the document will be included.attr_labels (
Optional
[List
[str
]]) – Labels of the attributes to be included into the new annotations. If None, all attributes will be included.relation_labels (
Optional
[List
[str
]]) – Labels of relations to be included in the mini documents. If None, all relations will be included.name (
Optional
[str
]) – Name describing the splitter (default to the class name).uid (str, Optional) – Identifier of the operation
Methods:
run
(docs)Split docs into mini documents
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- run(docs)[source]#
Split docs into mini documents
- Parameters
documents – List of text documents to split
- Return type
List
[TextDocument
]- Returns
List of documents created from the selected segments
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.