medkit.text.postprocessing.document_splitter#

Classes:

DocumentSplitter(segment_label[, ...])

Split text documents using its segments as a reference.

class DocumentSplitter(segment_label, entity_labels=None, attr_labels=None, relation_labels=None, name=None, uid=None)[source]#

Split text documents using its segments as a reference.

The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.

This operation can be used to create datasets from medkit text documents.

Instantiate the document splitter

Parameters

segment_label (str) – Label of the segments to use as references for the splitter
entity_labels (Optional[List[str]]) – Labels of entities to be included in the mini documents. If None, all entities from the document will be included.
attr_labels (Optional[List[str]]) – Labels of the attributes to be included into the new annotations. If None, all attributes will be included.
relation_labels (Optional[List[str]]) – Labels of relations to be included in the mini documents. If None, all relations will be included.
name (Optional[str]) – Name describing the splitter (default to the class name).
uid (str, Optional) – Identifier of the operation

Methods:

`run`(docs)	Split docs into mini documents
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(docs)[source]#

Split docs into mini documents

Contains all the operation init parameters.

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.