Core text components
Contents
Core text components#
This page contains all core text concepts of medkit.
Note
For more details about public APIs, refer to
medkit.core.text
.
Document, Annotations & Attributes#
The TextDocument
class implements the
Document
protocol. It allows to store subclasses of
TextAnnotation
, which implements the
Annotation
protocol.
Document#
TextDocument
relies on TextAnnotationContainer
,
a subclass of AnnotationContainer
, to manage the annotations,
Given a text document named doc
User can browse segments, entities, and relations
for entity in doc.anns.entities: ... for segment in doc.anns.segments: ... for relation in doc.anns.relations: ...
User can filter segments, entities and relations
sentences_segments = doc.get_segments(label="sentences") disorder_entities = doc.get_entities(label="disorder) entity = <my entity> relations = doc.get_relations(label="before", source_id=entity.uid)
Note
For common interfaces provided by core components, you can refer to Document.
Annotations#
For text modality, TextDocument
can only contain
TextAnnotation
s.
Note
For more details about public APIs, refer to medkit.core.text.annotation
).
Three subclasses are defined:
Segment
,
Entity
and
Relation
Note
Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation)
Attributes#
Text annotations can receive attributes, which will be instances of the core
Attribute
class.
Among attributes, medkit.core.text
proposes
EntityNormAttribute
, to be used
for normalization attributes, in order to have a common structure for
normalization information, independently of the operation used to create it.
Spans#
medkit relies on the concept of spans for following all text modifications made by the different operations.
Note
For more details about public APIs, refer to
medkit.core.text.span
.
medkit also proposes a set of utilities for manipulating these spans if we need it when implementing a new medkit operation.
Note
For more details about public APIs, refer to medkit.core.text.span_utils
.
See also
You may also take a look to the spans notebook example.
Text utilities#
These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not really supposed to be used directly, but rather inside a cleaning operation.
Note
For more details about public APIs, refer to medkit.core.text.utils
.
See also
Medkit provides the EDSCleaner
class that combines all these utilities to clean french documents (related to EDS documents coming from PDF).
Operations#
Abstract subclasses of Operation
have been defined for text
to ease the development of text operations according to run
operations.
Note
For more details about public APIs, refer to medkit.core.text.operation
.
Internal class _CustomTextOperation
has been implemented to allow user to
call create_text_operation()
for easily instantiating a custom
text operation.
See also
You may refer to this tutorial as example of definition of custom operation.