Core text components#

This page contains all core text concepts of medkit.

Note

For more details about public APIs, refer to medkit.core.text.

Document, Annotations & Attributes#

The TextDocument class implements the Document protocol. It allows to store subclasses of TextAnnotation, which implements the Annotation protocol.

classDiagram direction TB class Document~Annotation~{ <<protocol>> } class Annotation{ <<protocol>> } class TextDocument{ uid: str anns: TextAnnotationContainer } class TextAnnotation{ <<abstract>> uid: str label: str attrs: AttributeContainer } Document <|.. TextDocument: implements Annotation <|.. TextAnnotation: implements TextDocument *-- TextAnnotation: contains \n(TextAnnotationContainer)

Fig. 2 Text document and text annotation#

Document#

TextDocument relies on TextAnnotationContainer, a subclass of AnnotationContainer, to manage the annotations,

Given a text document named doc

  • User can browse segments, entities, and relations

    for entity in doc.anns.entities:
      ...
    
    for segment in doc.anns.segments:
      ...
    
    for relation in doc.anns.relations:
      ...
    
  • User can filter segments, entities and relations

      sentences_segments = doc.get_segments(label="sentences")
      disorder_entities = doc.get_entities(label="disorder)
    
      entity = <my entity>
      relations = doc.get_relations(label="before", source_id=entity.uid)
    

Note

For common interfaces provided by core components, you can refer to Document.

Annotations#

For text modality, TextDocument can only contain TextAnnotations.

Note

For more details about public APIs, refer to medkit.core.text.annotation).

Three subclasses are defined: Segment, Entity and Relation

classDiagram direction TB class Annotation{ <<protocol>> } class TextAnnotation{ <<abstract>> } Annotation <|.. TextAnnotation: implements TextAnnotation <|-- Segment TextAnnotation <|-- Relation Segment <|-- Entity

Fig. 3 Text annotation hierarchy#

Note

Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation)

Attributes#

Text annotations can receive attributes, which will be instances of the core Attribute class.

Among attributes, medkit.core.text proposes EntityNormAttribute, to be used for normalization attributes, in order to have a common structure for normalization information, independently of the operation used to create it.

Spans#

medkit relies on the concept of spans for following all text modifications made by the different operations.

Note

For more details about public APIs, refer to medkit.core.text.span.

medkit also proposes a set of utilities for manipulating these spans if we need it when implementing a new medkit operation.

Note

For more details about public APIs, refer to medkit.core.text.span_utils.

See also

You may also take a look to the spans notebook example.

Text utilities#

These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not really supposed to be used directly, but rather inside a cleaning operation.

Note

For more details about public APIs, refer to medkit.core.text.utils.

See also

Medkit provides the EDSCleaner class that combines all these utilities to clean french documents (related to EDS documents coming from PDF).

Operations#

Abstract subclasses of Operation have been defined for text to ease the development of text operations according to run operations.

classDiagram Operation <|-- ContextOperation Operation <|-- DocOperation Operation <|-- NEROperation Operation <|-- SegmentationOperation Operation <|-- _CustomTextOperation

Fig. 4 Operation hierarchy#

Note

For more details about public APIs, refer to medkit.core.text.operation.

Internal class _CustomTextOperation has been implemented to allow user to call create_text_operation() for easily instantiating a custom text operation.

See also

You may refer to this tutorial as example of definition of custom operation.