medkit.core.text
Contents
medkit.core.text#
APIs#
For accessing these APIs, you may use import like this:
from medkit.core.text import <api_to_import>
Classes:
|
|
|
Abstract operation for context detection. |
|
Supported function types for creating custom text operations. |
|
Text entity referencing part of an |
|
Manage a list of attributes attached to a text entity. |
|
Normalization attribute linking an entity to an ID in a knowledge base |
|
Slice of text not present in the original text |
|
Abstract operation for detecting entities. |
|
Relation between two text entities. |
|
Text segment referencing part of an |
|
Abstract operation for segmenting text. |
|
Slice of text extracted from the original text |
|
Base abstract class for all text annotations |
|
Manage a list of text annotations belonging to a text document. |
|
Document holding text annotations |
|
Normalization attribute linking an entity to a CUI in the UMLS knowledge base |
Functions:
|
Function for instantiating a custom test operation from a user-defined function |
- class TextAnnotation(label, attrs=None, metadata=None, uid=None, attr_container_class=<class 'AttributeContainer'>)[source]#
Base abstract class for all text annotations
- Variables
uid (str) – Unique identifier of the annotation.
label (str) – The label for this annotation (e.g., SENTENCE)
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the annotation. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the annotation
keys (Set[str]) – Pipeline output keys to which the annotation belongs to.
Methods:
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class Segment(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#
Text segment referencing part of an
TextDocument
.- Variables
uid (str) – The segment identifier.
label (str) – The label for this segment (e.g., SENTENCE)
text (str) – Text of the segment.
spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the segment text correspond to which part of the document’s full text.
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the segment. Stored in a :class:{~medkit.core.AttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the segment
keys (Set[str]) – Pipeline output keys to which the segment belongs to.
Methods:
from_dict
(segment_dict)Creates a Segment from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod from_dict(segment_dict)[source]#
Creates a Segment from a dict
- Parameters
segment_dict (dict) – A dictionary from a serialized segment as generated by to_dict()
- Return type
Self
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class Entity(label, text, spans, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'EntityAttributeContainer'>)[source]#
Text entity referencing part of an
TextDocument
.- Variables
uid (str) – The entity identifier.
label (str) – The label for this entity (e.g., DISEASE)
text (str) – Text of the entity.
spans (List[medkit.core.text.span.AnySpan]) – List of spans indicating which parts of the entity text correspond to which part of the document’s full text.
attrs (medkit.core.text.entity_attribute_container.EntityAttributeContainer) – Attributes of the entity. Stored in a :class:{~medkit.core.EntityAttributeContainer} but can be passed as a list at init.
metadata (Dict[str, Any]) – The metadata of the entity
keys (Set[str]) – Pipeline output keys to which the entity belongs to.
Methods:
from_dict
(segment_dict)Creates a Segment from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod from_dict(segment_dict)#
Creates a Segment from a dict
- Parameters
segment_dict (dict) – A dictionary from a serialized segment as generated by to_dict()
- Return type
Self
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class Relation(label, source_id, target_id, attrs=None, metadata=None, uid=None, store=None, attr_container_class=<class 'AttributeContainer'>)[source]#
Relation between two text entities.
- Variables
uid (str) – The identifier of the relation
label (str) – The relation label
source_id (str) – The identifier of the entity from which the relation is defined
target_id (str) – The identifier of the entity to which the relation is defined
attrs (medkit.core.attribute_container.AttributeContainer) – The attributes of the relation
metadata (Dict[str, Any]) – The metadata of the relation
keys (Set[str]) – Pipeline output keys to which the relation belongs to
Methods:
from_dict
(relation_dict)Creates a Relation from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class TextAnnotationContainer(doc_id, raw_segment)[source]#
Manage a list of text annotations belonging to a text document.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
Also provides retrieval of entities, segments, relations, and handling of raw segment.
Instantiate the annotation container
- Parameters
doc_id (
str
) – The identifier of the document which annotations belong to.
Attributes:
Return the list of entities
Return the list of relations
Return the list of segments
Methods:
add
(ann)Attach an annotation to the document.
get
(*[, label, key])Return a list of the annotations of the document, optionally filtering by label or key.
get_by_id
(uid)Return the annotation corresponding to a specific identifier.
get_entities
(*[, label, key])Return a list of the entities of the document, optionally filtering by label or key.
get_ids
(*[, label, key])Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.
get_relations
(*[, label, key, source_id])Return a list of the relations of the document, optionally filtering by label, key or source entity.
get_segments
(*[, label, key])Return a list of the segments of the document (not including entities), optionally filtering by label or key.
- property segments: List[medkit.core.text.annotation.Segment]#
Return the list of segments
- Return type
List
[Segment
]
- property entities: List[medkit.core.text.annotation.Entity]#
Return the list of entities
- Return type
List
[Entity
]
- property relations: List[medkit.core.text.annotation.Relation]#
Return the list of relations
- Return type
List
[Relation
]
- add(ann)[source]#
Attach an annotation to the document.
- Parameters
ann (
TextAnnotation
) – Annotation to add.- Raises
ValueError – If the annotation is already attached to the document (based on annotation.uid)
- get(*, label=None, key=None)[source]#
Return a list of the annotations of the document, optionally filtering by label or key.
- Parameters
label (
Optional
[str
]) – Label to use to filter annotations.key (
Optional
[str
]) – Key to use to filter annotations.
- Return type
List
[TextAnnotation
]
- get_by_id(uid)[source]#
Return the annotation corresponding to a specific identifier.
- Parameters
uid – Identifier of the annotation to return.
- Return type
- get_segments(*, label=None, key=None)[source]#
Return a list of the segments of the document (not including entities), optionally filtering by label or key.
- Parameters
label (
Optional
[str
]) – Label to use to filter segments.key (
Optional
[str
]) – Key to use to filter segments.
- Return type
List
[Segment
]
- get_entities(*, label=None, key=None)[source]#
Return a list of the entities of the document, optionally filtering by label or key.
- Parameters
label (
Optional
[str
]) – Label to use to filter entities.key (
Optional
[str
]) – Key to use to filter entities.
- Return type
List
[Entity
]
- get_relations(*, label=None, key=None, source_id=None)[source]#
Return a list of the relations of the document, optionally filtering by label, key or source entity.
- Parameters
label (
Optional
[str
]) – Label to use to filter relations.key (
Optional
[str
]) – Key to use to filter relations.source_id (
Optional
[str
]) – Identifier of the source entity to use to filter relations.
- Return type
List
[Relation
]
- get_ids(*, label=None, key=None)#
Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.
This method is provided, so it is easier to implement additional filtering in subclasses.
- Parameters
label (
Optional
[str
]) – Label to use to filter annotations.key (
Optional
[str
]) – Key to use to filter annotations.
- Return type
Iterator
[str
]
- class TextDocument(text, anns=None, attrs=None, metadata=None, uid=None)[source]#
Document holding text annotations
Annotations must be subclasses of TextAnnotation.
- Variables
uid (str) – Unique identifier of the document.
text – Full document text.
anns (medkit.core.text.annotation_container.TextAnnotationContainer) – Annotations of the document. Stored in an
TextAnnotationContainer
but can be passed as a list at init.attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the document. Stored in an
AttributeContainer
but can be passed as a list at initmetadata (Dict[str, Any]) – Document metadata.
raw_segment (medkit.core.text.annotation.Segment) –
Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:
>>> doc = TextDocument(text="hello") >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
Methods:
from_dict
(doc_dict)Creates a TextDocument from a dict
from_dir
(path[, pattern, encoding])Create documents from text files in a directory
from_file
(path[, encoding])Create a document from a text file
get_snippet
(segment, max_extend_length)Return a portion of the original text containing the annotation
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- classmethod from_dict(doc_dict)[source]#
Creates a TextDocument from a dict
- Parameters
doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()
- Return type
Self
- classmethod from_file(path, encoding='utf-8')[source]#
Create a document from a text file
- Parameters
path (
PathLike
) – Path of the text fileencoding (
Optional
[str
]) – Text encoding to use
- Return type
Self
- Returns
TextDocument – Text document with contents of path as text. The file path is included in the document metadata.
- classmethod from_dir(path, pattern='*.txt', encoding='utf-8')[source]#
Create documents from text files in a directory
- Parameters
path (
PathLike
) – Path of the directory containing text filespattern (
str
) – Glob pattern to match text files in pathencoding (
Optional
[str
]) – Text encoding to use
- Return type
List
[Self
]- Returns
List[TextDocument] – Text documents with contents of each file as text
- get_snippet(segment, max_extend_length)[source]#
Return a portion of the original text containing the annotation
- Parameters
segment (
Segment
) – The annotationmax_extend_length (
int
) – Maximum number of characters to use around the annotation
- Return type
str
- Returns
str – A portion of the text around the annotation
- class EntityAttributeContainer(owner_id)[source]#
Manage a list of attributes attached to a text entity.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
Also provides retrieval of normalization attributes.
Attributes:
Return the list of normalization attributes
Methods:
add
(attr)Attach an attribute to the annotation.
get
(*[, label])Return a list of the attributes of the annotation, optionally filtering by label.
get_by_id
(uid)Return the attribute corresponding to a specific identifier.
Return a list of the normalization attributes of the annotation
- property norms: List[medkit.core.text.entity_norm_attribute.EntityNormAttribute]#
Return the list of normalization attributes
- Return type
List
[EntityNormAttribute
]
- add(attr)[source]#
Attach an attribute to the annotation.
- Parameters
attr (
Attribute
) – Attribute to add.- Raises
ValueError – If the attribute is already attached to the annotation (based on attr.uid).
- get_norms()[source]#
Return a list of the normalization attributes of the annotation
- Return type
List
[EntityNormAttribute
]
- class EntityNormAttribute(kb_name, kb_id, kb_version=None, term=None, score=None, metadata=None, uid=None)[source]#
Normalization attribute linking an entity to an ID in a knowledge base
- Variables
uid (str) – Identifier of the attribute
label (str) – The attribute label, always set to
EntityNormAttribute.LABEL
value (Optional[Any]) – String representation of the normalization, containing kb_id, along with kb_name if available (ex: “umls:C0011849”). For special cases where only term is available, it is used as value.
kb_name (Optional[str]) – Name of the knowledge base (ex: “icd”). Should always be provided except in special cases when we just want to store a normalized term.
kb_id (Optional[Any]) – ID in the knowledge base to which the annotation should be linked. Should always be provided except in special cases when we just want to store a normalized term.
kb_version (Optional[str]) – Optional version of the knowledge base.
term (Optional[str]) – Optional normalized version of the entity text.
score (Optional[float]) – Optional score reflecting confidence of this link.
metadata (Dict[str, Any]) – Metadata of the attribute
Attributes:
Label used for all normalization attributes
Methods:
copy
()Create a new attribute that is a copy of the current instance, but with a new identifier
from_dict
(data_dict)Creates an Attribute from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
to_brat
()Return a value compatible with the brat format
to_spacy
()Return a value compatible with spaCy
- LABEL: ClassVar[str] = 'NORMALIZATION'#
Label used for all normalization attributes
- copy()#
Create a new attribute that is a copy of the current instance, but with a new identifier
This is used when we want to duplicate an existing attribute onto a different annotation.
- Return type
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class ContextOperation(uid=None, name=None, **kwargs)[source]#
Abstract operation for context detection. It uses a list of segments as input for running the operation and creates attributes that are directly appended to these segments.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
Attributes:
Contains all the operation init parameters.
Methods:
set_prov_tracer
(prov_tracer)Enable provenance tracing.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class NEROperation(uid=None, name=None, **kwargs)[source]#
Abstract operation for detecting entities. It uses a list of segments as input and produces a list of detected entities.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
Attributes:
Contains all the operation init parameters.
Methods:
set_prov_tracer
(prov_tracer)Enable provenance tracing.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class SegmentationOperation(uid=None, name=None, **kwargs)[source]#
Abstract operation for segmenting text. It uses a list of segments as input and produces a list of new segments.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
Attributes:
Contains all the operation init parameters.
Methods:
set_prov_tracer
(prov_tracer)Enable provenance tracing.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class CustomTextOpType(value)[source]#
Supported function types for creating custom text operations.
Attributes:
Take 1 data item, return N new data items.
Take 1 data item, return N existing data items
Take 1 data item, return True or False.
- CREATE_ONE_TO_N = 1#
Take 1 data item, return N new data items.
- EXTRACT_ONE_TO_N = 2#
Take 1 data item, return N existing data items
- FILTER = 3#
Take 1 data item, return True or False.
- create_text_operation(function, function_type, name=None, args=None)[source]#
Function for instantiating a custom test operation from a user-defined function
- Parameters
function (
Callable
) – User-defined functionfunction_type (
CustomTextOpType
) – Type of function. Supported values are defined inCustomTextOpType
name (
Optional
[str
]) – Name of the operation used for provenance info (default: function name)args (
Optional
[Dict
]) – Dictionary containing the arguments of the function if any.
- Return type
_CustomTextOperation
- Returns
operation – An instance of a custom text operation
- class Span(start, end)[source]#
Slice of text extracted from the original text
- Parameters
start (int) – Index of the first character in the original text
end (int) – Index of the last character in the original text, plus one
Methods:
from_dict
(span_dict)Creates a Span from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
overlaps
(other)Test if 2 spans reference at least one character in common
- classmethod from_dict(span_dict)[source]#
Creates a Span from a dict
- Parameters
span_dict (dict) – A dictionary from a serialized span as generated by to_dict()
- Return type
Self
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class ModifiedSpan(length, replaced_spans)[source]#
Slice of text not present in the original text
- Parameters
length (int) – Number of characters
replaced_spans (List[medkit.core.text.span.Span]) – Slices of the original text that this span is replacing
Methods:
from_dict
(modified_span_dict)Creates a Modified from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod from_dict(modified_span_dict)[source]#
Creates a Modified from a dict
- Parameters
modified_span_dict (dict) – A dictionary from a serialized ModifiedSpan as generated by to_dict()
- Return type
Self
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class AnySpan[source]#
Methods:
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class UMLSNormAttribute(cui, umls_version, term=None, score=None, sem_types=None, metadata=None, uid=None)[source]#
Normalization attribute linking an entity to a CUI in the UMLS knowledge base
- Variables
uid – Identifier of the attribute
label – The attribute label, always set to
EntityNormAttribute.LABEL
value – CUI prefixed with “umls:” (ex: “umls:C0011849”)
kb_name – Name of the knowledge base. Always “umls”
kb_id – CUI (Concept Unique Identifier) to which the annotation should be linked
cui – Convenience alias of kb_id
kb_version – Version of the UMLS database (ex: “202AB”)
umls_version – Convenience alias of kb_version
term – Optional normalized version of the entity text
score – Optional score reflecting confidence of this link
sem_types (Optional[List[str]]) – Optional IDs of semantic types of the CUI (ex: [“T047”])
metadata – Metadata of the attribute
Methods:
copy
()Create a new attribute that is a copy of the current instance, but with a new identifier
from_dict
(data)Creates an Attribute from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
to_brat
()Return a value compatible with the brat format
to_spacy
()Return a value compatible with spaCy
- copy()#
Create a new attribute that is a copy of the current instance, but with a new identifier
This is used when we want to duplicate an existing attribute onto a different annotation.
- Return type
- classmethod from_dict(data)[source]#
Creates an Attribute from a dict
- Parameters
attribute_dict (dict) – A dictionary from a serialized Attribute as generated by to_dict()
- Return type
Self
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- to_brat()#
Return a value compatible with the brat format
- Return type
str
- to_spacy()#
Return a value compatible with spaCy
- Return type
str