medkit.core.text.document
medkit.core.text.document#
Classes:
|
Document holding text annotations |
- class TextDocument(text, anns=None, attrs=None, metadata=None, uid=None)[source]#
Document holding text annotations
Annotations must be subclasses of TextAnnotation.
- Variables
uid (str) – Unique identifier of the document.
text – Full document text.
anns (medkit.core.text.annotation_container.TextAnnotationContainer) – Annotations of the document. Stored in an
TextAnnotationContainer
but can be passed as a list at init.attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the document. Stored in an
AttributeContainer
but can be passed as a list at initmetadata (Dict[str, Any]) – Document metadata.
raw_segment (medkit.core.text.annotation.Segment) –
Auto-generated segment containing the full unprocessed document text. To get the raw text as an annotation to pass to processing operations:
>>> doc = TextDocument(text="hello") >>> raw_text = doc.anns.get(label=TextDocument.RAW_LABEL)[0]
Methods:
from_dict
(doc_dict)Creates a TextDocument from a dict
from_dir
(path[, pattern, encoding])Create documents from text files in a directory
from_file
(path[, encoding])Create a document from a text file
get_snippet
(segment, max_extend_length)Return a portion of the original text containing the annotation
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- classmethod from_dict(doc_dict)[source]#
Creates a TextDocument from a dict
- Parameters
doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()
- Return type
Self
- classmethod from_file(path, encoding='utf-8')[source]#
Create a document from a text file
- Parameters
path (
PathLike
) – Path of the text fileencoding (
Optional
[str
]) – Text encoding to use
- Return type
Self
- Returns
TextDocument – Text document with contents of path as text. The file path is included in the document metadata.
- classmethod from_dir(path, pattern='*.txt', encoding='utf-8')[source]#
Create documents from text files in a directory
- Parameters
path (
PathLike
) – Path of the directory containing text filespattern (
str
) – Glob pattern to match text files in pathencoding (
Optional
[str
]) – Text encoding to use
- Return type
List
[Self
]- Returns
List[TextDocument] – Text documents with contents of each file as text
- get_snippet(segment, max_extend_length)[source]#
Return a portion of the original text containing the annotation
- Parameters
segment (
Segment
) – The annotationmax_extend_length (
int
) – Maximum number of characters to use around the annotation
- Return type
str
- Returns
str – A portion of the text around the annotation