medkit.core.text.document#

Classes:

TextDocument(text[, anns, attrs, metadata, uid])

Document holding text annotations

class TextDocument(text, anns=None, attrs=None, metadata=None, uid=None)[source]#

Document holding text annotations

Annotations must be subclasses of TextAnnotation.

Variables

Methods:

from_dict(doc_dict)

Creates a TextDocument from a dict

from_dir(path[, pattern, encoding])

Create documents from text files in a directory

from_file(path[, encoding])

Create a document from a text file

get_snippet(segment, max_extend_length)

Return a portion of the original text containing the annotation

get_subclass_for_data_dict(data_dict)

Return the subclass that corresponds to the class name found in a data dict

classmethod get_subclass_for_data_dict(data_dict)#

Return the subclass that corresponds to the class name found in a data dict

Parameters

data_dict (Dict[str, Any]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)

Return type

Optional[Type[Self]]

Returns

subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.

classmethod from_dict(doc_dict)[source]#

Creates a TextDocument from a dict

Parameters

doc_dict (dict) – A dictionary from a serialized TextDocument as generated by to_dict()

Return type

Self

classmethod from_file(path, encoding='utf-8')[source]#

Create a document from a text file

Parameters
  • path (PathLike) – Path of the text file

  • encoding (Optional[str]) – Text encoding to use

Return type

Self

Returns

TextDocument – Text document with contents of path as text. The file path is included in the document metadata.

classmethod from_dir(path, pattern='*.txt', encoding='utf-8')[source]#

Create documents from text files in a directory

Parameters
  • path (PathLike) – Path of the directory containing text files

  • pattern (str) – Glob pattern to match text files in path

  • encoding (Optional[str]) – Text encoding to use

Return type

List[Self]

Returns

List[TextDocument] – Text documents with contents of each file as text

get_snippet(segment, max_extend_length)[source]#

Return a portion of the original text containing the annotation

Parameters
  • segment (Segment) – The annotation

  • max_extend_length (int) – Maximum number of characters to use around the annotation

Return type

str

Returns

str – A portion of the text around the annotation