medkit.text.preprocessing#

APIs#

For accessing these APIs, you may use import like this:

from medkit.text.preprocessing import <api_to_import>

Classes:

CharReplacer(output_label[, rules, name, uid])

Generic character replacer to be used as pre-processing module

DuplicateFinder(output_label[, ...])

Detect duplicated chunks of text across a collection of text documents, relying on the duptextfinder library.

DuplicationAttribute(value[, source_doc_id, ...])

Attribute indicating if some text is a duplicate of some other text in another document

EDSCleaner([output_label, keep_endlines, ...])

EDS pre-processing annotation module

RegexpReplacer(output_label[, rules, name, uid])

Generic pattern replacer to be used as pre-processing module

class CharReplacer(output_label, rules=None, name=None, uid=None)[source]#

Generic character replacer to be used as pre-processing module

This module is a non-destructive module allowing to replace selected 1-char string with the wanted n-chars strings. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.

Parameters
  • output_label (str) – The output label of the created annotations

  • rules (Optional[List[Tuple[str, str]]]) – The list of replacement rules. Default: ALL_CHAR_RULES

  • name (Optional[str]) – Name describing the pre-processing module (defaults to the class name)

  • uid (str) – Identifier of the pre-processing module

Methods:

run(segments)

Run the module on a list of segments provided as input and returns a new list of segments

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Run the module on a list of segments provided as input and returns a new list of segments

Parameters

segments (List[Segment]) – List of segments to process

Return type

List[Segment]

Returns

List[~medkit.core.text.Segment] – List of new segments

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class DuplicateFinder(output_label, segments_to_output='dup', min_duplicate_length=5, fingerprint_type='word', fingerprint_length=2, date_metadata_key=None, case_sensitive=True, allow_multiline=True, orf=1)[source]#

Detect duplicated chunks of text across a collection of text documents, relying on the duptextfinder library.

When a duplicated chunk of text is found, a segment is created on the newest document covering the span that is duplicated. A DuplicationAttribute having “is_duplicate” as label and True as value is attached to the segment. It can later be propagated to the entities created from the duplicate segments.

The attribute also holds the id of the source document from which the text was copied, the spans of the text in the source document, and optionally the date of the source document if provided.

Optionally, segments can also be created for non-duplicate zones to make it easier to process only those parts of the documents. For these segments, the attribute value is False and the source, spans and date fields are None.

NB: better performance may be achieved by installing the ncls python package, which will then be used by duptextfinder library.

Parameters
  • output_label (str) – Label of created segments

  • segments_to_output (Literal['dup', 'nondup', 'both']) – Type of segments to create: only duplicate segments (“dup”), only non-duplicate segments (“nondup”), or both (“both”)

  • min_duplicate_length (int) – Minimum length of duplicated segments, in characters (shorter segments will be discarded)

  • fingerprint_type (Literal['char', 'word']) – Base unit to use for fingerprinting (either “char” or “word”)

  • fingerprint_length (int) – Number of chars or words in each fingerprint. If fingerprint_type is set to “char”, this should be the same value as min_duplicate_length. If fingerprint_type is set to “word”, this should be around the average word size multiplied by min_duplicate_length

  • date_metadata_key (Optional[str]) – Key to use to retrieve the date of each document from their metadata dicts. When provided, this is used to determine which document should be the source of a duplicate (the older) and which document should be the recipient (the newer). If None, the order of the documents in the collection will be used.

  • case_sensitive (bool) – Whether duplication detection should be case-sensitive or not

  • allow_multiline (bool) – Whether detected duplicates can span across multiline lines, or each line should be handled separately

  • orf (int) – Step size when building fingerprints, cf the duptextfinder documentation

Methods:

run(collections)

Find duplicates in each collection of documents

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(collections)[source]#

Find duplicates in each collection of documents

For each duplicate found, a Segment object with a DuplicationAttribute will be created and attached to the document that is the recipient of the duplication (ie not the source document).

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class DuplicationAttribute(value, source_doc_id=None, source_spans=None, source_doc_date=None, metadata=None, uid=None)[source]#

Attribute indicating if some text is a duplicate of some other text in another document

Variables
  • uid (str) – Identifier of the attribute

  • label (str) – The attribute label, always set to DuplicationAttribute.LABEL

  • value (Optional[Any]) – True if the segment or entity to which the attribute belongs is a duplicate of the part of another document, False otherwise.

  • source_doc_id (Optional[str]) – Identifier of the document from which the text was copied

  • source_spans (Optional[List[medkit.core.text.span.AnySpan]]) – Spans of the duplicated text in the source document

  • source_doc_date (Optional[Any]) – Date of the source document, if known

Attributes:

LABEL

Label used for all TNM attributes

Methods:

copy()

Create a new attribute that is a copy of the current instance, but with a new identifier

from_dict(attr_dict)

Creates an Attribute from a dict

get_subclass_for_data_dict(data_dict)

Return the subclass that corresponds to the class name found in a data dict

to_brat()

Return a value compatible with the brat format

to_spacy()

Return a value compatible with spaCy

LABEL: ClassVar[str] = 'is_duplicate'#

Label used for all TNM attributes

classmethod from_dict(attr_dict)[source]#

Creates an Attribute from a dict

Parameters

attribute_dict (dict) – A dictionary from a serialized Attribute as generated by to_dict()

Return type

Self

copy()#

Create a new attribute that is a copy of the current instance, but with a new identifier

This is used when we want to duplicate an existing attribute onto a different annotation.

Return type

Attribute

classmethod get_subclass_for_data_dict(data_dict)#

Return the subclass that corresponds to the class name found in a data dict

Parameters

data_dict (Dict[str, Any]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)

Return type

Optional[Type[Self]]

Returns

subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.

to_brat()#

Return a value compatible with the brat format

Return type

Optional[Any]

to_spacy()#

Return a value compatible with spaCy

Return type

Optional[Any]

class RegexpReplacer(output_label, rules=None, name=None, uid=None)[source]#

Generic pattern replacer to be used as pre-processing module

This module is a non-destructive module allowing to replace a regex pattern by a new text. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.

Parameters
  • output_label (str) – The output label of the created annotations

  • rules (Optional[List[Tuple[str, str]]]) – The list of replacement rules [(pattern_to_replace, new_text)]

  • name (Optional[str]) – Name describing the pre-processing module (defaults to the class name)

  • uid (str) – Identifier of the pre-processing module

Methods:

run(segments)

Run the module on a list of segments provided as input and returns a new list of segments

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Run the module on a list of segments provided as input and returns a new list of segments

Parameters

segments (List[Segment]) – List of segments to normalize

Return type

List[Segment]

Returns

List[~medkit.core.text.Segment] – List of normalized segments

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class EDSCleaner(output_label='clean_text', keep_endlines=False, handle_parentheses_eds=True, handle_points_eds=True, uid=None)[source]#

EDS pre-processing annotation module

This module is a non-destructive module allowing to remove and clean selected points and newlines characters. It respects the span modification by creating a new text-bound annotation containing the span modification information from input text.

Instantiate the endlines handler.

Parameters
  • output_label (str) – The output label of the created annotations.

  • keep_endlines (bool) – If True, modify multiple endlines using .n as a replacement. If False (default), modify multiple endlines using whitespaces (.s) as a replacement.

  • handle_parentheses_eds (bool) – If True (default), modify the text near to parentheses or keywords according to predefined rules for french documents If False, the text near to parentheses or keywords is not modified

  • handle_points_eds (bool) – Modify points near to predefined keywords for french documents If True (default), modify the points near to keywords If False, the points near to keywords is not modified

  • uid (str) – Identifier of the pre-processing module

Methods:

run(segments)

Run the module on a list of segments provided as input and returns a new list of segments.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Run the module on a list of segments provided as input and returns a new list of segments.

Parameters

segments (List[Segment]) – List of segments to normalize

Return type

List[Segment]

Returns

List[~medkit.core.text.Segment] – List of cleaned segments.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

Subpackages / Submodules#

medkit.text.preprocessing.char_replacer

medkit.text.preprocessing.char_rules

medkit.text.preprocessing.duplicate_finder

medkit.text.preprocessing.eds_cleaner

medkit.text.preprocessing.regexp_replacer