medkit.text.ner.simstring_matcher#

Classes:

SimstringMatcher(rules[, threshold, ...])

Entity matcher relying on string similarity

SimstringMatcherNormalization(kb_name, kb_id)

Descriptor of normalization attributes to attach to entities created from a SimstringMatcherRule

SimstringMatcherRule(term, label[, ...])

Rule to use with SimstringMatcher

class SimstringMatcher(rules, threshold=0.9, min_length=3, max_length=50, similarity='jaccard', spacy_tokenization_language=None, blacklist=None, same_beginning=False, attrs_to_copy=None, name=None, uid=None)[source]#

Entity matcher relying on string similarity

Uses the simstring fuzzy matching algorithm (http://chokkan.org/software/simstring/).

Note that setting spacy_tokenization_language to True might reduce the number of false positives. This requires the spacy optional dependency, which can be installed with pip install medkit-lib[spacy].

Parameters
  • rules (List[SimstringMatcherRule]) – Rules to use for matching entities.

  • min_length (int) – Minimum number of chars in matched entities.

  • max_length (int) – Maximum number of chars in matched entities.

  • threshold (float) – Minimum similarity (between 0.0 and 1.0) between a rule term and the text of an entity matched on that rule.

  • similarity (Literal['cosine', 'dice', 'jaccard', 'overlap']) – Similarity metric to use.

  • spacy_tokenization_language (Optional[str]) – 2-letter code (ex: “fr”, “en”, etc.) designating the language of the spacy model to use for tokenization. If provided, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.

  • blacklist (Optional[List[str]]) – Optional list of exact terms to ignore.

  • same_beginning (bool) – Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.

  • attrs_to_copy (Optional[List[str]]) – Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc.).

  • name (Optional[str]) – Name describing the matcher (defaults to the class name).

  • uid (str) – Identifier of the matcher.

Methods:

load_rules(path_to_rules[, encoding])

Load all rules stored in a yml file

run(segments)

Return entities (with optional normalization attributes) matched in segments

save_rules(rules, path_to_rules[, encoding])

Store rules in a yml file

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

static load_rules(path_to_rules, encoding=None)[source]#

Load all rules stored in a yml file

Parameters
  • path_to_rules (Path) – The path to a yml file containing a list of mappings with the same structure as SimstringMatcherRule

  • encoding (Optional[str]) – The encoding of the file to open

Return type

List[SimstringMatcherRule]

Returns

List[SimstringMatcherRule] – List of all the rules in path_to_rules, can be used to init a SimstringMatcher

static save_rules(rules, path_to_rules, encoding=None)[source]#

Store rules in a yml file

Parameters
  • rules (List[SimstringMatcherRule]) – The rules to save

  • path_to_rules (Path) – The path to a yml file that will contain the rules

  • encoding (Optional[str]) – The encoding of the yml file

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

run(segments)#

Return entities (with optional normalization attributes) matched in segments

Parameters

segments (List[Segment]) – List of segments into which to look for matches

Return type

List[Entity]

Returns

entities (List[Entity]:) – Entities found in segments (with optional normalization attributes)

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class SimstringMatcherRule(term, label, case_sensitive=False, unicode_sensitive=False, normalizations=<factory>)[source]#

Rule to use with SimstringMatcher

Variables
  • term (str) – Term to match using similarity-based fuzzy matching

  • label (str) – Label to use for the entities created when a match is found

  • case_sensitive (bool) – Whether to take case into account when looking for matches.

  • unicode_sensitive (bool) – Whether to use ASCII-only versions of the rule term and input texts when looking for matches (non-ASCII chars replaced by closest ASCII chars).

  • normalizations (List[medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization]) – Optional list of normalization attributes that should be attached to the entities created

Methods:

from_dict(data)

Creates a SimStringMatcherRule from a dict.

static from_dict(data)[source]#

Creates a SimStringMatcherRule from a dict.

Return type

SimstringMatcherRule

class SimstringMatcherNormalization(kb_name, kb_id, kb_version=None, term=None)[source]#

Descriptor of normalization attributes to attach to entities created from a SimstringMatcherRule

Variables
  • kb_name (str) – The name of the knowledge base we are referencing. Ex: “umls”

  • kb_version (Optional[str]) – The name of the knowledge base we are referencing. Ex: “202AB”

  • kb_id (Union[int, str]) – The id of the entity in the knowledge base, for instance a CUI

  • term (Optional[str]) – Optional normalized version of the entity text in the knowledge base

Methods:

from_dict(data)

Creates a SimstringMatcherNormalization object from a dict

to_attribute(score)

Create a normalization attribute based on the normalization descriptor

static from_dict(data)[source]#

Creates a SimstringMatcherNormalization object from a dict

Return type

SimstringMatcherNormalization

to_attribute(score)#

Create a normalization attribute based on the normalization descriptor

Parameters

score (float) – Score of similarity between the normalized term and the entity text

Return type

EntityNormAttribute

Returns

EntityNormAttribute – Normalization attribute to add to entity