medkit.text.ner.simstring_matcher
medkit.text.ner.simstring_matcher#
Classes:
|
Entity matcher relying on string similarity |
|
Descriptor of normalization attributes to attach to entities created from a |
|
Rule to use with |
- class SimstringMatcher(rules, threshold=0.9, min_length=3, max_length=50, similarity='jaccard', spacy_tokenization_language=None, blacklist=None, same_beginning=False, attrs_to_copy=None, name=None, uid=None)[source]#
Entity matcher relying on string similarity
Uses the simstring fuzzy matching algorithm (http://chokkan.org/software/simstring/).
Note that setting spacy_tokenization_language to True might reduce the number of false positives. This requires the spacy optional dependency, which can be installed with pip install medkit-lib[spacy].
- Parameters
rules (
List
[SimstringMatcherRule
]) – Rules to use for matching entities.min_length (
int
) – Minimum number of chars in matched entities.max_length (
int
) – Maximum number of chars in matched entities.threshold (
float
) – Minimum similarity (between 0.0 and 1.0) between a rule term and the text of an entity matched on that rule.similarity (
Literal
['cosine'
,'dice'
,'jaccard'
,'overlap'
]) – Similarity metric to use.spacy_tokenization_language (
Optional
[str
]) – 2-letter code (ex: “fr”, “en”, etc.) designating the language of the spacy model to use for tokenization. If provided, spacy will be used to tokenize input segments and filter out some tokens based on their part-of-speech tags, such as determinants, conjunctions and prepositions. If None, a simple regexp based tokenization will be used, which is faster but might give more false positives.blacklist (
Optional
[List
[str
]]) – Optional list of exact terms to ignore.same_beginning (
bool
) – Ignore all matches that start with a different character than the term of the rule. This can be convenient to get rid of false positives on words that are very similar but have opposite meanings because of a preposition, for instance “activation” and “inactivation”.attrs_to_copy (
Optional
[List
[str
]]) – Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc.).name (
Optional
[str
]) – Name describing the matcher (defaults to the class name).uid (str) – Identifier of the matcher.
Methods:
load_rules
(path_to_rules[, encoding])Load all rules stored in a yml file
run
(segments)Return entities (with optional normalization attributes) matched in segments
save_rules
(rules, path_to_rules[, encoding])Store rules in a yml file
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- static load_rules(path_to_rules, encoding=None)[source]#
Load all rules stored in a yml file
- Parameters
path_to_rules (
Path
) – The path to a yml file containing a list of mappings with the same structure asSimstringMatcherRule
encoding (
Optional
[str
]) – The encoding of the file to open
- Return type
List
[SimstringMatcherRule
]- Returns
List[SimstringMatcherRule] – List of all the rules in path_to_rules, can be used to init a
SimstringMatcher
- static save_rules(rules, path_to_rules, encoding=None)[source]#
Store rules in a yml file
- Parameters
rules (
List
[SimstringMatcherRule
]) – The rules to savepath_to_rules (
Path
) – The path to a yml file that will contain the rulesencoding (
Optional
[str
]) – The encoding of the yml file
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- run(segments)#
Return entities (with optional normalization attributes) matched in segments
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class SimstringMatcherRule(term, label, case_sensitive=False, unicode_sensitive=False, normalizations=<factory>)[source]#
Rule to use with
SimstringMatcher
- Variables
term (str) – Term to match using similarity-based fuzzy matching
label (str) – Label to use for the entities created when a match is found
case_sensitive (bool) – Whether to take case into account when looking for matches.
unicode_sensitive (bool) – Whether to use ASCII-only versions of the rule term and input texts when looking for matches (non-ASCII chars replaced by closest ASCII chars).
normalizations (List[medkit.text.ner._base_simstring_matcher.BaseSimstringMatcherNormalization]) – Optional list of normalization attributes that should be attached to the entities created
Methods:
from_dict
(data)Creates a SimStringMatcherRule from a dict.
- class SimstringMatcherNormalization(kb_name, kb_id, kb_version=None, term=None)[source]#
Descriptor of normalization attributes to attach to entities created from a
SimstringMatcherRule
- Variables
kb_name (str) – The name of the knowledge base we are referencing. Ex: “umls”
kb_version (Optional[str]) – The name of the knowledge base we are referencing. Ex: “202AB”
kb_id (Union[int, str]) – The id of the entity in the knowledge base, for instance a CUI
term (Optional[str]) – Optional normalized version of the entity text in the knowledge base
Methods:
from_dict
(data)Creates a SimstringMatcherNormalization object from a dict
to_attribute
(score)Create a normalization attribute based on the normalization descriptor
- static from_dict(data)[source]#
Creates a SimstringMatcherNormalization object from a dict
- Return type
- to_attribute(score)#
Create a normalization attribute based on the normalization descriptor
- Parameters
score (
float
) – Score of similarity between the normalized term and the entity text- Return type
- Returns
EntityNormAttribute – Normalization attribute to add to entity