medkit.text.ner.regexp_matcher#

Classes:

RegexpMatcher([rules, attrs_to_copy, name, uid])

Entity annotator relying on regexp-based rules

RegexpMatcherNormalization(kb_name, kb_id[, ...])

Descriptor of normalization attributes to attach to entities created from a RegexpMatcherRule

RegexpMatcherRule(regexp, label[, term, id, ...])

Regexp-based rule to use with RegexpMatcher

RegexpMetadata(_typename[, _fields])

Metadata dict added to entities matched by RegexpMatcher

class RegexpMatcher(rules=None, attrs_to_copy=None, name=None, uid=None)[source]#

Entity annotator relying on regexp-based rules

For detecting entities, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.

Instantiate the regexp matcher

Parameters
  • rules (Optional[List[RegexpMatcherRule]]) – The set of rules to use when matching entities. If none provided, the rules in “regexp_matcher_default_rules.yml” will be used

  • attrs_to_copy (Optional[List[str]]) – Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc)

  • name (Optional[str]) – Name describing the matcher (defaults to the class name)

  • uid (str) – Identifier of the matcher

Methods:

check_rules_sanity(rules)

Check consistency of a set of rules

load_rules(path_to_rules[, encoding])

Load all rules stored in a yml file

run(segments)

Return entities (with optional normalization attributes) matched in segments

save_rules(rules, path_to_rules[, encoding])

Store rules in a yml file

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return entities (with optional normalization attributes) matched in segments

Parameters

segments (List[Segment]) – List of segments into which to look for matches

Return type

List[Entity]

Returns

entities (List[Entity]:) – Entities found in segments (with optional normalization attributes). Entities have a metadata dict with fields described in RegexpMetadata

static load_rules(path_to_rules, encoding=None)[source]#

Load all rules stored in a yml file

Parameters
  • path_to_rules (Path) – Path to a yml file containing a list of mappings with the same structure as RegexpMatcherRule

  • encoding (Optional[str]) – Encoding of the file to open

Return type

List[RegexpMatcherRule]

Returns

List[RegexpMatcherRule] – List of all the rules in path_to_rules, can be used to init a RegexpMatcher

static check_rules_sanity(rules)[source]#

Check consistency of a set of rules

static save_rules(rules, path_to_rules, encoding=None)[source]#

Store rules in a yml file

Parameters
  • rules (List[RegexpMatcherRule]) – The rules to save

  • path_to_rules (Path) – Path to a .yml file that will contain the rules

  • encoding (Optional[str]) – Encoding of the .yml file

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class RegexpMatcherRule(regexp, label, term=None, id=None, version=None, index_extract=0, case_sensitive=True, unicode_sensitive=True, exclusion_regexp=None, normalizations=<factory>)[source]#

Regexp-based rule to use with RegexpMatcher

Variables
  • regexp (str) – The regexp pattern used to match entities

  • label (str) – The label to attribute to entities created based on this rule

  • term (Optional[str]) – The optional normalized version of the entity text

  • id (Optional[str]) – Unique identifier of the rule to store in the metadata of the entities

  • version (Optional[str]) – Version string to store in the metadata of the entities

  • index_extract (int) – If the regexp has groups, the index of the group to use to extract the entity

  • case_sensitive (bool) – Whether to ignore case when running regexp and `exclusion_regexp

  • unicode_sensitive (bool) – If True, regexp rule matches are searched on unicode text. So, regexp and `exclusion_regexps shall not contain non-ASCII chars because they would never be matched. If False, regexp rule matches are searched on closest ASCII text when possible. (cf. RegexpMatcher)

  • exclusion_regexp (Optional[str]) – An optional exclusion pattern. Note that this exclusion pattern will be executed on the whole input annotation, so when relying on exclusion_regexp make sure the input annotations passed to RegexpMatcher are “local”-enough (sentences or syntagmas) rather than the whole text or paragraphs

  • normalizations (List[medkit.text.ner.regexp_matcher.RegexpMatcherNormalization]) – Optional list of normalization attributes that should be attached to the entities created

class RegexpMatcherNormalization(kb_name, kb_id, kb_version=None)[source]#

Descriptor of normalization attributes to attach to entities created from a RegexpMatcherRule

Variables
  • kb_name (str) – The name of the knowledge base we are referencing. Ex: “umls”

  • kb_version (Optional[str]) – The name of the knowledge base we are referencing. Ex: “202AB”

  • kb_id (Any) – The id of the entity in the knowledge base, for instance a CUI

class RegexpMetadata(_typename, _fields=None, /, **kwargs)[source]#

Metadata dict added to entities matched by RegexpMatcher

Parameters
  • rule_id (Union[str, int]) – Identifier of the rule used to match an entity. If the rule has no id, then the index of the rule in the list of rules is used instead.

  • version (Optional[str]) – Optional version of the rule used to match an entity