medkit.text.context.family_detector
medkit.text.context.family_detector#
Classes:
|
Annotator creating family attributes with boolean values indicating if a family reference has been detected. |
|
Regexp-based rule to use with FamilyDetector |
|
Metadata dict added to family attributes with True value. |
- class FamilyDetector(output_label, rules=None, uid=None)[source]#
Annotator creating family attributes with boolean values indicating if a family reference has been detected.
Because family attributes will be attached to whole annotations, each input annotation should be “local”-enough rather than a big chunk of text (ie a sentence or a syntagma).
For detecting family references, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., n° -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.
Note that for better results, family detection should be run at the sentence level (ie on sentence segments) rather than at the syntagma level [1].
[1] N. Garcelon, A. Neuraz, V. Benoit, R. Salomon, A. Burgun, “Improving a full-text search engine: the importance of negation detection and family history context to identify cases in a biomedical data warehouse”, Journal of the American Medical Informatics Association, Volume 24, Issue 3, May 2017
- Parameters
output_label (
str
) – The label of the created attributesrules (
Optional
[List
[FamilyDetectorRule
]]) – The set of rules to use when detecting family references. If none provided, the rules in “family_detector_default_rules.yml” will be useduid (str) – Identifier of the detector
Methods:
check_rules_sanity
(rules)Check consistency of a set of rules
load_rules
(path_to_rules[, encoding])Load all rules stored in a yml file
run
(segments)Add a family attribute to each segment with a boolean value indicating if a family reference has been detected.
save_rules
(rules, path_to_rules[, encoding])Store rules in a yml file
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- run(segments)[source]#
Add a family attribute to each segment with a boolean value indicating if a family reference has been detected.
Family attributes with a True value have a metadata dict with fields described in
FamilyMetadata
.- Parameters
segments (
List
[Segment
]) – List of segments to detect as being family references or not
- static load_rules(path_to_rules, encoding=None)[source]#
Load all rules stored in a yml file
- Parameters
path_to_rules (
Path
) – Path to a yml file containing a list of mappings with the same structure as FamilyDetectorRuleencoding (
Optional
[str
]) – Encoding of the file to open
- Return type
List
[FamilyDetectorRule
]- Returns
List[FamilyDetectorRule] – List of all the rules in path_to_rules, can be used to init a FamilyDetector
- static save_rules(rules, path_to_rules, encoding=None)[source]#
Store rules in a yml file
- Parameters
rules (
List
[FamilyDetectorRule
]) – The rules to savepath_to_rules (
Path
) – Path to a .yml file that will contain the rulesencoding (
Optional
[str
]) – Encoding of the .yml file
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class FamilyDetectorRule(regexp, exclusion_regexps=<factory>, id=None, case_sensitive=False, unicode_sensitive=False)[source]#
Regexp-based rule to use with FamilyDetector
Input text may be converted before detecting rule.
- Parameters
regexp (str) – The regexp pattern used to match a family reference
exclusion_regexps (List[str]) – Optional exclusion patterns
id (Optional[str]) – Unique identifier of the rule to store in the metadata of the entities
case_sensitive (bool) – Whether to consider case when running regexp and `exclusion_regexs
unicode_sensitive (bool) – If True, rule matches are searched on unicode text. So, regexp and exclusion_regexps shall not contain non-ASCII chars because they would never be matched. If False, rule matches are searched on closest ASCII text when possible. (cf. FamilyDetector)
- class FamilyMetadata(_typename, _fields=None, /, **kwargs)[source]#
Metadata dict added to family attributes with True value.
- Parameters
rule_id (Union[str, int]) – Identifier of the rule used to detect a family reference. If the rule has no id, then the index of the rule in the list of rules is used instead.