Text operation modules
Contents
Text operation modules#
This page lists all components related to text processing.
Note
For more details about all sub-packages, refer to
medkit.text
.
Summary#
Here is a list of all the medkit text operations with a direct link to the corresponding API docs. Scrolling down this page, you will find more details about the components of medkit.text
.
Preprocessing:
Fast replacement of 1-char string by n-char strings |
|
Patterns replacement |
|
Cleaning of texts extracted from the APHP EDS |
|
Detection of duplicated parts across documents based on duptextfinder |
Segmentation:
Rule-based detection of sections |
|
Rule-based sentence splitting |
|
Sentence splitting based on PyRuSH |
|
Rule-based sub-sentence splitting |
Context:
Detection of negation |
|
Detection of hypothesis |
|
Detection of family antecedents |
Named Entity Recognition:
Regexp-based entity matching |
|
Fast fuzzy matching based on simstring |
|
Advanced entity matching based on IAMSystem |
|
Entity matcher relying on HuggingFace transformers models |
|
General matcher (dates, quantities, etc) relying on Duckling coder normalizer |
|
Date/time matching based on EDS-NLP |
|
TNM (Tumour/Node/Metastasis) matching based on EDS-NLP |
|
Normalization of pre-existing entities to UMLS CUIs relying on a CODER model |
spaCy:
Operation wrapping a spaCy pipeline to work at the annotation level |
|
Operation wrapping a spaCy pipeline to work at the document level |
|
Operation wrapping an EDS-NLP pipeline to work at the annotation level |
|
Operation wrapping an EDS-NLP pipeline to work at the document level |
Misc:
Relation detector relying on spaCy’s dependency parser |
|
Translation operation relying on HuggingFace transformers models |
|
Propagation of attributes based on annotation spans |
|
A component to divide text documents using its segments as a reference |
Pre-processing modules#
This section provides some information about how to use preprocessing modules.
Note
For more details about public API, refer to
medkit.text.preprocessing
.
If you need to pre-process your document texts for replacing some sub-texts by other ones, medkit provides some operations to do that and keep span information.
If you want to use some rule-based operations (like
RegexpMatcher
for example), document texts may need to be
pre-processed.
For example, concerning the RegexpMatcher
:
When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g.,
n°
->number
). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.
CharReplacer#
CharReplacer
is a pre-processing operation allowing
to replace one character by another one.
It is faster than RegexpReplacer
but is limited to
character replacement and does not support pattern replacement.
For example, if you want to replace some special characters like +
:
from medkit.core.text import TextDocument
from medkit.text.preprocessing import CharReplacer
doc = TextDocument(text="Il suit + ou - son traitement,")
rules = [("+", "plus"), ("-", "moins")]
op = CharReplacer(output_label="preprocessed_text", rules=rules)
new_segment = op.run([doc.raw_segment])[0]
print(new_segment.text)
Results:
new_segment.text
: “Il suit plus ou moins son traitement,”new_segment.spans
: [Span(start=0, end=8), ModifiedSpan(length=4, replaced_spans=[Span(start=8, end=9)]), Span(start=9, end=13), ModifiedSpan(length=5, replaced_spans=[Span(start=13, end=14)]), Span(start=14, end=30)]
medkit also provides some pre-defined rules that you can import
(cf. medkit.text.preprocessing.rules
) and then combine with your own rules.
For example:
from medkit.text.preprocessing import (
CharReplacer,
LIGATURE_RULES,
SIGN_RULES,
SPACE_RULES,
DOT_RULES,
FRACTION_RULES,
QUOTATION_RULES,
)
rules = (
LIGATURE_RULES
+ SIGN_RULES
+ SPACE_RULES
+ DOT_RULES
+ FRACTION_RULES
+ QUOTATION_RULES
+ <my_own_rules>
)
# same as rules = ALL_RULES + <my_own_rules>
op = CharReplacer(output_label="preprocessed_text", rules=rules)
Note
If you do not provide rules when initializing char replacer operation, all pre-defined rules (i.e., ALL_RULES) are used.
RegexpReplacer#
The RegexpReplacer
operation uses an algorithm based
on regular expressions for detecting patterns in the text and replace them by
the new text, and all that with preserving span information.
So, it may be useful if you need to replace sub-text or text with a context by
other ones.
For example, if you want to replace n°
by numéro
:
from medkit.core.text import TextDocument
from medkit.text.preprocessing import RegexpReplacer
doc = TextDocument(text="À l'aide d'une canule n ° 3,")
rule = (r"n\s*°", "numéro")
op = RegexpReplacer(output_label="preprocessed_text", rules=[rule])
new_segment = op.run([doc.raw_segment])[0]
print(new_segment.text)
Results:
new_segment.text
: “À l’aide d’une canule numéro 3,”new_segment.spans
: [Span(start=0, end=22), ModifiedSpan(length=6, replaced_spans=[Span(start=22, end=25)]), Span(start=25, end=28)]
Warning
If you have a lot of single characters to change, it is not the optimal way to
do it for performance reasons.
In this case, we recommend to use CharReplacer
.
Other pre-processing modules#
medkit also provides an operation for cleaning up text. This module has been implemented for a specific case of EDS document.
You can follow this tutorial example for more
details about this EDSCleaner
module.
Segmentation modules#
This section lists text segmentation modules. They are part of
medkit.text.segmentation
package.
Note
For more details about public APIs of each module, refer to
medkit.text.segmentation
sub-modules.
SectionTokenizer
and
SyntagmaTokenizer
may rely on a description file
containing the set of user-defined rules for splitting document text into a list
of medkit Segment
corresponding successively to sections or
syntagmas.
For SectionTokenizer
, here is the yaml schema
reference of the file.
sections
: dictionary of key-values where key will be the section name and value is a list of keywords to detect as the start of the section.rules
: list of modification rules which role is to rename a detected sectionrules.section_name
: name of the detected section to renamerules.new_section_name
: new name wanted for the sectionrules.order
: order condition for renaming. Possible values: BEFORE, AFTERother_sections
: list of other section names (i.e., context of the section to rename) to use with the order condition
Note
You may test French default rules using section_tokenizer = SectionTokenizer()
.
The corresponding file content is available
here.
For SyntagmaTokenizer
, here is the yaml schema
reference of the file.
syntagma.separators
: list of regular expressions allowing to trigger the start of a new syntagma.
Note
You may test default French rules using syntagma_tokenizer = SyntagmaTokenizer()
.
The corresponding file content is available
here.
Examples
For a better understanding, you may follow these tutorial examples:
section: section tokenizer tutorial
syntagma: syntagma tokenizer tutorial
sentence: first steps tutorial
Context detection modules#
This section lists text annotators for detecting context. They are part of
medkit.text.context
package.
Hypothesis#
If you want to test default French rules of HypothesisDetector
,
you may use :
detector = HypothesisDetector()
detector.run(syntagmas)
Note
For more details about public APIs, refer to
hypothesis_detector
.
Negation#
medkit provides a rule-based negation detector which attaches a negation attribute to a text segment.
Note
For more details about public APIs, refer to
negation_detector
.
Family reference#
medkit provides a rule-based family detector which attaches a family attribute to a text segment.
Note
For more details about public APIs, refer to
family_detector
.
NER modules#
This section lists text annotators for detecting entities. They are part of
medkit.text.ner
package.
Regular Expression Matcher#
medkit provides a rule-based entity matcher.
For an example of RegexpMatcher
usage, you can follow this
example tutorial.
Note
For more details about public APIs, refer to regexp_matcher
.
IAM system Matcher#
The iamsystem library is available under the following medkit operation.
Note
For more details about public APIs, refer to iamsystem_matcher
.
medkit also provides a custom implementation (MedkitKeyword
) of
IAM system IEntity
which allows user:
to associate
kb_name
tokb_id
to provide a medkit entity label (e.g., category) associated to the IAM system entity label (i.e., text to search).
Examples
For a better understanding, you may follow the iamsystem matcher example tutorial
Simstring Matcher#
Medkit provides an entity matcher using the simstring fuzzy matching algorithm.
Note
For more details about public APIs, refer to simstring_matcher
.
Quick UMLS Matcher#
Important
QuickUMLSMatcher
needs additional dependencies that can be installed with pip install medkit[quick-umls-matcher]
QuickUMLSMatcher is a wrapper around 3d-party quickumls.core.QuickUMLS, which requires a QuickUMLS install to work. A QuickUMLS install can be created with
python -m quickumls.install <umls_installation_path> <destination_path>
where <umls_installation_path> is the path to the UMLS folder containing the MRCONSO.RRF and MRSTY.RRF files.
You will also need to download spacy models used by QuickUMLS. A clear message error will be displayed to show you how to install it. Otherwise, you may also install it programmatically.
Here are examples of downloads for English and French models:
if not spacy.util.is_package("en_core_web_sm"):
spacy.cli.download("en_core_web_sm")
if not spacy.util.is_package("fr_core_news_sm"):
spacy.cli.download("fr_core_news_sm")
Given a medkit text document named doc
with text The patient has asthma
umls_matcher = QuickUMLSMatcher(version="2021AB", language="ENG")
entities = umls_matcher.run([sentence])
The entity (entities[0]
) will have the following description:
entity.text = “asthma”
entity.spans = [Span(16, 22)]
entity.label = “disorder”
Its normalization attribute (norm = entity.get_norms()[0]
) will be:
norm is an instance of
UMLSNormAttribute
norm.cui = _ASTHMA_CUI
norm.umls_version = “2021AB”
norm.term = “asthma”
norm.score = 1.0
norm.sem_types = [“T047”]
Note
For more details about public APIs, refer to
quick_umls_matcher
.
UMLS Matcher#
As an alternative to QuickUMLSMatcher
, medkit also provides an entity matcher
dedicated to UMLS terms
that uses the simstring fuzzy matching
algorithm but does not rely on QuickUMLS
Note
For more details about public APIs, refer to umls_matcher
.
Duckling Matcher#
medkit provides an entity annotator that uses Duckling.
Refer to DucklingMatcher
for more details about requirements
for using this operation.
Note
For more details about public APIs, refer to
duckling_matcher
.
EDS-NLP Date Matcher#
The EDS-NLP dates pipeline can be directly using inside medkit to identify date and duration mentions in texts.
Important
EDSNLPDateMatcher
needs additional dependencies that can be
installed with pip install medkit-lib[edsnlp]
Note
For more details about public APIs, refer to
edsnlp_date_matcher
.
Hugging Face Entity Matcher#
medkit provides an entity matcher based on Hugging Face models.
Important
HFEntityMatcher
needs additional dependencies that can be
installed with pip install medkit-lib[hf-entity-matcher]
Note
For more details about public APIs, refer to
hf_entity_matcher
.
UMLS Coder Normalizer#
This operation is not an entity matcher per-say but a normalizer that will add normalization attributes to pre-existing entities.
Important
UMLSCoderNormalizer
needs additional dependencies that can
be installed with pip install medkit-lib[umls-coder-normalizer]
Note
For more details about public APIs, refer to
umls_coder_normalizer
.
UMLS Normalization#
This modules provides a subclass of
EntityNormAttribute
to facilitate
the handling of UMLS information.
Note
For more details, refer to umls_norm_attribute
.
Spacy modules#
medkit provides operations and utilities for wrapping spacy pipelines into
medkit. They are part of
medkit.text.spacy
package.
Important
For using this python module, you need to install spacy.
These dependencies may be installed with pip install medkit-lib[spacy]
Spacy pipelines#
The SpacyPipeline
component is an annotation-level
operation. It takes medkit segments as inputs, runs a spacy pipeline, and
returns medkit segments by converting spacy outputs.
The SpacyDocPipeline
component is a document-level
operation, similarly to DocPipeline
.
It takes medkit documents as inputs, runs a spacy pipeline, and
directly attach the spacy annotations to medkit document.
Note
For more info about displacy helpers, refer to displacy_utils
.
Translation operations#
Note
For translation api, refer to translation
.
HuggingFace Translator#
Important
HFTranslator
needs additional dependencies that can
be installed with pip install medkit-lib[hf-translator]
Extraction of syntactic relations#
This module detects syntactic relations between entities using a parser of dependencies.
Note
For more info about this module, refer to syntactic_relation_extractor
.
Post-processing modules#
Medkit provides some modules to facilitate post-processing operations.
For the moment, you can use this module to:
align source and target
Segment
s from the sameTextDocument
duplicate attributes bewteen segments. For example, you can duplicate an attribute from a sentence to its entities.
filter overlapping entities: useful when creating named entity reconigtion (NER) datasets
create mini-documents from a
TextDocument
.
Examples
Creating mini-documents from sections: document splitter
Note
For more details about public API, refer to postprocessing
.
Metrics#
This module provides components to evaluate annotations as well as some implementations of MetricsComputer
to monitor the training of components in medkit.
The components inside metrics are also known as evaluators. An evaluator allows you to assess performance by task.
Note
For more details about public APIs, refer to metrics
Text Classification Evaluation#
Medkit provides TextClassificationEvaluator
, an evaluator for document attributes. You can compute the following metrics depending on your use-case:
Classification repport#
compute_classification_report
: To compare a list of reference and predicted documents. This method uses sklearn as backend to compute precision, recall, and F1-score.
Inter-rated agreement#
compute_cohen_kappa
: To compare the degree of agreement between lists of documents made by two annotators.compute_krippendorff_alpha
: To compare the degree of agreement between lists of documents made by multiple annotators.
Note
For more details about public API, refer to TextClassificationEvaluator
or irr_utils
.
NER Evaluation#
Medkit uses seqeval as backend of evaluation.
Important
This module needs additional dependencies that can be installed with pip install medkit-lib[metrics-ner]
Entity detection#
An example with perfect match:
The document has two entities: PER and GPE.
An operation has detected both entities
from medkit.core.text import TextDocument, Entity, Span
from medkit.text.metrics.ner import SeqEvalEvaluator
document = TextDocument("Marie lives in Paris",
anns = [Entity(label="PER",spans=[Span(0,5)],text="Marie"),
Entity(label="GPE",spans=[Span(15,20)],text="Paris")])
pred_ents = [Entity(label="PER",spans=[Span(0,5)],text="Marie"),
Entity(label="GPE",spans=[Span(15,20)],text="Paris")]
# define a evaluator using `iob2` as tagging scheme
evaluator = SeqEvalEvaluator(tagging_scheme="iob2")
metrics = evaluator.compute(documents=[document], predicted_entities=[pred_ents])
assert metrics["macro_precision"] == 1.0
print(metrics)
{'macro_precision': 1.0, 'macro_recall': 1.0, 'macro_f1-score': 1.0, 'support': 2, 'accuracy': 1.0, 'GPE_precision': 1.0, 'GPE_recall': 1.0, 'GPE_f1-score': 1.0, 'GPE_support': 1, 'PER_precision': 1.0, 'PER_recall': 1.0, 'PER_f1-score': 1.0, 'PER_support': 1}
Note
For more details about public APIs, refer to SeqEvalEvaluator
Using for training of NER components#
For example, a trainable component detects PER and GPE entities using iob2
as tagging scheme. The Trainer
may compute metrics during its training/evaluation loop.
from medkit.text.metrics.ner import SeqEvalMetricsComputer
from medkit.training import Trainer
seqeval_mc = SeqEvalMetricsComputer(
id_to_label={'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-GPE': 3, 'I-GPE': 4},
tagging_scheme="iob2"
)
trainer = Trainer(
...
metrics_computer=seqeval_mc
...
)
Note
For more details about public APIs, refer to SeqEvalMetricsComputer
. About training, refer to training api
Hint
There is an utility to convert labels to NER tags if required, hf_tokenization_utils
.
See also
You may refer to this tutorial to see how this works in a fine-tuning example.