Using EDS-NLP with medkit#

EDS-NLP provides a set of spaCy components that are used to extract information from clinical notes written in French. Because medkit is spaCy-compatible, using EDS-NLP within medkit is supported, as we will see.

To follow this tutorial, you will need to install medkit spaCy support and EDS-NLP with

pip install 'medkit-lib[edsnlp]'

Running an EDS-NLP spaCy pipeline on entire documents#

We will need a sample text document to annotate:

from medkit.core.text import TextDocument

text = """COMPTE RENDU D'HOSPITALISATION
Monsieur Jean Dupont a été hospitalisé du 11/08/2019 au 17/08/2019 pour attaque d'asthme

ANTÉCÉDENTS
Peut-être atteint de Covid19 en aout 2020"""
doc = TextDocument(text)

and a spaCy pipeline with a few EDS-NLP components:

import spacy

nlp = spacy.blank("eds")

# General-purpose components
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")
# Entity extraction
nlp.add_pipe("eds.covid")
nlp.add_pipe("eds.dates")
# Context detection
nlp.add_pipe("eds.negation")
nlp.add_pipe("eds.hypothesis")

The eds.normalizer and eds.sentences components do some pre-processing, eds.covid and eds.dates perform entity matching and create some spaCy entities and spans, and eds.negation and eds.hypothesis attach some context attributes to these entities and spans.

To be used within medkit, the pipeline could be wrapped into a generic SpacyDocPipeline operation. But medkit also provides a dedicated EDSNLPDocPipeline operation, with some additional support for specific EDS-NLP components:

from medkit.text.spacy.edsnlp import EDSNLPDocPipeline

eds_nlp_pipeline = EDSNLPDocPipeline(nlp)

The operation is executed by applying its run() method on a list of documents:

eds_nlp_pipeline.run([doc])

Let’s look at the entities and segments that were found:

for entity in doc.anns.entities:
    print(f"{entity.label}: {entity.text!r}")
for segment in doc.anns.segments:
    print(f"{segment.label}: {segment.text!r}")
covid: 'Covid19'
dates: '11/08/2019'
dates: '17/08/2019'
dates: 'aout 2020'

Here are the attributes attached to the "covid" entity:

entity = doc.anns.get_entities(label="covid")[0]
for attr in entity.attrs:
    print(f"{attr.label}={attr.value}")
negation=False
hypothesis=True

and the attributes of the first "dates" segment:

date_seg = doc.anns.get_segments(label="dates")[0]
for attr in date_seg.attrs:
    print(f"{attr.label}={attr.value}")
date=2019-08-11

Let’s now examine more closely the "date" attribute:

date_seg = doc.anns.get_segments(label="dates")[0]
date_attr = date_seg.attrs.get(label="date")[0]
date_attr
DateAttribute(label='date', value='2019-08-11', metadata={}, uid='c0443704-8e23-11ee-85fd-0242ac110002', year=2019, month=8, day=11, hour=None, minute=None, second=None)

This attribute is an instance of DateAttribute, a subclass of Attribute.It has year, month, day (etc) fields containing the different parts of the date that was detected, as well as a normalized string representation in its value field:

date_attr.value
'2019-08-11'

One of the benefits of using EDSNLPDocPipeline instead of SpacyDocPipeline is that some special EDS-NLP attributes are automatically converted to a corresponding Attribute subclass.

Here are the supported EDS-NLP attributes values and the corresponding medkit classes:

Note

The transformations performed by EDSNLPDocPipeline can be overriden or extended with the medkit_attribute_factories init parameter. For a list of all the default transformations, see DEFAULT_ATTRIBUTE_FACTORIES and corresponding functions in medkit.text.spacy.edsnlp.

Running an EDL-NLP spaCy pipeline at the annotation level#

So far, we have wrapped a spaCy pipeline and executed it on an entire document with EDSNLPDocPipeline. But it is also possible to run the spaCy pipeline on text annotations instead of a document with EDSNLPPipeline. To illustrate this, let’s create a medkit pipeline using pure medkit operations for sentence tokenization and entity matching, and EDS-NLP spaCy components for covid entity matching:

from medkit.core import Pipeline, PipelineStep
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.text.segmentation import SentenceTokenizer
from medkit.text.spacy.edsnlp import EDSNLPPipeline

sentence_tokenizer = SentenceTokenizer()
matcher = RegexpMatcher(rules=[RegexpMatcherRule(regexp=r"\basthme\b", label="asthme")])

nlp = spacy.blank("eds")
nlp.add_pipe("eds.covid")
eds_nlp_pipeline = EDSNLPPipeline(nlp)

pipeline = Pipeline(
    steps=[
        PipelineStep(operation=sentence_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
        PipelineStep(operation=matcher, input_keys=["sentences"], output_keys=["entities"]),
        PipelineStep(operation=eds_nlp_pipeline, input_keys=["sentences"], output_keys=["entities"]),
    ],
    input_keys=["full_text"],
    output_keys=["entities"],
)
doc = TextDocument(text)
entities = pipeline.run([doc.raw_segment])
for entity in entities:
    print(f"{entity.label}: {entity.text!r}")
asthme: 'asthme'
covid: 'Covid19'

For more information about advanced usage of EDSNLPDocPipeline and EDSNLPPipeline, you may refer to the API doc of medkit.text.spacy.edsnlp.