Cleaning text with a predefined operation#

Medkit allows us to transform and clean up text without destroying the original text. We could, for example, implement a set of clean-up steps within the run method of an operation to pre-process raw text.

In this example, we will use a predefined EDSCleaner operation to show how a cleaning process works in medkit. This operation is inspired by french documents with formatting problems given previous conversion processes.

Loading a text to clean#

Consider the following document:

# You can download the file available in source code
# !wget https://raw.githubusercontent.com/TeamHeka/medkit/main/docs/examples/input/text/text_to_clean.txt

from pathlib import Path
from medkit.core.text import TextDocument

doc = TextDocument.from_file(Path("./input/text/text_to_clean.txt"))
print(doc.text)

Objet : Compte-Rendu de Consultation 
 			du 20/12/1985
M.Machin, âgé de 65 ans, né le 10/10/1930, a été vu en 
		consultation dans le
service	 d'hépatologie avec le Dr. Bernand.
Motif

Suivi de cirrhose éthylique
Traitement en cours
aucun
Evolution depuis la dernière consultation
FOGD:

 VO de grade I/II sans
  signe rouge
va très bien, a arrêté le    

régime 		sans   sel et les diurétiques
(fait 		attention   au sel dans son régime) malgré tout
abstinence totale en alcool
BIO: plqt 227000 TP 72% albu 33 bili 10 AFP 6
Examen clinique
Le 20/12/1985 : Poids : 82.45 kg, Taille : 182.50 cm, IMC : 22.7 kg/m, SC : 1.7 m
Pas d'ictère
pas	 d'ascite
hépatomégalie dure au depens du foie gauche
OMI légers bilatéraux
Conclusion 	Cirrhose 	compensée Child A
VO de grade II prophylaxie par bêtabloquants, information (attention à une poussée de psoriasis qui pourrait
être déclenchée) donnée au patient.
OMI: reprise du lasilix à faibles doses
Traitement à l'issue de la consultation
Médicaments
Lasilix 40 mg le matin
Propranolol LP 160mg: 1 cp le matin

As we note, the text has:

additional spaces;
multiple newlines characters;
long parentheses and numbers in English format.

This complicates text segmentation of the text, it may be a good idea to clean up the text before segmenting or creating annotations.

Using EDSCleaner operation#

As mentioned before, you can create your own custom cleanup operation. In this case, we use the predefined operation for a french document (coming from the EDS) to format the document.

The main idea is to transform the raw_segment and keep track of the modifications made by the operation. That segment is defined using the span of the text.

A span in medkit

In medkit the span of an annotation is a list of simple spans Span or modified spans ModifiedSpan. With this mechanism, we keep track of the modifications and can return to the original version whenever we want.

The EDSCleaner is configurable, we initialize keep_endlines=True to facilitate the visualization. Otherwise, the output segment would be a plain text with no newlines (\n) characters.

from medkit.text.preprocessing import EDSCleaner

eds_cleaner = EDSCleaner(keep_endlines=True)
raw_segment = doc.raw_segment
clean_segment = eds_cleaner.run([raw_segment])[0]
print(clean_segment.text)

Objet : Compte-Rendu de Consultation du 20/12/1985.
M Machin, âgé de 65 ans, né le 10/10/1930, a été vu en consultation dans le service d'hépatologie avec le Dr  Bernand..
Motif.
Suivi de cirrhose éthylique.
Traitement en cours aucun.
Evolution depuis la dernière consultation.
FOGD:.
 VO de grade I/II sans signe rouge va très bien, a arrêté le régime sans sel et les diurétiques malgré tout abstinence totale en alcool ; fait attention au sel dans son régime.
BIO: plqt 227000 TP 72% albu 33 bili 10 AFP 6.
Examen clinique.
Le 20/12/1985 : Poids : 82,45 kg, Taille : 182,50 cm, IMC : 22,7 kg/m, SC : 1,7 m.
Pas d'ictère pas d'ascite hépatomégalie dure au depens du foie gauche.
OMI légers bilatéraux.
Conclusion Cirrhose compensée Child A.
VO de grade II prophylaxie par bêtabloquants, information donnée au patient ; attention à une poussée de psoriasis qui pourrait être déclenchée..
OMI: reprise du lasilix à faibles doses.
Traitement à l'issue de la consultation.
Médicaments.
Lasilix 40 mg le matin.
Propranolol LP 160mg: 1 cp le matin.

The class works on Segments. In the run method it performs several operations to delete or change characters of interest. By default, it performs these operations:

Changes points between uppercase letters to spaces
Changes points between numbers to commas
Deletes multiple newline characters.
Deletes multiple whitespaces.

Note

There are two special operations that process parentheses and dots near French keywords such as Dr., Mme. and others. To enable/disable these operations you can use handle_parentheses_eds and handle_points_eds.

Extract text from the clean text#

Now that we have a clean segment, we can run an operation on the new segment. We can detect the sentences, for example.

from medkit.text.segmentation import SentenceTokenizer

sentences = SentenceTokenizer().run([clean_segment])
for sent in sentences:
  print(f"{sent.text!r}")

'Objet : Compte-Rendu de Consultation du 20/12/1985'
"M Machin, âgé de 65 ans, né le 10/10/1930, a été vu en consultation dans le service d'hépatologie avec le Dr  Bernand"
'Motif'
'Suivi de cirrhose éthylique'
'Traitement en cours aucun'
'Evolution depuis la dernière consultation'
'FOGD:'
'VO de grade I/II sans signe rouge va très bien, a arrêté le régime sans sel et les diurétiques malgré tout abstinence totale en alcool '
'fait attention au sel dans son régime'
'BIO: plqt 227000 TP 72% albu 33 bili 10 AFP 6'
'Examen clinique'
'Le 20/12/1985 : Poids : 82,45 kg, Taille : 182,50 cm, IMC : 22,7 kg/m, SC : 1,7 m'
"Pas d'ictère pas d'ascite hépatomégalie dure au depens du foie gauche"
'OMI légers bilatéraux'
'Conclusion Cirrhose compensée Child A'
'VO de grade II prophylaxie par bêtabloquants, information donnée au patient '
'attention à une poussée de psoriasis qui pourrait être déclenchée'
'OMI: reprise du lasilix à faibles doses'
"Traitement à l'issue de la consultation"
'Médicaments'
'Lasilix 40 mg le matin'
'Propranolol LP 160mg: 1 cp le matin'

A created sentence in detail

The span of each generated sentence contains the modifications made by eds_cleaner object. Let’s look at the second sentence:

sentence = sentences[1]
print(f"text={sentence.text!r}")
print("spans=\n","\n".join(f"{sp}" for sp in sentence.spans))

text="M Machin, âgé de 65 ans, né le 10/10/1930, a été vu en consultation dans le service d'hépatologie avec le Dr  Bernand"
spans=
 Span(start=56, end=57)
ModifiedSpan(length=1, replaced_spans=[Span(start=57, end=58)])
Span(start=58, end=110)
ModifiedSpan(length=1, replaced_spans=[Span(start=110, end=111), Span(start=111, end=112), Span(start=112, end=114)])
Span(start=114, end=134)
ModifiedSpan(length=1, replaced_spans=[Span(start=134, end=135)])
Span(start=135, end=142)
ModifiedSpan(length=1, replaced_spans=[Span(start=142, end=144)])
Span(start=144, end=168)
ModifiedSpan(length=1, replaced_spans=[Span(start=168, end=169)])
Span(start=169, end=177)

The sentence starts with the character M (index 56), followed by a point . which has been replaced by a space (index 57). Then, the whole text up to the newline character has not been modified, so it corresponds to the original span (index 58 to 110). Each modification is stored by ModifiedSpan objects, until the end of the sentence, the character index 177.

Displaying in the original text#

Since the sentence contains the information from the original spans, it will always be possible to go back and display the information in the raw text.

To get the original spans, we can use normalize_spans(). Next, we can extract the raw text using extract().

from medkit.core.text.span_utils import normalize_spans, extract

spans_sentence = normalize_spans(sentence.spans)
ranges = [(s.start, s.end) for s in spans_sentence]
extracted_text, spans = extract(raw_segment.text,raw_segment.spans,ranges)
print(f"- Sentence in the ORIGINAL version:\n \"{extracted_text}\"")

- Sentence in the ORIGINAL version:
 "M.Machin, âgé de 65 ans, né le 10/10/1930, a été vu en 
		consultation dans le
service	 d'hépatologie avec le Dr. Bernand"

That’s how an operation transforms text and extracts information without losing the raw text.

Cleaning text with a predefined operation

Contents

Cleaning text with a predefined operation#

Loading a text to clean#

Using EDSCleaner operation#

Extract text from the clean text#

Displaying in the original text#