Brat integration#

Brat is a web-based tool for text annotation that uses the standoff format. Medkit supports input-output conversion of Brat files with the following annotations types:

  • Entities

  • Relations

  • Attributes

Annotations with other types are ignored in the conversion process.

In this example, we will show how to import Brat annotated files into medkit and how to convert medkit documents into Brat annotated collections.

Consider this text file:

# You can download the file available in source code
# !wget https://raw.githubusercontent.com/TeamHeka/medkit/main/docs/examples/input/brat/doc_01.txt

from pathlib import Path

print(Path("./input/brat/doc_01.txt").read_text(encoding="utf-8"))
Le patient est prescrit du Lisinopril parce qu'il souffre d'hypertension.
Le patient avait une déficience en vitamines A et B.

It has the following brat annotation file:

# You can download the file available in source code
# !wget https://raw.githubusercontent.com/TeamHeka/medkit/main/docs/examples/input/brat/doc_01.ann

print(Path("./input/brat/doc_01.ann").read_text(encoding="utf-8"))
T1	medication 27 37	Lisinopril
T2	disease 60 72	hypertension
T3	disease 95 105	déficience
A1	antecedent T3
T4	vitamin 109 120	vitamines A
T5	vitamin 109 118;124 125	vitamines B
R1	treats Arg1:T1 Arg2:T2	

Load brat into a list of TextDocuments#

To load Brat Files, medkit provides the BratInputConverter class. This converter returns a list of TextDocument.

Tip

You can enable provenance tracing by assigning a ProvTracer object to the BratInputConverter with the set_prov_tracer() method.

from medkit.io.brat import BratInputConverter

# Define Input Converter 
brat_converter = BratInputConverter()

# Load brat into a list of documents
docs = brat_converter.load(dir_path="./input/brat")
medkit_doc = docs[0]

# Explore annotations
print(f"The document has {len(medkit_doc.anns)} annotations")
entities_disease = medkit_doc.anns.get(label="disease")
print(f"Where {len(entities_disease)} annotations have 'disease' as label")
The document has 6 annotations
Where 2 annotations have 'disease' as label

Visualize entities information

The created document contains the annotations defined in the brat annotation file. We can show the entities information, for example.

for entity in medkit_doc.anns.get_entities():
    print(f"label={entity.label}, spans={entity.spans}, text={entity.text!r}")
label=medication, spans=[Span(start=27, end=37)], text='Lisinopril'
label=disease, spans=[Span(start=60, end=72)], text='hypertension'
label=disease, spans=[Span(start=95, end=105)], text='déficience'
label=vitamin, spans=[Span(start=109, end=120)], text='vitamines A'
label=vitamin, spans=[Span(start=109, end=118), ModifiedSpan(length=1, replaced_spans=[]), Span(start=124, end=125)], text='vitamines B'

Save TextDocuments to Brat#

To save a list of TextDocument in Brat format, you can use BratOutputConverter.

You can choose which medkit annotations and attributes to keep in the resulting Brat collection. By default, since its anns_labels and attrs are set to None, all annotations and attributes will be in the generated file.

If you also want to include the segments in the brat collection, the parameter ignore_segments can be set to False.

Automatic configuration of annotations

Brat is actually controlling the configuration with text-based configuration files. It uses four types, but only the annotation types configuration is necessary (cf: brat configuration).

To facilitate integration and ensure correct visualisation, medkit automatically generates an annotation.conf for each collection.

from medkit.io.brat import BratOutputConverter

# Define Output Converter with default params,
# transfer all annotations and attributes
brat_output_converter = BratOutputConverter()

# save the medkit documents in `dir_path`
brat_output_converter.save(
  docs,  dir_path="./brat_out", doc_names=["doc_1"])

The collection is saved on disk including the following files:

  • doc_1.txt: text of medkit document

  • doc_1.ann: brat annotation file

  • annotation.conf: annotation type configuration

By default the name is the document_id, you can change it using the doc_names parameter.

Note

Since the values of the attributes in brat must be defined in the configuration, medkit shows the top50 for each attribute. In case you want to show more values in the configuration, you can change top_values_by_attr in the brat output converter.

See also

cf. Brat IO module.