medkit.tools.e3c_corpus#

This module aims to provide facilities for accessing data from e3c corpus.

Version : 2.0.0 License: The E3C corpus is released under Creative Commons NonCommercial license (CC BY-NC).

Github: https://github.com/hltfbk/E3C-Corpus

Reference

B. magnini, B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli. 2020. The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases. In Proceedings of the Seventh Italian Conference on Computational Linguistics, Bologna, Italy, December. Associazione Italiana di Linguistica Computazionale.

Functions:

convert_data_annotation_to_medkit(dir_path, ...)

Convert E3C corpus data annotation to medkit jsonl file.

convert_data_collection_to_medkit(dir_path, ...)

Convert E3C corpus data collection to medkit jsonl file

load_annotated_document(filepath[, ...])

Load a E3C corpus annotated document (xml document) as medkit text document.

load_data_annotation(dir_path[, encoding, ...])

Load the E3C corpus data annotation as medkit text documents.

load_data_collection(dir_path[, encoding])

Load the E3C corpus data collection as medkit text documents

load_document(filepath[, encoding])

Load a E3C corpus document (json document) as medkit text document.

Data:

CLINENTITY_LABEL

Label used by medkit for annotated clinical entities of E3C corpus

SENTENCE_LABEL

Label used by medkit for annotated sentences of E3C corpus

load_document(filepath, encoding='utf-8')[source]#

Load a E3C corpus document (json document) as medkit text document. For example, one in data collection folder. Document id is always kept in medkit document metadata.

Parameters
  • filepath (Union[str, Path]) – The path to the json file of the E3C corpus

  • encoding (str) – The encoding of the file. Default: ‘utf-8’

Return type

TextDocument

Returns

TextDocument – The corresponding medkit text document

load_data_collection(dir_path, encoding='utf-8')[source]#

Load the E3C corpus data collection as medkit text documents

Parameters
  • dir_path (Union[Path, str]) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

  • encoding (str) – The encoding of the files. Default: ‘utf-8’

Return type

Iterator[TextDocument]

Returns

Iterator[TextDocument] – An iterator on corresponding medkit text documents

convert_data_collection_to_medkit(dir_path, output_file, encoding='utf-8')[source]#

Convert E3C corpus data collection to medkit jsonl file

Parameters
  • dir_path (Union[Path, str]) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

  • output_file (Union[str, Path]) – The medkit jsonl output file which will contain medkit text documents

  • encoding (Optional[str]) – The encoding of the files. Default: ‘utf-8’

load_annotated_document(filepath, encoding='utf-8', keep_sentences=False)[source]#

Load a E3C corpus annotated document (xml document) as medkit text document. For example, one in data annotation folder. Each annotation id is always kept in corresponding medkit element metadata.

For the time being, only supports ‘CLINENTITY’ annotations. ‘SENTENCE’ annotations may be also loaded.

Parameters
  • filepath (Union[str, Path]) – The path to the xml file of the E3C corpus

  • encoding (str) – The encoding of the file. Default: ‘utf-8’

  • keep_sentences – Whether to load sentences into medkit documents.

Return type

TextDocument

Returns

TextDocument – The corresponding medkit text document

load_data_annotation(dir_path, encoding='utf-8', keep_sentences=False)[source]#

Load the E3C corpus data annotation as medkit text documents.

Parameters
  • dir_path (Union[Path, str]) – The path to the E3C corpus data annotation directory containing the xml files (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)

  • encoding (str) – The encoding of the files. Default: ‘utf-8’

  • keep_sentences (bool) – Whether to load sentences into medkit documents.

Return type

Iterator[TextDocument]

Returns

Iterator[TextDocument] – An iterator on corresponding medkit text documents

convert_data_annotation_to_medkit(dir_path, output_file, encoding='utf-8', keep_sentences=False)[source]#

Convert E3C corpus data annotation to medkit jsonl file.

Parameters
  • dir_path (Union[Path, str]) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

  • output_file (Union[str, Path]) – The medkit jsonl output file which will contain medkit text documents

  • encoding (Optional[str]) – The encoding of the files. Default: ‘utf-8’

  • keep_sentences (bool) – Whether to load sentences into medkit documents.

SENTENCE_LABEL = 'sentence'#

Label used by medkit for annotated sentences of E3C corpus

CLINENTITY_LABEL = 'disorder'#

Label used by medkit for annotated clinical entities of E3C corpus