medkit.tools.e3c_corpus
medkit.tools.e3c_corpus#
This module aims to provide facilities for accessing data from e3c corpus.
Version : 2.0.0 License: The E3C corpus is released under Creative Commons NonCommercial license (CC BY-NC).
Github: https://github.com/hltfbk/E3C-Corpus
Reference
B. magnini, B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli. 2020. The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases. In Proceedings of the Seventh Italian Conference on Computational Linguistics, Bologna, Italy, December. Associazione Italiana di Linguistica Computazionale.
Functions:
|
Convert E3C corpus data annotation to medkit jsonl file. |
|
Convert E3C corpus data collection to medkit jsonl file |
|
Load a E3C corpus annotated document (xml document) as medkit text document. |
|
Load the E3C corpus data annotation as medkit text documents. |
|
Load the E3C corpus data collection as medkit text documents |
|
Load a E3C corpus document (json document) as medkit text document. |
Data:
Label used by medkit for annotated clinical entities of E3C corpus |
|
Label used by medkit for annotated sentences of E3C corpus |
- load_document(filepath, encoding='utf-8')[source]#
Load a E3C corpus document (json document) as medkit text document. For example, one in data collection folder. Document id is always kept in medkit document metadata.
- Parameters
filepath (
Union
[str
,Path
]) – The path to the json file of the E3C corpusencoding (
str
) – The encoding of the file. Default: ‘utf-8’
- Return type
- Returns
TextDocument – The corresponding medkit text document
- load_data_collection(dir_path, encoding='utf-8')[source]#
Load the E3C corpus data collection as medkit text documents
- Parameters
dir_path (
Union
[Path
,str
]) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)encoding (
str
) – The encoding of the files. Default: ‘utf-8’
- Return type
Iterator
[TextDocument
]- Returns
Iterator[TextDocument] – An iterator on corresponding medkit text documents
- convert_data_collection_to_medkit(dir_path, output_file, encoding='utf-8')[source]#
Convert E3C corpus data collection to medkit jsonl file
- Parameters
dir_path (
Union
[Path
,str
]) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)output_file (
Union
[str
,Path
]) – The medkit jsonl output file which will contain medkit text documentsencoding (
Optional
[str
]) – The encoding of the files. Default: ‘utf-8’
- load_annotated_document(filepath, encoding='utf-8', keep_sentences=False)[source]#
Load a E3C corpus annotated document (xml document) as medkit text document. For example, one in data annotation folder. Each annotation id is always kept in corresponding medkit element metadata.
For the time being, only supports ‘CLINENTITY’ annotations. ‘SENTENCE’ annotations may be also loaded.
- Parameters
filepath (
Union
[str
,Path
]) – The path to the xml file of the E3C corpusencoding (
str
) – The encoding of the file. Default: ‘utf-8’keep_sentences – Whether to load sentences into medkit documents.
- Return type
- Returns
TextDocument – The corresponding medkit text document
- load_data_annotation(dir_path, encoding='utf-8', keep_sentences=False)[source]#
Load the E3C corpus data annotation as medkit text documents.
- Parameters
dir_path (
Union
[Path
,str
]) – The path to the E3C corpus data annotation directory containing the xml files (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)encoding (
str
) – The encoding of the files. Default: ‘utf-8’keep_sentences (
bool
) – Whether to load sentences into medkit documents.
- Return type
Iterator
[TextDocument
]- Returns
Iterator[TextDocument] – An iterator on corresponding medkit text documents
- convert_data_annotation_to_medkit(dir_path, output_file, encoding='utf-8', keep_sentences=False)[source]#
Convert E3C corpus data annotation to medkit jsonl file.
- Parameters
dir_path (
Union
[Path
,str
]) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)output_file (
Union
[str
,Path
]) – The medkit jsonl output file which will contain medkit text documentsencoding (
Optional
[str
]) – The encoding of the files. Default: ‘utf-8’keep_sentences (
bool
) – Whether to load sentences into medkit documents.
- SENTENCE_LABEL = 'sentence'#
Label used by medkit for annotated sentences of E3C corpus
- CLINENTITY_LABEL = 'disorder'#
Label used by medkit for annotated clinical entities of E3C corpus