medkit.text.ner.umls_utils
medkit.text.ner.umls_utils#
Classes:
|
Entry in MRCONSO.RRF file of a UMLS dictionary |
Functions:
|
Try to infer UMLS version (ex: "2021AB") from any UMLS-related path |
|
Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file |
|
Detect if a term contains an acronym with the expanded form between parenthesis, and return the acronym if that is the case. |
|
Preprocess a UMLS term for matching purposes |
Data:
Valid UMLS semgroups |
|
Labels corresponding to UMLS semgroups |
- class UMLSEntry(cui, term, semtypes=None, semgroups=None)[source]#
Entry in MRCONSO.RRF file of a UMLS dictionary
- Variables
cui (str) – Unique identifier of the concept designated by the term
ref_term – Original version of the term
semtypes (Optional[List[str]]) – Semantic types of the concept (TUIs)
semgroups (Optional[List[str]]) – Semantic groups of the concept
- load_umls_entries(mrconso_file, mrsty_file=None, sources=None, languages=None, show_progress=False)[source]#
Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file
- Parameters
mrconso_file (
Union
[str
,Path
]) – Path to the UMLS MRCONSO.RRF filemrsty_file (
Union
[str
,Path
,None
]) – Path to the UMLS MRSTY.RRF file. If provided, semtypes info will be included in the entries returned.sources (
Optional
[List
[str
]]) – Sources to consider (ex: ICD10, CCS) If none provided, CUIs and terms of all sources will be taken into account.languages (
Optional
[List
[str
]]) – Languages to consider. If none provided, CUIs and terms of all languages will be taken into accountshow_progress (
bool
) – Whether to show a progressbar
- Return type
Iterator
[UMLSEntry
]- Returns
Iterator[UMLSEntry] – Iterator over all term entries found in UMLS install
- preprocess_term_to_match(term, lowercase, normalize_unicode, clean_nos=True, clean_brackets=False, clean_dashes=False)[source]#
Preprocess a UMLS term for matching purposes
- Parameters
term (str) – Term to preprocess
lowercase (
bool
) – Whether term should be lowercasednormalize_unicode (
bool
) – Whether term_to_match should be ASCII-only (non-ASCII chars replaced by closest ASCII chars)clean_nos (
bool
) – Whether to remove “NOS”clean_brackets (
bool
) – Whether to remove bracketsclean_dashes (
bool
) – Whether to remove dashes
- preprocess_acronym(term)[source]#
Detect if a term contains an acronym with the expanded form between parenthesis, and return the acronym if that is the case.
This will work for terms such as: “ECG (ÉlectroCardioGramme)”, where the acronym can be rebuilt by taking the ASCII version of each uppercase letter inside the parenthesis.
- Parameters
term (
str
) – Term that may contain an acronym. Ex: “ECG (ÉlectroCardioGramme)”- Return type
Optional
[str
]- Returns
Optional[str] – The acronym in the term if any, else None. Ex: “ECG”
- guess_umls_version(path)[source]#
Try to infer UMLS version (ex: “2021AB”) from any UMLS-related path
- Parameters
path (
Union
[str
,Path
]) – Path to the root directory of the UMLS install or any file inside that directory- Return type
str
- Returns
UMLS version, estimated by finding the leaf-most folder in path that is not
”META”, “NET” nor “LEX”, nor a subfolder of these folders
- SEMGROUPS = ['ACTI', 'ANAT', 'CHEM', 'CONC', 'DEVI', 'DISO', 'GENE', 'GEOG', 'LIVB', 'OBJC', 'OCCU', 'ORGA', 'PHEN', 'PHYS', 'PROC']#
Valid UMLS semgroups
- SEMGROUP_LABELS = {'ACTI': 'activity', 'ANAT': 'anatomy', 'CHEM': 'chemical', 'CONC': 'concept', 'DEVI': 'device', 'DISO': 'disorder', 'GENE': 'genes_sequence', 'GEOG': 'geographic_area', 'LIVB': 'living_being', 'OBJC': 'object', 'OCCU': 'occupation', 'ORGA': 'organization', 'PHEN': 'phenomenon', 'PHYS': 'physiology', 'PROC': 'procedure'}#
Labels corresponding to UMLS semgroups