medkit.text.ner.umls_utils#

Classes:

UMLSEntry(cui, term[, semtypes, semgroups])

Entry in MRCONSO.RRF file of a UMLS dictionary

Functions:

guess_umls_version(path)

Try to infer UMLS version (ex: "2021AB") from any UMLS-related path

load_umls_entries(mrconso_file[, ...])

Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file

preprocess_acronym(term)

Detect if a term contains an acronym with the expanded form between parenthesis, and return the acronym if that is the case.

preprocess_term_to_match(term, lowercase, ...)

Preprocess a UMLS term for matching purposes

Data:

SEMGROUPS

Valid UMLS semgroups

SEMGROUP_LABELS

Labels corresponding to UMLS semgroups

class UMLSEntry(cui, term, semtypes=None, semgroups=None)[source]#

Entry in MRCONSO.RRF file of a UMLS dictionary

Variables
  • cui (str) – Unique identifier of the concept designated by the term

  • ref_term – Original version of the term

  • semtypes (Optional[List[str]]) – Semantic types of the concept (TUIs)

  • semgroups (Optional[List[str]]) – Semantic groups of the concept

load_umls_entries(mrconso_file, mrsty_file=None, sources=None, languages=None, show_progress=False)[source]#

Load all terms and associated CUIs found in a UMLS MRCONSO.RRF file

Parameters
  • mrconso_file (Union[str, Path]) – Path to the UMLS MRCONSO.RRF file

  • mrsty_file (Union[str, Path, None]) – Path to the UMLS MRSTY.RRF file. If provided, semtypes info will be included in the entries returned.

  • sources (Optional[List[str]]) – Sources to consider (ex: ICD10, CCS) If none provided, CUIs and terms of all sources will be taken into account.

  • languages (Optional[List[str]]) – Languages to consider. If none provided, CUIs and terms of all languages will be taken into account

  • show_progress (bool) – Whether to show a progressbar

Return type

Iterator[UMLSEntry]

Returns

Iterator[UMLSEntry] – Iterator over all term entries found in UMLS install

preprocess_term_to_match(term, lowercase, normalize_unicode, clean_nos=True, clean_brackets=False, clean_dashes=False)[source]#

Preprocess a UMLS term for matching purposes

Parameters
  • term (str) – Term to preprocess

  • lowercase (bool) – Whether term should be lowercased

  • normalize_unicode (bool) – Whether term_to_match should be ASCII-only (non-ASCII chars replaced by closest ASCII chars)

  • clean_nos (bool) – Whether to remove “NOS”

  • clean_brackets (bool) – Whether to remove brackets

  • clean_dashes (bool) – Whether to remove dashes

preprocess_acronym(term)[source]#

Detect if a term contains an acronym with the expanded form between parenthesis, and return the acronym if that is the case.

This will work for terms such as: “ECG (ÉlectroCardioGramme)”, where the acronym can be rebuilt by taking the ASCII version of each uppercase letter inside the parenthesis.

Parameters

term (str) – Term that may contain an acronym. Ex: “ECG (ÉlectroCardioGramme)”

Return type

Optional[str]

Returns

Optional[str] – The acronym in the term if any, else None. Ex: “ECG”

guess_umls_version(path)[source]#

Try to infer UMLS version (ex: “2021AB”) from any UMLS-related path

Parameters

path (Union[str, Path]) – Path to the root directory of the UMLS install or any file inside that directory

Return type

str

Returns

  • UMLS version, estimated by finding the leaf-most folder in path that is not

  • ”META”, “NET” nor “LEX”, nor a subfolder of these folders

SEMGROUPS = ['ACTI', 'ANAT', 'CHEM', 'CONC', 'DEVI', 'DISO', 'GENE', 'GEOG', 'LIVB', 'OBJC', 'OCCU', 'ORGA', 'PHEN', 'PHYS', 'PROC']#

Valid UMLS semgroups

SEMGROUP_LABELS = {'ACTI': 'activity', 'ANAT': 'anatomy', 'CHEM': 'chemical', 'CONC': 'concept', 'DEVI': 'device', 'DISO': 'disorder', 'GENE': 'genes_sequence', 'GEOG': 'geographic_area', 'LIVB': 'living_being', 'OBJC': 'object', 'OCCU': 'occupation', 'ORGA': 'organization', 'PHEN': 'phenomenon', 'PHYS': 'physiology', 'PROC': 'procedure'}#

Labels corresponding to UMLS semgroups