medkit.text.ner.umls_coder_normalizer#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[umls-coder-normalizer].

Classes:

UMLSCoderNormalizer(umls_mrconso_file, ...)

Normalizer adding UMLS normalization attributes to pre-existing entities.

class UMLSCoderNormalizer(umls_mrconso_file, language, model, embeddings_cache_dir, summary_method='cls', normalize_embeddings=True, lowercase=False, normalize_unicode=False, threshold=None, max_nb_matches=1, device=- 1, batch_size=128, hf_auth_token=None, nb_umls_embeddings_chunks=None, hf_cache_dir=None, name=None, uid=None)[source]#

Normalizer adding UMLS normalization attributes to pre-existing entities. Based on https://github.com/GanjinZero/CODER/.

An UMLS MRCONSO.RRF file is needed. The normalizer identifies UMLS concepts by comparing embeddings of reference UMLS terms with the embeddings of the input entities. Any text transformer model from the HuggingFace Hub can be used, but “GanjinZero/UMLSBert_ENG” was specifically trained for this task (for english).

When UMLSCoderNormalizer is used for the first time for a given MRCONSO.RRF, the embeddings of all umls terms are pre-computed (this can take a very long time) and stored in embeddings_cache_dir, so they can be reused next time.

If another MRCONSO.RRF file is used, or if a parameter impacting the computation of embeddings (model, summary_method, etc) is changed, then another embeddings_cache_dir must be used, or embeddings_cache_dir must be deleted so it can be created properly.

If the umls embeddings are too big to be held in memory, use nb_umls_embeddings_chunks.

Parameters
  • umls_mrconso_file (Union[str, Path]) – Path to the UMLS MRCONSO.RRF file.

  • language (str) – Language of the UMLS terms to use (ex: “ENG”, “FRE”).

  • model (Union[str, Path]) – Name on the Hugging Face hub or path to the transformers model that will be used to extract embeddings (ex: “GanjinZero/UMLSBert_ENG”).

  • embeddings_cache_dir (Union[str, Path]) – Path to the directory into which pre-computed embeddings of UMLS terms should be cached. If it doesn’t exist yet, the embeddings will be automatically generated (it can take a long time) and stored there, ready to be reused on further instantiations. If it already exists, a check will be done to make sure the params used when the embeddings were computed are consistent with the params of the current instance.

  • summary_method (Literal['mean', 'cls']) – If set to “mean”, the embeddings extracted will be the mean of the pooling layers of the model. Otherwise, when set to “cls”, the last hidden layer will be used.

  • normalize_embeddings (bool) – Whether to normalize the extracted embeddings.

  • lowercase (bool) – Whether to use lowercased versions of UMLS terms and input entities.

  • normalize_unicode (bool) – Whether to use ASCII-only versions of UMLS terms and input entities (non-ASCII chars replaced by closest ASCII chars).

  • threshold (Optional[float]) – Minimum similarity threshold (between 0.0 and 1.0) between the embeddings of an entity and of an UMLS term for a normalization attribute to be added.

  • max_nb_matches (int) – Maximum number of normalization attributes to add to each entity.

  • device (int) – Device to use for transformers models. Follows the Hugging Face convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).

  • batch_size (int) – Number of entities in batches processed by the embeddings extraction pipeline.

  • hf_auth_token (Optional[str]) – HuggingFace Authentication token (to access private models on the hub)

  • hf_cache_dir (Union[str, Path, None]) – Directory where to store downloaded models. If not set, the default HuggingFace cache dir is used.

  • nb_umls_embeddings_chunks (Optional[int]) – Number of umls embeddings chunks to load at the same time when computing embeddings similarities. (a chunk contains 65536 embeddings). If None, all pre-computed umls embeddings are pre-loaded in memory and similaries are computed in one shot. Otherwise, at each call to run(), umls embeddings are loaded by groups of chunks and similaries are computed for each group. Use this when umls embeddings are too big to be fully loaded in memory. The higher this value, the more memory needed.

  • name (Optional[str]) – Name describing the normalizer (defaults to the class name).

  • uid (str) – Identifier of the normalizer.

Methods:

run(entities)

Add normalization attributes to each entity in entities.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(entities)[source]#

Add normalization attributes to each entity in entities.

Each entity will have zero, one or more normalization attributes depending on max_nb_matches and on how many matches with a similarity above threshold are found.

Parameters

entities (List[Entity]) – List of entities to add normalization attributes to

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.