medkit.text.ner.hf_entity_matcher_trainable#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[hf-entity-matcher].

Classes:

HFEntityMatcherTrainable(model_name_or_path, ...)

Trainable entity matcher based on HuggingFace transformers model Any token classification model from the HuggingFace hub can be used (for instance "samrawal/bert-base-uncased_clinical-ner").

class HFEntityMatcherTrainable(model_name_or_path, labels, tagging_scheme, tag_subtokens=False, tokenizer_max_length=None, hf_auth_token=None, device=- 1)[source]#

Trainable entity matcher based on HuggingFace transformers model Any token classification model from the HuggingFace hub can be used (for instance “samrawal/bert-base-uncased_clinical-ner”).

Parameters
  • model_name_or_path (Union[str, Path]) – Name (on the HuggingFace models hub) or path of the NER model. Must be a model compatible with the TokenClassification transformers class.

  • labels (List[str]) – List of labels to detect

  • tagging_scheme (Literal['bilou', 'iob2']) – Tagging scheme to use in the segment-entities preprocessing and label mapping definition.

  • tag_subtokens (bool) – Whether tag subtokens in a word. PreTrained models require a tokenization step. If any word of the segment is not in the vocabulary of the tokenizer used by the PreTrained model, the word is split into subtokens. It is recommended to only tag the first subtoken of a word. However, it is possible to tag all subtokens by setting this value to True. It could influence the time and results of fine-tunning.

  • tokenizer_max_length (Optional[int]) – Optional max length for the tokenizer, by default the model_max_length will be used.

  • hf_auth_token (Optional[str]) – HuggingFace Authentication token (to access private models on the hub)

  • device (int) – Device to use for the transformer model. Follows the HuggingFace convention (-1 for “cpu” and device number for gpu, for instance 0 for “cuda:0”).