medkit.text.ner.hf_tokenization_utils#

Functions:

align_and_map_tokens_with_tags(...[, ...])

Return a list of tags_ids aligned with the text encoding.

convert_labels_to_tags(labels[, tagging_scheme])

Convert a list of labels in a mapping of NER tags

transform_entities_to_tags(text_encoding, ...)

Transform entities from a encoded document to a list of BILOU/IOB2 tags.

transform_entities_to_tags(text_encoding, entities, tagging_scheme='bilou')[source]#

Transform entities from a encoded document to a list of BILOU/IOB2 tags.

Parameters
  • text_encoding (EncodingFast) – Encoding of the document of reference, this is created by a HuggingFace fast tokenizer. It contains a tokenized version of the document to tag.

  • entities (List[Entity]) – The list of entities to transform

  • tagging_scheme (Literal['bilou', 'iob2']) – Scheme to tag the tokens, it can be bilou or iob2

Return type

List[str]

Returns

List[str] – A list describing the document with tags. By default the tags could be “B”, “I”, “L”, “O”,”U”, if tagging_scheme is iob2 the tags could be “B”, “I”,”O”.

Examples

>>> # Define a fast tokenizer, i.e. : bert tokenizer
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> document = TextDocument(text="medkit")
>>> entities = [Entity(label="corporation", spans=[Span(start=0, end=6)], text='medkit')]
>>> # Get text encoding of the document using the tokenizer
>>> text_encoding = tokenizer(document.text).encodings[0]
>>> print(text_encoding.tokens)
['[CLS]', 'med',##kit', '[SEP]']

Transform to BILOU tags

>>> tags = transform_entities_to_tags(text_encoding,entities)
>>> assert tags == ['O', 'B-corporation', 'L-corporation', 'O']

Transform to IOB2 tags

>>> tags = transform_entities_to_tags(text_encoding,entities,"iob2")
>>> assert tags == ['O', 'B-corporation', 'I-corporation', 'O']
align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id, map_sub_tokens=True)[source]#

Return a list of tags_ids aligned with the text encoding. Tags considered as special tokens will have the SPECIAL_TAG_ID_HF.

Parameters
  • text_encoding (EncodingFast) – Text encoding after tokenization with a HuggingFace fast tokenizer

  • tags (List[str]) – A list of tags i.e BILOU tags

  • tag_to_id (Dict[str, int]) – Mapping tag to id

  • map_sub_tokens (bool) – When a token is not in the vocabulary of the tokenizer, it could split the token into multiple subtokens. If map_sub_tokens is True, all tags inside a token will be converted. If map_sub_tokens is False, only the first subtoken of a split token will be converted.

Return type

List[int]

Returns

List[int] – A list of tags ids

Examples

>>> # Define a fast tokenizer, i.e. : bert tokenizer
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> # define data to map
>>> text_encoding = tokenizer("medkit").encodings[0]
>>> tags = ["O","B-corporation","I-corporation","O"]
>>> tag_to_id = {"O":0, "B-corporation":1, "I-corporation":2}
>>> print(text_encoding.tokens)
['[CLS]', 'med',##kit', '[SEP]']

Maping all tags to tags_ids

>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags,tag_to_id)
>>> assert tags_ids == [-100, 1, 2, -100]

Maping only first tag in tokens

>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id,False)
>>> assert tags_ids == [-100, 1, -100, -100]
convert_labels_to_tags(labels, tagging_scheme='bilou')[source]#

Convert a list of labels in a mapping of NER tags

Parameters
  • labels (List[str]) – List of labels to convert

  • tagging_scheme (Literal['bilou', 'iob2']) – Scheme to use in the conversion, “iob2” follows the BIO scheme.

Return type

Dict[str, int]

Returns

label_to_id (Dict[str, int]:) – Mapping with NER tags.

Examples

>>> convert_labels_to_tags(labels=["test","problem"],tagging_scheme="iob2")
{'O': 0, 'B-test': 1, 'I-test': 2, 'B-problem': 3, 'I-problem': 4}