medkit.text.ner.hf_tokenization_utils
medkit.text.ner.hf_tokenization_utils#
Functions:
|
Return a list of tags_ids aligned with the text encoding. |
|
Convert a list of labels in a mapping of NER tags |
|
Transform entities from a encoded document to a list of BILOU/IOB2 tags. |
- transform_entities_to_tags(text_encoding, entities, tagging_scheme='bilou')[source]#
Transform entities from a encoded document to a list of BILOU/IOB2 tags.
- Parameters
text_encoding (
EncodingFast
) – Encoding of the document of reference, this is created by a HuggingFace fast tokenizer. It contains a tokenized version of the document to tag.entities (
List
[Entity
]) – The list of entities to transformtagging_scheme (
Literal
['bilou'
,'iob2'
]) – Scheme to tag the tokens, it can be bilou or iob2
- Return type
List
[str
]- Returns
List[str] – A list describing the document with tags. By default the tags could be “B”, “I”, “L”, “O”,”U”, if tagging_scheme is iob2 the tags could be “B”, “I”,”O”.
Examples
>>> # Define a fast tokenizer, i.e. : bert tokenizer >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> document = TextDocument(text="medkit") >>> entities = [Entity(label="corporation", spans=[Span(start=0, end=6)], text='medkit')] >>> # Get text encoding of the document using the tokenizer >>> text_encoding = tokenizer(document.text).encodings[0] >>> print(text_encoding.tokens) ['[CLS]', 'med',##kit', '[SEP]']
Transform to BILOU tags
>>> tags = transform_entities_to_tags(text_encoding,entities) >>> assert tags == ['O', 'B-corporation', 'L-corporation', 'O']
Transform to IOB2 tags
>>> tags = transform_entities_to_tags(text_encoding,entities,"iob2") >>> assert tags == ['O', 'B-corporation', 'I-corporation', 'O']
- align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id, map_sub_tokens=True)[source]#
Return a list of tags_ids aligned with the text encoding. Tags considered as special tokens will have the SPECIAL_TAG_ID_HF.
- Parameters
text_encoding (
EncodingFast
) – Text encoding after tokenization with a HuggingFace fast tokenizertags (
List
[str
]) – A list of tags i.e BILOU tagstag_to_id (
Dict
[str
,int
]) – Mapping tag to idmap_sub_tokens (
bool
) – When a token is not in the vocabulary of the tokenizer, it could split the token into multiple subtokens. If map_sub_tokens is True, all tags inside a token will be converted. If map_sub_tokens is False, only the first subtoken of a split token will be converted.
- Return type
List
[int
]- Returns
List[int] – A list of tags ids
Examples
>>> # Define a fast tokenizer, i.e. : bert tokenizer >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
>>> # define data to map >>> text_encoding = tokenizer("medkit").encodings[0] >>> tags = ["O","B-corporation","I-corporation","O"] >>> tag_to_id = {"O":0, "B-corporation":1, "I-corporation":2} >>> print(text_encoding.tokens) ['[CLS]', 'med',##kit', '[SEP]']
Maping all tags to tags_ids
>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags,tag_to_id) >>> assert tags_ids == [-100, 1, 2, -100]
Maping only first tag in tokens
>>> tags_ids = align_and_map_tokens_with_tags(text_encoding, tags, tag_to_id,False) >>> assert tags_ids == [-100, 1, -100, -100]
- convert_labels_to_tags(labels, tagging_scheme='bilou')[source]#
Convert a list of labels in a mapping of NER tags
- Parameters
labels (
List
[str
]) – List of labels to converttagging_scheme (
Literal
['bilou'
,'iob2'
]) – Scheme to use in the conversion, “iob2” follows the BIO scheme.
- Return type
Dict
[str
,int
]- Returns
label_to_id (Dict[str, int]:) – Mapping with NER tags.
Examples
>>> convert_labels_to_tags(labels=["test","problem"],tagging_scheme="iob2") {'O': 0, 'B-test': 1, 'I-test': 2, 'B-problem': 3, 'I-problem': 4}