medkit.text.spacy.spacy_utils#

Functions:

build_spacy_doc_from_medkit_doc(nlp, medkit_doc)

Create a Spacy Doc from a TextDocument.

build_spacy_doc_from_medkit_segment(nlp, segment)

Create a Spacy Doc from a Segment.

extract_anns_and_attrs_from_spacy_doc(spacy_doc)

Given a spacy document, convert selected entities or spans into Segments.

extract_anns_and_attrs_from_spacy_doc(spacy_doc, medkit_source_ann=None, entities=None, span_groups=None, attrs=None, attribute_factories=None, rebuild_medkit_anns_and_attrs=False)[source]#

Given a spacy document, convert selected entities or spans into Segments. Extract attributes for each annotation in the document.

Parameters
  • spacy_doc (Doc) – A Spacy Doc with spans to be converted

  • medkit_source_ann (Optional[Segment]) – Segment used to rebuild spans referencing the original text

  • entities (Optional[List[str]]) – Labels of entities to be extracted If None (default) all new entities will be extracted as annotations

  • span_groups (Optional[List[str]]) – Name of span groups to be extracted If None (default) all new spans will be extracted as annotations

  • attrs (Optional[List[str]]) – Name of custom attributes to extract from the annotations that will be included. If None (default) all the custom attributes will be extracted

  • attribute_factories (Optional[Dict[str, Callable[[Span, str], Attribute]]]) – Mapping of factories in charge of converting spacy attributes to medkit attributes. Factories will receive a spacy span and an attribute label when called. The key in the mapping is the attribute label.

  • rebuild_medkit_anns_and_attrs (bool) – If True the annotations and attributes with medkit ids will become new annotations/attributes with new ids. If False (default) the annotations and attributes with medkit ids are not rebuilt, only new annotations and attributes are returned

Return type

Tuple[List[Segment], Dict[str, List[Attribute]]]

Returns

  • annotations (List[~medkit.core.text.Segment]) – Segments extracted from the spacy Doc object

  • attributes_by_ann (Dict[str, List[Attribute]]]) – Attributes extracted for each annotation, the key is a medkit uid

Raises

ValueError – Raises when the given medkit source and the spacy doc do not have the same medkit uid

build_spacy_doc_from_medkit_doc(nlp, medkit_doc, labels_anns=None, attrs=None, include_medkit_info=True)[source]#

Create a Spacy Doc from a TextDocument.

Parameters
  • nlp (Language) – Language object with the loaded pipeline from Spacy

  • medkit_doc (TextDocument) – TextDocument to convert

  • labels_anns (Optional[List[str]]) – Labels of annotations to include in the spacy document. If None (default) all the annotations will be included.

  • attrs (Optional[List[str]]) – Labels of attributes to add in the annotations that will be included. If None (default) all the attributes will be added as custom attributes in each annotation included.

  • include_medkit_info (bool) – If True, medkitID is included as an extension in the Doc object to identify the medkit source annotation. If False, no information about IDs is included

Return type

Doc

Returns

Doc – A Spacy Doc with the selected annotations included.

build_spacy_doc_from_medkit_segment(nlp, segment, annotations=[], attrs=None, include_medkit_info=True)[source]#

Create a Spacy Doc from a Segment.

Parameters
  • nlp (Language) – Language object with the loaded pipeline from Spacy

  • segment (Segment) – Segment to convert, this annotation contains the text to create the spacy doc

  • annotations (List[Segment]) – List of annotations in segment to include

  • attrs (Optional[List[str]]) – Labels of attributes to add in the annotations that will be included. If None (default) all the attributes will be added as custom attributes in each annotation included.

  • include_medkit_info (bool) – If True, medkitID is included as an extension in the Doc object to identify the medkit source annotation. If False, no information about IDs is included.

Return type

Doc

Returns

Doc – A Spacy Doc with the selected annotations included.