medkit.core#

APIs#

For accessing these APIs, you may use import like this:

from medkit.core import <api_to_import>

Classes:

AnnotationContainer(doc_id)

Manage a list of annotations belonging to a document.

Attribute(label[, value, metadata, uid])

Medkit attribute, to be added to an annotation

AttributeContainer(owner_id)

Manage a list of attributes attached to another data structure.

Collection(*[, text_docs, audio_docs])

Collection of documents of any modality (text, audio).

DescribableOperation(*args, **kwargs)

DocOperation([uid, name])

Abstract operation directly executed on text documents.

DocPipeline(pipeline[, labels_by_input_key, uid])

Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.

Document(*args, **kwargs)

Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).

GlobalStore()

Global store

IdentifiableDataItem(*args, **kwargs)

IdentifiableDataItemWithAttrs(*args, **kwargs)

InputConverter()

Abstract class for converting external document to medkit documents

Operation([uid, name])

Abstract class for all annotator modules

OperationDescription(uid, name[, ...])

Description of a specific instance of an operation

OutputConverter()

Abstract class for converting medkit document to external format

Pipeline(steps, input_keys, output_keys[, ...])

Graph of processing operations

PipelineCompatibleOperation(*args, **kwargs)

PipelineStep(operation, input_keys, output_keys)

Pipeline item describing how a processing operation is connected to other

Prov(data_item, op_desc, source_data_items, ...)

Provenance information for a specific data item.

ProvCompatibleOperation(*args, **kwargs)

ProvStore(*args, **kwargs)

ProvTracer([store, _graph])

Provenance tracing component.

Store(*args, **kwargs)

Store protocol

Functions:

generate_deterministic_id(reference_id)

Generate a deterministic UUID based on reference_id.

class AnnotationContainer(doc_id)[source]#

Manage a list of annotations belonging to a document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

The annotations will be stored in a Store, which can rely on a simple dict or something more complicated like a database.

This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.

Instantiate the annotation container

Parameters

doc_id (str) – The identifier of the document which annotations belong to.

Methods:

add(ann)

Attach an annotation to the document.

get(*[, label, key])

Return a list of the annotations of the document, optionally filtering by label or key.

get_by_id(uid)

Return the annotation corresponding to a specific identifier.

get_ids(*[, label, key])

Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.

add(ann)[source]#

Attach an annotation to the document.

Parameters

ann (~AnnotationType) – Annotation to add.

Raises

ValueError – If the annotation is already attached to the document (based on annotation.uid)

get(*, label=None, key=None)[source]#

Return a list of the annotations of the document, optionally filtering by label or key.

Parameters
  • label (Optional[str]) – Label to use to filter annotations.

  • key (Optional[str]) – Key to use to filter annotations.

Return type

List[~AnnotationType]

get_ids(*, label=None, key=None)[source]#

Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.

This method is provided, so it is easier to implement additional filtering in subclasses.

Parameters
  • label (Optional[str]) – Label to use to filter annotations.

  • key (Optional[str]) – Key to use to filter annotations.

Return type

Iterator[str]

get_by_id(uid)[source]#

Return the annotation corresponding to a specific identifier.

Parameters

uid (str) – Identifier of the annotation to return.

Return type

~AnnotationType

class Attribute(label, value=None, metadata=None, uid=None)[source]#

Medkit attribute, to be added to an annotation

Variables
  • label (str) – The attribute label

  • value (Optional[Any]) – The value of the attribute. Should be either simple built-in types (int, float, bool, str) or collections of these types (list, dict, tuple). If you need structured complex data you should create a subclass of Attribute.

  • metadata (Dict[str, Any]) – The metadata of the attribute

  • uid (str) – The identifier of the attribute

Methods:

copy()

Create a new attribute that is a copy of the current instance, but with a new identifier

from_dict(attribute_dict)

Creates an Attribute from a dict

get_subclass_for_data_dict(data_dict)

Return the subclass that corresponds to the class name found in a data dict

to_brat()

Return a value compatible with the brat format

to_spacy()

Return a value compatible with spaCy

classmethod get_subclass_for_data_dict(data_dict)#

Return the subclass that corresponds to the class name found in a data dict

Parameters

data_dict (Dict[str, Any]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)

Return type

Optional[Type[Self]]

Returns

subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.

to_brat()[source]#

Return a value compatible with the brat format

Return type

Optional[Any]

to_spacy()[source]#

Return a value compatible with spaCy

Return type

Optional[Any]

copy()[source]#

Create a new attribute that is a copy of the current instance, but with a new identifier

This is used when we want to duplicate an existing attribute onto a different annotation.

Return type

Attribute

classmethod from_dict(attribute_dict)[source]#

Creates an Attribute from a dict

Parameters

attribute_dict (dict) – A dictionary from a serialized Attribute as generated by to_dict()

Return type

Self

class AttributeContainer(owner_id)[source]#

Manage a list of attributes attached to another data structure. For example, it may be a document or an annotation.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

The attributes will be stored in a Store, which can rely on a simple dict or something more complicated like a database.

This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.

Methods:

add(attr)

Attach an attribute to the annotation.

get(*[, label])

Return a list of the attributes of the annotation, optionally filtering by label.

get_by_id(uid)

Return the attribute corresponding to a specific identifier.

get(*, label=None)[source]#

Return a list of the attributes of the annotation, optionally filtering by label.

Parameters

label (Optional[str]) – Label to use to filter attributes.

Return type

List[Attribute]

add(attr)[source]#

Attach an attribute to the annotation.

Parameters

attr (Attribute) – Attribute to add.

Raises

ValueError – If the attribute is already attached to the annotation (based on attr.uid).

get_by_id(uid)[source]#

Return the attribute corresponding to a specific identifier.

Parameters

uid (str) – Identifier of the attribute to return.

Return type

Attribute

class Collection(*, text_docs=None, audio_docs=None)[source]#

Collection of documents of any modality (text, audio).

This class allows to group together a set of documents representing a common unit (for instance a patient), even if they don’t belong to the same modality.

This class is still a work-in-progress. In the future it should be possible to attach additional information to a Collection.

Parameters
  • text_docs (Optional[List[TextDocument]]) – List of text documents

  • audio_docs (Optional[List[AudioDocument]]) – List of audio documents

Attributes:

all_docs

List of all the documents belonging to the document, whatever they modality

property all_docs: List[medkit.core.document.Document]#

List of all the documents belonging to the document, whatever they modality

Return type

List[Document]

class InputConverter[source]#

Abstract class for converting external document to medkit documents

class OutputConverter[source]#

Abstract class for converting medkit document to external format

class IdentifiableDataItem(*args, **kwargs)[source]#
class IdentifiableDataItemWithAttrs(*args, **kwargs)[source]#
class DocPipeline(pipeline, labels_by_input_key=None, uid=None)[source]#

Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.

Initialize the pipeline

Parameters
  • pipeline (Pipeline) – Pipeline to execute on documents. Annotations given to pipeline (corresponding to its input_keys) will be retrieved from documents, according to labels_by_input. Annotations returned by pipeline (corresponding to its output_keys) will be added to documents.

  • labels_by_input_key (Optional[Dict[str, List[str]]]) –

    Optional labels of existing annotations that should be retrieved from documents and passed to the pipeline as input. One list of labels per input key.

    When labels_by_input_key is not provided, it is assumed that the pipeline just expects the document raw segments as input.

    For the use case where the documents contain pre-existing sentence segments labelled as “SENTENCE”, that we want to pass the “sentences” input key of the pipeline:

    >>> doc_pipeline = DocPipeline(
    >>>     pipeline,
    >>>     labels_by_input={"sentences": ["SENTENCE"]},
    >>> )
    

    Because the values of labels_by_input_key are lists (one per input), it is possible to use annotation with different labels for the same input key.

Methods:

run(docs)

Run the pipeline on a list of documents, adding the output annotations to each document

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

run(docs)[source]#

Run the pipeline on a list of documents, adding the output annotations to each document

Parameters

docs (List[Document[~AnnotationType]]) – The documents on which to run the pipeline. Labels to input keys association will be used to retrieve existing annotations from each document, and all output annotations will also be added to each corresponding document.

Return type

None

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

class Document(*args, **kwargs)[source]#

Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).

Documents can contain Annotation objects.

Variables
generate_deterministic_id(reference_id)[source]#

Generate a deterministic UUID based on reference_id. The generated UUID will be the same if the reference_id is the same.

Parameters

reference_id (str) – A string representation of an UID

Return type

UUID

Returns

uuid.UUID – The UUID object

class DocOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation directly executed on text documents. It uses a list of documents as input for running the operation and creates annotations that are directly appended to these documents.

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

Attributes:

description

Contains all the operation init parameters.

Methods:

set_prov_tracer(prov_tracer)

Enable provenance tracing.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class Operation(uid=None, name=None, **kwargs)[source]#

Abstract class for all annotator modules

Common initialization for all annotators:
  • assigning identifier to operation

  • storing class name, name and config in description

Parameters
  • uid (str) – Operation identifier

  • name – Operation name (defaults to class name)

  • kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

Methods:

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

class OperationDescription(uid, name, class_name=None, config=<factory>)[source]#

Description of a specific instance of an operation

Parameters
  • uid (str) – The unique identifier of the instance described

  • name (str) – The name of the operation. Can be the same as class_name or something more specific, for operations with a behavior that can be customized (for instance a rule-based entity matcher with user-provided rules, or a model-based entity matcher with a user-provided model)

  • class_name (Optional[str]) – The name of the class of the operation

  • config (Dict[str, Any]) – The specific configuration of the instance

class Pipeline(steps, input_keys, output_keys, name=None, uid=None)[source]#

Graph of processing operations

A pipeline is made of pipeline steps, connecting together different processing operations by the use of input/output keys. Each operation can be seen as a node and the keys are its edge. Two operations can be chained by using the same string as an output key for the first operation and as an input key to the second.

Steps must be added in the order of execution, there isn’t any sort of dependency detection mechanism.

Initialize the pipeline

Parameters
  • steps (List[PipelineStep]) –

    List of pipeline steps

    Steps will be executed in the order in which they were added, so make sure to add first the steps generating data used by other steps.

  • input_keys (List[str]) – List of keys corresponding to the inputs passed to run()

  • output_keys (List[str]) – List of keys corresponding to the outputs returned by run()

  • name (Optional[str]) – Name describing the pipeline (defaults to the class name)

  • uid (Optional[str]) – Identifier of the pipeline

Methods:

run(*all_input_data)

Run the pipeline.

run(*all_input_data)[source]#

Run the pipeline.

Parameters

*all_input_data (List[Any]) –

Input data expected by the pipeline, must be of same length as the pipeline input_keys.

For each input key, the corresponding input data must be a list of items than can be of any type.

Return type

Union[None, List[Any], Tuple[List[Any], …]]

Returns

Union[None, List[Any], Tuple[List[Any], …]] – All output data returned by the pipeline, will be of same length as the pipeline output_keys.

For each output key, the corresponding output will be a list of items that can be of any type.

If the pipeline has only one output key, then the corresponding output will be directly returned, not wrapped in a tuple. If the pipeline doesn’t have any output key, nothing (ie None) will be returned.

class PipelineStep(operation, input_keys, output_keys, aggregate_input_keys=False)[source]#

Pipeline item describing how a processing operation is connected to other

Parameters
  • operation (medkit.core.pipeline.PipelineCompatibleOperation) – The operation to use at that step

  • input_keys (List[str]) – For each input of operation, the key to use to retrieve the corresponding annotations (either retrieved from a document or generated by an earlier pipeline step)

  • output_keys (List[str]) – For each output of operation, the key used to pass output annotations to the next Pipeline step. Can be empty if operation doesn’t return new annotations.

  • aggregate_input_keys (bool) – If True, all the annotations from multiple input keys are aggregated in a single list. Defaults to False

class PipelineCompatibleOperation(*args, **kwargs)[source]#

Methods:

run(*all_input_data)

param all_input_data

One or several list of data items to process

run(*all_input_data)[source]#
Parameters

all_input_data (List[Any]) – One or several list of data items to process (according to the number of input the operation needs)

Return type

Union[None, List[Any], Tuple[List[Any], …]]

Returns

Union[None, List[Any], Tuple[List[Any], …]] – Tuple of list of all new data items created by the operation. Can be None if the operation does not create any new data items but rather modify existing items in-place (for instance by adding attributes to existing annotations). If there is only one list of created data items, it is possible to return directly that list without wrapping it in a tuple.

class DescribableOperation(*args, **kwargs)[source]#
class ProvCompatibleOperation(*args, **kwargs)[source]#
class ProvTracer(store=None, _graph=None)[source]#

Provenance tracing component.

ProvTracer is intended to gather provenance information about how all data generated by medkit. For each data item (for instance an annotation or an attribute), ProvTracer can tell the operation that created it, the data items that were used to create it, and reciprocally, the data items that were derived from it (cf. Prov).

Provenance-compatible operations should inform the provenance tracer of each data item that through the add_prov() method.

Users wanting to gather provenance information should instantiate one unique ProvTracer object and provide it to all operations involved in their data processing flow. Once all operations have been executed, they may then retrieve provenance info for specific data items through get_prov(), or for all items with get_provs().

Composite operations relying on inner operations (such as pipelines) shouldn’t call add_prov() method. Instead, they should instantiate their own internal ProvTracer and provide it to the operations they rely on, then use add_prov_from_sub_tracer() to integrate information from this internal sub-provenance tracer into the main provenance tracer that was provided to them.

This will build sub-provenance information, that can be retrieved later through get_sub_prov_tracer() or get_sub_prov_tracers(). The inner operations of a composite operation can themselves be composite operations, leading to a tree-like structure of nested provenance tracers.

Parameters

store (Optional[ProvStore]) – Store that will contain all traced data items.

Methods:

add_prov(data_item, op_desc, source_data_items)

Append provenance information about a specific data item.

add_prov_from_sub_tracer(data_items, ...)

Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.

get_prov(data_item_id)

Return provenance information about a specific data item.

get_provs()

Return all provenance information about all data items known to the tracer.

get_sub_prov_tracer(operation_id)

Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.

get_sub_prov_tracers()

Return all sub-provenance tracers of the provenance tracer.

has_prov(data_item_id)

Check if the provenance tracer has provenance information about a specific data item.

has_sub_prov_tracer(operation_id)

Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).

add_prov(data_item, op_desc, source_data_items)[source]#

Append provenance information about a specific data item.

Parameters
add_prov_from_sub_tracer(data_items, op_desc, sub_tracer)[source]#

Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.

Parameters
  • data_items (List[IdentifiableDataItem]) – Data items created by the composite operation. Should not include internal intermediate data items, only the output of the operation.

  • op_desc (OperationDescription) – Description of the composite operation that created the data items.

  • sub_tracer (ProvTracer) – Internal sub-provenance tracer of the composite operation.

has_prov(data_item_id)[source]#

Check if the provenance tracer has provenance information about a specific data item.

Note

This will return False if we have provenance info about a data item but only in a sub-provenance tracer.

Parameters

data_item_id (str) – Id of the data item.

Return type

bool

Returns

boolTrue if there is provenance info that can be retrieved with get_prov().

get_prov(data_item_id)[source]#

Return provenance information about a specific data item.

Parameters

data_item_id (str) – Id of the data item.

Return type

Prov

Returns

Prov – Provenance info about the data item.

get_provs()[source]#

Return all provenance information about all data items known to the tracer.

Note

Nested provenance info from sub-provenance tracers will not be returned.

Return type

List[Prov]

Returns

List[Prov] – Provenance info about all known data items.

has_sub_prov_tracer(operation_id)[source]#

Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).

Note

This will return False if there is a sub-provenance tracer for the operation but that is not a direct child (i.e. that is deeper in the hierarchy).

Parameters

operation_id (str) – Id of the composite operation.

Return type

bool

Returns

boolTrue if there is a sub-provenance tracer for the operation.

get_sub_prov_tracer(operation_id)[source]#

Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.

Parameters

operation_id (str) – Id of the composite operation.

Return type

ProvTracer

Returns

ProvTracer – The sub-provenance tracer containing sub-provenance information from the operation.

get_sub_prov_tracers()[source]#

Return all sub-provenance tracers of the provenance tracer.

Note

This will not return sub-provenance tracers that are not direct children of this tracer (i.e. that are deeper in the hierarchy).

Return type

List[ProvTracer]

Returns

List[ProvTracer] – All sub-provenance tracers of this provenance tracer.

class Prov(data_item, op_desc, source_data_items, derived_data_items)[source]#

Provenance information for a specific data item.

Parameters
class Store(*args, **kwargs)[source]#

Store protocol

class GlobalStore[source]#

Global store

Methods:

del_store()

Delete the global store object

get_store()

Returns the global store object

init_store(store)

Initialize the global store for your application

classmethod init_store(store)[source]#

Initialize the global store for your application

Parameters

store (Store) – Store for all the data items

Raises

RuntimeError – If global store is already set

classmethod get_store()[source]#

Returns the global store object

Return type

Store

Returns

Store – the global store

classmethod del_store()[source]#

Delete the global store object

class ProvStore(*args, **kwargs)[source]#

Subpackages / Submodules#

medkit.core.annotation

medkit.core.annotation_container

medkit.core.attribute

medkit.core.attribute_container

medkit.core.audio

medkit.core.collection

medkit.core.conversion

medkit.core.data_item

medkit.core.dict_conv

medkit.core.doc_pipeline

medkit.core.document

medkit.core.id

medkit.core.operation

medkit.core.operation_desc

medkit.core.pipeline

medkit.core.prov_store

medkit.core.prov_tracer

medkit.core.store

medkit.core.text

medkit.core.utils