medkit.core#

APIs#

For accessing these APIs, you may use import like this:

from medkit.core import <api_to_import>

Classes:

`AnnotationContainer`(doc_id)	Manage a list of annotations belonging to a document.
`Attribute`(label[, value, metadata, uid])	Medkit attribute, to be added to an annotation
`AttributeContainer`(owner_id)	Manage a list of attributes attached to another data structure.
`Collection`(*[, text_docs, audio_docs])	Collection of documents of any modality (text, audio).
`DescribableOperation`(args, *kwargs)
`DocOperation`([uid, name])	Abstract operation directly executed on text documents.
`DocPipeline`(pipeline[, labels_by_input_key, uid])	Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.
`Document`(args, *kwargs)	Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).
`GlobalStore`()	Global store
`IdentifiableDataItem`(args, *kwargs)
`IdentifiableDataItemWithAttrs`(args, *kwargs)
`InputConverter`()	Abstract class for converting external document to medkit documents
`Operation`([uid, name])	Abstract class for all annotator modules
`OperationDescription`(uid, name[, ...])	Description of a specific instance of an operation
`OutputConverter`()	Abstract class for converting medkit document to external format
`Pipeline`(steps, input_keys, output_keys[, ...])	Graph of processing operations
`PipelineCompatibleOperation`(args, *kwargs)
`PipelineStep`(operation, input_keys, output_keys)	Pipeline item describing how a processing operation is connected to other
`Prov`(data_item, op_desc, source_data_items, ...)	Provenance information for a specific data item.
`ProvCompatibleOperation`(args, *kwargs)
`ProvStore`(args, *kwargs)
`ProvTracer`([store, _graph])	Provenance tracing component.
`Store`(args, *kwargs)	Store protocol

Functions:

generate_deterministic_id(reference_id)

Generate a deterministic UUID based on reference_id.

class AnnotationContainer(doc_id)[source]#

Manage a list of annotations belonging to a document.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

The annotations will be stored in a Store, which can rely on a simple dict or something more complicated like a database.

This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.

Instantiate the annotation container

Parameters: doc_id (str) – The identifier of the document which annotations belong to.

Methods:

`add`(ann)	Attach an annotation to the document.
`get`(*[, label, key])	Return a list of the annotations of the document, optionally filtering by label or key.
`get_by_id`(uid)	Return the annotation corresponding to a specific identifier.
`get_ids`(*[, label, key])	Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.

add(ann)[source]#

Attach an annotation to the document.

Parameters: ann (~AnnotationType) – Annotation to add.
Raises: ValueError – If the annotation is already attached to the document (based on annotation.uid)

get(*, label=None, key=None)[source]#

Return a list of the annotations of the document, optionally filtering by label or key.

Parameters

label (Optional[str]) – Label to use to filter annotations.
key (Optional[str]) – Key to use to filter annotations.

Return type

List[~AnnotationType]

get_ids(*, label=None, key=None)[source]#

Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.

This method is provided, so it is easier to implement additional filtering in subclasses.

Parameters

label (Optional[str]) – Label to use to filter annotations.
key (Optional[str]) – Key to use to filter annotations.

Return type

Iterator[str]

get_by_id(uid)[source]#

Return the annotation corresponding to a specific identifier.

Parameters: uid (str) – Identifier of the annotation to return.
Return type: ~AnnotationType

class Attribute(label, value=None, metadata=None, uid=None)[source]#

Medkit attribute, to be added to an annotation

Variables

label (str) – The attribute label
value (Optional[Any]) – The value of the attribute. Should be either simple built-in types (int, float, bool, str) or collections of these types (list, dict, tuple). If you need structured complex data you should create a subclass of Attribute.
metadata (Dict[str, Any]) – The metadata of the attribute
uid (str) – The identifier of the attribute

Methods:

`copy`()	Create a new attribute that is a copy of the current instance, but with a new identifier
`from_dict`(attribute_dict)	Creates an Attribute from a dict
`get_subclass_for_data_dict`(data_dict)	Return the subclass that corresponds to the class name found in a data dict
`to_brat`()	Return a value compatible with the brat format
`to_spacy`()	Return a value compatible with spaCy

classmethod get_subclass_for_data_dict(data_dict)#

Return the subclass that corresponds to the class name found in a data dict

Parameters: data_dict (Dict[str, Any]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)
Return type: Optional[Type[Self]]
Returns: subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.

to_brat()[source]#

Return a value compatible with the brat format

Return type: Optional[Any]

to_spacy()[source]#

Return a value compatible with spaCy

Return type: Optional[Any]

copy()[source]#

Create a new attribute that is a copy of the current instance, but with a new identifier

This is used when we want to duplicate an existing attribute onto a different annotation.

Return type: Attribute

classmethod from_dict(attribute_dict)[source]#

Creates an Attribute from a dict

Parameters: attribute_dict (dict) – A dictionary from a serialized Attribute as generated by to_dict()
Return type: Self

class AttributeContainer(owner_id)[source]#

Manage a list of attributes attached to another data structure. For example, it may be a document or an annotation.

This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.

The attributes will be stored in a Store, which can rely on a simple dict or something more complicated like a database.

This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.

Methods:

`add`(attr)	Attach an attribute to the annotation.
`get`(*[, label])	Return a list of the attributes of the annotation, optionally filtering by label.
`get_by_id`(uid)	Return the attribute corresponding to a specific identifier.

get(*, label=None)[source]#

Return a list of the attributes of the annotation, optionally filtering by label.

Parameters: label (Optional[str]) – Label to use to filter attributes.
Return type: List[Attribute]

add(attr)[source]#

Attach an attribute to the annotation.

Parameters: attr (Attribute) – Attribute to add.
Raises: ValueError – If the attribute is already attached to the annotation (based on attr.uid).

get_by_id(uid)[source]#

Return the attribute corresponding to a specific identifier.

Parameters: uid (str) – Identifier of the attribute to return.
Return type: Attribute

class Collection(*, text_docs=None, audio_docs=None)[source]#

Collection of documents of any modality (text, audio).

This class allows to group together a set of documents representing a common unit (for instance a patient), even if they don’t belong to the same modality.

This class is still a work-in-progress. In the future it should be possible to attach additional information to a Collection.

Parameters

text_docs (Optional[List[TextDocument]]) – List of text documents
audio_docs (Optional[List[AudioDocument]]) – List of audio documents

Attributes:

all_docs

List of all the documents belonging to the document, whatever they modality

property all_docs: List[medkit.core.document.Document]#

List of all the documents belonging to the document, whatever they modality

Return type: List[Document]

class InputConverter[source]#: Abstract class for converting external document to medkit documents

class OutputConverter[source]#: Abstract class for converting medkit document to external format

class IdentifiableDataItem(*args, **kwargs)[source]#

class IdentifiableDataItemWithAttrs(*args, **kwargs)[source]#

class DocPipeline(pipeline, labels_by_input_key=None, uid=None)[source]#

Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.

Initialize the pipeline

Parameters

pipeline (Pipeline) – Pipeline to execute on documents. Annotations given to pipeline (corresponding to its input_keys) will be retrieved from documents, according to labels_by_input. Annotations returned by pipeline (corresponding to its output_keys) will be added to documents.
labels_by_input_key (Optional[Dict[str, List[str]]]) –
Optional labels of existing annotations that should be retrieved from documents and passed to the pipeline as input. One list of labels per input key.

When labels_by_input_key is not provided, it is assumed that the pipeline just expects the document raw segments as input.

For the use case where the documents contain pre-existing sentence segments labelled as “SENTENCE”, that we want to pass the “sentences” input key of the pipeline:
```
>>> doc_pipeline = DocPipeline(
>>>     pipeline,
>>>     labels_by_input={"sentences": ["SENTENCE"]},
>>> )
```
Because the values of labels_by_input_key are lists (one per input), it is possible to use annotation with different labels for the same input key.

Methods:

`run`(docs)	Run the pipeline on a list of documents, adding the output annotations to each document
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

run(docs)[source]#

Run the pipeline on a list of documents, adding the output annotations to each document

Parameters: docs (List[Document[~AnnotationType]]) – The documents on which to run the pipeline. Labels to input keys association will be used to retrieve existing annotations from each document, and all output annotations will also be added to each corresponding document.
Return type: None

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type: OperationDescription

class Document(*args, **kwargs)[source]#

Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).

Documents can contain Annotation objects.

Variables

uid (str) – Unique identifier of the document
anns (medkit.core.annotation_container.AnnotationContainer[medkit.core.annotation.AnnotationType]) – Annotations of the document, stored in an AnnotationContainer for easier access (can be subclassed to add modality-specific features).
attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the document, stored in an :class: ~medkit.core.attribute_container.AttributeContainer for easier access
raw_segment (medkit.core.annotation.AnnotationType) – Auto-generated segment containing the full unprocessed document.

generate_deterministic_id(reference_id)[source]#

Generate a deterministic UUID based on reference_id. The generated UUID will be the same if the reference_id is the same.

Parameters: reference_id (str) – A string representation of an UID
Return type: UUID
Returns: uuid.UUID – The UUID object

class DocOperation(uid=None, name=None, **kwargs)[source]#

Abstract operation directly executed on text documents. It uses a list of documents as input for running the operation and creates annotations that are directly appended to these documents.

Common initialization for all annotators:

assigning identifier to operation
storing class name, name and config in description

Parameters

uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

Attributes:

description

Contains all the operation init parameters.

Methods:

set_prov_tracer(prov_tracer)

Enable provenance tracing.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type: OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class Operation(uid=None, name=None, **kwargs)[source]#

Abstract class for all annotator modules

Common initialization for all annotators:

assigning identifier to operation
storing class name, name and config in description

Parameters

uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation

Examples

In the __init__ function of your annotator, use:

>>> init_args = locals()
>>> init_args.pop('self')
>>> super().__init__(**init_args)

Methods:

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

set_prov_tracer(prov_tracer)[source]#

Enable provenance tracing.

Parameters: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type: OperationDescription

class OperationDescription(uid, name, class_name=None, config=<factory>)[source]#

Description of a specific instance of an operation

Parameters

uid (str) – The unique identifier of the instance described
name (str) – The name of the operation. Can be the same as class_name or something more specific, for operations with a behavior that can be customized (for instance a rule-based entity matcher with user-provided rules, or a model-based entity matcher with a user-provided model)
class_name (Optional[str]) – The name of the class of the operation
config (Dict[str, Any]) – The specific configuration of the instance

class Pipeline(steps, input_keys, output_keys, name=None, uid=None)[source]#

Graph of processing operations

A pipeline is made of pipeline steps, connecting together different processing operations by the use of input/output keys. Each operation can be seen as a node and the keys are its edge. Two operations can be chained by using the same string as an output key for the first operation and as an input key to the second.

Steps must be added in the order of execution, there isn’t any sort of dependency detection mechanism.

Initialize the pipeline

Parameters

steps (List[PipelineStep]) –
List of pipeline steps

Steps will be executed in the order in which they were added, so make sure to add first the steps generating data used by other steps.
input_keys (List[str]) – List of keys corresponding to the inputs passed to run()
output_keys (List[str]) – List of keys corresponding to the outputs returned by run()
name (Optional[str]) – Name describing the pipeline (defaults to the class name)
uid (Optional[str]) – Identifier of the pipeline

Methods:

run(*all_input_data)

Run the pipeline.

run(*all_input_data)[source]#

Run the pipeline.

Parameters

*all_input_data (List[Any]) –

Input data expected by the pipeline, must be of same length as the pipeline input_keys.

For each input key, the corresponding input data must be a list of items than can be of any type.

Return type

Union[None, List[Any], Tuple[List[Any], …]]

Returns

Union[None, List[Any], Tuple[List[Any], …]] – All output data returned by the pipeline, will be of same length as the pipeline output_keys.

For each output key, the corresponding output will be a list of items that can be of any type.

If the pipeline has only one output key, then the corresponding output will be directly returned, not wrapped in a tuple. If the pipeline doesn’t have any output key, nothing (ie None) will be returned.

class PipelineStep(operation, input_keys, output_keys, aggregate_input_keys=False)[source]#

Pipeline item describing how a processing operation is connected to other

Parameters

operation (medkit.core.pipeline.PipelineCompatibleOperation) – The operation to use at that step
input_keys (List[str]) – For each input of operation, the key to use to retrieve the corresponding annotations (either retrieved from a document or generated by an earlier pipeline step)
output_keys (List[str]) – For each output of operation, the key used to pass output annotations to the next Pipeline step. Can be empty if operation doesn’t return new annotations.
aggregate_input_keys (bool) – If True, all the annotations from multiple input keys are aggregated in a single list. Defaults to False

class PipelineCompatibleOperation(*args, **kwargs)[source]#

Methods:

run(*all_input_data)

param all_input_data: One or several list of data items to process

run(*all_input_data)[source]#

Parameters: all_input_data (List[Any]) – One or several list of data items to process (according to the number of input the operation needs)
Return type: Union[None, List[Any], Tuple[List[Any], …]]
Returns: Union[None, List[Any], Tuple[List[Any], …]] – Tuple of list of all new data items created by the operation. Can be None if the operation does not create any new data items but rather modify existing items in-place (for instance by adding attributes to existing annotations). If there is only one list of created data items, it is possible to return directly that list without wrapping it in a tuple.

class DescribableOperation(*args, **kwargs)[source]#

class ProvCompatibleOperation(*args, **kwargs)[source]#

class ProvTracer(store=None, _graph=None)[source]#

Provenance tracing component.

ProvTracer is intended to gather provenance information about how all data generated by medkit. For each data item (for instance an annotation or an attribute), ProvTracer can tell the operation that created it, the data items that were used to create it, and reciprocally, the data items that were derived from it (cf. Prov).

Provenance-compatible operations should inform the provenance tracer of each data item that through the add_prov() method.

Users wanting to gather provenance information should instantiate one unique ProvTracer object and provide it to all operations involved in their data processing flow. Once all operations have been executed, they may then retrieve provenance info for specific data items through get_prov(), or for all items with get_provs().

Composite operations relying on inner operations (such as pipelines) shouldn’t call add_prov() method. Instead, they should instantiate their own internal ProvTracer and provide it to the operations they rely on, then use add_prov_from_sub_tracer() to integrate information from this internal sub-provenance tracer into the main provenance tracer that was provided to them.

This will build sub-provenance information, that can be retrieved later through get_sub_prov_tracer() or get_sub_prov_tracers(). The inner operations of a composite operation can themselves be composite operations, leading to a tree-like structure of nested provenance tracers.

Parameters: store (Optional[ProvStore]) – Store that will contain all traced data items.

Methods:

`add_prov`(data_item, op_desc, source_data_items)	Append provenance information about a specific data item.
`add_prov_from_sub_tracer`(data_items, ...)	Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.
`get_prov`(data_item_id)	Return provenance information about a specific data item.
`get_provs`()	Return all provenance information about all data items known to the tracer.
`get_sub_prov_tracer`(operation_id)	Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.
`get_sub_prov_tracers`()	Return all sub-provenance tracers of the provenance tracer.
`has_prov`(data_item_id)	Check if the provenance tracer has provenance information about a specific data item.
`has_sub_prov_tracer`(operation_id)	Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).

add_prov(data_item, op_desc, source_data_items)[source]#

Append provenance information about a specific data item.

Parameters

data_item (IdentifiableDataItem) – Data item that was created.
op_desc (OperationDescription) – Description of the operation that created the data item.
source_data_items (List[IdentifiableDataItem]) – Data items that were used by the operation to create the data item.

add_prov_from_sub_tracer(data_items, op_desc, sub_tracer)[source]#

Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.

Parameters

data_items (List[IdentifiableDataItem]) – Data items created by the composite operation. Should not include internal intermediate data items, only the output of the operation.
op_desc (OperationDescription) – Description of the composite operation that created the data items.
sub_tracer (ProvTracer) – Internal sub-provenance tracer of the composite operation.

has_prov(data_item_id)[source]#

Check if the provenance tracer has provenance information about a specific data item.

Note

This will return False if we have provenance info about a data item but only in a sub-provenance tracer.

Parameters: data_item_id (str) – Id of the data item.
Return type: bool
Returns: bool – True if there is provenance info that can be retrieved with get_prov().

get_prov(data_item_id)[source]#

Return provenance information about a specific data item.

Parameters: data_item_id (str) – Id of the data item.
Return type: Prov
Returns: Prov – Provenance info about the data item.

get_provs()[source]#

Return all provenance information about all data items known to the tracer.

Note

Nested provenance info from sub-provenance tracers will not be returned.

Return type: List[Prov]
Returns: List[Prov] – Provenance info about all known data items.

has_sub_prov_tracer(operation_id)[source]#

Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).

Note

This will return False if there is a sub-provenance tracer for the operation but that is not a direct child (i.e. that is deeper in the hierarchy).

Parameters: operation_id (str) – Id of the composite operation.
Return type: bool
Returns: bool – True if there is a sub-provenance tracer for the operation.

get_sub_prov_tracer(operation_id)[source]#

Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.

Parameters: operation_id (str) – Id of the composite operation.
Return type: ProvTracer
Returns: ProvTracer – The sub-provenance tracer containing sub-provenance information from the operation.

get_sub_prov_tracers()[source]#

Return all sub-provenance tracers of the provenance tracer.

Note

This will not return sub-provenance tracers that are not direct children of this tracer (i.e. that are deeper in the hierarchy).

Return type: List[ProvTracer]
Returns: List[ProvTracer] – All sub-provenance tracers of this provenance tracer.

class Prov(data_item, op_desc, source_data_items, derived_data_items)[source]#

Provenance information for a specific data item.

Parameters

data_item (medkit.core.data_item.IdentifiableDataItem) – Data item that was created (for instance an annotation or an attribute).
op_desc (Optional[medkit.core.operation_desc.OperationDescription]) – Description of the operation that created the data item.
source_data_items (List[medkit.core.data_item.IdentifiableDataItem]) – Data items that were used by the operation to create the data item.
derived_data_items (List[medkit.core.data_item.IdentifiableDataItem]) – Data items that were created by other operations using this data item.

class Store(*args, **kwargs)[source]#: Store protocol

class GlobalStore[source]#

Global store

Methods:

`del_store`()	Delete the global store object
`get_store`()	Returns the global store object
`init_store`(store)	Initialize the global store for your application

classmethod init_store(store)[source]#

Initialize the global store for your application

Parameters: store (Store) – Store for all the data items
Raises: RuntimeError – If global store is already set

classmethod get_store()[source]#

Returns the global store object

Return type: Store
Returns: Store – the global store

classmethod del_store()[source]#: Delete the global store object

class ProvStore(*args, **kwargs)[source]#

`medkit.core.annotation`
`medkit.core.annotation_container`
`medkit.core.attribute`
`medkit.core.attribute_container`
`medkit.core.audio`
`medkit.core.collection`
`medkit.core.conversion`
`medkit.core.data_item`
`medkit.core.dict_conv`
`medkit.core.doc_pipeline`
`medkit.core.document`
`medkit.core.id`
`medkit.core.operation`
`medkit.core.operation_desc`
`medkit.core.pipeline`
`medkit.core.prov_store`
`medkit.core.prov_tracer`
`medkit.core.store`
`medkit.core.text`
`medkit.core.utils`

medkit.core

Contents

medkit.core#

APIs#

Subpackages / Submodules#