medkit.core
Contents
medkit.core#
APIs#
For accessing these APIs, you may use import like this:
from medkit.core import <api_to_import>
Classes:
|
Manage a list of annotations belonging to a document. |
|
Medkit attribute, to be added to an annotation |
|
Manage a list of attributes attached to another data structure. |
|
Collection of documents of any modality (text, audio). |
|
|
|
Abstract operation directly executed on text documents. |
|
Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents. |
|
Base document protocol that must be implemented by document classes of all modalities (text, audio, etc). |
Global store |
|
|
|
|
|
Abstract class for converting external document to medkit documents |
|
|
Abstract class for all annotator modules |
|
Description of a specific instance of an operation |
Abstract class for converting medkit document to external format |
|
|
Graph of processing operations |
|
|
|
Pipeline item describing how a processing operation is connected to other |
|
Provenance information for a specific data item. |
|
|
|
|
|
Provenance tracing component. |
|
Store protocol |
Functions:
|
Generate a deterministic UUID based on reference_id. |
- class AnnotationContainer(doc_id)[source]#
Manage a list of annotations belonging to a document.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
The annotations will be stored in a
Store
, which can rely on a simple dict or something more complicated like a database.This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.
Instantiate the annotation container
- Parameters
doc_id (
str
) – The identifier of the document which annotations belong to.
Methods:
add
(ann)Attach an annotation to the document.
get
(*[, label, key])Return a list of the annotations of the document, optionally filtering by label or key.
get_by_id
(uid)Return the annotation corresponding to a specific identifier.
get_ids
(*[, label, key])Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.
- add(ann)[source]#
Attach an annotation to the document.
- Parameters
ann (
~AnnotationType
) – Annotation to add.- Raises
ValueError – If the annotation is already attached to the document (based on annotation.uid)
- get(*, label=None, key=None)[source]#
Return a list of the annotations of the document, optionally filtering by label or key.
- Parameters
label (
Optional
[str
]) – Label to use to filter annotations.key (
Optional
[str
]) – Key to use to filter annotations.
- Return type
List
[~AnnotationType
]
- get_ids(*, label=None, key=None)[source]#
Return an iterator of the identifiers of the annotations of the document, optionally filtering by label or key.
This method is provided, so it is easier to implement additional filtering in subclasses.
- Parameters
label (
Optional
[str
]) – Label to use to filter annotations.key (
Optional
[str
]) – Key to use to filter annotations.
- Return type
Iterator
[str
]
- class Attribute(label, value=None, metadata=None, uid=None)[source]#
Medkit attribute, to be added to an annotation
- Variables
label (str) – The attribute label
value (Optional[Any]) – The value of the attribute. Should be either simple built-in types (int, float, bool, str) or collections of these types (list, dict, tuple). If you need structured complex data you should create a subclass of Attribute.
metadata (Dict[str, Any]) – The metadata of the attribute
uid (str) – The identifier of the attribute
Methods:
copy
()Create a new attribute that is a copy of the current instance, but with a new identifier
from_dict
(attribute_dict)Creates an Attribute from a dict
get_subclass_for_data_dict
(data_dict)Return the subclass that corresponds to the class name found in a data dict
to_brat
()Return a value compatible with the brat format
to_spacy
()Return a value compatible with spaCy
- classmethod get_subclass_for_data_dict(data_dict)#
Return the subclass that corresponds to the class name found in a data dict
- Parameters
data_dict (
Dict
[str
,Any
]) – Data dict returned by the to_dict() method of a subclass (or of the base class itself)- Return type
Optional
[Type
[Self
]]- Returns
subclass – Subclass that generated data_dict, or None if data_dict correspond to the base class itself.
- class AttributeContainer(owner_id)[source]#
Manage a list of attributes attached to another data structure. For example, it may be a document or an annotation.
This behaves more or less like a list: calling len() and iterating are supported. Additional filtering is available through the get() method.
The attributes will be stored in a
Store
, which can rely on a simple dict or something more complicated like a database.This global store may be initialized using :class:~medkit.core.GlobalStore. Otherwise, a default one (i.e. dict store) is used.
Methods:
add
(attr)Attach an attribute to the annotation.
get
(*[, label])Return a list of the attributes of the annotation, optionally filtering by label.
get_by_id
(uid)Return the attribute corresponding to a specific identifier.
- get(*, label=None)[source]#
Return a list of the attributes of the annotation, optionally filtering by label.
- Parameters
label (
Optional
[str
]) – Label to use to filter attributes.- Return type
List
[Attribute
]
- class Collection(*, text_docs=None, audio_docs=None)[source]#
Collection of documents of any modality (text, audio).
This class allows to group together a set of documents representing a common unit (for instance a patient), even if they don’t belong to the same modality.
This class is still a work-in-progress. In the future it should be possible to attach additional information to a Collection.
- Parameters
text_docs (
Optional
[List
[TextDocument
]]) – List of text documentsaudio_docs (
Optional
[List
[AudioDocument
]]) – List of audio documents
Attributes:
List of all the documents belonging to the document, whatever they modality
- property all_docs: List[medkit.core.document.Document]#
List of all the documents belonging to the document, whatever they modality
- Return type
List
[Document
]
- class DocPipeline(pipeline, labels_by_input_key=None, uid=None)[source]#
Wrapper around the Pipeline class that runs a pipeline on a list (or collection) of documents, retrieving input annotations from each document and attaching output annotations back to documents.
Initialize the pipeline
- Parameters
pipeline (
Pipeline
) – Pipeline to execute on documents. Annotations given to pipeline (corresponding to its input_keys) will be retrieved from documents, according to labels_by_input. Annotations returned by pipeline (corresponding to its output_keys) will be added to documents.labels_by_input_key (
Optional
[Dict
[str
,List
[str
]]]) –Optional labels of existing annotations that should be retrieved from documents and passed to the pipeline as input. One list of labels per input key.
When labels_by_input_key is not provided, it is assumed that the pipeline just expects the document raw segments as input.
For the use case where the documents contain pre-existing sentence segments labelled as “SENTENCE”, that we want to pass the “sentences” input key of the pipeline:
>>> doc_pipeline = DocPipeline( >>> pipeline, >>> labels_by_input={"sentences": ["SENTENCE"]}, >>> )
Because the values of labels_by_input_key are lists (one per input), it is possible to use annotation with different labels for the same input key.
Methods:
run
(docs)Run the pipeline on a list of documents, adding the output annotations to each document
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- set_prov_tracer(prov_tracer)[source]#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- run(docs)[source]#
Run the pipeline on a list of documents, adding the output annotations to each document
- Parameters
docs (
List
[Document
[~AnnotationType
]]) – The documents on which to run the pipeline. Labels to input keys association will be used to retrieve existing annotations from each document, and all output annotations will also be added to each corresponding document.- Return type
None
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- class Document(*args, **kwargs)[source]#
Base document protocol that must be implemented by document classes of all modalities (text, audio, etc).
Documents can contain
Annotation
objects.- Variables
uid (str) – Unique identifier of the document
anns (medkit.core.annotation_container.AnnotationContainer[medkit.core.annotation.AnnotationType]) – Annotations of the document, stored in an
AnnotationContainer
for easier access (can be subclassed to add modality-specific features).attrs (medkit.core.attribute_container.AttributeContainer) – Attributes of the document, stored in an :class: ~medkit.core.attribute_container.AttributeContainer for easier access
raw_segment (medkit.core.annotation.AnnotationType) – Auto-generated segment containing the full unprocessed document.
- generate_deterministic_id(reference_id)[source]#
Generate a deterministic UUID based on reference_id. The generated UUID will be the same if the reference_id is the same.
- Parameters
reference_id (
str
) – A string representation of an UID- Return type
UUID
- Returns
uuid.UUID – The UUID object
- class DocOperation(uid=None, name=None, **kwargs)[source]#
Abstract operation directly executed on text documents. It uses a list of documents as input for running the operation and creates annotations that are directly appended to these documents.
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
Attributes:
Contains all the operation init parameters.
Methods:
set_prov_tracer
(prov_tracer)Enable provenance tracing.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class Operation(uid=None, name=None, **kwargs)[source]#
Abstract class for all annotator modules
- Common initialization for all annotators:
assigning identifier to operation
storing class name, name and config in description
- Parameters
uid (str) – Operation identifier
name – Operation name (defaults to class name)
kwargs – All other arguments of the child init useful to describe the operation
Examples
In the __init__ function of your annotator, use:
>>> init_args = locals() >>> init_args.pop('self') >>> super().__init__(**init_args)
Methods:
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- set_prov_tracer(prov_tracer)[source]#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- class OperationDescription(uid, name, class_name=None, config=<factory>)[source]#
Description of a specific instance of an operation
- Parameters
uid (str) – The unique identifier of the instance described
name (str) – The name of the operation. Can be the same as class_name or something more specific, for operations with a behavior that can be customized (for instance a rule-based entity matcher with user-provided rules, or a model-based entity matcher with a user-provided model)
class_name (Optional[str]) – The name of the class of the operation
config (Dict[str, Any]) – The specific configuration of the instance
- class Pipeline(steps, input_keys, output_keys, name=None, uid=None)[source]#
Graph of processing operations
A pipeline is made of pipeline steps, connecting together different processing operations by the use of input/output keys. Each operation can be seen as a node and the keys are its edge. Two operations can be chained by using the same string as an output key for the first operation and as an input key to the second.
Steps must be added in the order of execution, there isn’t any sort of dependency detection mechanism.
Initialize the pipeline
- Parameters
steps (
List
[PipelineStep
]) –List of pipeline steps
Steps will be executed in the order in which they were added, so make sure to add first the steps generating data used by other steps.
input_keys (
List
[str
]) – List of keys corresponding to the inputs passed to run()output_keys (
List
[str
]) – List of keys corresponding to the outputs returned by run()name (
Optional
[str
]) – Name describing the pipeline (defaults to the class name)uid (
Optional
[str
]) – Identifier of the pipeline
Methods:
run
(*all_input_data)Run the pipeline.
- run(*all_input_data)[source]#
Run the pipeline.
- Parameters
*all_input_data (
List
[Any
]) –Input data expected by the pipeline, must be of same length as the pipeline input_keys.
For each input key, the corresponding input data must be a list of items than can be of any type.
- Return type
Union
[None
,List
[Any
],Tuple
[List
[Any
], …]]- Returns
Union[None, List[Any], Tuple[List[Any], …]] – All output data returned by the pipeline, will be of same length as the pipeline output_keys.
For each output key, the corresponding output will be a list of items that can be of any type.
If the pipeline has only one output key, then the corresponding output will be directly returned, not wrapped in a tuple. If the pipeline doesn’t have any output key, nothing (ie None) will be returned.
- class PipelineStep(operation, input_keys, output_keys, aggregate_input_keys=False)[source]#
Pipeline item describing how a processing operation is connected to other
- Parameters
operation (medkit.core.pipeline.PipelineCompatibleOperation) – The operation to use at that step
input_keys (List[str]) – For each input of operation, the key to use to retrieve the corresponding annotations (either retrieved from a document or generated by an earlier pipeline step)
output_keys (List[str]) – For each output of operation, the key used to pass output annotations to the next Pipeline step. Can be empty if operation doesn’t return new annotations.
aggregate_input_keys (bool) – If True, all the annotations from multiple input keys are aggregated in a single list. Defaults to False
- class PipelineCompatibleOperation(*args, **kwargs)[source]#
Methods:
run
(*all_input_data)- param all_input_data
One or several list of data items to process
- run(*all_input_data)[source]#
- Parameters
all_input_data (List[Any]) – One or several list of data items to process (according to the number of input the operation needs)
- Return type
Union
[None
,List
[Any
],Tuple
[List
[Any
], …]]- Returns
Union[None, List[Any], Tuple[List[Any], …]] – Tuple of list of all new data items created by the operation. Can be None if the operation does not create any new data items but rather modify existing items in-place (for instance by adding attributes to existing annotations). If there is only one list of created data items, it is possible to return directly that list without wrapping it in a tuple.
- class ProvTracer(store=None, _graph=None)[source]#
Provenance tracing component.
ProvTracer is intended to gather provenance information about how all data generated by medkit. For each data item (for instance an annotation or an attribute), ProvTracer can tell the operation that created it, the data items that were used to create it, and reciprocally, the data items that were derived from it (cf.
Prov
).Provenance-compatible operations should inform the provenance tracer of each data item that through the
add_prov()
method.Users wanting to gather provenance information should instantiate one unique ProvTracer object and provide it to all operations involved in their data processing flow. Once all operations have been executed, they may then retrieve provenance info for specific data items through
get_prov()
, or for all items withget_provs()
.Composite operations relying on inner operations (such as pipelines) shouldn’t call
add_prov()
method. Instead, they should instantiate their own internal ProvTracer and provide it to the operations they rely on, then useadd_prov_from_sub_tracer()
to integrate information from this internal sub-provenance tracer into the main provenance tracer that was provided to them.This will build sub-provenance information, that can be retrieved later through
get_sub_prov_tracer()
orget_sub_prov_tracers()
. The inner operations of a composite operation can themselves be composite operations, leading to a tree-like structure of nested provenance tracers.- Parameters
store (
Optional
[ProvStore
]) – Store that will contain all traced data items.
Methods:
add_prov
(data_item, op_desc, source_data_items)Append provenance information about a specific data item.
add_prov_from_sub_tracer
(data_items, ...)Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.
get_prov
(data_item_id)Return provenance information about a specific data item.
Return all provenance information about all data items known to the tracer.
get_sub_prov_tracer
(operation_id)Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.
Return all sub-provenance tracers of the provenance tracer.
has_prov
(data_item_id)Check if the provenance tracer has provenance information about a specific data item.
has_sub_prov_tracer
(operation_id)Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).
- add_prov(data_item, op_desc, source_data_items)[source]#
Append provenance information about a specific data item.
- Parameters
data_item (
IdentifiableDataItem
) – Data item that was created.op_desc (
OperationDescription
) – Description of the operation that created the data item.source_data_items (
List
[IdentifiableDataItem
]) – Data items that were used by the operation to create the data item.
- add_prov_from_sub_tracer(data_items, op_desc, sub_tracer)[source]#
Append provenance information about data items created by a composite operation relying on inner operations (such as a pipeline) having its own internal sub-provenance tracer.
- Parameters
data_items (
List
[IdentifiableDataItem
]) – Data items created by the composite operation. Should not include internal intermediate data items, only the output of the operation.op_desc (
OperationDescription
) – Description of the composite operation that created the data items.sub_tracer (
ProvTracer
) – Internal sub-provenance tracer of the composite operation.
- has_prov(data_item_id)[source]#
Check if the provenance tracer has provenance information about a specific data item.
Note
This will return False if we have provenance info about a data item but only in a sub-provenance tracer.
- Parameters
data_item_id (
str
) – Id of the data item.- Return type
bool
- Returns
bool – True if there is provenance info that can be retrieved with
get_prov()
.
- get_prov(data_item_id)[source]#
Return provenance information about a specific data item.
- Parameters
data_item_id (
str
) – Id of the data item.- Return type
- Returns
Prov – Provenance info about the data item.
- get_provs()[source]#
Return all provenance information about all data items known to the tracer.
Note
Nested provenance info from sub-provenance tracers will not be returned.
- Return type
List
[Prov
]- Returns
List[Prov] – Provenance info about all known data items.
- has_sub_prov_tracer(operation_id)[source]#
Check if the provenance tracer has a sub-provenance tracer for a specific composite operation (such as a pipeline).
Note
This will return False if there is a sub-provenance tracer for the operation but that is not a direct child (i.e. that is deeper in the hierarchy).
- Parameters
operation_id (
str
) – Id of the composite operation.- Return type
bool
- Returns
bool – True if there is a sub-provenance tracer for the operation.
- get_sub_prov_tracer(operation_id)[source]#
Return a sub-provenance tracer containing sub-provenance information from a specific composite operation.
- Parameters
operation_id (
str
) – Id of the composite operation.- Return type
- Returns
ProvTracer – The sub-provenance tracer containing sub-provenance information from the operation.
- get_sub_prov_tracers()[source]#
Return all sub-provenance tracers of the provenance tracer.
Note
This will not return sub-provenance tracers that are not direct children of this tracer (i.e. that are deeper in the hierarchy).
- Return type
List
[ProvTracer
]- Returns
List[ProvTracer] – All sub-provenance tracers of this provenance tracer.
- class Prov(data_item, op_desc, source_data_items, derived_data_items)[source]#
Provenance information for a specific data item.
- Parameters
data_item (medkit.core.data_item.IdentifiableDataItem) – Data item that was created (for instance an annotation or an attribute).
op_desc (Optional[medkit.core.operation_desc.OperationDescription]) – Description of the operation that created the data item.
source_data_items (List[medkit.core.data_item.IdentifiableDataItem]) – Data items that were used by the operation to create the data item.
derived_data_items (List[medkit.core.data_item.IdentifiableDataItem]) – Data items that were created by other operations using this data item.
- class GlobalStore[source]#
Global store
Methods:
Delete the global store object
Returns the global store object
init_store
(store)Initialize the global store for your application
- classmethod init_store(store)[source]#
Initialize the global store for your application
- Parameters
store (
Store
) – Store for all the data items- Raises
RuntimeError – If global store is already set