medkit.text.segmentation.sentence_tokenizer#

Classes:

SentenceTokenizer([output_label, ...])

Sentence segmentation annotator based on end punctuation rules

class SentenceTokenizer(output_label='sentence', punct_chars=('.', ';', '?', '!'), keep_punct=False, split_on_newlines=True, attrs_to_copy=None, uid=None)[source]#

Sentence segmentation annotator based on end punctuation rules

Instantiate the sentence tokenizer

Parameters
  • output_label (str, Optional) – The output label of the created annotations.

  • punct_chars (Tuple[str], Optional) – The set of characters corresponding to end punctuations.

  • keep_punct (bool, Optional) – If True, the end punctuations are kept in the detected sentence. If False, the sentence text does not include the end punctuations.

  • split_on_newlines (bool) – Whether to consider that newlines characters are sentence boundaries or not.

  • attrs_to_copy (Optional[List[str]]) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.

  • uid (str, Optional) – Identifier of the tokenizer

Methods:

run(segments)

Return sentences detected in segments.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return sentences detected in segments.

Parameters

segments (List[Segment]) – List of segments into which to look for sentences

Return type

List[Segment]

Returns

List[Segments] – Sentences segments found in segments

property description: medkit.core.operation_desc.OperationDescription#

Contains all the operation init parameters.

Return type

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.