medkit.text.segmentation.sentence_tokenizer
medkit.text.segmentation.sentence_tokenizer#
Classes:
|
Sentence segmentation annotator based on end punctuation rules |
- class SentenceTokenizer(output_label='sentence', punct_chars=('.', ';', '?', '!'), keep_punct=False, split_on_newlines=True, attrs_to_copy=None, uid=None)[source]#
Sentence segmentation annotator based on end punctuation rules
Instantiate the sentence tokenizer
- Parameters
output_label (str, Optional) – The output label of the created annotations.
punct_chars (Tuple[str], Optional) – The set of characters corresponding to end punctuations.
keep_punct (bool, Optional) – If True, the end punctuations are kept in the detected sentence. If False, the sentence text does not include the end punctuations.
split_on_newlines (
bool
) – Whether to consider that newlines characters are sentence boundaries or not.attrs_to_copy (
Optional
[List
[str
]]) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.uid (str, Optional) – Identifier of the tokenizer
Methods:
run
(segments)Return sentences detected in segments.
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.