medkit.text.segmentation.rush_sentence_tokenizer
medkit.text.segmentation.rush_sentence_tokenizer#
This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[rush-sentence-tokenizer].
Classes:
|
Sentence segmentation annotator based on PyRuSH. |
- class RushSentenceTokenizer(output_label='sentence', path_to_rules=None, keep_newlines=True, attrs_to_copy=None, uid=None)[source]#
Sentence segmentation annotator based on PyRuSH.
Instantiate the RuSH tokenizer
- Parameters
output_label (
str
) – The output label of the created annotations.path_to_rules (
Union
[str
,Path
,None
]) – Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)keep_newlines (
bool
) – With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.attrs_to_copy (
Optional
[List
[str
]]) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.uid (str) – Identifier of the tokenizer
Methods:
run
(segments)Return sentences detected in segments.
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.