medkit.text.segmentation
Contents
medkit.text.segmentation#
APIs#
For accessing these APIs, you may use import like this:
from medkit.text.segmentation import <api_to_import>
Classes:
|
|
|
Section segmentation annotator based on keyword rules |
|
Sentence segmentation annotator based on end punctuation rules |
|
Syntagma segmentation annotator based on provided separators |
- class SectionTokenizer(section_dict=None, output_label='section', section_rules=(), strip_chars='.;,?! \n\r\t', uid=None)[source]#
Section segmentation annotator based on keyword rules
Initialize the Section Tokenizer
- Parameters
section_dict (
Optional
[Dict
[str
,List
[str
]]]) – Dictionary containing the section name as key and the list of mappings as value. If None, the content of default_section_definition.yml will be used.output_label (
str
) – Segment label to use for annotation output.section_rules (
Iterable
[SectionModificationRule
]) – List of rules for modifying a section name according its order to the other sections. If section_dict is None, the content of default_section_definition.yml will be used.strip_chars (
str
) – The list of characters to strip at the beginning of the returned segment.uid (str, Optional) – Identifier of the tokenizer
Methods:
load_section_definition
(filepath[, encoding])Load the sections definition stored in a yml file
run
(segments)Return sections detected in segments.
save_section_definition
(section_dict, ...[, ...])Save section yaml definition file
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- run(segments)[source]#
Return sections detected in segments. Each section is a segment with an attached attribute (label: <same as self.output_label>, value: <the name of the section>).
- static load_section_definition(filepath, encoding=None)[source]#
Load the sections definition stored in a yml file
- Parameters
filepath (
Path
) – Path to a yml file containing the sections(name + mappings) and rulesencoding (
Optional
[str
]) – Encoding of the file to open
- Return type
Tuple
[Dict
[str
,List
[str
]],Tuple
[SectionModificationRule
, …]]- Returns
Tuple[Dict[str, List[str]], Tuple[SectionModificationRule, …]] – Tuple containing: - the dictionary where key is the section name and value is the list of all equivalent strings. - the list of section modification rules. These rules allow to rename some sections according their order
- static save_section_definition(section_dict, section_rules, filepath, encoding=None)[source]#
Save section yaml definition file
- Parameters
section_dict (
Dict
[str
,List
[str
]]) – Dictionary containing the section name as key and the list of mappings as value (cf. content of default_section_dict.yml as example)section_rules (
Iterable
[SectionModificationRule
]) – List of rules for modifying a section name according its order to the other sections.filepath (
Path
) – Path to the file to saveencoding (
Optional
[str
]) – File encoding. Default: None
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class SentenceTokenizer(output_label='sentence', punct_chars=('.', ';', '?', '!'), keep_punct=False, split_on_newlines=True, attrs_to_copy=None, uid=None)[source]#
Sentence segmentation annotator based on end punctuation rules
Instantiate the sentence tokenizer
- Parameters
output_label (str, Optional) – The output label of the created annotations.
punct_chars (Tuple[str], Optional) – The set of characters corresponding to end punctuations.
keep_punct (bool, Optional) – If True, the end punctuations are kept in the detected sentence. If False, the sentence text does not include the end punctuations.
split_on_newlines (
bool
) – Whether to consider that newlines characters are sentence boundaries or not.attrs_to_copy (
Optional
[List
[str
]]) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.uid (str, Optional) – Identifier of the tokenizer
Methods:
run
(segments)Return sentences detected in segments.
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
- class SyntagmaTokenizer(separators=None, output_label='syntagma', strip_chars='.;,?! \n\r\t', attrs_to_copy=None, uid=None)[source]#
Syntagma segmentation annotator based on provided separators
Instantiate the syntagma tokenizer
- Parameters
separators (Tuple[str, ...]) – The tuple of regular expressions corresponding to separators. If None provided, the rules in “default_syntagma_definitiion.yml” will be used.
output_label (str, Optional) – The output label of the created annotations.
strip_chars (
str
) – The list of characters to strip at the beginning of the returned segment.attrs_to_copy (
Optional
[List
[str
]]) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.uid (str, Optional) – Identifier of the tokenizer
Methods:
load_syntagma_definition
(filepath[, encoding])Load the syntagma definition stored in yml file
run
(segments)Return syntagmes detected in segments.
save_syntagma_definition
(syntagma_seps, filepath)Save syntagma yaml definition file
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- static load_syntagma_definition(filepath, encoding=None)[source]#
Load the syntagma definition stored in yml file
- Parameters
filepath (
Path
) – Path to a yml file containing the syntagma separatorsencoding (
Optional
[str
]) – Encoding of the file to open
- Return type
Tuple
[str
, …]- Returns
Tuple[str, …] – Tuple containing the separators
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- static save_syntagma_definition(syntagma_seps, filepath, encoding=None)[source]#
Save syntagma yaml definition file
- Parameters
syntagma_seps (
Tuple
[str
, …]) – The tuple of regular expressions corresponding to separatorsfilepath (
Path
) – The path of the file to saveencoding (
Optional
[str
]) – The encoding of the file. Default: None
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.
Subpackages / Submodules#
This module needs extra-dependencies not installed as core dependencies of medkit. |
|