medkit.text.segmentation.section_tokenizer
medkit.text.segmentation.section_tokenizer#
Classes:
|
|
|
Section segmentation annotator based on keyword rules |
- class SectionTokenizer(section_dict=None, output_label='section', section_rules=(), strip_chars='.;,?! \n\r\t', uid=None)[source]#
Section segmentation annotator based on keyword rules
Initialize the Section Tokenizer
- Parameters
section_dict (
Optional
[Dict
[str
,List
[str
]]]) – Dictionary containing the section name as key and the list of mappings as value. If None, the content of default_section_definition.yml will be used.output_label (
str
) – Segment label to use for annotation output.section_rules (
Iterable
[SectionModificationRule
]) – List of rules for modifying a section name according its order to the other sections. If section_dict is None, the content of default_section_definition.yml will be used.strip_chars (
str
) – The list of characters to strip at the beginning of the returned segment.uid (str, Optional) – Identifier of the tokenizer
Methods:
load_section_definition
(filepath[, encoding])Load the sections definition stored in a yml file
run
(segments)Return sections detected in segments.
save_section_definition
(section_dict, ...[, ...])Save section yaml definition file
set_prov_tracer
(prov_tracer)Enable provenance tracing.
Attributes:
Contains all the operation init parameters.
- run(segments)[source]#
Return sections detected in segments. Each section is a segment with an attached attribute (label: <same as self.output_label>, value: <the name of the section>).
- static load_section_definition(filepath, encoding=None)[source]#
Load the sections definition stored in a yml file
- Parameters
filepath (
Path
) – Path to a yml file containing the sections(name + mappings) and rulesencoding (
Optional
[str
]) – Encoding of the file to open
- Return type
Tuple
[Dict
[str
,List
[str
]],Tuple
[SectionModificationRule
, …]]- Returns
Tuple[Dict[str, List[str]], Tuple[SectionModificationRule, …]] – Tuple containing: - the dictionary where key is the section name and value is the list of all equivalent strings. - the list of section modification rules. These rules allow to rename some sections according their order
- static save_section_definition(section_dict, section_rules, filepath, encoding=None)[source]#
Save section yaml definition file
- Parameters
section_dict (
Dict
[str
,List
[str
]]) – Dictionary containing the section name as key and the list of mappings as value (cf. content of default_section_dict.yml as example)section_rules (
Iterable
[SectionModificationRule
]) – List of rules for modifying a section name according its order to the other sections.filepath (
Path
) – Path to the file to saveencoding (
Optional
[str
]) – File encoding. Default: None
- property description: medkit.core.operation_desc.OperationDescription#
Contains all the operation init parameters.
- Return type
- set_prov_tracer(prov_tracer)#
Enable provenance tracing.
- Parameters
prov_tracer (
ProvTracer
) – The provenance tracer used to trace the provenance.