sparknlp.annotator.document_title_splitter#
Contains classes for the DocumentTitleSplitter
Module Contents#
Classes#
Annotator that groups element-level documents into title-aware sections. |
- class DocumentTitleSplitter[source]#
Annotator that groups element-level documents into title-aware sections.
DocumentTitleSplitteris intended to work with element-levelDOCUMENTannotations, such as those produced byReader2Doc().setOutputAsDocument(False). Whenever an input annotation hasmetadata["elementType"] == "Title", it starts a new semantic section and the title stays with the following content.Optionally, oversized sections can be split by character length after the semantic grouping phase.
Input Annotation types
Output Annotation type
DOCUMENTDOCUMENT- Parameters:
- joinString
String used to join element texts inside a section, by default
" ".- splitOnPageChange
Whether to start a new section when page number changes, by default
False.- enableOverflowSplitting
Whether to split oversized sections after title grouping, by default
False.- maxCharacters
Maximum size of an overflow-split chunk, by default
500.- explodeSplits
Whether to explode split chunks to separate rows, by default
False.
- setJoinString(value)[source]#
Sets the string used to join element texts inside a section.
- Parameters:
- valuestr
Join string used between element texts
- setSplitOnPageChange(value)[source]#
Sets whether to start a new section when page number changes.
- Parameters:
- valuebool
Whether to start a new section when page number changes
- setEnableOverflowSplitting(value)[source]#
Sets whether to split oversized sections after title grouping.
- Parameters:
- valuebool
Whether to split oversized sections after title grouping