sparknlp.annotator.document_title_splitter#

Contains classes for the DocumentTitleSplitter

Module Contents#

Classes#

DocumentTitleSplitter

Annotator that groups element-level documents into title-aware sections.

class DocumentTitleSplitter[source]#

Annotator that groups element-level documents into title-aware sections.

DocumentTitleSplitter is intended to work with element-level DOCUMENT annotations, such as those produced by Reader2Doc().setOutputAsDocument(False). Whenever an input annotation has metadata["elementType"] == "Title", it starts a new semantic section and the title stays with the following content.

Optionally, oversized sections can be split by character length after the semantic grouping phase.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
joinString

String used to join element texts inside a section, by default " ".

splitOnPageChange

Whether to start a new section when page number changes, by default False.

enableOverflowSplitting

Whether to split oversized sections after title grouping, by default False.

maxCharacters

Maximum size of an overflow-split chunk, by default 500.

explodeSplits

Whether to explode split chunks to separate rows, by default False.

inputAnnotatorTypes[source]#
outputAnnotatorType = 'document'[source]#
joinString[source]#
splitOnPageChange[source]#
enableOverflowSplitting[source]#
maxCharacters[source]#
explodeSplits[source]#
setJoinString(value)[source]#

Sets the string used to join element texts inside a section.

Parameters:
valuestr

Join string used between element texts

setSplitOnPageChange(value)[source]#

Sets whether to start a new section when page number changes.

Parameters:
valuebool

Whether to start a new section when page number changes

setEnableOverflowSplitting(value)[source]#

Sets whether to split oversized sections after title grouping.

Parameters:
valuebool

Whether to split oversized sections after title grouping

setMaxCharacters(value)[source]#

Sets the maximum size of an overflow-split chunk.

Parameters:
valueint

Maximum size of an overflow-split chunk

setExplodeSplits(value)[source]#

Sets whether to explode split chunks to separate rows.

Parameters:
valuebool

Whether to explode split chunks to separate rows