sparknlp.annotator.document_character_text_splitter#

Contains classes for the DocumentNormalizer

Module Contents#

Classes#

DocumentCharacterTextSplitter

Annotator which splits large documents into chunks of roughly given size.

class DocumentCharacterTextSplitter[source]#

Annotator which splits large documents into chunks of roughly given size.

DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

Additionally, you can set

  • custom patterns with setSplitPatterns

  • whether patterns should be interpreted as regex with setPatternsAreRegex

  • whether to keep the separators with setKeepSeparators

  • whether to trim whitespaces with setTrimWhitespace

  • whether to explode the splits to individual rows with setExplodeSplits

For extended examples of usage, see the DocumentCharacterTextSplitterTest.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
chunkSize

Size of each chunk of text.

chunkOverlap

Length of the overlap between text chunks , by default 0.

splitPatterns

Patterns to separate the text by in decreasing priority , by default `[”

“, “
“, “ “, “”]`.
patternsAreRegex

Whether to interpret the split patterns as regular expressions , by default False.

keepSeparators

Whether to keep the separators in the final result , by default True.

explodeSplits

Whether to explode split chunks to separate rows , by default False.

trimWhitespace

Whether to trim whitespaces of extracted chunks , by default True.

“Is Briony Lodge, Serpentine Aven…| 19798| 39395| 19597|

|[“How did that help you?”

“It was all-important. When a woman thinks that …| 39371| 59242| 19871|

|[“‘But,’ said I, ‘there would be millions of red-headed men who

would apply….| 59166| 77833| 18667|

|[My friend was an enthusiastic musician, being himself not only a

very capab…| 77835| 97769| 19934|

|[“And yet I am not convinced of it,” I answered. “The cases which

come to li…| 97771| 117248| 19477|

|[“Well, she had a slate-coloured, broad-brimmed straw hat, with a

feather of…| 117250| 137242| 19992|

|[“That sounds a little paradoxical.”

“But it is profoundly True. Singulari…| 137244| 157171| 19927|
setChunkSize(value)[source]#

Sets size of each chunk of text.

Parameters:
valueint

Size of each chunk of text

setChunkOverlap(value)[source]#

Sets length of the overlap between text chunks , by default 0.

Parameters:
valueint

Length of the overlap between text chunks

setSplitPatterns(value)[source]#

Sets patterns to separate the text by in decreasing priority , by default `[”

“, ” “, “ “, “”]`.

Parameters:
valueList[str]

Patterns to separate the text by in decreasing priority

setPatternsAreRegex(value)[source]#

Sets whether to interpret the split patterns as regular expressions , by default False.

Parameters:
valuebool

Whether to interpret the split patterns as regular expressions

setKeepSeparators(value)[source]#

Sets whether to keep the separators in the final result , by default True.

Parameters:
valuebool

Whether to keep the separators in the final result

setExplodeSplits(value)[source]#

Sets whether to explode split chunks to separate rows , by default False.

Parameters:
valuebool

Whether to explode split chunks to separate rows

setTrimWhitespace(value)[source]#

Sets whether to trim whitespaces of extracted chunks , by default True.

Parameters:
valuebool

Whether to trim whitespaces of extracted chunks