sparknlp.annotator.document_character_text_splitter
#
Contains classes for the DocumentNormalizer
Module Contents#
Classes#
Annotator which splits large documents into chunks of roughly given size. |
- class DocumentCharacterTextSplitter[source]#
Annotator which splits large documents into chunks of roughly given size.
DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
For example, given chunk size 20 and overlap 5:
"He was, I take it, the most perfect reasoning and observing machine that the world has seen." ["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]
Additionally, you can set
custom patterns with setSplitPatterns
whether patterns should be interpreted as regex with setPatternsAreRegex
whether to keep the separators with setKeepSeparators
whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits
For extended examples of usage, see the DocumentCharacterTextSplitterTest.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- chunkSize
Size of each chunk of text.
- chunkOverlap
Length of the overlap between text chunks , by default 0.
- splitPatterns
Patterns to separate the text by in decreasing priority , by default `[”
- “, “
- “, “ “, “”]`.
- patternsAreRegex
Whether to interpret the split patterns as regular expressions , by default False.
- keepSeparators
Whether to keep the separators in the final result , by default True.
- explodeSplits
Whether to explode split chunks to separate rows , by default False.
- trimWhitespace
Whether to trim whitespaces of extracted chunks , by default True.
- “Is Briony Lodge, Serpentine Aven…| 19798| 39395| 19597|
|[“How did that help you?”
- “It was all-important. When a woman thinks that …| 39371| 59242| 19871|
|[“‘But,’ said I, ‘there would be millions of red-headed men who
- would apply….| 59166| 77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a
- very capab…| 77835| 97769| 19934|
|[“And yet I am not convinced of it,” I answered. “The cases which
- come to li…| 97771| 117248| 19477|
|[“Well, she had a slate-coloured, broad-brimmed straw hat, with a
- feather of…| 117250| 137242| 19992|
|[“That sounds a little paradoxical.”
- “But it is profoundly True. Singulari…| 137244| 157171| 19927|
- setChunkSize(value)[source]#
Sets size of each chunk of text.
- Parameters:
- valueint
Size of each chunk of text
- setChunkOverlap(value)[source]#
Sets length of the overlap between text chunks , by default 0.
- Parameters:
- valueint
Length of the overlap between text chunks
- setSplitPatterns(value)[source]#
Sets patterns to separate the text by in decreasing priority , by default `[”
“, ” “, “ “, “”]`.
- Parameters:
- valueList[str]
Patterns to separate the text by in decreasing priority
- setPatternsAreRegex(value)[source]#
Sets whether to interpret the split patterns as regular expressions , by default False.
- Parameters:
- valuebool
Whether to interpret the split patterns as regular expressions
- setKeepSeparators(value)[source]#
Sets whether to keep the separators in the final result , by default True.
- Parameters:
- valuebool
Whether to keep the separators in the final result