`sparknlp.annotator.document_character_text_splitter`#

Contains classes for the DocumentNormalizer

Module Contents#

Classes#

DocumentCharacterTextSplitter

Annotator which splits large documents into chunks of roughly given size.

class DocumentCharacterTextSplitter[source]#

Annotator which splits large documents into chunks of roughly given size.

DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

Additionally, you can set

custom patterns with setSplitPatterns
whether patterns should be interpreted as regex with setPatternsAreRegex
whether to keep the separators with setKeepSeparators
whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits

For extended examples of usage, see the DocumentCharacterTextSplitterTest.

Input Annotation types	Output Annotation type
`DOCUMENT`	`DOCUMENT`

Parameters:

chunkSize: Size of each chunk of text.
chunkOverlap: Length of the overlap between text chunks , by default 0.
splitPatterns: Patterns to separate the text by in decreasing priority , by default [”nn”, “n”, “ “, “”].
patternsAreRegex: Whether to interpret the split patterns as regular expressions , by default False.
keepSeparators: Whether to keep the separators in the final result , by default True.
explodeSplits: Whether to explode split chunks to separate rows , by default False.
trimWhitespace: Whether to trim whitespaces of extracted chunks , by default True.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> textDF = spark.read.text(
...    "sherlockholmes.txt",
...    wholetext=True
... ).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text")
>>> textSplitter = DocumentCharacterTextSplitter() \
...     .setInputCols(["document"]) \
...     .setOutputCol("splits") \
...     .setChunkSize(20000) \
...     .setChunkOverlap(200) \
...     .setExplodeSplits(True)
>>> pipeline = Pipeline().setStages([documentAssembler, textSplitter])
>>> result = pipeline.fit(textDF).transform(textDF)
>>> result.selectExpr(
...       "splits.result",
...       "splits[0].begin",
...       "splits[0].end",
...       "splits[0].end - splits[0].begin as length") \
...     .show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly True. Singulari...|         137244|       157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'document'[source]#

chunkSize[source]#

chunkOverlap[source]#

splitPatterns[source]#

patternsAreRegex[source]#

keepSeparators[source]#

explodeSplits[source]#

trimWhitespace[source]#

setChunkSize(value)[source]#

Sets size of each chunk of text.

Parameters:

valueint: Size of each chunk of text

setChunkOverlap(value)[source]#

Sets length of the overlap between text chunks , by default 0.

Parameters:

valueint: Length of the overlap between text chunks

setSplitPatterns(value)[source]#

Sets patterns to separate the text by in decreasing priority , by default `[”

“, ” “, “ “, “”]`.

Parameters:

valueList[str]: Patterns to separate the text by in decreasing priority

setPatternsAreRegex(value)[source]#

Sets whether to interpret the split patterns as regular expressions , by default False.

Parameters:

valuebool: Whether to interpret the split patterns as regular expressions

setKeepSeparators(value)[source]#

Sets whether to keep the separators in the final result , by default True.

Parameters:

valuebool: Whether to keep the separators in the final result

setExplodeSplits(value)[source]#

Sets whether to explode split chunks to separate rows , by default False.

Parameters:

valuebool: Whether to explode split chunks to separate rows

setTrimWhitespace(value)[source]#

Sets whether to trim whitespaces of extracted chunks , by default True.

Parameters:

valuebool: Whether to trim whitespaces of extracted chunks

sparknlp.annotator.document_character_text_splitter#

Module Contents#

Classes#

`sparknlp.annotator.document_character_text_splitter`#