sparknlp.annotator.document_token_splitter
#
Contains classes for the DocumentNormalizer
Module Contents#
Classes#
Annotator that splits large documents into smaller documents based on the number of tokens in |
- class DocumentTokenSplitter[source]#
- Annotator that splits large documents into smaller documents based on the number of tokens in
the text.
Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.
For example, given 3 tokens and overlap 1:
He was, I take it, the most perfect reasoning and observing machine that the world has seen. ["He was, I", "I take it,", "it, the most", "most perfect reasoning", "reasoning and observing", "observing machine that", "that the world", "world has seen."]
Additionally, you can set
whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits
For extended examples of usage, see the DocumentTokenSplitterTest.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- numTokens
Limit of the number of tokens in a text
- tokenOverlap
Length of the token overlap between text chunks, by default 0.
- explodeSplits
Whether to explode split chunks to separate rows, by default False.
- trimWhitespace
Whether to trim whitespaces of extracted chunks, by default True.
- immense faculties and extraordinary powers…| 2950| 5707| 2757| 512|
|[but as I have changed my clothes I can’t imagine how you
- deduce it. As to M…| 5659| 8483| 2824| 512|
|[quarters received. Be in your chamber then at that hour, and do
- not take it…| 8427|11241| 2814| 512|
|[a pity
to miss it.”
“But your client–”
- “Never mind him. I may want y…|11188|13970| 2782| 512|
|[person who employs me wishes his agent to be unknown to
- you, and I may conf…|13918|16898| 2980| 512|
|[letters back.”
“Precisely so. But how–”
- “Was there a secret marriage?…|16836|19744| 2908| 512|
|[seven hundred in
notes,” he said.
- Holmes scribbled a receipt upon a shee…|19683|22551| 2868| 512|
- setNumTokens(value)[source]#
Sets the limit of the number of tokens in a text
- Parameters:
- valueint
Number of tokens in a text
- setTokenOverlap(value)[source]#
Length of the token overlap between text chunks, by default 0.
- Parameters:
- valueint
Length of the token overlap between text chunks