`sparknlp.annotator.token.recursive_tokenizer`#

Contains classes for the RecursiveTokenizer.

Module Contents#

Classes#

`RecursiveTokenizer`	Tokenizes raw text recursively based on a handful of definable rules.
`RecursiveTokenizerModel`	Instantiated model of the RecursiveTokenizer.

class RecursiveTokenizer(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizer')[source]#

Tokenizes raw text recursively based on a handful of definable rules.

Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

prefixes: Strings that will be split when found at the beginning of token.
suffixes: Strings that will be split when found at the end of token.
infixes: Strings that will be split when found at the middle of token.
whitelist: Whitelist of strings not to split

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT`	`TOKEN`

Parameters:

prefixes: Strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’]
suffixes: Strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”]
infixes: Strings to be considered independent tokens when found in the middle of a word, by default [’n’, ‘(’, ‘)’]
whitelist: Strings to be considered as single tokens , by default [“it’s”, “that’s”, “there’s”, “he’s”, “she’s”, “what’s”, “let’s”, “who’s”, “It’s”, “That’s”, “There’s”, “He’s”, “She’s”, “What’s”, “Let’s”, “Who’s”]

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = RecursiveTokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer
... ])
>>> data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("token.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+

name = 'RecursiveTokenizer'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

prefixes[source]#

suffixes[source]#

infixes[source]#

whitelist[source]#

setPrefixes(p)[source]#

Sets strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’].

Parameters:

pList[str]: Strings to be considered independent tokens when found at the beginning of a word

setSuffixes(s)[source]#

Sets strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”].

Parameters:

sList[str]: Strings to be considered independent tokens when found at the end of a word

setInfixes(i)[source]#

Sets strings to be considered independent tokens when found in the middle of a word, by default [’n’, ‘(’, ‘)’].

Parameters:

iList[str]: Strings to be considered independent tokens when found in the middle of a word

Returns:

[type]: [description]

setWhitelist(w)[source]#

Sets strings to be considered as single tokens, by default [“it’s”, “that’s”, “there’s”, “he’s”, “she’s”, “what’s”, “let’s”, “who’s”, “It’s”, “That’s”, “There’s”, “He’s”, “She’s”, “What’s”, “Let’s”, “Who’s”].

Parameters:

wList[str]: Strings to be considered as single tokens

class RecursiveTokenizerModel(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizerModel', java_model=None)[source]#

Instantiated model of the RecursiveTokenizer.

This is the instantiated model of the RecursiveTokenizer. For training your own model, please see the documentation of that class.

Input Annotation types	Output Annotation type
`DOCUMENT`	`TOKEN`

Parameters:

None

name = 'RecursiveTokenizerModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

sparknlp.annotator.token.recursive_tokenizer#

Module Contents#

Classes#

`sparknlp.annotator.token.recursive_tokenizer`#