sparknlp.annotator.token.recursive_tokenizer#

Contains classes for the RecursiveTokenizer.

Module Contents#

Classes#

RecursiveTokenizer

Tokenizes raw text recursively based on a handful of definable rules.

RecursiveTokenizerModel

Instantiated model of the RecursiveTokenizer.

class RecursiveTokenizer(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizer')[source]#

Tokenizes raw text recursively based on a handful of definable rules.

Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

  • prefixes: Strings that will be split when found at the beginning of token.

  • suffixes: Strings that will be split when found at the end of token.

  • infixes: Strings that will be split when found at the middle of token.

  • whitelist: Whitelist of strings not to split

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

TOKEN

Parameters:
prefixes

Strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’]

suffixes

Strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”]

infixes

Strings to be considered independent tokens when found in the middle of a word, by default [’n’, ‘(’, ‘)’]

whitelist

Strings to be considered as single tokens , by default [“it’s”, “that’s”, “there’s”, “he’s”, “she’s”, “what’s”, “let’s”, “who’s”, “It’s”, “That’s”, “There’s”, “He’s”, “She’s”, “What’s”, “Let’s”, “Who’s”]

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = RecursiveTokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer
... ])
>>> data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("token.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+
setPrefixes(p)[source]#

Sets strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’].

Parameters:
pList[str]

Strings to be considered independent tokens when found at the beginning of a word

setSuffixes(s)[source]#

Sets strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”].

Parameters:
sList[str]

Strings to be considered independent tokens when found at the end of a word

setInfixes(i)[source]#

Sets strings to be considered independent tokens when found in the middle of a word, by default [’n’, ‘(’, ‘)’].

Parameters:
iList[str]

Strings to be considered independent tokens when found in the middle of a word

Returns:
[type]

[description]

setWhitelist(w)[source]#

Sets strings to be considered as single tokens, by default [“it’s”, “that’s”, “there’s”, “he’s”, “she’s”, “what’s”, “let’s”, “who’s”, “It’s”, “That’s”, “There’s”, “He’s”, “She’s”, “What’s”, “Let’s”, “Who’s”].

Parameters:
wList[str]

Strings to be considered as single tokens

class RecursiveTokenizerModel(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizerModel', java_model=None)[source]#

Instantiated model of the RecursiveTokenizer.

This is the instantiated model of the RecursiveTokenizer. For training your own model, please see the documentation of that class.

Input Annotation types

Output Annotation type

DOCUMENT

TOKEN

Parameters:
None