sparknlp.annotator.token.recursive_tokenizer
#
Contains classes for the RecursiveTokenizer.
Module Contents#
Classes#
Tokenizes raw text recursively based on a handful of definable rules. |
|
Instantiated model of the RecursiveTokenizer. |
- class RecursiveTokenizer(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizer')[source]#
Tokenizes raw text recursively based on a handful of definable rules.
Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:
prefixes
: Strings that will be split when found at the beginning of token.suffixes
: Strings that will be split when found at the end of token.infixes
: Strings that will be split when found at the middle of token.whitelist
: Whitelist of strings not to split
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
TOKEN
- Parameters:
- prefixes
Strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’]
- suffixes
Strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”]
- infixes
Strings to be considered independent tokens when found in the middle of a word, by default [’n’, ‘(’, ‘)’]
- whitelist
Strings to be considered as single tokens , by default [“it’s”, “that’s”, “there’s”, “he’s”, “she’s”, “what’s”, “let’s”, “who’s”, “It’s”, “That’s”, “There’s”, “He’s”, “She’s”, “What’s”, “Let’s”, “Who’s”]
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = RecursiveTokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer ... ]) >>> data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("token.result").show(truncate=False) +------------------------------------------------------------------+ |result | +------------------------------------------------------------------+ |[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]| +------------------------------------------------------------------+
- setPrefixes(p)[source]#
Sets strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’].
- Parameters:
- pList[str]
Strings to be considered independent tokens when found at the beginning of a word
- setSuffixes(s)[source]#
Sets strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”].
- Parameters:
- sList[str]
Strings to be considered independent tokens when found at the end of a word
- class RecursiveTokenizerModel(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizerModel', java_model=None)[source]#
Instantiated model of the RecursiveTokenizer.
This is the instantiated model of the
RecursiveTokenizer
. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENT
TOKEN
- Parameters:
- None