sparknlp.annotator.token.recursive_tokenizer#
Contains classes for the RecursiveTokenizer.
Module Contents#
Classes#
Tokenizes raw text recursively based on a handful of definable rules. |
|
Instantiated model of the RecursiveTokenizer. |
- class RecursiveTokenizer(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizer')[source]#
Tokenizes raw text recursively based on a handful of definable rules.
Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:
prefixes: Strings that will be split when found at the beginning of token.suffixes: Strings that will be split when found at the end of token.infixes: Strings that will be split when found at the middle of token.whitelist: Whitelist of strings not to split
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENTTOKEN- Parameters:
- prefixes
Strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’]
- suffixes
Strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”]
- infixes
Strings to be considered independent tokens when found in the middle of a word, by default [’n’, ‘(’, ‘)’]
- whitelist
Strings to be considered as single tokens , by default [“it’s”, “that’s”, “there’s”, “he’s”, “she’s”, “what’s”, “let’s”, “who’s”, “It’s”, “That’s”, “There’s”, “He’s”, “She’s”, “What’s”, “Let’s”, “Who’s”]
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = RecursiveTokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer ... ]) >>> data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("token.result").show(truncate=False) +------------------------------------------------------------------+ |result | +------------------------------------------------------------------+ |[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]| +------------------------------------------------------------------+
- setPrefixes(p)[source]#
Sets strings to be considered independent tokens when found at the beginning of a word, by default [”’”, ‘”’, ‘(’, ‘[’, ‘n’].
- Parameters:
- pList[str]
Strings to be considered independent tokens when found at the beginning of a word
- setSuffixes(s)[source]#
Sets strings to be considered independent tokens when found at the end of a word, by default [‘.’, ‘:’, ‘%’, ‘,’, ‘;’, ‘?’, “’”, ‘”’, ‘)’, ‘]’, ‘n’, ‘!’, “‘s”].
- Parameters:
- sList[str]
Strings to be considered independent tokens when found at the end of a word
- class RecursiveTokenizerModel(classname='com.johnsnowlabs.nlp.annotators.RecursiveTokenizerModel', java_model=None)[source]#
Instantiated model of the RecursiveTokenizer.
This is the instantiated model of the
RecursiveTokenizer. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENTTOKEN- Parameters:
- None