sparknlp.annotator.token.regex_tokenizer
#
Contains classes for the RegexTokenizer.
Module Contents#
Classes#
A tokenizer that splits text by a regex pattern. |
- class RegexTokenizer[source]#
A tokenizer that splits text by a regex pattern.
The pattern needs to be set with
setPattern()
and this sets the delimiting pattern or how the tokens should be split. By default this pattern is\s+
which means that tokens should be split by 1 or more whitespace characters.Input Annotation types
Output Annotation type
DOCUMENT
TOKEN
- Parameters:
- minLength
Set the minimum allowed length for each token, by default 1
- maxLength
Set the maximum allowed length for each token
- toLowercase
Indicates whether to convert all characters to lowercase before tokenizing, by default False
- pattern
Regex pattern used for tokenizing, by default
\s+
- positionalMask
Using a positional mask to guarantee the incremental progression of the tokenization, by default False
- trimWhitespace
Using a trimWhitespace flag to remove whitespaces from identified tokens, by default False
- preservePosition
Using a preservePosition flag to preserve initial indexes before eventual whitespaces removal in tokens, by default True
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> regexTokenizer = RegexTokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("regexToken") \ ... .setToLowercase(True) \ >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... regexTokenizer ... ]) >>> data = spark.createDataFrame([["This is my first sentence.\nThis is my second."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("regexToken.result").show(truncate=False) +-------------------------------------------------------+ |result | +-------------------------------------------------------+ |[this, is, my, first, sentence., this, is, my, second.]| +-------------------------------------------------------+
- setMinLength(value)[source]#
Sets the minimum allowed length for each token, by default 1.
- Parameters:
- valueint
Minimum allowed length for each token
- setMaxLength(value)[source]#
Sets the maximum allowed length for each token.
- Parameters:
- valueint
Maximum allowed length for each token
- setToLowercase(value)[source]#
Sets whether to convert all characters to lowercase before tokenizing, by default False.
- Parameters:
- valuebool
Whether to convert all characters to lowercase before tokenizing
- setPattern(value)[source]#
Sets the regex pattern used for tokenizing, by default
\s+
.- Parameters:
- valuestr
Regex pattern used for tokenizing
- setPositionalMask(value)[source]#
Sets whether to use a positional mask to guarantee the incremental progression of the tokenization, by default False.
- Parameters:
- valuebool
Whether to use a positional mask