sparknlp.annotator.token.regex_tokenizer#

Contains classes for the RegexTokenizer.

Module Contents#

Classes#

RegexTokenizer

A tokenizer that splits text by a regex pattern.

class RegexTokenizer[source]#

A tokenizer that splits text by a regex pattern.

The pattern needs to be set with setPattern() and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

Input Annotation types

Output Annotation type

DOCUMENT

TOKEN

Parameters:
minLength

Set the minimum allowed length for each token, by default 1

maxLength

Set the maximum allowed length for each token

toLowercase

Indicates whether to convert all characters to lowercase before tokenizing, by default False

pattern

Regex pattern used for tokenizing, by default \s+

positionalMask

Using a positional mask to guarantee the incremental progression of the tokenization, by default False

trimWhitespace

Using a trimWhitespace flag to remove whitespaces from identified tokens, by default False

preservePosition

Using a preservePosition flag to preserve initial indexes before eventual whitespaces removal in tokens, by default True

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> regexTokenizer = RegexTokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("regexToken") \
...     .setToLowercase(True) \
>>> pipeline = Pipeline().setStages([
...       documentAssembler,
...       regexTokenizer
...     ])
>>> data = spark.createDataFrame([["This is my first sentence.\nThis is my second."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("regexToken.result").show(truncate=False)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+
setMinLength(value)[source]#

Sets the minimum allowed length for each token, by default 1.

Parameters:
valueint

Minimum allowed length for each token

setMaxLength(value)[source]#

Sets the maximum allowed length for each token.

Parameters:
valueint

Maximum allowed length for each token

setToLowercase(value)[source]#

Sets whether to convert all characters to lowercase before tokenizing, by default False.

Parameters:
valuebool

Whether to convert all characters to lowercase before tokenizing

setPattern(value)[source]#

Sets the regex pattern used for tokenizing, by default \s+.

Parameters:
valuestr

Regex pattern used for tokenizing

setPositionalMask(value)[source]#

Sets whether to use a positional mask to guarantee the incremental progression of the tokenization, by default False.

Parameters:
valuebool

Whether to use a positional mask

setTrimWhitespace(value)[source]#

Indicates whether to use a trimWhitespaces flag to remove whitespaces from identified tokens.

Parameters:
valuebool

Indicates whether to use a trimWhitespaces flag, by default False.

setPreservePosition(value)[source]#

Indicates whether to use a preserve initial indexes before eventual whitespaces removal in tokens.

Parameters:
valuebool

Indicates whether to use a preserve initial indexes, by default True.