sparknlp.annotator.token.tokenizer#

Contains classes for the Tokenizer.

Module Contents#

Classes#

Tokenizer

Tokenizes raw text in document type columns into TokenizedSentence .

TokenizerModel

Tokenizes raw text into word pieces, tokens. Identifies tokens with

class Tokenizer[source]#

Tokenizes raw text in document type columns into TokenizedSentence .

This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

For extended examples of usage see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

TOKEN

Parameters:
targetPattern

Pattern to grab from text as token candidates, by default \S+

prefixPattern

Regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*)

suffixPattern

Regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z

infixPatterns

Regex patterns that match tokens within a single target. groups identify different sub-tokens. multiple defaults

exceptions

Words that won’t be affected by tokenization rules

exceptionsPath

Path to file containing list of exceptions

caseSensitiveExceptions

Whether to care for case sensitiveness in exceptions, by default True

contextChars

Character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”]

splitPattern

Pattern to separate from the inside of tokens. Takes priority over splitChars.

splitChars

Character list used to separate from the inside of tokens

minLength

Set the minimum allowed length for each token, by default 0

maxLength

Set the maximum allowed length for each token, by default 99999

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)
>>> pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
>>> result = pipeline.transform(data)
>>> result.selectExpr("token.result").show(truncate=False)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+
getInfixPatterns()[source]#

Gets regex patterns that match tokens within a single target. Groups identify different sub-tokens.

Returns:
List[str]

The infix patterns

getSuffixPattern()[source]#

Gets regex with groups and ends with \z to match target suffix.

Returns:
str

The suffix pattern

getPrefixPattern()[source]#

Gets regex with groups and begins with \A to match target prefix.

Returns:
str

The prefix pattern

getContextChars()[source]#

Gets character list used to separate from token boundaries.

Returns:
List[str]

Character list used to separate from token boundaries

getSplitChars()[source]#

Gets character list used to separate from the inside of tokens.

Returns:
List[str]

Character list used to separate from the inside of tokens

setTargetPattern(value)[source]#

Sets pattern to grab from text as token candidates, by default \S+.

Parameters:
valuestr

Pattern to grab from text as token candidates

setPrefixPattern(value)[source]#

Sets regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*).

Parameters:
valuestr

Regex with groups and begins with \A to match target prefix

setSuffixPattern(value)[source]#

Sets regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z.

Parameters:
valuestr

Regex with groups and ends with \z to match target suffix

setInfixPatterns(value)[source]#

Sets regex patterns that match tokens within a single target. Groups identify different sub-tokens.

Parameters:
valueList[str]

Regex patterns that match tokens within a single target

addInfixPattern(value)[source]#

Adds an additional regex pattern that match tokens within a single target. Groups identify different sub-tokens.

Parameters:
valuestr

Regex pattern that match tokens within a single target

setExceptions(value)[source]#

Sets words that won’t be affected by tokenization rules.

Parameters:
valueList[str]

Words that won’t be affected by tokenization rules

getExceptions()[source]#

Gets words that won’t be affected by tokenization rules.

Returns:
List[str]

Words that won’t be affected by tokenization rules

setExceptionsPath(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Path to txt file with list of token exceptions

Parameters:
pathstr

Path to the source file

read_asstr, optional

How to read the file, by default ReadAs.TEXT

optionsdict, optional

Options to read the resource, by default {“format”: “text”}

addException(value)[source]#

Adds an additional word that won’t be affected by tokenization rules.

Parameters:
valuestr

Additional word that won’t be affected by tokenization rules

setCaseSensitiveExceptions(value)[source]#

Sets whether to care for case sensitiveness in exceptions, by default True.

Parameters:
valuebool

Whether to care for case sensitiveness in exceptions

getCaseSensitiveExceptions()[source]#

Gets whether to care for case sensitiveness in exceptions.

Returns:
bool

Whether to care for case sensitiveness in exceptions

setContextChars(value)[source]#

Sets character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”].

Parameters:
valueList[str]

Character list used to separate from token boundaries

addContextChars(value)[source]#

Adds an additional character to the list used to separate from token boundaries.

Parameters:
valuestr

Additional context character

setSplitPattern(value)[source]#

Sets pattern to separate from the inside of tokens. Takes priority over splitChars.

Parameters:
valuestr

Pattern used to separate from the inside of tokens

setSplitChars(value)[source]#

Sets character list used to separate from the inside of tokens.

Parameters:
valueList[str]

Character list used to separate from the inside of tokens

addSplitChars(value)[source]#

Adds an additional character to separate from the inside of tokens.

Parameters:
valuestr

Additional character to separate from the inside of tokens

setMinLength(value)[source]#

Sets the minimum allowed length for each token, by default 0.

Parameters:
valueint

Minimum allowed length for each token

setMaxLength(value)[source]#

Sets the maximum allowed length for each token, by default 99999.

Parameters:
valueint

Maximum allowed length for each token

class TokenizerModel(classname='com.johnsnowlabs.nlp.annotators.TokenizerModel', java_model=None)[source]#

Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

This class represents an already fitted Tokenizer.

See the main class Tokenizer for more examples of usage.

Input Annotation types

Output Annotation type

DOCUMENT

TOKEN

Parameters:
splitPattern

Character list used to separate from the inside of tokens

splitChars

Character list used to separate from the inside of tokens

setSplitPattern(value)[source]#

Sets pattern to separate from the inside of tokens. Takes priority over splitChars.

Parameters:
valuestr

Pattern used to separate from the inside of tokens

setSplitChars(value)[source]#

Sets character list used to separate from the inside of tokens.

Parameters:
valueList[str]

Character list used to separate from the inside of tokens

addSplitChars(value)[source]#

Adds an additional character to separate from the inside of tokens.

Parameters:
valuestr

Additional character to separate from the inside of tokens

static pretrained(name='token_rules', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “token_rules”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
TokenizerModel

The restored model