`sparknlp.annotator.token.tokenizer`#

Contains classes for the Tokenizer.

Module Contents#

Classes#

`Tokenizer`	Tokenizes raw text in document type columns into TokenizedSentence .
`TokenizerModel`	Tokenizes raw text into word pieces, tokens. Identifies tokens with

class Tokenizer[source]#

Tokenizes raw text in document type columns into TokenizedSentence .

This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

For extended examples of usage see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT`	`TOKEN`

Parameters:

targetPattern: Pattern to grab from text as token candidates, by default \S+
prefixPattern: Regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*)
suffixPattern: Regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z
infixPatterns: Regex patterns that match tokens within a single target. groups identify different sub-tokens. multiple defaults
exceptions: Words that won’t be affected by tokenization rules
exceptionsPath: Path to file containing list of exceptions
caseSensitiveExceptions: Whether to care for case sensitiveness in exceptions, by default True
contextChars: Character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”]
splitPattern: Pattern to separate from the inside of tokens. Takes priority over splitChars.
splitChars: Character list used to separate from the inside of tokens
minLength: Set the minimum allowed length for each token, by default 0
maxLength: Set the maximum allowed length for each token, by default 99999

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)
>>> pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
>>> result = pipeline.transform(data)
>>> result.selectExpr("token.result").show(truncate=False)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+

name = 'Tokenizer'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

targetPattern[source]#

prefixPattern[source]#

suffixPattern[source]#

infixPatterns[source]#

exceptions[source]#

exceptionsPath[source]#

caseSensitiveExceptions[source]#

contextChars[source]#

splitPattern[source]#

splitChars[source]#

minLength[source]#

maxLength[source]#

getInfixPatterns()[source]#

Gets regex patterns that match tokens within a single target. Groups identify different sub-tokens.

Returns:

List[str]: The infix patterns

getSuffixPattern()[source]#

Gets regex with groups and ends with \z to match target suffix.

Returns:

str: The suffix pattern

getPrefixPattern()[source]#

Gets regex with groups and begins with \A to match target prefix.

Returns:

str: The prefix pattern

getContextChars()[source]#

Gets character list used to separate from token boundaries.

Returns:

List[str]: Character list used to separate from token boundaries

getSplitChars()[source]#

Gets character list used to separate from the inside of tokens.

Returns:

List[str]: Character list used to separate from the inside of tokens

setTargetPattern(value)[source]#

Sets pattern to grab from text as token candidates, by default \S+.

Parameters:

valuestr: Pattern to grab from text as token candidates

setPrefixPattern(value)[source]#

Sets regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*).

Parameters:

valuestr: Regex with groups and begins with \A to match target prefix

setSuffixPattern(value)[source]#

Sets regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z.

Parameters:

valuestr: Regex with groups and ends with \z to match target suffix

setInfixPatterns(value)[source]#

Sets regex patterns that match tokens within a single target. Groups identify different sub-tokens.

Parameters:

valueList[str]: Regex patterns that match tokens within a single target

addInfixPattern(value)[source]#

Adds an additional regex pattern that match tokens within a single target. Groups identify different sub-tokens.

Parameters:

valuestr: Regex pattern that match tokens within a single target

setExceptions(value)[source]#

Sets words that won’t be affected by tokenization rules.

Parameters:

valueList[str]: Words that won’t be affected by tokenization rules

getExceptions()[source]#

Gets words that won’t be affected by tokenization rules.

Returns:

List[str]: Words that won’t be affected by tokenization rules

setExceptionsPath(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Path to txt file with list of token exceptions

Parameters:

pathstr: Path to the source file
read_asstr, optional: How to read the file, by default ReadAs.TEXT
optionsdict, optional: Options to read the resource, by default {“format”: “text”}

addException(value)[source]#

Adds an additional word that won’t be affected by tokenization rules.

Parameters:

valuestr: Additional word that won’t be affected by tokenization rules

setCaseSensitiveExceptions(value)[source]#

Sets whether to care for case sensitiveness in exceptions, by default True.

Parameters:

valuebool: Whether to care for case sensitiveness in exceptions

getCaseSensitiveExceptions()[source]#

Gets whether to care for case sensitiveness in exceptions.

Returns:

bool: Whether to care for case sensitiveness in exceptions

setContextChars(value)[source]#

Sets character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”].

Parameters:

valueList[str]: Character list used to separate from token boundaries

addContextChars(value)[source]#

Adds an additional character to the list used to separate from token boundaries.

Parameters:

valuestr: Additional context character

setSplitPattern(value)[source]#

Sets pattern to separate from the inside of tokens. Takes priority over splitChars.

Parameters:

valuestr: Pattern used to separate from the inside of tokens

setSplitChars(value)[source]#

Sets character list used to separate from the inside of tokens.

Parameters:

valueList[str]: Character list used to separate from the inside of tokens

addSplitChars(value)[source]#

Adds an additional character to separate from the inside of tokens.

Parameters:

valuestr: Additional character to separate from the inside of tokens

setMinLength(value)[source]#

Sets the minimum allowed length for each token, by default 0.

Parameters:

valueint: Minimum allowed length for each token

setMaxLength(value)[source]#

Sets the maximum allowed length for each token, by default 99999.

Parameters:

valueint: Maximum allowed length for each token

class TokenizerModel(classname='com.johnsnowlabs.nlp.annotators.TokenizerModel', java_model=None)[source]#

Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

This class represents an already fitted Tokenizer.

See the main class Tokenizer for more examples of usage.

Input Annotation types	Output Annotation type
`DOCUMENT`	`TOKEN`

Parameters:

splitPattern: Character list used to separate from the inside of tokens
splitChars: Character list used to separate from the inside of tokens

name = 'TokenizerModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

exceptions[source]#

caseSensitiveExceptions[source]#

targetPattern[source]#

rules[source]#

splitPattern[source]#

splitChars[source]#

setSplitPattern(value)[source]#

Sets pattern to separate from the inside of tokens. Takes priority over splitChars.

Parameters:

valuestr: Pattern used to separate from the inside of tokens

setSplitChars(value)[source]#

Sets character list used to separate from the inside of tokens.

Parameters:

valueList[str]: Character list used to separate from the inside of tokens

addSplitChars(value)[source]#

Adds an additional character to separate from the inside of tokens.

Parameters:

valuestr: Additional character to separate from the inside of tokens

static pretrained(name='token_rules', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “token_rules”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

TokenizerModel: The restored model

sparknlp.annotator.token.tokenizer#

Module Contents#

Classes#

`sparknlp.annotator.token.tokenizer`#