sparknlp.annotator.token.tokenizer
#
Contains classes for the Tokenizer.
Module Contents#
Classes#
Tokenizes raw text in document type columns into TokenizedSentence . |
|
Tokenizes raw text into word pieces, tokens. Identifies tokens with |
- class Tokenizer[source]#
Tokenizes raw text in document type columns into TokenizedSentence .
This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.
Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
For extended examples of usage see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
TOKEN
- Parameters:
- targetPattern
Pattern to grab from text as token candidates, by default
\S+
- prefixPattern
Regex with groups and begins with
\A
to match target prefix, by default\A([^\s\w\$\.]*)
- suffixPattern
Regex with groups and ends with
\z
to match target suffix, by default([^\s\w]?)([^\s\w]*)\z
- infixPatterns
Regex patterns that match tokens within a single target. groups identify different sub-tokens. multiple defaults
- exceptions
Words that won’t be affected by tokenization rules
- exceptionsPath
Path to file containing list of exceptions
- caseSensitiveExceptions
Whether to care for case sensitiveness in exceptions, by default True
- contextChars
Character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”]
- splitPattern
Pattern to separate from the inside of tokens. Takes priority over splitChars.
- splitChars
Character list used to separate from the inside of tokens
- minLength
Set the minimum allowed length for each token, by default 0
- maxLength
Set the maximum allowed length for each token, by default 99999
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text") >>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data) >>> pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data) >>> result = pipeline.transform(data) >>> result.selectExpr("token.result").show(truncate=False) +-----------------------------------------------------------------------+ |output | +-----------------------------------------------------------------------+ |[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]| +-----------------------------------------------------------------------+
- getInfixPatterns()[source]#
Gets regex patterns that match tokens within a single target. Groups identify different sub-tokens.
- Returns:
- List[str]
The infix patterns
- getSuffixPattern()[source]#
Gets regex with groups and ends with
\z
to match target suffix.- Returns:
- str
The suffix pattern
- getPrefixPattern()[source]#
Gets regex with groups and begins with
\A
to match target prefix.- Returns:
- str
The prefix pattern
- getContextChars()[source]#
Gets character list used to separate from token boundaries.
- Returns:
- List[str]
Character list used to separate from token boundaries
- getSplitChars()[source]#
Gets character list used to separate from the inside of tokens.
- Returns:
- List[str]
Character list used to separate from the inside of tokens
- setTargetPattern(value)[source]#
Sets pattern to grab from text as token candidates, by default
\S+
.- Parameters:
- valuestr
Pattern to grab from text as token candidates
- setPrefixPattern(value)[source]#
Sets regex with groups and begins with
\A
to match target prefix, by default\A([^\s\w\$\.]*)
.- Parameters:
- valuestr
Regex with groups and begins with
\A
to match target prefix
- setSuffixPattern(value)[source]#
Sets regex with groups and ends with
\z
to match target suffix, by default([^\s\w]?)([^\s\w]*)\z
.- Parameters:
- valuestr
Regex with groups and ends with
\z
to match target suffix
- setInfixPatterns(value)[source]#
Sets regex patterns that match tokens within a single target. Groups identify different sub-tokens.
- Parameters:
- valueList[str]
Regex patterns that match tokens within a single target
- addInfixPattern(value)[source]#
Adds an additional regex pattern that match tokens within a single target. Groups identify different sub-tokens.
- Parameters:
- valuestr
Regex pattern that match tokens within a single target
- setExceptions(value)[source]#
Sets words that won’t be affected by tokenization rules.
- Parameters:
- valueList[str]
Words that won’t be affected by tokenization rules
- getExceptions()[source]#
Gets words that won’t be affected by tokenization rules.
- Returns:
- List[str]
Words that won’t be affected by tokenization rules
- setExceptionsPath(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Path to txt file with list of token exceptions
- Parameters:
- pathstr
Path to the source file
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- addException(value)[source]#
Adds an additional word that won’t be affected by tokenization rules.
- Parameters:
- valuestr
Additional word that won’t be affected by tokenization rules
- setCaseSensitiveExceptions(value)[source]#
Sets whether to care for case sensitiveness in exceptions, by default True.
- Parameters:
- valuebool
Whether to care for case sensitiveness in exceptions
- getCaseSensitiveExceptions()[source]#
Gets whether to care for case sensitiveness in exceptions.
- Returns:
- bool
Whether to care for case sensitiveness in exceptions
- setContextChars(value)[source]#
Sets character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-’, ‘(’, ‘)’, ‘”’, “’”].
- Parameters:
- valueList[str]
Character list used to separate from token boundaries
- addContextChars(value)[source]#
Adds an additional character to the list used to separate from token boundaries.
- Parameters:
- valuestr
Additional context character
- setSplitPattern(value)[source]#
Sets pattern to separate from the inside of tokens. Takes priority over splitChars.
- Parameters:
- valuestr
Pattern used to separate from the inside of tokens
- setSplitChars(value)[source]#
Sets character list used to separate from the inside of tokens.
- Parameters:
- valueList[str]
Character list used to separate from the inside of tokens
- addSplitChars(value)[source]#
Adds an additional character to separate from the inside of tokens.
- Parameters:
- valuestr
Additional character to separate from the inside of tokens
- class TokenizerModel(classname='com.johnsnowlabs.nlp.annotators.TokenizerModel', java_model=None)[source]#
Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
This class represents an already fitted
Tokenizer
.See the main class Tokenizer for more examples of usage.
Input Annotation types
Output Annotation type
DOCUMENT
TOKEN
- Parameters:
- splitPattern
Character list used to separate from the inside of tokens
- splitChars
Character list used to separate from the inside of tokens
- setSplitPattern(value)[source]#
Sets pattern to separate from the inside of tokens. Takes priority over splitChars.
- Parameters:
- valuestr
Pattern used to separate from the inside of tokens
- setSplitChars(value)[source]#
Sets character list used to separate from the inside of tokens.
- Parameters:
- valueList[str]
Character list used to separate from the inside of tokens
- addSplitChars(value)[source]#
Adds an additional character to separate from the inside of tokens.
- Parameters:
- valuestr
Additional character to separate from the inside of tokens
- static pretrained(name='token_rules', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “token_rules”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- TokenizerModel
The restored model