`sparknlp.annotator.spell_check.context_spell_checker`#

Contains classes for the ContextSpellChecker.

Module Contents#

Classes#

`ContextSpellCheckerApproach`	Trains a deep-learning based Noisy Channel Model Spell Algorithm.
`ContextSpellCheckerModel`	Implements a deep-learning based Noisy Channel Model Spell Algorithm.

class ContextSpellCheckerApproach[source]#

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

Correction candidates are extracted combining context information and word information.

For instantiated/pretrained models, see ContextSpellCheckerModel.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

Different correction candidates for each word — word level.
The surrounding text of each word, i.e. it’s context — sentence level.
The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

languageModelClasses

Number of classes to use during factorization of the softmax output in the LM.

wordMaxDistance

Maximum distance for the generated candidates for every word.

maxCandidates

Maximum number of candidates for every word.

caseStrategy

What case combinations to try when generating candidates, by default 2. Possible values are:

0: All uppercase letters
1: First letter capitalized
2: All letters

errorThreshold

Threshold perplexity for a word to be considered as an error.

epochs

Number of epochs to train the language model.

batchSize

Batch size for the training in NLM.

initialRate

Initial learning rate for the LM.

finalRate

Final learning rate for the LM.

validationFraction

Percentage of datapoints to use for validation.

minCount

Min number of times a token should appear to be included in vocab.

compoundCount

Min number of times a compound word should appear to be included in vocab.

classCount

Min number of times the word need to appear in corpus to not be considered of a special class.

tradeoff

Tradeoff between the cost of a word error and a transition in the language model.

weightedDistPath

The path to the file containing the weights for the levenshtein distance.

maxWindowLen

Maximum size for the window used to remember history prior to every correction.

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

maxSentLen

Maximum length for a sentence - internal use during training.

graphFolder

Folder path that contain external graph files.

See also

NorvigSweetingApproach, SymmetricDeleteApproach: For alternative approaches to spell checking

References

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

For this example, we use the first Sherlock Holmes book as the training dataset.

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols("document") \
...     .setOutputCol("token")
>>> spellChecker = ContextSpellCheckerApproach() \
...     .setInputCols("token") \
...     .setOutputCol("corrected") \
...     .setWordMaxDistance(3) \
...     .setBatchSize(24) \
...     .setEpochs(8) \
...     .setLanguageModelClasses(1650)  # dependant on vocabulary size
...     # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> path = "sherlockholmes.txt"
>>> dataset = spark.read.text(path) \
...     .toDF("text")
>>> pipelineModel = pipeline.fit(dataset)

name = 'ContextSpellCheckerApproach'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

languageModelClasses[source]#

wordMaxDistance[source]#

maxCandidates[source]#

caseStrategy[source]#

errorThreshold[source]#

epochs[source]#

batchSize[source]#

initialRate[source]#

finalRate[source]#

validationFraction[source]#

minCount[source]#

compoundCount[source]#

classCount[source]#

tradeoff[source]#

weightedDistPath[source]#

maxWindowLen[source]#

configProtoBytes[source]#

maxSentLen[source]#

graphFolder[source]#

setLanguageModelClasses(count)[source]#

Sets number of classes to use during factorization of the softmax output in the Language Model.

Parameters:

countint: Number of classes

setWordMaxDistance(dist)[source]#

Sets maximum distance for the generated candidates for every word.

Parameters:

distint: Maximum distance for the generated candidates for every word

setMaxCandidates(candidates)[source]#

Sets maximum number of candidates for every word.

Parameters:

candidatesint: Maximum number of candidates for every word.

setCaseStrategy(strategy)[source]#

Sets what case combinations to try when generating candidates.

Possible values are:

0: All uppercase letters
1: First letter capitalized
2: All letters

Parameters:

strategyint: Case combinations to try when generating candidates

setErrorThreshold(threshold)[source]#

Sets threshold perplexity for a word to be considered as an error.

Parameters:

thresholdfloat: Threshold perplexity for a word to be considered as an error

setEpochs(count)[source]#

Sets number of epochs to train the language model.

Parameters:

countint: Number of epochs

setBatchSize(size)[source]#

Sets batch size.

Parameters:

sizeint: Batch size

setInitialRate(rate)[source]#

Sets initial learning rate for the LM.

Parameters:

ratefloat: Initial learning rate for the LM

setFinalRate(rate)[source]#

Sets final learning rate for the LM.

Parameters:

ratefloat: Final learning rate for the LM

setValidationFraction(fraction)[source]#

Sets percentage of datapoints to use for validation.

Parameters:

fractionfloat: Percentage of datapoints to use for validation

setMinCount(count)[source]#

Sets min number of times a token should appear to be included in vocab.

Parameters:

countfloat: Min number of times a token should appear to be included in vocab

setCompoundCount(count)[source]#

Sets min number of times a compound word should appear to be included in vocab.

Parameters:

countint: Min number of times a compound word should appear to be included in vocab.

setClassCount(count)[source]#

Sets min number of times the word need to appear in corpus to not be considered of a special class.

Parameters:

countfloat: Min number of times the word need to appear in corpus to not be considered of a special class.

setTradeoff(alpha)[source]#

Sets tradeoff between the cost of a word error and a transition in the language model.

Parameters:

alphafloat: Tradeoff between the cost of a word error and a transition in the language model

setWeightedDistPath(path)[source]#

Sets the path to the file containing the weights for the levenshtein distance.

Parameters:

pathstr: Path to the file containing the weights for the levenshtein distance.

setMaxWindowLen(length)[source]#

Sets the maximum size for the window used to remember history prior to every correction.

Parameters:

lengthint: Maximum size for the window used to remember history prior to every correction

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

setGraphFolder(path)[source]#

Sets folder path that contain external graph files.

Parameters:

pathstr: Folder path that contain external graph files.

setMaxSentLen(sentlen)[source]#

Sets the maximum length of a sentence.

Parameters:

sentlenint: Maximum length of a sentence

addVocabClass(label, vocab, userdist=3)[source]#

Adds a new class of words to correct, based on a vocabulary.

Parameters:

labelstr: Name of the class
vocabList[str]: Vocabulary as a list
userdistint, optional: Maximal distance to the word, by default 3

addRegexClass(label, regex, userdist=3)[source]#

Adds a new class of words to correct, based on regex.

Parameters:

labelstr: Name of the class
regexstr: Regex to add
userdistint, optional: Maximal distance to the word, by default 3

class ContextSpellCheckerModel(classname='com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerModel', java_model=None)[source]#

Implements a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

Different correction candidates for each word — word level.
The surrounding text of each word, i.e. it’s context — sentence level.
The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

This is the instantiated model of the ContextSpellCheckerApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> spellChecker = ContextSpellCheckerModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("checked")

The default model is "spellcheck_dl", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

wordMaxDistance: Maximum distance for the generated candidates for every word.
maxCandidates: Maximum number of candidates for every word.
caseStrategy: What case combinations to try when generating candidates.
errorThreshold: Threshold perplexity for a word to be considered as an error.
tradeoff: Tradeoff between the cost of a word error and a transition in the language model.
maxWindowLen: Maximum size for the window used to remember history prior to every correction.
gamma: Controls the influence of individual word frequency in the decision.
correctSymbols: Whether to correct special symbols or skip spell checking for them
compareLowcase: If true will compare tokens in low case with vocabulary.
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
vocabFreq: Frequency words from the vocabulary.
idsVocab: Mapping of ids to vocabulary.
vocabIds: Mapping of vocabulary to ids.
classes: Classes the spell checker recognizes.
weights: Levenshtein weights.
useNewLines: When set to true new lines will be treated as any other character. When set to false correction is applied on paragraphs as defined by newline characters.

See also

NorvigSweetingModel, SymmetricDeleteModel: For alternative approaches to spell checking

References

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("doc")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["doc"]) \
...     .setOutputCol("token")
>>> spellChecker = ContextSpellCheckerModel \
...     .pretrained() \
...     .setTradeoff(12.0) \
...     .setInputCols("token") \
...     .setOutputCol("checked")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["It was a cold , dreary day and the country was white with smow ."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("checked.result").show(truncate=False)
+--------------------------------------------------------------------------------+
|result                                                                          |
+--------------------------------------------------------------------------------+
|[It, was, a, cold, ,, dreary, day, and, the, country, was, white, with, snow, .]|
+--------------------------------------------------------------------------------+

name = 'ContextSpellCheckerModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

wordMaxDistance[source]#

maxCandidates[source]#

caseStrategy[source]#

errorThreshold[source]#

tradeoff[source]#

maxWindowLen[source]#

gamma[source]#

correctSymbols[source]#

compareLowcase[source]#

configProtoBytes[source]#

vocabFreq[source]#

idsVocab[source]#

vocabIds[source]#

classes[source]#

setWordMaxDistance(dist)[source]#

Sets maximum distance for the generated candidates for every word.

Parameters:

distint: Maximum distance for the generated candidates for every word.

setMaxCandidates(candidates)[source]#

Sets maximum number of candidates for every word.

Parameters:

candidatesint: Maximum number of candidates for every word.

setCaseStrategy(strategy)[source]#

Sets what case combinations to try when generating candidates.

Parameters:

strategyint: Case combinations to try when generating candidates.

setErrorThreshold(threshold)[source]#

Sets threshold perplexity for a word to be considered as an error.

Parameters:

thresholdfloat: Threshold perplexity for a word to be considered as an error

setTradeoff(alpha)[source]#

Sets tradeoff between the cost of a word error and a transition in the language model.

Parameters:

alphafloat: Tradeoff between the cost of a word error and a transition in the language model

setWeights(weights)[source]#

Sets weights of each word for Levenshtein distance.

Parameters:

weightsDict[str, float]: Weights for Levenshtein distance as a mapping.

setMaxWindowLen(length)[source]#

Sets the maximum size for the window used to remember history prior to every correction.

Parameters:

lengthint: Maximum size for the window used to remember history prior to every correction

setGamma(g)[source]#

Sets the influence of individual word frequency in the decision.

Parameters:

gfloat: Controls the influence of individual word frequency in the decision.

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

setVocabFreq(value: dict)[source]#

Sets frequency words from the vocabulary.

Parameters:

valuedict: Frequency words from the vocabulary.

setIdsVocab(idsVocab: dict)[source]#

Sets mapping of ids to vocabulary.

Parameters:

idsVocabdict: Mapping of ids to vocabulary.

setVocabIds(vocabIds: dict)[source]#

Sets mapping of vocabulary to ids.

Parameters:

vocabIdsdict: Mapping of vocabulary to ids.

setClasses(value)[source]#

Sets classes the spell checker recognizes.

Parameters:

valuelist: Classes the spell checker recognizes.

getWordClasses()[source]#

Gets the classes of words to be corrected.

Returns:

List[str]: Classes of words to be corrected

updateRegexClass(label, regex)[source]#

Update existing class to correct, based on regex

Parameters:

labelstr: Label of the class
regexstr: Regex to parse the class

updateVocabClass(label, vocab, append=True)[source]#

Update existing class to correct, based on a vocabulary.

Parameters:

labelstr: Label of the class
vocabList[str]: Vocabulary as a list
appendbool, optional: Whether to append to the existing vocabulary, by default True

setCorrectSymbols(value)[source]#

Sets whether to correct special symbols or skip spell checking for them.

Parameters:

valuebool: Whether to correct special symbols or skip spell checking for them

setCompareLowcase(value)[source]#

Sets whether to compare tokens in lower case with vocabulary.

Parameters:

valuebool: Whether to compare tokens in lower case with vocabulary.

static pretrained(name='spellcheck_dl', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “spellcheck_dl”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

ContextSpellCheckerModel: The restored model

sparknlp.annotator.spell_check.context_spell_checker#

Module Contents#

Classes#

`sparknlp.annotator.spell_check.context_spell_checker`#