sparknlp.annotator.spell_check.context_spell_checker#

Contains classes for the ContextSpellChecker.

Module Contents#

Classes#

ContextSpellCheckerApproach

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

ContextSpellCheckerModel

Implements a deep-learning based Noisy Channel Model Spell Algorithm.

class ContextSpellCheckerApproach[source]#

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

Correction candidates are extracted combining context information and word information.

For instantiated/pretrained models, see ContextSpellCheckerModel.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

  1. Different correction candidates for each word — word level.

  2. The surrounding text of each word, i.e. it’s context — sentence level.

  3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
languageModelClasses

Number of classes to use during factorization of the softmax output in the LM.

wordMaxDistance

Maximum distance for the generated candidates for every word.

maxCandidates

Maximum number of candidates for every word.

caseStrategy

What case combinations to try when generating candidates, by default 2. Possible values are:

  • 0: All uppercase letters

  • 1: First letter capitalized

  • 2: All letters

errorThreshold

Threshold perplexity for a word to be considered as an error.

epochs

Number of epochs to train the language model.

batchSize

Batch size for the training in NLM.

initialRate

Initial learning rate for the LM.

finalRate

Final learning rate for the LM.

validationFraction

Percentage of datapoints to use for validation.

minCount

Min number of times a token should appear to be included in vocab.

compoundCount

Min number of times a compound word should appear to be included in vocab.

classCount

Min number of times the word need to appear in corpus to not be considered of a special class.

tradeoff

Tradeoff between the cost of a word error and a transition in the language model.

weightedDistPath

The path to the file containing the weights for the levenshtein distance.

maxWindowLen

Maximum size for the window used to remember history prior to every correction.

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

maxSentLen

Maximum length for a sentence - internal use during training.

graphFolder

Folder path that contain external graph files.

See also

NorvigSweetingApproach, SymmetricDeleteApproach

For alternative approaches to spell checking

References

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

For this example, we use the first Sherlock Holmes book as the training dataset.

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols("document") \
...     .setOutputCol("token")
>>> spellChecker = ContextSpellCheckerApproach() \
...     .setInputCols("token") \
...     .setOutputCol("corrected") \
...     .setWordMaxDistance(3) \
...     .setBatchSize(24) \
...     .setEpochs(8) \
...     .setLanguageModelClasses(1650)  # dependant on vocabulary size
...     # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> path = "sherlockholmes.txt"
>>> dataset = spark.read.text(path) \
...     .toDF("text")
>>> pipelineModel = pipeline.fit(dataset)
setLanguageModelClasses(count)[source]#

Sets number of classes to use during factorization of the softmax output in the Language Model.

Parameters:
countint

Number of classes

setWordMaxDistance(dist)[source]#

Sets maximum distance for the generated candidates for every word.

Parameters:
distint

Maximum distance for the generated candidates for every word

setMaxCandidates(candidates)[source]#

Sets maximum number of candidates for every word.

Parameters:
candidatesint

Maximum number of candidates for every word.

setCaseStrategy(strategy)[source]#

Sets what case combinations to try when generating candidates.

Possible values are:

  • 0: All uppercase letters

  • 1: First letter capitalized

  • 2: All letters

Parameters:
strategyint

Case combinations to try when generating candidates

setErrorThreshold(threshold)[source]#

Sets threshold perplexity for a word to be considered as an error.

Parameters:
thresholdfloat

Threshold perplexity for a word to be considered as an error

setEpochs(count)[source]#

Sets number of epochs to train the language model.

Parameters:
countint

Number of epochs

setBatchSize(size)[source]#

Sets batch size.

Parameters:
sizeint

Batch size

setInitialRate(rate)[source]#

Sets initial learning rate for the LM.

Parameters:
ratefloat

Initial learning rate for the LM

setFinalRate(rate)[source]#

Sets final learning rate for the LM.

Parameters:
ratefloat

Final learning rate for the LM

setValidationFraction(fraction)[source]#

Sets percentage of datapoints to use for validation.

Parameters:
fractionfloat

Percentage of datapoints to use for validation

setMinCount(count)[source]#

Sets min number of times a token should appear to be included in vocab.

Parameters:
countfloat

Min number of times a token should appear to be included in vocab

setCompoundCount(count)[source]#

Sets min number of times a compound word should appear to be included in vocab.

Parameters:
countint

Min number of times a compound word should appear to be included in vocab.

setClassCount(count)[source]#

Sets min number of times the word need to appear in corpus to not be considered of a special class.

Parameters:
countfloat

Min number of times the word need to appear in corpus to not be considered of a special class.

setTradeoff(alpha)[source]#

Sets tradeoff between the cost of a word error and a transition in the language model.

Parameters:
alphafloat

Tradeoff between the cost of a word error and a transition in the language model

setWeightedDistPath(path)[source]#

Sets the path to the file containing the weights for the levenshtein distance.

Parameters:
pathstr

Path to the file containing the weights for the levenshtein distance.

setMaxWindowLen(length)[source]#

Sets the maximum size for the window used to remember history prior to every correction.

Parameters:
lengthint

Maximum size for the window used to remember history prior to every correction

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setGraphFolder(path)[source]#

Sets folder path that contain external graph files.

Parameters:
pathstr

Folder path that contain external graph files.

setMaxSentLen(sentlen)[source]#

Sets the maximum length of a sentence.

Parameters:
sentlenint

Maximum length of a sentence

addVocabClass(label, vocab, userdist=3)[source]#

Adds a new class of words to correct, based on a vocabulary.

Parameters:
labelstr

Name of the class

vocabList[str]

Vocabulary as a list

userdistint, optional

Maximal distance to the word, by default 3

addRegexClass(label, regex, userdist=3)[source]#

Adds a new class of words to correct, based on regex.

Parameters:
labelstr

Name of the class

regexstr

Regex to add

userdistint, optional

Maximal distance to the word, by default 3

class ContextSpellCheckerModel(classname='com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerModel', java_model=None)[source]#

Implements a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

  1. Different correction candidates for each word — word level.

  2. The surrounding text of each word, i.e. it’s context — sentence level.

  3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

This is the instantiated model of the ContextSpellCheckerApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> spellChecker = ContextSpellCheckerModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("checked")

The default model is "spellcheck_dl", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
wordMaxDistance

Maximum distance for the generated candidates for every word.

maxCandidates

Maximum number of candidates for every word.

caseStrategy

What case combinations to try when generating candidates.

errorThreshold

Threshold perplexity for a word to be considered as an error.

tradeoff

Tradeoff between the cost of a word error and a transition in the language model.

maxWindowLen

Maximum size for the window used to remember history prior to every correction.

gamma

Controls the influence of individual word frequency in the decision.

correctSymbols

Whether to correct special symbols or skip spell checking for them

compareLowcase

If true will compare tokens in low case with vocabulary.

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

vocabFreq

Frequency words from the vocabulary.

idsVocab

Mapping of ids to vocabulary.

vocabIds

Mapping of vocabulary to ids.

classes

Classes the spell checker recognizes.

weights

Levenshtein weights.

useNewLines

When set to true new lines will be treated as any other character. When set to false correction is applied on paragraphs as defined by newline characters.

See also

NorvigSweetingModel, SymmetricDeleteModel

For alternative approaches to spell checking

References

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("doc")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["doc"]) \
...     .setOutputCol("token")
>>> spellChecker = ContextSpellCheckerModel \
...     .pretrained() \
...     .setTradeoff(12.0) \
...     .setInputCols("token") \
...     .setOutputCol("checked")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["It was a cold , dreary day and the country was white with smow ."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("checked.result").show(truncate=False)
+--------------------------------------------------------------------------------+
|result                                                                          |
+--------------------------------------------------------------------------------+
|[It, was, a, cold, ,, dreary, day, and, the, country, was, white, with, snow, .]|
+--------------------------------------------------------------------------------+
setWordMaxDistance(dist)[source]#

Sets maximum distance for the generated candidates for every word.

Parameters:
distint

Maximum distance for the generated candidates for every word.

setMaxCandidates(candidates)[source]#

Sets maximum number of candidates for every word.

Parameters:
candidatesint

Maximum number of candidates for every word.

setCaseStrategy(strategy)[source]#

Sets what case combinations to try when generating candidates.

Parameters:
strategyint

Case combinations to try when generating candidates.

setErrorThreshold(threshold)[source]#

Sets threshold perplexity for a word to be considered as an error.

Parameters:
thresholdfloat

Threshold perplexity for a word to be considered as an error

setTradeoff(alpha)[source]#

Sets tradeoff between the cost of a word error and a transition in the language model.

Parameters:
alphafloat

Tradeoff between the cost of a word error and a transition in the language model

setWeights(weights)[source]#

Sets weights of each word for Levenshtein distance.

Parameters:
weightsDict[str, float]

Weights for Levenshtein distance as a mapping.

setMaxWindowLen(length)[source]#

Sets the maximum size for the window used to remember history prior to every correction.

Parameters:
lengthint

Maximum size for the window used to remember history prior to every correction

setGamma(g)[source]#

Sets the influence of individual word frequency in the decision.

Parameters:
gfloat

Controls the influence of individual word frequency in the decision.

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setVocabFreq(value: dict)[source]#

Sets frequency words from the vocabulary.

Parameters:
valuedict

Frequency words from the vocabulary.

setIdsVocab(idsVocab: dict)[source]#

Sets mapping of ids to vocabulary.

Parameters:
idsVocabdict

Mapping of ids to vocabulary.

setVocabIds(vocabIds: dict)[source]#

Sets mapping of vocabulary to ids.

Parameters:
vocabIdsdict

Mapping of vocabulary to ids.

setClasses(value)[source]#

Sets classes the spell checker recognizes.

Parameters:
valuelist

Classes the spell checker recognizes.

getWordClasses()[source]#

Gets the classes of words to be corrected.

Returns:
List[str]

Classes of words to be corrected

updateRegexClass(label, regex)[source]#

Update existing class to correct, based on regex

Parameters:
labelstr

Label of the class

regexstr

Regex to parse the class

updateVocabClass(label, vocab, append=True)[source]#

Update existing class to correct, based on a vocabulary.

Parameters:
labelstr

Label of the class

vocabList[str]

Vocabulary as a list

appendbool, optional

Whether to append to the existing vocabulary, by default True

setCorrectSymbols(value)[source]#

Sets whether to correct special symbols or skip spell checking for them.

Parameters:
valuebool

Whether to correct special symbols or skip spell checking for them

setCompareLowcase(value)[source]#

Sets whether to compare tokens in lower case with vocabulary.

Parameters:
valuebool

Whether to compare tokens in lower case with vocabulary.

static pretrained(name='spellcheck_dl', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “spellcheck_dl”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
ContextSpellCheckerModel

The restored model