sparknlp.annotator.spell_check.context_spell_checker
#
Contains classes for the ContextSpellChecker.
Module Contents#
Classes#
Trains a deep-learning based Noisy Channel Model Spell Algorithm. |
|
Implements a deep-learning based Noisy Channel Model Spell Algorithm. |
- class ContextSpellCheckerApproach[source]#
Trains a deep-learning based Noisy Channel Model Spell Algorithm.
Correction candidates are extracted combining context information and word information.
For instantiated/pretrained models, see
ContextSpellCheckerModel
.Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors,
ContextSpellChecker
will rank correction sequences according to three things:Different correction candidates for each word — word level.
The surrounding text of each word, i.e. it’s context — sentence level.
The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.
For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples.
Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- languageModelClasses
Number of classes to use during factorization of the softmax output in the LM.
- wordMaxDistance
Maximum distance for the generated candidates for every word.
- maxCandidates
Maximum number of candidates for every word.
- caseStrategy
What case combinations to try when generating candidates, by default 2. Possible values are:
0: All uppercase letters
1: First letter capitalized
2: All letters
- errorThreshold
Threshold perplexity for a word to be considered as an error.
- epochs
Number of epochs to train the language model.
- batchSize
Batch size for the training in NLM.
- initialRate
Initial learning rate for the LM.
- finalRate
Final learning rate for the LM.
- validationFraction
Percentage of datapoints to use for validation.
- minCount
Min number of times a token should appear to be included in vocab.
- compoundCount
Min number of times a compound word should appear to be included in vocab.
- classCount
Min number of times the word need to appear in corpus to not be considered of a special class.
- tradeoff
Tradeoff between the cost of a word error and a transition in the language model.
- weightedDistPath
The path to the file containing the weights for the levenshtein distance.
- maxWindowLen
Maximum size for the window used to remember history prior to every correction.
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- maxSentLen
Maximum length for a sentence - internal use during training.
- graphFolder
Folder path that contain external graph files.
See also
NorvigSweetingApproach
,SymmetricDeleteApproach
For alternative approaches to spell checking
References
For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
For this example, we use the first Sherlock Holmes book as the training dataset.
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols("document") \ ... .setOutputCol("token") >>> spellChecker = ContextSpellCheckerApproach() \ ... .setInputCols("token") \ ... .setOutputCol("corrected") \ ... .setWordMaxDistance(3) \ ... .setBatchSize(24) \ ... .setEpochs(8) \ ... .setLanguageModelClasses(1650) # dependant on vocabulary size ... # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> path = "sherlockholmes.txt" >>> dataset = spark.read.text(path) \ ... .toDF("text") >>> pipelineModel = pipeline.fit(dataset)
- setLanguageModelClasses(count)[source]#
Sets number of classes to use during factorization of the softmax output in the Language Model.
- Parameters:
- countint
Number of classes
- setWordMaxDistance(dist)[source]#
Sets maximum distance for the generated candidates for every word.
- Parameters:
- distint
Maximum distance for the generated candidates for every word
- setMaxCandidates(candidates)[source]#
Sets maximum number of candidates for every word.
- Parameters:
- candidatesint
Maximum number of candidates for every word.
- setCaseStrategy(strategy)[source]#
Sets what case combinations to try when generating candidates.
Possible values are:
0: All uppercase letters
1: First letter capitalized
2: All letters
- Parameters:
- strategyint
Case combinations to try when generating candidates
- setErrorThreshold(threshold)[source]#
Sets threshold perplexity for a word to be considered as an error.
- Parameters:
- thresholdfloat
Threshold perplexity for a word to be considered as an error
- setEpochs(count)[source]#
Sets number of epochs to train the language model.
- Parameters:
- countint
Number of epochs
- setInitialRate(rate)[source]#
Sets initial learning rate for the LM.
- Parameters:
- ratefloat
Initial learning rate for the LM
- setFinalRate(rate)[source]#
Sets final learning rate for the LM.
- Parameters:
- ratefloat
Final learning rate for the LM
- setValidationFraction(fraction)[source]#
Sets percentage of datapoints to use for validation.
- Parameters:
- fractionfloat
Percentage of datapoints to use for validation
- setMinCount(count)[source]#
Sets min number of times a token should appear to be included in vocab.
- Parameters:
- countfloat
Min number of times a token should appear to be included in vocab
- setCompoundCount(count)[source]#
Sets min number of times a compound word should appear to be included in vocab.
- Parameters:
- countint
Min number of times a compound word should appear to be included in vocab.
- setClassCount(count)[source]#
Sets min number of times the word need to appear in corpus to not be considered of a special class.
- Parameters:
- countfloat
Min number of times the word need to appear in corpus to not be considered of a special class.
- setTradeoff(alpha)[source]#
Sets tradeoff between the cost of a word error and a transition in the language model.
- Parameters:
- alphafloat
Tradeoff between the cost of a word error and a transition in the language model
- setWeightedDistPath(path)[source]#
Sets the path to the file containing the weights for the levenshtein distance.
- Parameters:
- pathstr
Path to the file containing the weights for the levenshtein distance.
- setMaxWindowLen(length)[source]#
Sets the maximum size for the window used to remember history prior to every correction.
- Parameters:
- lengthint
Maximum size for the window used to remember history prior to every correction
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setGraphFolder(path)[source]#
Sets folder path that contain external graph files.
- Parameters:
- pathstr
Folder path that contain external graph files.
- setMaxSentLen(sentlen)[source]#
Sets the maximum length of a sentence.
- Parameters:
- sentlenint
Maximum length of a sentence
- class ContextSpellCheckerModel(classname='com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerModel', java_model=None)[source]#
Implements a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.
Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors,
ContextSpellChecker
will rank correction sequences according to three things:Different correction candidates for each word — word level.
The surrounding text of each word, i.e. it’s context — sentence level.
The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.
This is the instantiated model of the
ContextSpellCheckerApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> spellChecker = ContextSpellCheckerModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("checked")
The default model is
"spellcheck_dl"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- wordMaxDistance
Maximum distance for the generated candidates for every word.
- maxCandidates
Maximum number of candidates for every word.
- caseStrategy
What case combinations to try when generating candidates.
- errorThreshold
Threshold perplexity for a word to be considered as an error.
- tradeoff
Tradeoff between the cost of a word error and a transition in the language model.
- maxWindowLen
Maximum size for the window used to remember history prior to every correction.
- gamma
Controls the influence of individual word frequency in the decision.
- correctSymbols
Whether to correct special symbols or skip spell checking for them
- compareLowcase
If true will compare tokens in low case with vocabulary.
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- vocabFreq
Frequency words from the vocabulary.
- idsVocab
Mapping of ids to vocabulary.
- vocabIds
Mapping of vocabulary to ids.
- classes
Classes the spell checker recognizes.
- weights
Levenshtein weights.
- useNewLines
When set to true new lines will be treated as any other character. When set to false correction is applied on paragraphs as defined by newline characters.
See also
NorvigSweetingModel
,SymmetricDeleteModel
For alternative approaches to spell checking
References
For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("doc") >>> tokenizer = Tokenizer() \ ... .setInputCols(["doc"]) \ ... .setOutputCol("token") >>> spellChecker = ContextSpellCheckerModel \ ... .pretrained() \ ... .setTradeoff(12.0) \ ... .setInputCols("token") \ ... .setOutputCol("checked") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> data = spark.createDataFrame([["It was a cold , dreary day and the country was white with smow ."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("checked.result").show(truncate=False) +--------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------+ |[It, was, a, cold, ,, dreary, day, and, the, country, was, white, with, snow, .]| +--------------------------------------------------------------------------------+
- setWordMaxDistance(dist)[source]#
Sets maximum distance for the generated candidates for every word.
- Parameters:
- distint
Maximum distance for the generated candidates for every word.
- setMaxCandidates(candidates)[source]#
Sets maximum number of candidates for every word.
- Parameters:
- candidatesint
Maximum number of candidates for every word.
- setCaseStrategy(strategy)[source]#
Sets what case combinations to try when generating candidates.
- Parameters:
- strategyint
Case combinations to try when generating candidates.
- setErrorThreshold(threshold)[source]#
Sets threshold perplexity for a word to be considered as an error.
- Parameters:
- thresholdfloat
Threshold perplexity for a word to be considered as an error
- setTradeoff(alpha)[source]#
Sets tradeoff between the cost of a word error and a transition in the language model.
- Parameters:
- alphafloat
Tradeoff between the cost of a word error and a transition in the language model
- setWeights(weights)[source]#
Sets weights of each word for Levenshtein distance.
- Parameters:
- weightsDict[str, float]
Weights for Levenshtein distance as a mapping.
- setMaxWindowLen(length)[source]#
Sets the maximum size for the window used to remember history prior to every correction.
- Parameters:
- lengthint
Maximum size for the window used to remember history prior to every correction
- setGamma(g)[source]#
Sets the influence of individual word frequency in the decision.
- Parameters:
- gfloat
Controls the influence of individual word frequency in the decision.
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setVocabFreq(value: dict)[source]#
Sets frequency words from the vocabulary.
- Parameters:
- valuedict
Frequency words from the vocabulary.
- setIdsVocab(idsVocab: dict)[source]#
Sets mapping of ids to vocabulary.
- Parameters:
- idsVocabdict
Mapping of ids to vocabulary.
- setVocabIds(vocabIds: dict)[source]#
Sets mapping of vocabulary to ids.
- Parameters:
- vocabIdsdict
Mapping of vocabulary to ids.
- setClasses(value)[source]#
Sets classes the spell checker recognizes.
- Parameters:
- valuelist
Classes the spell checker recognizes.
- getWordClasses()[source]#
Gets the classes of words to be corrected.
- Returns:
- List[str]
Classes of words to be corrected
- updateRegexClass(label, regex)[source]#
Update existing class to correct, based on regex
- Parameters:
- labelstr
Label of the class
- regexstr
Regex to parse the class
- updateVocabClass(label, vocab, append=True)[source]#
Update existing class to correct, based on a vocabulary.
- Parameters:
- labelstr
Label of the class
- vocabList[str]
Vocabulary as a list
- appendbool, optional
Whether to append to the existing vocabulary, by default True
- setCorrectSymbols(value)[source]#
Sets whether to correct special symbols or skip spell checking for them.
- Parameters:
- valuebool
Whether to correct special symbols or skip spell checking for them
- setCompareLowcase(value)[source]#
Sets whether to compare tokens in lower case with vocabulary.
- Parameters:
- valuebool
Whether to compare tokens in lower case with vocabulary.
- static pretrained(name='spellcheck_dl', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “spellcheck_dl”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- ContextSpellCheckerModel
The restored model