sparknlp.annotator.spell_check.norvig_sweeting
#
Contains classes for the NorvigSweeting spell checker.
Module Contents#
Classes#
Trains annotator, that retrieves tokens and makes corrections automatically if |
|
This annotator retrieves tokens and makes corrections automatically if |
- class NorvigSweetingApproach[source]#
Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary, based on the algorithm by Peter Norvig.
The algorithm is based on a Bayesian approach to spell checking: Given the word we look in the provided dictionary to choose the word with the highest probability to be the correct one.
A dictionary of correct spellings must be provided with
setDictionary()
in the form of a text file, where each word is parsed by a regex pattern.For instantiated/pretrained models, see
NorvigSweetingModel
.Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- dictionary
Dictionary needs ‘tokenPattern’ regex in dictionary for separating words
- caseSensitive
Whether to ignore case sensitivity, by default False
- doubleVariants
Whether to use more expensive spell checker, by default False
Increase search at cost of performance. Enables extra check for word combinations.
- shortCircuit
Whether to use faster mode, by default False
Increase performance at cost of accuracy. Faster but less accurate.
- frequencyPriority
Applies frequency over hamming in intersections, when false hamming takes priority, by default True
- wordSizeIgnore
Minimum size of word before ignoring, by default 3
- dupsLimit
Maximum duplicate of characters in a word to consider, by default 2
- reductLimit
Word reductions limit, by default 3
- intersections
Hamming intersections to attempt, by default 10
- vowelSwapLimit
Vowel swap attempts, by default 6
See also
SymmetricDeleteApproach
for an alternative approach to spell checking
ContextSpellCheckerApproach
for a DL based approach
References
Inspired by the spell checker by Peter Norvig: How to Write a Spelling Corrector
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
In this example, the dictionary
"words.txt"
has the form of:... gummy gummic gummier gummiest gummiferous ...
This dictionary is then set to be the basis of the spell checker.
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> spellChecker = NorvigSweetingApproach() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") \ ... .setDictionary("src/test/resources/spell/words.txt") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> pipelineModel = pipeline.fit(trainingData)
- setDictionary(path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets dictionary which needs ‘tokenPattern’ regex for separating words.
- Parameters:
- pathstr
Path to the source file
- token_patternstr, optional
Pattern for token separation, by default
\S+
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- setCaseSensitive(value)[source]#
Sets whether to ignore case sensitivity, by default False.
- Parameters:
- valuebool
Whether to ignore case sensitivity
- setDoubleVariants(value)[source]#
Sets whether to use more expensive spell checker, by default False.
Increase search at cost of performance. Enables extra check for word combinations.
- Parameters:
- valuebool
[description]
- class NorvigSweetingModel(classname='com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel', java_model=None)[source]#
This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
This is the instantiated model of the
NorvigSweetingApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> spellChecker = NorvigSweetingModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") \
The default model is
"spellcheck_norvig"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- None
See also
SymmetricDeleteModel
for an alternative approach to spell checking
ContextSpellCheckerModel
for a DL based approach
References
Inspired by Norvig model and SymSpell.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> spellChecker = NorvigSweetingModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> data = spark.createDataFrame([["somtimes i wrrite wordz erong."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("spell.result").show(truncate=False) +--------------------------------------+ |result | +--------------------------------------+ |[sometimes, i, write, words, wrong, .]| +--------------------------------------+
- static pretrained(name='spellcheck_norvig', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “spellcheck_norvig”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- NorvigSweetingModel
The restored model