`sparknlp.annotator.spell_check.norvig_sweeting`#

Contains classes for the NorvigSweeting spell checker.

Module Contents#

Classes#

`NorvigSweetingApproach`	Trains annotator, that retrieves tokens and makes corrections automatically if
`NorvigSweetingModel`	This annotator retrieves tokens and makes corrections automatically if

class NorvigSweetingApproach[source]#

Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary, based on the algorithm by Peter Norvig.

The algorithm is based on a Bayesian approach to spell checking: Given the word we look in the provided dictionary to choose the word with the highest probability to be the correct one.

A dictionary of correct spellings must be provided with setDictionary() in the form of a text file, where each word is parsed by a regex pattern.

For instantiated/pretrained models, see NorvigSweetingModel.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

dictionary

Dictionary needs ‘tokenPattern’ regex in dictionary for separating words

caseSensitive

Whether to ignore case sensitivity, by default False

doubleVariants

Whether to use more expensive spell checker, by default False

Increase search at cost of performance. Enables extra check for word combinations.

shortCircuit

Whether to use faster mode, by default False

Increase performance at cost of accuracy. Faster but less accurate.

frequencyPriority

Applies frequency over hamming in intersections, when false hamming takes priority, by default True

wordSizeIgnore

Minimum size of word before ignoring, by default 3

dupsLimit

Maximum duplicate of characters in a word to consider, by default 2

reductLimit

Word reductions limit, by default 3

intersections

Hamming intersections to attempt, by default 10

vowelSwapLimit

Vowel swap attempts, by default 6

See also

SymmetricDeleteApproach: for an alternative approach to spell checking
ContextSpellCheckerApproach: for a DL based approach

References

Inspired by the spell checker by Peter Norvig: How to Write a Spelling Corrector

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the dictionary "words.txt" has the form of:

...
gummy
gummic
gummier
gummiest
gummiferous
...

This dictionary is then set to be the basis of the spell checker.

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = NorvigSweetingApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell") \
...     .setDictionary("src/test/resources/spell/words.txt")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> pipelineModel = pipeline.fit(trainingData)

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

dictionary[source]#

caseSensitive[source]#

doubleVariants[source]#

shortCircuit[source]#

frequencyPriority[source]#

wordSizeIgnore[source]#

dupsLimit[source]#

reductLimit[source]#

intersections[source]#

vowelSwapLimit[source]#

dictionary_path = ''[source]#

setDictionary(path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets dictionary which needs ‘tokenPattern’ regex for separating words.

Parameters:

pathstr: Path to the source file
token_patternstr, optional: Pattern for token separation, by default \S+
read_asstr, optional: How to read the file, by default ReadAs.TEXT
optionsdict, optional: Options to read the resource, by default {“format”: “text”}

setCaseSensitive(value)[source]#

Sets whether to ignore case sensitivity, by default False.

Parameters:

valuebool: Whether to ignore case sensitivity

setDoubleVariants(value)[source]#

Sets whether to use more expensive spell checker, by default False.

Increase search at cost of performance. Enables extra check for word combinations.

Parameters:

valuebool: [description]

setShortCircuit(value)[source]#

Sets whether to use faster mode, by default False.

Increase performance at cost of accuracy. Faster but less accurate.

Parameters:

valuebool: Whether to use faster mode

setFrequencyPriority(value)[source]#

Sets whether to consider frequency over hamming in intersections, when false hamming takes priority, by default True.

Parameters:

valuebool: Whether to consider frequency over hamming in intersections

class NorvigSweetingModel(classname='com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel', java_model=None)[source]#

This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

This is the instantiated model of the NorvigSweetingApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>>    spellChecker = NorvigSweetingModel.pretrained() \
...        .setInputCols(["token"]) \
...        .setOutputCol("spell") \

The default model is "spellcheck_norvig", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`TOKEN`	`TOKEN`

Parameters:

None

See also

SymmetricDeleteModel: for an alternative approach to spell checking
ContextSpellCheckerModel: for a DL based approach

References

Inspired by Norvig model and SymSpell.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = NorvigSweetingModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["somtimes i wrrite wordz erong."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("spell.result").show(truncate=False)
+--------------------------------------+
|result                                |
+--------------------------------------+
|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+

name = 'NorvigSweetingModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'token'[source]#

static pretrained(name='spellcheck_norvig', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “spellcheck_norvig”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

NorvigSweetingModel: The restored model

sparknlp.annotator.spell_check.norvig_sweeting#

Module Contents#

Classes#

`sparknlp.annotator.spell_check.norvig_sweeting`#