sparknlp.annotator.spell_check.symmetric_delete
#
Contains classes for SymmetricDelete.
Module Contents#
Classes#
Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens |
|
Symmetric Delete spelling correction algorithm. |
- class SymmetricDeleteApproach[source]#
Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
A dictionary of correct spellings must be provided with
setDictionary()
in the form of a text file, where each word is parsed by a regex pattern.For instantiated/pretrained models, see
SymmetricDeleteModel
.Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- dictionary
folder or file with text that teaches about the language
- maxEditDistance
max edit distance characters to derive strings from a word, by default 3
- frequencyThreshold
minimum frequency of words to be considered from training, by default 0
- deletesThreshold
minimum frequency of corrections a word needs to have to be considered from training, by default 0
See also
NorvigSweetingApproach
for an alternative approach to spell checking
ContextSpellCheckerApproach
for a DL based approach
References
Inspired by SymSpell.
Examples
In this example, the dictionary
"words.txt"
has the form of:... gummy gummic gummier gummiest gummiferous ...
This dictionary is then set to be the basis of the spell checker.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> spellChecker = SymmetricDeleteApproach() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") \ ... .setDictionary("src/test/resources/spell/words.txt") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> pipelineModel = pipeline.fit(trainingData)
- setDictionary(path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets folder or file with text that teaches about the language.
- Parameters:
- pathstr
Path to the resource
- token_patternstr, optional
Regex patttern to extract tokens, by default “S+”
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- optionsdict, optional
Options for reading the resource, by default {“format”: “text”}
- setMaxEditDistance(v)[source]#
Sets max edit distance characters to derive strings from a word, by default 3.
- Parameters:
- vint
Max edit distance characters to derive strings from a word
- class SymmetricDeleteModel(classname='com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteModel', java_model=None)[source]#
Symmetric Delete spelling correction algorithm.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> spell = SymmetricDeleteModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell")
The default model is
"spellcheck_sd"
, if no name is provided. For available pretrained models please see the Models Hub.Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- None
See also
NorvigSweetingModel
for an alternative approach to spell checking
ContextSpellCheckerModel
for a DL based approach
References
Inspired by SymSpell.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> spellChecker = SymmetricDeleteModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> data = spark.createDataFrame([["spmetimes i wrrite wordz erong."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("spell.result").show(truncate=False) +--------------------------------------+ |result | +--------------------------------------+ |[sometimes, i, write, words, wrong, .]| +--------------------------------------+
- static pretrained(name='spellcheck_sd', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “spellcheck_sd”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- SymmetricDeleteModel
The restored model