class ContextSpellCheckerApproach extends AnnotatorApproach[ContextSpellCheckerModel] with HasFeatures with WeightedLevenshtein
Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.
For instantiated/pretrained models, see ContextSpellCheckerModel.
Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially
containing a certain number of errors, ContextSpellChecker
will rank correction sequences
according to three things:
- Different correction candidates for each word — word level.
- The surrounding text of each word, i.e. it’s context — sentence level.
- The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.
For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.
For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples and the ContextSpellCheckerTestSpec.
Example
For this example, we use the first Sherlock Holmes book as the training dataset.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val spellChecker = new ContextSpellCheckerApproach() .setInputCols("token") .setOutputCol("corrected") .setWordMaxDistance(3) .setBatchSize(24) .setEpochs(8) .setLanguageModelClasses(1650) // dependant on vocabulary size // .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, spellChecker )) val path = "src/test/resources/spell/sherlockholmes.txt" val dataset = spark.sparkContext.textFile(path) .toDF("text") val pipelineModel = pipeline.fit(dataset)
- See also
NorvigSweetingApproach and SymmetricDeleteApproach for alternative approaches to spell checking
- Grouped
- Alphabetic
- By Inheritance
- ContextSpellCheckerApproach
- WeightedLevenshtein
- HasFeatures
- AnnotatorApproach
- CanBeLazy
- DefaultParamsWritable
- MLWritable
- HasOutputAnnotatorType
- HasOutputAnnotationCol
- HasInputAnnotationCols
- Estimator
- PipelineStage
- Logging
- Params
- Serializable
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
Type Members
-
type
AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
- implicit class ArrayHelper extends AnyRef
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
$[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
-
def
$$[T](feature: StructFeature[T]): T
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[K, V](feature: MapFeature[K, V]): Map[K, V]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: SetFeature[T]): Set[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: ArrayFeature[T]): Array[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
_fit(dataset: Dataset[_], recursiveStages: Option[PipelineModel]): ContextSpellCheckerModel
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
def
addRegexClass(usrLabel: String, usrRegex: String, userDist: Int = 3): ContextSpellCheckerApproach.this.type
Adds a new class of words to correct, based on regex.
Adds a new class of words to correct, based on regex.
- usrLabel
Name of the class
- usrRegex
Regex to add
- userDist
Maximal distance to the word
-
def
addVocabClass(usrLabel: String, vocabList: ArrayList[String], userDist: Int = 3): ContextSpellCheckerApproach.this.type
Adds a new class of words to correct, based on a vocabulary.
Adds a new class of words to correct, based on a vocabulary.
- usrLabel
Name of the class
- vocabList
Vocabulary as a list
- userDist
Maximal distance to the word
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
backTrack(dist: Array[Array[Float]], s2: String, s1: String, j: Int, i: Int, acc: Seq[(String, String)]): Seq[(String, String)]
- Definition Classes
- WeightedLevenshtein
-
val
batchSize: IntParam
Batch size for the training in NLM (Default:
24
). -
def
beforeTraining(spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
val
caseStrategy: IntParam
What case combinations to try when generating candidates (Default:
CandidateStrategy.ALL
). -
final
def
checkSchema(schema: StructType, inputAnnotatorType: String): Boolean
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
val
classCount: Param[Double]
Min number of times the word need to appear in corpus to not be considered of a special class (Default:
15.0
). -
final
def
clear(param: Param[_]): ContextSpellCheckerApproach.this.type
- Definition Classes
- Params
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
val
compoundCount: Param[Int]
Min number of times a compound word should appear to be included in vocab (Default:
5
). - def computeClasses(vocab: HashMap[String, Double], total: Double, k: Int): Map[String, (Int, Int)]
-
val
configProtoBytes: IntArrayParam
Configproto from tensorflow, serialized into byte array.
Configproto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString()
-
final
def
copy(extra: ParamMap): Estimator[ContextSpellCheckerModel]
- Definition Classes
- AnnotatorApproach → Estimator → PipelineStage → Params
-
def
copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
final
def
defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
val
description: String
- Definition Classes
- ContextSpellCheckerApproach → AnnotatorApproach
-
val
epochs: IntParam
Number of epochs to train the language model (Default:
2
). -
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
errorThreshold: FloatParam
Threshold perplexity for a word to be considered as an error (Default:
10f
). -
def
explainParam(param: Param[_]): String
- Definition Classes
- Params
-
def
explainParams(): String
- Definition Classes
- Params
-
final
def
extractParamMap(): ParamMap
- Definition Classes
- Params
-
final
def
extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
-
val
features: ArrayBuffer[Feature[_, _, _]]
- Definition Classes
- HasFeatures
-
val
finalRate: FloatParam
Final learning rate for the LM (Default:
0.0005f
). -
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
fit(dataset: Dataset[_]): ContextSpellCheckerModel
- Definition Classes
- AnnotatorApproach → Estimator
-
def
fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[ContextSpellCheckerModel]
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], paramMap: ParamMap): ContextSpellCheckerModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): ContextSpellCheckerModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" ) @varargs()
- def genVocab(dataset: Dataset[_]): (HashMap[String, Double], Map[String, (Int, Int)])
-
def
get[T](feature: StructFeature[T]): Option[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: SetFeature[T]): Option[Set[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: ArrayFeature[T]): Option[Array[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def getConfigProtoBytes: Option[Array[Byte]]
-
final
def
getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getInputCols: Array[String]
- returns
input annotations columns currently used
- Definition Classes
- HasInputAnnotationCols
-
def
getLazyAnnotator: Boolean
- Definition Classes
- CanBeLazy
-
final
def
getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
-
final
def
getOutputCol: String
Gets annotation column name going to generate
Gets annotation column name going to generate
- Definition Classes
- HasOutputAnnotationCol
-
def
getParam(paramName: String): Param[Any]
- Definition Classes
- Params
-
val
graphFolder: Param[String]
Folder path that contain external graph files
-
final
def
hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
-
def
hasParam(paramName: String): Boolean
- Definition Classes
- Params
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
val
initialRate: FloatParam
Initial learning rate for the LM (Default:
.7f
). -
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
inputAnnotatorTypes: Array[String]
Input Annotator Types: TOKEN
Input Annotator Types: TOKEN
- Definition Classes
- ContextSpellCheckerApproach → HasInputAnnotationCols
-
final
val
inputCols: StringArrayParam
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
columns that contain annotations necessary to run this annotator AnnotatorType is used both as input and output columns if not specified
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
isSet(param: Param[_]): Boolean
- Definition Classes
- Params
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
val
languageModelClasses: Param[Int]
Number of classes to use during factorization of the softmax output in the LM (Default:
2000
). -
val
lazyAnnotator: BooleanParam
- Definition Classes
- CanBeLazy
-
def
learnDist(s1: String, s2: String): Seq[(String, String)]
- Definition Classes
- WeightedLevenshtein
-
def
levenshteinDist(s11: String, s22: String)(cost: (String, String) ⇒ Float): Float
- Definition Classes
- WeightedLevenshtein
-
def
loadWeights(filename: String): Map[String, Map[String, Float]]
- Definition Classes
- WeightedLevenshtein
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
maxCandidates: IntParam
Maximum number of candidates for every word (Default:
6
). -
val
maxSentLen: IntParam
Maximum length for a sentence - internal use during training (Default:
250
) -
val
maxWindowLen: IntParam
Maximum size for the window used to remember history prior to every correction (Default:
5
). -
val
minCount: Param[Double]
Min number of times a token should appear to be included in vocab (Default:
3.0
). -
def
msgHelper(schema: StructType): String
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
onTrained(model: ContextSpellCheckerModel, spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
val
optionalInputAnnotatorTypes: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
val
outputAnnotatorType: AnnotatorType
Output Annotator Types: TOKEN
Output Annotator Types: TOKEN
- Definition Classes
- ContextSpellCheckerApproach → HasOutputAnnotatorType
-
final
val
outputCol: Param[String]
- Attributes
- protected
- Definition Classes
- HasOutputAnnotationCol
-
lazy val
params: Array[Param[_]]
- Definition Classes
- Params
-
def
save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since( "1.6.0" ) @throws( ... )
-
def
set[T](feature: StructFeature[T], value: T): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[K, V](feature: MapFeature[K, V], value: Map[K, V]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: SetFeature[T], value: Set[T]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: ArrayFeature[T], value: Array[T]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
set(paramPair: ParamPair[_]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set(param: String, value: Any): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set[T](param: Param[T], value: T): ContextSpellCheckerApproach.this.type
- Definition Classes
- Params
- def setBatchSize(k: Int): ContextSpellCheckerApproach.this.type
- def setCaseStrategy(k: Int): ContextSpellCheckerApproach.this.type
- def setClassCount(t: Double): ContextSpellCheckerApproach.this.type
- def setCompoundCount(k: Int): ContextSpellCheckerApproach.this.type
- def setConfigProtoBytes(bytes: Array[Int]): ContextSpellCheckerApproach.this.type
-
def
setDefault[T](feature: StructFeature[T], value: () ⇒ T): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
setDefault(paramPairs: ParamPair[_]*): ContextSpellCheckerApproach.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
setDefault[T](param: Param[T], value: T): ContextSpellCheckerApproach.this.type
- Attributes
- protected[org.apache.spark.ml]
- Definition Classes
- Params
- def setEpochs(k: Int): ContextSpellCheckerApproach.this.type
- def setErrorThreshold(t: Float): ContextSpellCheckerApproach.this.type
- def setFinalRate(r: Float): ContextSpellCheckerApproach.this.type
-
def
setGraphFolder(path: String): ContextSpellCheckerApproach.this.type
Folder path that contain external graph files
- def setInitialRate(r: Float): ContextSpellCheckerApproach.this.type
-
final
def
setInputCols(value: String*): ContextSpellCheckerApproach.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setInputCols(value: Array[String]): ContextSpellCheckerApproach.this.type
Overrides required annotators column if different than default
Overrides required annotators column if different than default
- Definition Classes
- HasInputAnnotationCols
- def setLanguageModelClasses(k: Int): ContextSpellCheckerApproach.this.type
-
def
setLazyAnnotator(value: Boolean): ContextSpellCheckerApproach.this.type
- Definition Classes
- CanBeLazy
- def setMaxCandidates(k: Int): ContextSpellCheckerApproach.this.type
- def setMaxWindowLen(w: Int): ContextSpellCheckerApproach.this.type
- def setMinCount(threshold: Double): ContextSpellCheckerApproach.this.type
-
final
def
setOutputCol(value: String): ContextSpellCheckerApproach.this.type
Overrides annotation column name when transforming
Overrides annotation column name when transforming
- Definition Classes
- HasOutputAnnotationCol
- def setSpecialClasses(parsers: List[SpecialClassParser]): ContextSpellCheckerApproach.this.type
- def setTradeoff(alpha: Float): ContextSpellCheckerApproach.this.type
- def setValidationFraction(r: Float): ContextSpellCheckerApproach.this.type
- def setWeightedDistPath(filePath: String): ContextSpellCheckerApproach.this.type
- def setWordMaxDistance(k: Int): ContextSpellCheckerApproach.this.type
-
val
specialClasses: Param[List[SpecialClassParser]]
List of parsers for special classes (Default:
List(new DateToken, new NumberToken)
). -
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
-
val
tradeoff: Param[Float]
Tradeoff between the cost of a word error and a transition in the language model (Default:
18.0f
). -
def
train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): ContextSpellCheckerModel
- Definition Classes
- ContextSpellCheckerApproach → AnnotatorApproach
-
final
def
transformSchema(schema: StructType): StructType
requirement for pipeline transformation validation.
requirement for pipeline transformation validation. It is called on fit()
- Definition Classes
- AnnotatorApproach → PipelineStage
-
def
transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
-
val
uid: String
- Definition Classes
- ContextSpellCheckerApproach → Identifiable
-
def
validate(schema: StructType): Boolean
takes a Dataset and checks to see if all the required annotation types are present.
takes a Dataset and checks to see if all the required annotation types are present.
- schema
to be validated
- returns
True if all the required types are present, else false
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
val
validationFraction: FloatParam
Percentage of datapoints to use for validation (Default:
.1f
). -
def
wLevenshteinDist(s1: String, s2: String, weights: Map[String, Map[String, Float]]): Float
- Definition Classes
- WeightedLevenshtein
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
val
weightedDistPath: Param[String]
The path to the file containing the weights for the levenshtein distance.
-
val
wordMaxDistance: IntParam
Maximum distance for the generated candidates for every word (Default:
3
). -
def
write: MLWriter
- Definition Classes
- DefaultParamsWritable → MLWritable
Inherited from WeightedLevenshtein
Inherited from HasFeatures
Inherited from AnnotatorApproach[ContextSpellCheckerModel]
Inherited from CanBeLazy
Inherited from DefaultParamsWritable
Inherited from MLWritable
Inherited from HasOutputAnnotatorType
Inherited from HasOutputAnnotationCol
Inherited from HasInputAnnotationCols
Inherited from Estimator[ContextSpellCheckerModel]
Inherited from PipelineStage
Inherited from Logging
Inherited from Params
Inherited from Serializable
Inherited from Serializable
Inherited from Identifiable
Inherited from AnyRef
Inherited from Any
Parameters
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.
Annotator types
Required input and expected output annotator types