Description
This model was imported from Hugging Face
(link) and it’s been trained on NeuSpell corpus to detect typos, leveraging DistilBERT
embeddings and DistilBertForTokenClassification
for NER purposes. It classifies typo tokens as PO
.
Predicted Entities
PO
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
text = """He had also stgruggled with addiction during his tine in Congress."""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["He had also stgruggled with addiction during his tine in Congress."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("en.classify.typos.distilbert").predict("""He had also stgruggled with addiction during his tine in Congress.""")
Results
+------------+---------+
|chunk |ner_label|
+------------+---------+
|stgruggled |PO |
|tine |PO |
+------------+---------+
Model Information
Model Name: | distilbert_token_classifier_typo_detector |
Compatibility: | Spark NLP 3.3.4+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 244.1 MB |
Case sensitive: | true |
Max sentence length: | 256 |
Data Source
https://github.com/neuspell/neuspell
Benchmarking
label precision recall f1-score support
micro-avg 0.992332 0.985997 0.989154 416054.0
macro-avg 0.992332 0.985997 0.989154 416054.0
weighted-avg 0.992332 0.985997 0.989154 416054.0