Spam Classifier

Description

Automatically identify messages as being regular messages or Spam.

Predicted Entities

spam, ham

Live Demo
Open in Colab
Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(lang="en") \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")

nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate('Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.')

val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val use = UniversalSentenceEncoder.pretrained(lang="en")
.setInputCols(Array("document"))
.setOutputCol("sentence_embeddings")
val document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en')
.setInputCols(Array("document", "sentence_embeddings"))
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier))

val data = Seq("Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["""Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now."""]
spam_df = nlu.load('classify.spam.use').predict(text, output_level='document')
spam_df[["document", "spam"]]

Results

+------------------------------------------------------------------------------------------------+------------+
|document                                                                                        |class       |
+------------------------------------------------------------------------------------------------+------------+
|Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.  | spam       |
+------------------------------------------------------------------------------------------------+------------+

Model Information

Model Name	classifierdl_use_spam
Model Class	ClassifierDLModel
Spark Compatibility	2.5.3
Spark NLP Compatibility	2.4
License	open source
Edition	public
Input Labels	[document, sentence_embeddings]
Output Labels	[class]
Language	en
Upstream Dependencies	tfhub_use

Data Source

This model is trained on UCI spam dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

Benchmarking

Accuracy of the model with USE Embeddings is 0.86

precision    recall  f1-score   support

ham       0.86      1.00      0.92      1440
spam       0.00      0.00      0.00       238

accuracy                           0.86      1678
macro avg       0.43      0.50      0.46      1678
weighted avg       0.74      0.86      0.79      1678

PREVIOUSSarcasm Classifier

NEXTStop Words Cleaner for Anglo-French