Description
This model is imported from Hugging Face-modelsand it is used for classifying a text as Hate speech, Offensive, or Normal. The model is trained using data from Gab and Twitter and Human Rationales were included as part of the training data to boost the performance.
- Citing :
@article{mathew2020hatexplain, title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection}, author={Mathew, Binny and Saha, Punyajoy and Yimam, Seid Muhie and Biemann, Chris and Goyal, Pawan and Mukherjee, Animesh}, journal={arXiv preprint arXiv:2012.10289}, year={2020} }
Predicted Entities
hate speech, normal, offensive
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_hatexplain', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([['I love you very much!']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_hatexplain", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["I love you very much!"].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("en.classify.bert.hate.").predict("""I love you very much!""")
Results
['normal']
Model Information
| Model Name: | bert_sequence_classifier_hatexplain |
| Compatibility: | Spark NLP 3.3.2+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [token, sentence] |
| Output Labels: | [label] |
| Language: | en |
| Case sensitive: | true |
Data Source
https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain
Benchmarking
+-------+------------+--------+
| Acc | Macro F1 | AUROC |
+-------+------------+--------+
| 0.698 | 0.687 | 0.851 |
+-------+------------+--------+