BERT Sequence Classification - Detecting Hate Speech (bert_sequence_classifier_hatexplain)

Description

This model is imported from Hugging Face-modelsand it is used for classifying a text as Hate speech, Offensive, or Normal. The model is trained using data from Gab and Twitter and Human Rationales were included as part of the training data to boost the performance.

Citing :

@article{mathew2020hatexplain,
title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection},
author={Mathew, Binny and Saha, Punyajoy and Yimam, Seid Muhie and Biemann, Chris and Goyal, Pawan and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2012.10289},
year={2020}
}

Predicted Entities

hate speech, normal, offensive

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
      .pretrained('bert_sequence_classifier_hatexplain', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class') \
      .setCaseSensitive(True) \
      .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])

example = spark.createDataFrame([['I love you very much!']]).toDF("text")
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler() 
    .setInputCol("text") 
    .setOutputCol("document")

val tokenizer = Tokenizer() 
    .setInputCols("document") 
    .setOutputCol("token")

val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_hatexplain", "en")
      .setInputCols("document", "token")
      .setOutputCol("class")
      .setCaseSensitive(true)
      .setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

val example = Seq.empty["I love you very much!"].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)

import nlu
nlu.load("en.classify.bert.hate.").predict("""I love you very much!""")

Results

['normal']

Model Information

Model Name:	bert_sequence_classifier_hatexplain
Compatibility:	Spark NLP 3.3.2+
License:	Open Source
Edition:	Official
Input Labels:	[token, sentence]
Output Labels:	[label]
Language:	en
Case sensitive:	true

Data Source

https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain

Benchmarking

+-------+------------+--------+
| Acc   | Macro F1   | AUROC  |
+-------+------------+--------+
| 0.698 | 0.687      | 0.851  |
+-------+------------+--------+

PREVIOUSBERT Sequence Classification - Identify Antisemitic texts

NEXTBERT Sequence Classification - Identify Trec Data Classes