Description
This model is used detecting hatespeech in English language. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model. The model is trained with different learning rates and the best validation score achieved is 0.726030 for a learning rate of 2e-5. Training code can be found here https://github.com/punyajoy/DE-LIMIT
For more details about our paper
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. “Deep Learning Models for Multilingual Hate Speech Detection”. Accepted at ECML-PKDD 2020.
@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}
Predicted Entities
NON_HATE
, HATE
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_dehatebert_mono', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
example = spark.createDataFrame([['I love you!']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_dehatebert_mono", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq("I love you!").toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("en.classify.bert_sequence.dehatebert_mono").predict("""I love you!""")
Model Information
Model Name: | bert_sequence_classifier_dehatebert_mono |
Compatibility: | Spark NLP 3.3.2+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [token, document] |
Output Labels: | [class] |
Language: | en |
Case sensitive: | false |
Max sentense length: | 512 |
Data Source
https://huggingface.co/Hate-speech-CNERG/dehatebert-mono-english