Description
This model was imported from Hugging Face and it’s been fine-tuned for the Russian language, leveraging Bert embeddings and BertForSequenceClassification for text classification purposes.
Predicted Entities
neutral, toxic
Live Demo Open in Colab Download Copy S3 URI
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_toxicity', 'ru') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier])
example = spark.createDataFrame([["Ненавижу тебя, идиот."]]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_toxicity", "ru")
.setInputCols("document", "token")
.setOutputCol("class")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))
val example = Seq.empty["Ненавижу тебя, идиот."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("ru.classify.toxic").predict("""Ненавижу тебя, идиот.""")
Results
['toxic']
Model Information
| Model Name: | bert_sequence_classifier_toxicity |
| Compatibility: | Spark NLP 3.3.4+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [document, token] |
| Output Labels: | [class] |
| Language: | ru |
| Size: | 665.1 MB |
| Case sensitive: | true |
| Max sentense length: | 512 |
Data Source
https://huggingface.co/SkolkovoInstitute/russian_toxicity_classifier
Benchmarking
label precision recall f1-score support
neutral 0.98 0.99 0.98 21384
toxic 0.94 0.92 0.93 4886
accuracy - - 0.97 26270
macro-avg 0.96 0.96 0.96 26270
weighted-avg 0.97 0.97 0.97 26270