Cyberbullying Classifier in Turkish texts.

Description

Identifies whether a Turkish text contains cyberbullying or not.

Predicted Entities

Negative, Positive

Live Demo Open in Colab Download Copy S3 URI

How to use

...
berturk_embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") \
.setInputCols("document", "lemma") \
.setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
.setInputCols(["document", "embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")

document_classifier = ClassifierDLModel.pretrained('classifierdl_berturk_cyberbullying', 'tr') \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")

berturk_pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer, stopwords_cleaner, lemma, berturk_embeddings, embeddingsSentence, document_classifier])

light_pipeline = LightPipeline(berturk_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

result = light_pipeline.annotate("""Gidişin olsun, dönüşün olmasın inşallah senin..""")
result["class"]
...
val berturk_embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") 
.setInputCols("document", "lemma") 
.setOutputCol("embeddings")

val embeddingsSentence = SentenceEmbeddings() 
.setInputCols(Array("document", "embeddings")) 
.setOutputCol("sentence_embeddings") 
.setPoolingStrategy("AVERAGE")

val document_classifier = ClassifierDLModel.pretrained("classifierdl_berturk_cyberbullying", "tr") 
.setInputCols(Array("document", "sentence_embeddings")) 
.setOutputCol("class")

val berturk_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, normalizer, stopwords_cleaner, lemma, berturk_embeddings, embeddingsSentence, document_classifier))

val light_pipeline = LightPipeline(berturk_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))

val result = light_pipeline.annotate("Gidişin olsun, dönüşün olmasın inşallah senin..")
import nlu
nlu.load("tr.classify.cyberbullying").predict("""Gidişin olsun, dönüşün olmasın inşallah senin..""")

Results

['Negative']

Model Information

Model Name: classifierdl_berturk_cyberbullying
Compatibility: Spark NLP 3.1.2+
License: Open Source
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: tr

Data Source

Trained on a custom dataset with Turkish Bert embeddings (BERTurk).

Benchmarking

precision    recall  f1-score   support

Negative       0.83      0.80      0.81       970
Positive       0.84      0.87      0.86      1225

accuracy                           0.84      2195
macro avg       0.84      0.83      0.84      2195
weighted avg       0.84      0.84      0.84      2195