News Classifier of Turkish text

Description

Classify Turkish news texts

Predicted Entities

kultur, saglik, ekonomi, teknoloji, siyaset, spor

Download Copy S3 URI

How to use

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

embeddings = BertSentenceEmbeddings\
.pretrained('labse', 'xx') \
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")

document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "tr") \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")

nlpPipeline = Pipeline(stages=[document, embeddings, document_classifier])
light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.annotate('Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı.')
val document = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val embeddings = BertSentenceEmbeddings
.pretrained("labse", "xx") 
.setInputCols("document")
.setOutputCol("sentence_embeddings")

val document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "tr") 
.setInputCols(Array("document", "sentence_embeddings")) 
.setOutputCol("class")

val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier))
val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))
val result = light_pipeline.annotate("Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı".)
import nlu
nlu.load("tr.classify.news").predict("""Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı.""")

Results

["spor"]

Model Information

Model Name: classifierdl_bert_news
Compatibility: Spark NLP 3.0.2+
License: Open Source
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: tr
Dependencies: labse_BERT

Data Source

Trained on a custom dataset with multi-lingual Bert Embeddings labse.

Benchmarking

precision    recall  f1-score   support

ekonomi       0.88      0.86      0.87       263
kultur       0.93      0.96      0.94       277
saglik       0.95      0.96      0.95       273
siyaset       0.89      0.91      0.90       257
spor       0.97      0.97      0.97       279
teknoloji       0.94      0.88      0.91       250

accuracy                           0.93      1599
macro avg       0.93      0.92      0.93      1599
weighted avg       0.93      0.93      0.93      1599