Description
Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. LanguageDetectorDL
is an annotator that detects the language of documents or sentences depending on the inputCols
. In addition, LanguageDetetorDL
can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.
We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on Wikipedia dataset with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: https://en.wikipedia.org/wiki/List_of_Wikipedias.
This model can detect the following languages:
Achinese
, Afrikaans
, Tosk Albanian
, Amharic
, Aragonese
, Old English
, Arabic
, Egyptian Arabic
, Assamese
, Asturian
, Avaric
, Aymara
, Azerbaijani
, South Azerbaijani
, Bashkir
, Bavarian
, bat-smg
, Central Bikol
, Belarusian
, Bulgarian
, bh
, Banjar
, Bengali
, Tibetan
, Bishnupriya
, Breton
, Bosnian
, Russia Buriat
, Catalan
, cbk-zam
, Min Dong Chinese
, Chechen
, Cebuano
, Central Kurdish (Soranî)
, Corsican
, Crimean Tatar
, Czech
, Kashubian
, Chuvash
, Welsh
, Danish
, German
, Dimli (individual language)
, Lower Sorbian
, dty
, Dhivehi
, Greek
, eml
, English
, Esperanto
, Spanish
, Estonian
, Extremaduran
, Persian
, Finnish
, fiu-vro
, Faroese
, French
, Arpitan
, Friulian
, Frisian
, Irish
, Gagauz
, Scottish Gaelic
, Galician
, Gilaki
, Guarani
, Konkani (Goan)
, Gujarati
, Manx
, Hausa
, Hakka Chinese
, Hebrew
, Hindi
, Fiji Hindi
, Croatian
, Upper Sorbian
, Haitian Creole
, Hungarian
, Armenian
, Interlingua
, Indonesian
, Interlingue
, Igbo
, Ilocano
, Ido
, Icelandic
, Italian
, Japanese
, Jamaican Patois
, Lojban
, Javanese
, Georgian
, Karakalpak
, Kabyle
, Kabardian
, Kazakh
, Khmer
, Kannada
, Korean
, Komi-Permyak
, Karachay-Balkar
, Kölsch
, Kurdish
, Komi
, Cornish
, Kyrgyz
, Latin
, Ladino
, Luxembourgish
, Lezghian
, Luganda
, Limburgan
, Ligurian
, Lombard
, Lingala
, Lao
, Northern Luri
, Lithuanian
, Latgalian
, Latvian
, Maithili
, map-bms
, Moksha
, Malagasy
, Meadow Mari
, Maori
, Minangkabau
, Macedonian
, Malayalam
, Mongolian
, Marathi
, Hill Mari
, Malay
, Maltese
, Mirandese
, Burmese
, Erzya
, Mazanderani
, Nahuatl
, Neapolitan
, Low German (Low Saxon)
, nds-nl
, Nepali
, Newari
, Dutch
, Norwegian Nynorsk
, Norwegian
, Narom
, Pedi
, Navajo
, Occitan
, Livvi
, Oromo
, Odia (Oriya)
, Ossetian
, Punjabi (Eastern)
, Pangasinan
, Kapampangan
, Papiamento
, Picard
, Pennsylvania German
, Palatine German
, Polish
, Punjabi (Western)
, Pashto
, Portuguese
, Quechua
, Romansh
, Romanian
, roa-rup
, roa-tara
, Russian
, Rusyn
, Kinyarwanda
, Sanskrit
, Yakut
, Sardinian
, Sicilian
, Sindhi
, Northern Sami
, Serbo-Croatian
, Sinhala
, Slovak
, Slovenian
, Shona
, Somali
, Albanian
, Serbian
, Sranan Tongo
, Saterland Frisian
, Sundanese
, Swedish
, Swahili
, Silesian
, Tamil
, Tulu
, Telugu
, Tetun
, Tajik
, Thai
, Turkmen
, Tagalog
, Setswana
, Tongan
, Turkish
, Tatar
, Tuvinian
, Udmurt
, Uyghur
, Ukrainian
, Urdu
, Uzbek
, Venetian
, Veps
, Vietnamese
, Vlaams
, Volapük
, Walloon
, Waray
, Wolof
, Shanghainese
, Xhosa
, Mingrelian
, Yiddish
, Yoruba
, Zeeuws
, Chinese
, zh-classical
, zh-min-nan
, zh-yue
.
Predicted Entities
ace
, af
, als
, am
, an
, ang
, ar
, arz
, as
, ast
, av
, ay
, az
, azb
, ba
, bar
, bat-smg
, bcl
, be
, bg
, bh
, bjn
, bn
, bo
, bpy
, br
, bs
, bxr
, ca
, cbk-zam
, cdo
, ce
, ceb
, ckb
, co
, crh
, cs
, csb
, cv
, cy
, da
, de
, diq
, dsb
, dty
, dv
, el
, eml
, en
, eo
, es
, et
, ext
, fa
, fi
, fiu-vro
, fo
, fr
, frp
, fur
, fy
, ga
, gag
, gd
, gl
, glk
, gn
, gom
, gu
, gv
, ha
, hak
, he
, hi
, hif
, hr
, hsb
, ht
, hu
, hy
, ia
, id
, ie
, ig
, ilo
, io
, is
, it
, ja
, jam
, jbo
, jv
, ka
, kaa
, kab
, kbd
, kk
, km
, kn
, ko
, koi
, krc
, ksh
, ku
, kv
, kw
, ky
, la
, lad
, lb
, lez
, lg
, li
, lij
, lmo
, ln
, lo
, lrc
, lt
, ltg
, lv
, mai
, map-bms
, mdf
, mg
, mhr
, mi
, min
, mk
, ml
, mn
, mr
, mrj
, ms
, mt
, mwl
, my
, myv
, mzn
, nah
, nap
, nds
, nds-nl
, ne
, new
, nl
, nn
, no
, nrm
, nso
, nv
, oc
, olo
, om
, or
, os
, pa
, pag
, pam
, pap
, pcd
, pdc
, pfl
, pl
, pnb
, ps
, pt
, qu
, rm
, ro
, roa-rup
, roa-tara
, ru
, rue
, rw
, sa
, sah
, sc
, scn
, sd
, se
, sh
, si
, sk
, sl
, sn
, so
, sq
, sr
, srn
, stq
, su
, sv
, sw
, szl
, ta
, tcy
, te
, tet
, tg
, th
, tk
, tl
, tn
, to
, tr
, tt
, tyv
, udm
, ug
, uk
, ur
, uz
, vec
, vep
, vi
, vls
, vo
, wa
, war
, wo
, wuu
, xh
, xmf
, yi
, yo
, zea
, zh
, zh-classical
, zh-min-nan
, zh-yue
.
Open in Colab Download Copy S3 URI
How to use
...
language_detector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("language")
languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector])
light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.")
...
val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")
.setInputCols("sentence")
.setOutputCol("language")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector))
val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."]
lang_df = nlu.load('xx.classify.wiki_231').predict(text, output_level='sentence')
lang_df
Results
'fr'
Model Information
Model Name: | ld_wiki_cnn_231 |
Compatibility: | Spark NLP 2.7.0+ |
Edition: | Official |
Input Labels: | [sentence] |
Output Labels: | [language] |
Language: | xx |
Data Source
Wikipedia
Benchmarking
Evaluated on Europarl dataset which the model has never seen:
+--------+-----+-------+------------------+
|src_lang|count|correct| precision|
+--------+-----+-------+------------------+
| fr| 1000| 996| 0.996|
| fi| 1000| 995| 0.995|
| sv| 1000| 994| 0.994|
| en| 1000| 991| 0.991|
| pt| 1000| 988| 0.988|
| de| 1000| 986| 0.986|
| it| 1000| 982| 0.982|
| es| 1000| 977| 0.977|
| nl| 1000| 974| 0.974|
| lt| 1000| 969| 0.969|
| hu| 880| 850|0.9659090909090909|
| lv| 916| 884|0.9650655021834061|
| el| 1000| 965| 0.965|
| pl| 914| 882|0.9649890590809628|
| cs| 1000| 964| 0.964|
| da| 1000| 963| 0.963|
| et| 928| 892|0.9612068965517241|
| bg| 1000| 954| 0.954|
| sk| 1000| 945| 0.945|
| ro| 784| 738|0.9413265306122449|
| sl| 914| 850|0.9299781181619255|
+--------+-----+-------+------------------+
+-------+--------------------+
|summary| precision|
+-------+--------------------+
| count| 21|
| mean| 0.9700702474999693|
| stddev|0.018256955176991118|
| min| 0.9299781181619255|
| max| 0.996|
+-------+--------------------+