Fast and Accurate Language Identification - 231 Languages (CNN)

Description

Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. In addition, LanguageDetetorDL can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate.

We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on Wikipedia dataset with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: https://en.wikipedia.org/wiki/List_of_Wikipedias.

This model can detect the following languages:

Achinese, Afrikaans, Tosk Albanian, Amharic, Aragonese, Old English, Arabic, Egyptian Arabic, Assamese, Asturian, Avaric, Aymara, Azerbaijani, South Azerbaijani, Bashkir, Bavarian, bat-smg, Central Bikol, Belarusian, Bulgarian, bh, Banjar, Bengali, Tibetan, Bishnupriya, Breton, Bosnian, Russia Buriat, Catalan, cbk-zam, Min Dong Chinese, Chechen, Cebuano, Central Kurdish (Soranî), Corsican, Crimean Tatar, Czech, Kashubian, Chuvash, Welsh, Danish, German, Dimli (individual language), Lower Sorbian, dty, Dhivehi, Greek, eml, English, Esperanto, Spanish, Estonian, Extremaduran, Persian, Finnish, fiu-vro, Faroese, French, Arpitan, Friulian, Frisian, Irish, Gagauz, Scottish Gaelic, Galician, Gilaki, Guarani, Konkani (Goan), Gujarati, Manx, Hausa, Hakka Chinese, Hebrew, Hindi, Fiji Hindi, Croatian, Upper Sorbian, Haitian Creole, Hungarian, Armenian, Interlingua, Indonesian, Interlingue, Igbo, Ilocano, Ido, Icelandic, Italian, Japanese, Jamaican Patois, Lojban, Javanese, Georgian, Karakalpak, Kabyle, Kabardian, Kazakh, Khmer, Kannada, Korean, Komi-Permyak, Karachay-Balkar, Kölsch, Kurdish, Komi, Cornish, Kyrgyz, Latin, Ladino, Luxembourgish, Lezghian, Luganda, Limburgan, Ligurian, Lombard, Lingala, Lao, Northern Luri, Lithuanian, Latgalian, Latvian, Maithili, map-bms, Moksha, Malagasy, Meadow Mari, Maori, Minangkabau, Macedonian, Malayalam, Mongolian, Marathi, Hill Mari, Malay, Maltese, Mirandese, Burmese, Erzya, Mazanderani, Nahuatl, Neapolitan, Low German (Low Saxon), nds-nl, Nepali, Newari, Dutch, Norwegian Nynorsk, Norwegian, Narom, Pedi, Navajo, Occitan, Livvi, Oromo, Odia (Oriya), Ossetian, Punjabi (Eastern), Pangasinan, Kapampangan, Papiamento, Picard, Pennsylvania German, Palatine German, Polish, Punjabi (Western), Pashto, Portuguese, Quechua, Romansh, Romanian, roa-rup, roa-tara, Russian, Rusyn, Kinyarwanda, Sanskrit, Yakut, Sardinian, Sicilian, Sindhi, Northern Sami, Serbo-Croatian, Sinhala, Slovak, Slovenian, Shona, Somali, Albanian, Serbian, Sranan Tongo, Saterland Frisian, Sundanese, Swedish, Swahili, Silesian, Tamil, Tulu, Telugu, Tetun, Tajik, Thai, Turkmen, Tagalog, Setswana, Tongan, Turkish, Tatar, Tuvinian, Udmurt, Uyghur, Ukrainian, Urdu, Uzbek, Venetian, Veps, Vietnamese, Vlaams, Volapük, Walloon, Waray, Wolof, Shanghainese, Xhosa, Mingrelian, Yiddish, Yoruba, Zeeuws, Chinese, zh-classical, zh-min-nan, zh-yue.

Predicted Entities

ace, af, als, am, an, ang, ar, arz, as, ast, av, ay, az, azb, ba, bar, bat-smg, bcl, be, bg, bh, bjn, bn, bo, bpy, br, bs, bxr, ca, cbk-zam, cdo, ce, ceb, ckb, co, crh, cs, csb, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, ext, fa, fi, fiu-vro, fo, fr, frp, fur, fy, ga, gag, gd, gl, glk, gn, gom, gu, gv, ha, hak, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ig, ilo, io, is, it, ja, jam, jbo, jv, ka, kaa, kab, kbd, kk, km, kn, ko, koi, krc, ksh, ku, kv, kw, ky, la, lad, lb, lez, lg, li, lij, lmo, ln, lo, lrc, lt, ltg, lv, mai, map-bms, mdf, mg, mhr, mi, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, nds-nl, ne, new, nl, nn, no, nrm, nso, nv, oc, olo, om, or, os, pa, pag, pam, pap, pcd, pdc, pfl, pl, pnb, ps, pt, qu, rm, ro, roa-rup, roa-tara, ru, rue, rw, sa, sah, sc, scn, sd, se, sh, si, sk, sl, sn, so, sq, sr, srn, stq, su, sv, sw, szl, ta, tcy, te, tet, tg, th, tk, tl, tn, to, tr, tt, tyv, udm, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wo, wuu, xh, xmf, yi, yo, zea, zh, zh-classical, zh-min-nan, zh-yue.

Open in Colab Download Copy S3 URI

How to use

...
language_detector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")\
.setInputCols(["sentence"])\
.setOutputCol("language")
languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector])
light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text")))
result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.")
...
val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")
.setInputCols("sentence")
.setOutputCol("language")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector))
val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."]
lang_df = nlu.load('xx.classify.wiki_231').predict(text, output_level='sentence')
lang_df

Results

'fr'

Model Information

Model Name: ld_wiki_cnn_231
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [sentence]
Output Labels: [language]
Language: xx

Data Source

Wikipedia

Benchmarking

Evaluated on Europarl dataset which the model has never seen:

+--------+-----+-------+------------------+
|src_lang|count|correct|         precision|
+--------+-----+-------+------------------+
|      fr| 1000|    996|             0.996|
|      fi| 1000|    995|             0.995|
|      sv| 1000|    994|             0.994|
|      en| 1000|    991|             0.991|
|      pt| 1000|    988|             0.988|
|      de| 1000|    986|             0.986|
|      it| 1000|    982|             0.982|
|      es| 1000|    977|             0.977|
|      nl| 1000|    974|             0.974|
|      lt| 1000|    969|             0.969|
|      hu|  880|    850|0.9659090909090909|
|      lv|  916|    884|0.9650655021834061|
|      el| 1000|    965|             0.965|
|      pl|  914|    882|0.9649890590809628|
|      cs| 1000|    964|             0.964|
|      da| 1000|    963|             0.963|
|      et|  928|    892|0.9612068965517241|
|      bg| 1000|    954|             0.954|
|      sk| 1000|    945|             0.945|
|      ro|  784|    738|0.9413265306122449|
|      sl|  914|    850|0.9299781181619255|
+--------+-----+-------+------------------+

+-------+--------------------+
|summary|           precision|
+-------+--------------------+
|  count|                  21|
|   mean|  0.9700702474999693|
| stddev|0.018256955176991118|
|    min|  0.9299781181619255|
|    max|               0.996|
+-------+--------------------+