sparknlp.annotator.ld_dl.language_detector_dl#

Contains classes for LanguageDetectorDL.

Module Contents#

Classes#

LanguageDetectorDL

Language Identification and Detection by using CNN and RNN architectures

class LanguageDetectorDL(classname='com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL', java_model=None)[source]#

Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.

Pretrained models can be loaded with pretrained() of the companion object:

>>> languageDetector = LanguageDetectorDL.pretrained() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("language")

The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual), if no values are provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

LANGUAGE

Parameters:
configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

threshold

The minimum threshold for the final result otheriwse it will be either neutral or the value set in thresholdLabel, by default 0.5

thresholdLabel

In case the score is less than threshold, what should be the label, by default Unknown

coalesceSentences

If sets to true the output of all sentences will be averaged to one output instead of one output per sentence, by default True.

languages

The languages used to trained the model

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> languageDetector = LanguageDetectorDL.pretrained() \
...     .setInputCols("document") \
...     .setOutputCol("language")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       languageDetector
...     ])
>>> data = spark.createDataFrame([
...     ["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."],
...     ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."],
...     ["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."]
... ]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("language.result").show(truncate=False)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+
setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setThreshold(v)[source]#

Sets the minimum threshold for the final result otherwise it will be either neutral or the value set in thresholdLabel, by default 0.5.

Parameters:
vfloat

Minimum threshold for the final result

setThresholdLabel(p)[source]#

Sets what should be the label in case the score is less than threshold, by default Unknown.

Parameters:
pstr

The replacement label.

setCoalesceSentences(value)[source]#

Sets if the output of all sentences will be averaged to one output instead of one output per sentence, by default True.

Parameters:
valuebool

If the output of all sentences will be averaged to one output

static pretrained(name='ld_wiki_tatoeba_cnn_21', lang='xx', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “ld_wiki_tatoeba_cnn_21”

langstr, optional

Language of the pretrained model, by default “xx”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
LanguageDetectorDL

The restored model