sparknlp.annotator.ld_dl.language_detector_dl
#
Contains classes for LanguageDetectorDL.
Module Contents#
Classes#
Language Identification and Detection by using CNN and RNN architectures |
- class LanguageDetectorDL(classname='com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL', java_model=None)[source]#
Language Identification and Detection by using CNN and RNN architectures in TensorFlow.
LanguageDetectorDL
is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.Pretrained models can be loaded with
pretrained()
of the companion object:>>> languageDetector = LanguageDetectorDL.pretrained() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("language")
The default model is
"ld_wiki_tatoeba_cnn_21"
, default language is"xx"
(meaning multi-lingual), if no values are provided.For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
LANGUAGE
- Parameters:
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- threshold
The minimum threshold for the final result otheriwse it will be either neutral or the value set in thresholdLabel, by default 0.5
- thresholdLabel
In case the score is less than threshold, what should be the label, by default Unknown
- coalesceSentences
If sets to true the output of all sentences will be averaged to one output instead of one output per sentence, by default True.
- languages
The languages used to trained the model
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> languageDetector = LanguageDetectorDL.pretrained() \ ... .setInputCols("document") \ ... .setOutputCol("language") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... languageDetector ... ]) >>> data = spark.createDataFrame([ ... ["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."], ... ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."], ... ["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."] ... ]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("language.result").show(truncate=False) +------+ |result| +------+ |[en] | |[fr] | |[de] | +------+
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setThreshold(v)[source]#
Sets the minimum threshold for the final result otherwise it will be either neutral or the value set in thresholdLabel, by default 0.5.
- Parameters:
- vfloat
Minimum threshold for the final result
- setThresholdLabel(p)[source]#
Sets what should be the label in case the score is less than threshold, by default Unknown.
- Parameters:
- pstr
The replacement label.
- setCoalesceSentences(value)[source]#
Sets if the output of all sentences will be averaged to one output instead of one output per sentence, by default True.
- Parameters:
- valuebool
If the output of all sentences will be averaged to one output
- static pretrained(name='ld_wiki_tatoeba_cnn_21', lang='xx', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “ld_wiki_tatoeba_cnn_21”
- langstr, optional
Language of the pretrained model, by default “xx”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- LanguageDetectorDL
The restored model