Packages

class LanguageDetectorDL extends AnnotatorModel[LanguageDetectorDL] with HasSimpleAnnotate[LanguageDetectorDL] with WriteTensorflowModel with HasEngine

Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.

Pretrained models can be loaded with pretrained of the companion object:

Val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("sentence")
  .setOutputCol("language")

The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples And the LanguageDetectorDLTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("document")
  .setOutputCol("language")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    languageDetector
  ))

val data = Seq(
  "Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
  "Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
  "Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("language.result").show(false)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+
Linear Supertypes
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. LanguageDetectorDL
  2. HasEngine
  3. WriteTensorflowModel
  4. HasSimpleAnnotate
  5. AnnotatorModel
  6. CanBeLazy
  7. RawAnnotator
  8. HasOutputAnnotationCol
  9. HasInputAnnotationCols
  10. HasOutputAnnotatorType
  11. ParamsAndFeaturesWritable
  12. HasFeatures
  13. DefaultParamsWritable
  14. MLWritable
  15. Model
  16. Transformer
  17. PipelineStage
  18. Logging
  19. Params
  20. Serializable
  21. Serializable
  22. Identifiable
  23. AnyRef
  24. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Parameters

A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.

  1. val alphabet: MapFeature[String, Int]

    Alphabet used to feed the TensorFlow model for prediction

  2. val coalesceSentences: BooleanParam

    Output average of sentences instead of one output per sentence (Default: true).

  3. val configProtoBytes: IntArrayParam

    ConfigProto from tensorflow, serialized into byte array.

    ConfigProto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString()

  4. val engine: Param[String]

    This param is set internally once via loadSavedModel.

    This param is set internally once via loadSavedModel. That's why there is no setter

    Definition Classes
    HasEngine
  5. val language: MapFeature[String, Int]

    Language used to map prediction to ISO 639-1 language codes

  6. val languages: StringArrayParam

    Languages the model was trained with.

  7. val threshold: FloatParam

    The minimum threshold for the final result, otherwise it will be either "unk" or the value set in thresholdLabel (Default: 0.1f).

    The minimum threshold for the final result, otherwise it will be either "unk" or the value set in thresholdLabel (Default: 0.1f). Value is between 0.0 to 1.0. Try to set this lower if your text is hard to predict

  8. val thresholdLabel: Param[String]

    Value for the classification, if confidence is less than threshold (Default: "unk").

Members

  1. type AnnotatorType = String
    Definition Classes
    HasOutputAnnotatorType
  1. def annotate(annotations: Seq[Annotation]): Seq[Annotation]

    Takes a document and annotations and produces new annotations of this annotator's annotation type

    Takes a document and annotations and produces new annotations of this annotator's annotation type

    annotations

    Annotations that correspond to inputAnnotationCols generated by previous annotators if any

    returns

    any number of annotations processed for every input annotation. Not necessary one to one relationship

    Definition Classes
    LanguageDetectorDLHasSimpleAnnotate
  2. final def clear(param: Param[_]): LanguageDetectorDL.this.type
    Definition Classes
    Params
  3. def copy(extra: ParamMap): LanguageDetectorDL

    requirement for annotators copies

    requirement for annotators copies

    Definition Classes
    RawAnnotator → Model → Transformer → PipelineStage → Params
  4. def dfAnnotate: UserDefinedFunction

    Wraps annotate to happen inside SparkSQL user defined functions in order to act with org.apache.spark.sql.Column

    Wraps annotate to happen inside SparkSQL user defined functions in order to act with org.apache.spark.sql.Column

    returns

    udf function to be applied to inputCols using this annotator's annotate function as part of ML transformation

    Definition Classes
    HasSimpleAnnotate
  5. def explainParam(param: Param[_]): String
    Definition Classes
    Params
  6. def explainParams(): String
    Definition Classes
    Params
  7. final def extractParamMap(): ParamMap
    Definition Classes
    Params
  8. final def extractParamMap(extra: ParamMap): ParamMap
    Definition Classes
    Params
  9. val features: ArrayBuffer[Feature[_, _, _]]
    Definition Classes
    HasFeatures
  10. final def get[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  11. final def getDefault[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  12. def getInputCols: Array[String]

    returns

    input annotations columns currently used

    Definition Classes
    HasInputAnnotationCols
  13. def getLazyAnnotator: Boolean
    Definition Classes
    CanBeLazy
  14. final def getOrDefault[T](param: Param[T]): T
    Definition Classes
    Params
  15. final def getOutputCol: String

    Gets annotation column name going to generate

    Gets annotation column name going to generate

    Definition Classes
    HasOutputAnnotationCol
  16. def getParam(paramName: String): Param[Any]
    Definition Classes
    Params
  17. final def hasDefault[T](param: Param[T]): Boolean
    Definition Classes
    Params
  18. def hasParam(paramName: String): Boolean
    Definition Classes
    Params
  19. def hasParent: Boolean
    Definition Classes
    Model
  20. val inputAnnotatorTypes: Array[String]

    Annotator reference id.

    Annotator reference id. Used to identify elements in metadata or to refer to this annotator type

    Definition Classes
    LanguageDetectorDLHasInputAnnotationCols
  21. final def isDefined(param: Param[_]): Boolean
    Definition Classes
    Params
  22. final def isSet(param: Param[_]): Boolean
    Definition Classes
    Params
  23. val lazyAnnotator: BooleanParam
    Definition Classes
    CanBeLazy
  24. def onWrite(path: String, spark: SparkSession): Unit
  25. val optionalInputAnnotatorTypes: Array[String]
    Definition Classes
    HasInputAnnotationCols
  26. val outputAnnotatorType: AnnotatorType
  27. lazy val params: Array[Param[_]]
    Definition Classes
    Params
  28. var parent: Estimator[LanguageDetectorDL]
    Definition Classes
    Model
  29. def save(path: String): Unit
    Definition Classes
    MLWritable
    Annotations
    @Since( "1.6.0" ) @throws( ... )
  30. final def set[T](param: Param[T], value: T): LanguageDetectorDL.this.type
    Definition Classes
    Params
  31. final def setInputCols(value: String*): LanguageDetectorDL.this.type
    Definition Classes
    HasInputAnnotationCols
  32. def setInputCols(value: Array[String]): LanguageDetectorDL.this.type

    Overrides required annotators column if different than default

    Overrides required annotators column if different than default

    Definition Classes
    HasInputAnnotationCols
  33. def setLazyAnnotator(value: Boolean): LanguageDetectorDL.this.type
    Definition Classes
    CanBeLazy
  34. final def setOutputCol(value: String): LanguageDetectorDL.this.type

    Overrides annotation column name when transforming

    Overrides annotation column name when transforming

    Definition Classes
    HasOutputAnnotationCol
  35. def setParent(parent: Estimator[LanguageDetectorDL]): LanguageDetectorDL
    Definition Classes
    Model
  36. def toString(): String
    Definition Classes
    Identifiable → AnyRef → Any
  37. final def transform(dataset: Dataset[_]): DataFrame

    Given requirements are met, this applies ML transformation within a Pipeline or stand-alone Output annotation will be generated as a new column, previous annotations are still available separately metadata is built at schema level to record annotations structural information outside its content

    Given requirements are met, this applies ML transformation within a Pipeline or stand-alone Output annotation will be generated as a new column, previous annotations are still available separately metadata is built at schema level to record annotations structural information outside its content

    dataset

    Dataset[Row]

    Definition Classes
    AnnotatorModel → Transformer
  38. def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
    Definition Classes
    Transformer
    Annotations
    @Since( "2.0.0" )
  39. def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
    Definition Classes
    Transformer
    Annotations
    @Since( "2.0.0" ) @varargs()
  40. final def transformSchema(schema: StructType): StructType

    requirement for pipeline transformation validation.

    requirement for pipeline transformation validation. It is called on fit()

    Definition Classes
    RawAnnotator → PipelineStage
  41. val uid: String
    Definition Classes
    LanguageDetectorDL → Identifiable
  42. def write: MLWriter
    Definition Classes
    ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable
  43. def writeTensorflowHub(path: String, tfPath: String, spark: SparkSession, suffix: String = "_use"): Unit
    Definition Classes
    WriteTensorflowModel
  44. def writeTensorflowModel(path: String, spark: SparkSession, tensorflow: TensorflowWrapper, suffix: String, filename: String, configProtoBytes: Option[Array[Byte]] = None): Unit
    Definition Classes
    WriteTensorflowModel
  45. def writeTensorflowModelV2(path: String, spark: SparkSession, tensorflow: TensorflowWrapper, suffix: String, filename: String, configProtoBytes: Option[Array[Byte]] = None, savedSignatures: Option[Map[String, String]] = None): Unit
    Definition Classes
    WriteTensorflowModel

Parameter setters

  1. def setAlphabet(value: Map[String, Int]): LanguageDetectorDL.this.type

  2. def setCoalesceSentences(value: Boolean): LanguageDetectorDL.this.type

  3. def setConfigProtoBytes(bytes: Array[Int]): LanguageDetectorDL.this.type

  4. def setLanguage(value: Map[String, Int]): LanguageDetectorDL.this.type

  5. def setModelIfNotSet(spark: SparkSession, tensorflow: TensorflowWrapper): LanguageDetectorDL.this.type

  6. def setThreshold(threshold: Float): LanguageDetectorDL.this.type

  7. def setThresholdLabel(label: String): LanguageDetectorDL.this.type

Parameter getters

  1. def getCoalesceSentences: Boolean

  2. def getConfigProtoBytes: Option[Array[Byte]]

  3. def getEngine: String

    Definition Classes
    HasEngine
  4. def getLanguage: Array[String]

  5. def getModelIfNotSet: TensorflowLD

  6. def getThreshold: Float

  7. def getThresholdLabel: String