Packages

class M2M100Transformer extends AnnotatorModel[M2M100Transformer] with HasBatchedAnnotate[M2M100Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

M2M100 : multilingual translation model

M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

The model can directly translate between the 9,900 directions of 100 languages.

Pretrained models can be loaded with pretrained of the companion object:

val m2m100 = M2M100Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")

The default model is "m2m100_418M", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see M2M100TestSpec.

References:

Paper Abstract:

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

Languages Covered:

Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.M2M100Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val m2m100 = M2M100Transformer.pretrained("m2m100_418M")
  .setInputCols(Array("documents"))
  .setSrcLang("zh")
  .serTgtLang("en")
  .setMaxOutputLength(100)
  .setDoSample(false)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, m2m100))

val data = Seq(
  "生活就像一盒巧克力。"
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+-------------------------------------------------------------------------------------------+
|result                                                                                     |
+-------------------------------------------------------------------------------------------+
|[ Life is like a box of chocolate.]                                                        |
+-------------------------------------------------------------------------------------------+
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. M2M100Transformer
  2. HasEngine
  3. WriteSentencePieceModel
  4. HasGeneratorProperties
  5. WriteOpenvinoModel
  6. WriteOnnxModel
  7. HasBatchedAnnotate
  8. AnnotatorModel
  9. CanBeLazy
  10. RawAnnotator
  11. HasOutputAnnotationCol
  12. HasInputAnnotationCols
  13. HasOutputAnnotatorType
  14. ParamsAndFeaturesWritable
  15. HasFeatures
  16. DefaultParamsWritable
  17. MLWritable
  18. Model
  19. Transformer
  20. PipelineStage
  21. Logging
  22. Params
  23. Serializable
  24. Serializable
  25. Identifiable
  26. AnyRef
  27. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Parameters

A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.

  1. val batchSize: IntParam

    Size of every batch (Default depends on model).

    Size of every batch (Default depends on model).

    Definition Classes
    HasBatchedAnnotate
  2. val beamSize: IntParam

    Beam size for the beam search algorithm (Default: 4)

    Beam size for the beam search algorithm (Default: 4)

    Definition Classes
    HasGeneratorProperties
  3. val doSample: BooleanParam

    Whether or not to use sampling, use greedy decoding otherwise (Default: false)

    Whether or not to use sampling, use greedy decoding otherwise (Default: false)

    Definition Classes
    HasGeneratorProperties
  4. val engine: Param[String]

    This param is set internally once via loadSavedModel.

    This param is set internally once via loadSavedModel. That's why there is no setter

    Definition Classes
    HasEngine
  5. var ignoreTokenIds: IntArrayParam

    A list of token ids which are ignored in the decoder's output (Default: Array())

  6. val inputAnnotatorTypes: Array[AnnotatorType]

    Input annotator type : DOCUMENT

    Input annotator type : DOCUMENT

    Definition Classes
    M2M100TransformerHasInputAnnotationCols
  7. val maxInputLength: IntParam

    max length of the input sequence (Default: 0)

    max length of the input sequence (Default: 0)

    Definition Classes
    HasGeneratorProperties
  8. val maxOutputLength: IntParam

    Maximum length of the sequence to be generated (Default: 20)

    Maximum length of the sequence to be generated (Default: 20)

    Definition Classes
    HasGeneratorProperties
  9. val minOutputLength: IntParam

    Minimum length of the sequence to be generated (Default: 0)

    Minimum length of the sequence to be generated (Default: 0)

    Definition Classes
    HasGeneratorProperties
  10. val nReturnSequences: IntParam

    The number of sequences to return from the beam search.

    The number of sequences to return from the beam search.

    Definition Classes
    HasGeneratorProperties
  11. val noRepeatNgramSize: IntParam

    If set to int > 0, all ngrams of that size can only occur once (Default: 0)

    If set to int > 0, all ngrams of that size can only occur once (Default: 0)

    Definition Classes
    HasGeneratorProperties
  12. val outputAnnotatorType: String

    Output annotator type : DOCUMENT

    Output annotator type : DOCUMENT

    Definition Classes
    M2M100TransformerHasOutputAnnotatorType
  13. val randomSeed: Option[Long]

    Optional Random seed for the model.

    Optional Random seed for the model. Needs to be of type Int.

    Definition Classes
    HasGeneratorProperties
  14. val repetitionPenalty: DoubleParam

    The parameter for repetition penalty (Default: 1.0).

    The parameter for repetition penalty (Default: 1.0). 1.0 means no penalty. See this paper for more details.

    Definition Classes
    HasGeneratorProperties
  15. var srcLang: Param[String]

    Source Language (Default: en)

  16. val stopTokenIds: IntArrayParam

    Stop tokens to terminate the generation

    Stop tokens to terminate the generation

    Definition Classes
    HasGeneratorProperties
  17. val task: Param[String]

    Set transformer task, e.g.

    Set transformer task, e.g. "summarize:" (Default: "").

    Definition Classes
    HasGeneratorProperties
  18. val temperature: DoubleParam

    The value used to module the next token probabilities (Default: 1.0)

    The value used to module the next token probabilities (Default: 1.0)

    Definition Classes
    HasGeneratorProperties
  19. var tgtLang: Param[String]

    Target Language (Default: fr)

  20. val topK: IntParam

    The number of highest probability vocabulary tokens to keep for top-k-filtering (Default: 50)

    The number of highest probability vocabulary tokens to keep for top-k-filtering (Default: 50)

    Definition Classes
    HasGeneratorProperties
  21. val topP: DoubleParam

    If set to float < 1.0, only the most probable tokens with probabilities that add up to topP or higher are kept for generation (Default: 1.0)

    If set to float < 1.0, only the most probable tokens with probabilities that add up to topP or higher are kept for generation (Default: 1.0)

    Definition Classes
    HasGeneratorProperties
  22. val vocabulary: MapFeature[String, Int]

    Vocabulary used to encode the words to ids with bpeTokenizer.encode

Members

  1. type AnnotatorType = String
    Definition Classes
    HasOutputAnnotatorType
  1. def batchAnnotate(batchedAnnotations: Seq[Array[Annotation]]): Seq[Seq[Annotation]]

    takes a document and annotations and produces new annotations of this annotator's annotation type

    takes a document and annotations and produces new annotations of this annotator's annotation type

    batchedAnnotations

    Annotations that correspond to inputAnnotationCols generated by previous annotators if any

    returns

    any number of annotations processed for every input annotation. Not necessary one to one relationship

    Definition Classes
    M2M100TransformerHasBatchedAnnotate
  2. def batchProcess(rows: Iterator[_]): Iterator[Row]
    Definition Classes
    HasBatchedAnnotate
  3. final def clear(param: Param[_]): M2M100Transformer.this.type
    Definition Classes
    Params
  4. def copy(extra: ParamMap): M2M100Transformer

    requirement for annotators copies

    requirement for annotators copies

    Definition Classes
    RawAnnotator → Model → Transformer → PipelineStage → Params
  5. def explainParam(param: Param[_]): String
    Definition Classes
    Params
  6. def explainParams(): String
    Definition Classes
    Params
  7. final def extractParamMap(): ParamMap
    Definition Classes
    Params
  8. final def extractParamMap(extra: ParamMap): ParamMap
    Definition Classes
    Params
  9. val features: ArrayBuffer[Feature[_, _, _]]
    Definition Classes
    HasFeatures
  10. val generationConfig: StructFeature[GenerationConfig]
  11. final def get[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  12. final def getDefault[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  13. def getGenerationConfig: GenerationConfig
  14. def getInputCols: Array[String]

    returns

    input annotations columns currently used

    Definition Classes
    HasInputAnnotationCols
  15. def getLazyAnnotator: Boolean
    Definition Classes
    CanBeLazy
  16. final def getOrDefault[T](param: Param[T]): T
    Definition Classes
    Params
  17. final def getOutputCol: String

    Gets annotation column name going to generate

    Gets annotation column name going to generate

    Definition Classes
    HasOutputAnnotationCol
  18. def getParam(paramName: String): Param[Any]
    Definition Classes
    Params
  19. def getSrcLangToken: Int
  20. def getTgtLangToken: Int
  21. final def hasDefault[T](param: Param[T]): Boolean
    Definition Classes
    Params
  22. def hasParam(paramName: String): Boolean
    Definition Classes
    Params
  23. def hasParent: Boolean
    Definition Classes
    Model
  24. final def isDefined(param: Param[_]): Boolean
    Definition Classes
    Params
  25. final def isSet(param: Param[_]): Boolean
    Definition Classes
    Params
  26. val lazyAnnotator: BooleanParam
    Definition Classes
    CanBeLazy
  27. def onWrite(path: String, spark: SparkSession): Unit
  28. val optionalInputAnnotatorTypes: Array[String]
    Definition Classes
    HasInputAnnotationCols
  29. lazy val params: Array[Param[_]]
    Definition Classes
    Params
  30. var parent: Estimator[M2M100Transformer]
    Definition Classes
    Model
  31. def save(path: String): Unit
    Definition Classes
    MLWritable
    Annotations
    @Since( "1.6.0" ) @throws( ... )
  32. final def set[T](param: Param[T], value: T): M2M100Transformer.this.type
    Definition Classes
    Params
  33. def setGenerationConfig(value: GenerationConfig): M2M100Transformer.this.type
  34. final def setInputCols(value: String*): M2M100Transformer.this.type
    Definition Classes
    HasInputAnnotationCols
  35. def setInputCols(value: Array[String]): M2M100Transformer.this.type

    Overrides required annotators column if different than default

    Overrides required annotators column if different than default

    Definition Classes
    HasInputAnnotationCols
  36. def setLazyAnnotator(value: Boolean): M2M100Transformer.this.type
    Definition Classes
    CanBeLazy
  37. def setMaxInputLength(value: Int): M2M100Transformer.this.type
    Definition Classes
    HasGeneratorProperties
  38. final def setOutputCol(value: String): M2M100Transformer.this.type

    Overrides annotation column name when transforming

    Overrides annotation column name when transforming

    Definition Classes
    HasOutputAnnotationCol
  39. def setParent(parent: Estimator[M2M100Transformer]): M2M100Transformer
    Definition Classes
    Model
  40. def setSrcLang(value: String): M2M100Transformer.this.type
  41. def setTgtLang(value: String): M2M100Transformer.this.type
  42. def toString(): String
    Definition Classes
    Identifiable → AnyRef → Any
  43. final def transform(dataset: Dataset[_]): DataFrame

    Given requirements are met, this applies ML transformation within a Pipeline or stand-alone Output annotation will be generated as a new column, previous annotations are still available separately metadata is built at schema level to record annotations structural information outside its content

    Given requirements are met, this applies ML transformation within a Pipeline or stand-alone Output annotation will be generated as a new column, previous annotations are still available separately metadata is built at schema level to record annotations structural information outside its content

    dataset

    Dataset[Row]

    Definition Classes
    AnnotatorModel → Transformer
  44. def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
    Definition Classes
    Transformer
    Annotations
    @Since( "2.0.0" )
  45. def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
    Definition Classes
    Transformer
    Annotations
    @Since( "2.0.0" ) @varargs()
  46. final def transformSchema(schema: StructType): StructType

    requirement for pipeline transformation validation.

    requirement for pipeline transformation validation. It is called on fit()

    Definition Classes
    RawAnnotator → PipelineStage
  47. val uid: String
    Definition Classes
    M2M100Transformer → Identifiable
  48. def write: MLWriter
    Definition Classes
    ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable
  49. def writeOnnxModel(path: String, spark: SparkSession, onnxWrapper: OnnxWrapper, suffix: String, fileName: String): Unit
    Definition Classes
    WriteOnnxModel
  50. def writeOnnxModels(path: String, spark: SparkSession, onnxWrappersWithNames: Seq[(OnnxWrapper, String)], suffix: String): Unit
    Definition Classes
    WriteOnnxModel
  51. def writeOpenvinoModel(path: String, spark: SparkSession, openvinoWrapper: OpenvinoWrapper, suffix: String, fileName: String): Unit
    Definition Classes
    WriteOpenvinoModel
  52. def writeOpenvinoModels(path: String, spark: SparkSession, ovWrappersWithNames: Seq[(OpenvinoWrapper, String)], suffix: String): Unit
    Definition Classes
    WriteOpenvinoModel
  53. def writeSentencePieceModel(path: String, spark: SparkSession, spp: SentencePieceWrapper, suffix: String, filename: String): Unit
    Definition Classes
    WriteSentencePieceModel

Parameter setters

  1. def setBatchSize(size: Int): M2M100Transformer.this.type

    Size of every batch.

    Size of every batch.

    Definition Classes
    HasBatchedAnnotate
  2. def setBeamSize(beamNum: Int): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  3. def setDoSample(value: Boolean): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  4. def setIgnoreTokenIds(tokenIds: Array[Int]): M2M100Transformer.this.type

  5. def setMaxOutputLength(value: Int): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  6. def setMinOutputLength(value: Int): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  7. def setModelIfNotSet(spark: SparkSession, onnxWrappers: Option[EncoderDecoderWithoutPastWrappers], openvinoWrapper: Option[EncoderDecoderWithoutPastWrappers], spp: SentencePieceWrapper): M2M100Transformer.this.type

  8. def setNReturnSequences(beamNum: Int): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  9. def setNoRepeatNgramSize(value: Int): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  10. def setRandomSeed(value: Int): M2M100Transformer.this.type

  11. def setRandomSeed(value: Long): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  12. def setRepetitionPenalty(value: Double): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  13. def setStopTokenIds(value: Array[Int]): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  14. def setTask(value: String): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  15. def setTemperature(value: Double): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  16. def setTopK(value: Int): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  17. def setTopP(value: Double): M2M100Transformer.this.type

    Definition Classes
    HasGeneratorProperties
  18. def setVocabulary(value: Map[String, Int]): M2M100Transformer.this.type

Parameter getters

  1. def getBatchSize: Int

    Size of every batch.

    Size of every batch.

    Definition Classes
    HasBatchedAnnotate
  2. def getBeamSize: Int

    Definition Classes
    HasGeneratorProperties
  3. def getDoSample: Boolean

    Definition Classes
    HasGeneratorProperties
  4. def getEngine: String

    Definition Classes
    HasEngine
  5. def getIgnoreTokenIds: Array[Int]

  6. def getMaxOutputLength: Int

    Definition Classes
    HasGeneratorProperties
  7. def getMinOutputLength: Int

    Definition Classes
    HasGeneratorProperties
  8. def getModelIfNotSet: M2M100

  9. def getNReturnSequences: Int

    Definition Classes
    HasGeneratorProperties
  10. def getNoRepeatNgramSize: Int

    Definition Classes
    HasGeneratorProperties
  11. def getRandomSeed: Option[Long]

    Definition Classes
    HasGeneratorProperties
  12. def getRepetitionPenalty: Double

    Definition Classes
    HasGeneratorProperties
  13. def getStopTokenIds: Array[Int]

    Definition Classes
    HasGeneratorProperties
  14. def getTask: Option[String]

    Definition Classes
    HasGeneratorProperties
  15. def getTemperature: Double

    Definition Classes
    HasGeneratorProperties
  16. def getTopK: Int

    Definition Classes
    HasGeneratorProperties
  17. def getTopP: Double

    Definition Classes
    HasGeneratorProperties