M2M100 Multilingual Translation 1.2B

Description

M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation The model that can directly translate between the 9,900 directions of 100 languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

m2m100 = M2M100Transformer.pretrained("m2m100_1.2B","xx") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(50) \
    .setOutputCol("generation") \
    .setSrcLang("en") \
    .setTgtLang("zh")


pipeline = Pipeline().setStages([documentAssembler, m2m100])
data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.show(truncate = false)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")

val m2m100 = M2M100Transformer.pretrained("m2m100_1.2B","xx") 
    .setInputCols(Array("documents"))
    .setMaxOutputLength(50) 
    .setOutputCol("generation") 
    .setSrcLang("en") 
    .setTgtLang("zh")

val pipeline = new Pipeline().setStages(Array(documentAssembler, m2m100))

val data = Seq("My name is Leonardo.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.show(truncate = false)

Model Information

Model Name: m2m100_1.2B
Compatibility: Spark NLP 5.3.0+
License: Open Source
Edition: Official
Input Labels: [documents]
Output Labels: [generation]
Language: xx