Description
M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation The model that can directly translate between the 9,900 directions of 100 languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.
How to use
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
m2m100 = M2M100Transformer.pretrained("m2m100_418M","xx") \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation") \
.setSrcLang("en") \
.setTgtLang("zh")
pipeline = Pipeline().setStages([documentAssembler, m2m100])
data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.show(truncate = false)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")
val m2m100 = M2M100Transformer.pretrained("m2m100_418M","xx")
.setInputCols(Array("documents"))
.setMaxOutputLength(50)
.setOutputCol("generation")
.setSrcLang("en")
.setTgtLang("zh")
val pipeline = new Pipeline().setStages(Array(documentAssembler, m2m100))
val data = Seq("My name is Leonardo.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.show(truncate = false)
Model Information
Model Name: | m2m100_418M |
Compatibility: | Spark NLP 5.3.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [documents] |
Output Labels: | [generation] |
Language: | xx |