Fast Neural Machine Translation Model from English to Tetun Dili

Description

Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).

  • source languages: en

  • target languages: tdt

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\ 
.setInputCol("text")\ 
.setOutputCol("document")

sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ 
.setInputCols(["document"])\ 
.setOutputCol("sentences")

marian = MarianTransformer.pretrained("opus_mt_en_tdt", "xx")\ 
.setInputCols(["sentence"])\ 
.setOutputCol("translation")

marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian])
light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
result = light_pipeline.fullAnnotate(data)

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")

val marian = MarianTransformer.pretrained("opus_mt_en_tdt", "xx")
.setInputCols("sentence")
.setOutputCol("translation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian))
val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data)
import nlu

text = ["text to translate"]
opus_df = nlu.load('xx.en.marian.translate_to.tdt').predict(text, output_level='sentence')
opus_df

Model Information

Model Name: opus_mt_en_tdt
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [sentence]
Output Labels: [translation]
Language: xx

Data Source

https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models