Translation is a natural language processing task where models convert text from one language into another while preserving its meaning, grammar, and context. For example, given the input “My name is Omar and I live in Zürich”, a translation model might output “Mein Name ist Omar und ich wohne in Zürich”. Modern translation models, especially multilingual neural models like mBART, can handle a wide variety of language pairs and can also be fine-tuned on custom data to improve accuracy for specific domains or dialects.
Translation models are widely used to build multilingual conversational agents and cross-lingual applications. They can either translate datasets of user intents and responses to train a new model in the target language or translate live user inputs and chatbot outputs for real-time interaction. These capabilities make translation essential for global communication, content localization, cross-border business, and international customer support, enabling systems to operate seamlessly across multiple languages.
Picking a Model
The choice of model for translation depends on the languages, domain, and whether real-time or batch translation is required. For general-purpose multilingual translation, encoder–decoder architectures like mBART, M2M100, and MarianMT perform well across a wide range of language pairs. For high-quality domain-specific translation, fine-tuned versions of these models can be used, such as models trained on legal, medical, or technical corpora. Lightweight or faster models like DistilMarianMT or distilled versions of mBART are suitable for real-time applications or deployment in resource-constrained environments. Finally, when rare or low-resource languages are involved, models like NLLB-200 or language-adapted versions of M2M100 provide improved coverage and accuracy.
Recommended Models for Translation Tasks
-
General-Purpose Multilingual Translation: Models such as
mbart-large-50-many-to-many-mmt,m2m100_418M, andHelsinki-NLP/opus-mthandle a wide variety of language pairs effectively. -
Domain-Specific Translation: For legal, medical, technical, or other specialized texts, fine-tuned variants of mBART, M2M100, or MarianMT trained on domain-specific corpora provide higher accuracy.
-
Lightweight or Real-Time Translation: Distilled or smaller models like
Helsinki-NLP/opus-mtand distilled mBART versions are optimized for low-latency, resource-constrained deployment. -
Low-Resource Languages: Models such as
NLLB-200or language-adapted versions of M2M100 are recommended for improved performance on rare or low-resource language pairs.
Explore the available translation models at Spark NLP Models to find the one that best suits your translation tasks.
How to use
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
m2m100 = M2M100Transformer.pretrained("m2m100_418M") \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation") \
.setSrcLang("zh") \ # Source language: Chinese
.setTgtLang("en") # Target language: English
pipeline = Pipeline().setStages([
documentAssembler,
m2m100
])
data = spark.createDataFrame([["生活就像一盒巧克力。"]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.select("summaries.generation").show(truncate=False)
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotators._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")
val m2m100 = M2M100Transformer.pretrained("m2m100_418M")
.setInputCols(Array("documents"))
.setSrcLang("zh") // Source language: Chinese
.serTgtLang("en") // Target language: English
.setMaxOutputLength(100)
.setDoSample(false)
.setOutputCol("generation")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
m2m100
))
val data = Seq("生活就像一盒巧克力。").toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)
result.select("generation.result").show(truncate = false)
+-------------------------------------------------------------------------------------------+
|result |
+-------------------------------------------------------------------------------------------+
|[ Life is like a box of chocolate.] |
+-------------------------------------------------------------------------------------------+
Try Real-Time Demos!
If you want to see the outputs of text generation models in real time, visit our interactive demos:
- Text-To-Text Transfer Transformer (Google T5)
- Multilingual Text Translation with MarianMT
- M2M100 Multilingual Translation Model
Useful Resources
Want to dive deeper into text generation with Spark NLP? Here are some curated resources to help you get started and explore further:
Articles and Guides
- Multilingual machine translation with Spark NLP
- Use Spark NLP offline models for Language Translation
Notebooks