sparknlp.annotator.seq2seq.marian_transformer#

Contains classes for the MarianTransformer.

Module Contents#

Classes#

MarianTransformer

MarianTransformer: Fast Neural Machine Translation

class MarianTransformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer', java_model=None)[source]#

MarianTransformer: Fast Neural Machine Translation

Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences.

Pretrained models can be loaded with pretrained() of the companion object:

>>> marian = MarianTransformer.pretrained() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("translation")

The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
batchSize

Size of every batch, by default 1

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

langId

Transformer’s task, e.g. “summarize>”, by default “”

maxInputLength

Controls the maximum length for encoder inputs (source language texts), by default 40

maxOutputLength

Controls the maximum length for decoder outputs (target language texts), by default 40

Notes

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

References

MarianNMT at GitHub

Marian: Fast Neural Machine Translation in C++

Paper Abstract:

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
...     .setInputCols("document") \
...     .setOutputCol("sentence")
>>> marian = MarianTransformer.pretrained() \
...     .setInputCols("sentence") \
...     .setOutputCol("translation") \
...     .setMaxInputLength(30)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentence,
...       marian
...     ])
>>> data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(translation.result) as result").show(truncate=False)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+
setIgnoreTokenIds(value)[source]#

A list of token ids which are ignored in the decoder’s output.

Parameters:
valueList[int]

The words to be filtered out

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setLangId(value)[source]#

Sets transformer’s task, e.g. “summarize>”, by default “”.

Parameters:
valuestr

Transformer’s task, e.g. “summarize>”

setMaxInputLength(value)[source]#

Sets the maximum length for encoder inputs (source language texts), by default 40. The value should be less than 512, as the Marian Transformer does not support inputs longer than 512 tokens.

Parameters:
valueint

The maximum length for encoder inputs (source language texts)

setMaxOutputLength(value)[source]#

Sets the maximum length for decoder outputs (target language texts), by default 40.

Parameters:
valueint

The maximum length for decoder outputs (target language texts)

setDoSample(value)[source]#

Sets whether or not to use sampling, use greedy decoding otherwise.

Parameters:
valuebool

Whether or not to use sampling; use greedy decoding otherwise

setTemperature(value)[source]#

Sets the value used to module the next token probabilities.

Parameters:
valuefloat

The value used to module the next token probabilities

setTopK(value)[source]#

Sets the number of highest probability vocabulary tokens to keep for top-k-filtering.

Parameters:
valueint

Number of highest probability vocabulary tokens to keep

setTopP(value)[source]#

Sets the top cumulative probability for vocabulary tokens.

If set to float < 1, only the most probable tokens with probabilities that add up to topP or higher are kept for generation.

Parameters:
valuefloat

Cumulative probability for vocabulary tokens

setRepetitionPenalty(value)[source]#

Sets the parameter for repetition penalty. 1.0 means no penalty.

Parameters:
valuefloat

The repetition penalty

References

See Ctrl: A Conditional Transformer Language Model For Controllable Generation for more details.

setNoRepeatNgramSize(value)[source]#

Sets size of n-grams that can only occur once.

If set to int > 0, all ngrams of that size can only occur once.

Parameters:
valueint

N-gram size can only occur once

setRandomSeed(seed)[source]#

Sets random seed.

Parameters:
seedint

Random seed

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:
folderstr

Folder of the saved model

spark_sessionpyspark.sql.SparkSession

The current SparkSession

Returns:
MarianTransformer

The restored model

static pretrained(name='opus_mt_en_fr', lang='xx', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “opus_mt_en_fr”

langstr, optional

Language of the pretrained model, by default “xx”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
MarianTransformer

The restored model