sparknlp.annotator.seq2seq.marian_transformer
#
Contains classes for the MarianTransformer.
Module Contents#
Classes#
MarianTransformer: Fast Neural Machine Translation |
- class MarianTransformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer', java_model=None)[source]#
MarianTransformer: Fast Neural Machine Translation
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.
Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> marian = MarianTransformer.pretrained() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("translation")
The default model is
"opus_mt_en_fr"
, default language is"xx"
(meaning multi-lingual), if no values are provided.For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- batchSize
Size of every batch, by default 1
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- langId
Transformer’s task, e.g. “summarize>”, by default “”
- maxInputLength
Controls the maximum length for encoder inputs (source language texts), by default 40
- maxOutputLength
Controls the maximum length for decoder outputs (target language texts), by default 40
Notes
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
References
Marian: Fast Neural Machine Translation in C++
Paper Abstract:
We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \ ... .setInputCols("document") \ ... .setOutputCol("sentence") >>> marian = MarianTransformer.pretrained() \ ... .setInputCols("sentence") \ ... .setOutputCol("translation") \ ... .setMaxInputLength(30) >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... sentence, ... marian ... ]) >>> data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(translation.result) as result").show(truncate=False) +-------------------------------------+ |result | +-------------------------------------+ |Quelle est la capitale de la France ?| |On devrait le savoir en français. | +-------------------------------------+
- setIgnoreTokenIds(value)[source]#
A list of token ids which are ignored in the decoder’s output.
- Parameters:
- valueList[int]
The words to be filtered out
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setLangId(value)[source]#
Sets transformer’s task, e.g. “summarize>”, by default “”.
- Parameters:
- valuestr
Transformer’s task, e.g. “summarize>”
- setMaxInputLength(value)[source]#
Sets the maximum length for encoder inputs (source language texts), by default 40. The value should be less than 512, as the Marian Transformer does not support inputs longer than 512 tokens.
- Parameters:
- valueint
The maximum length for encoder inputs (source language texts)
- setMaxOutputLength(value)[source]#
Sets the maximum length for decoder outputs (target language texts), by default 40.
- Parameters:
- valueint
The maximum length for decoder outputs (target language texts)
- setDoSample(value)[source]#
Sets whether or not to use sampling, use greedy decoding otherwise.
- Parameters:
- valuebool
Whether or not to use sampling; use greedy decoding otherwise
- setTemperature(value)[source]#
Sets the value used to module the next token probabilities.
- Parameters:
- valuefloat
The value used to module the next token probabilities
- setTopK(value)[source]#
Sets the number of highest probability vocabulary tokens to keep for top-k-filtering.
- Parameters:
- valueint
Number of highest probability vocabulary tokens to keep
- setTopP(value)[source]#
Sets the top cumulative probability for vocabulary tokens.
If set to float < 1, only the most probable tokens with probabilities that add up to
topP
or higher are kept for generation.- Parameters:
- valuefloat
Cumulative probability for vocabulary tokens
- setRepetitionPenalty(value)[source]#
Sets the parameter for repetition penalty. 1.0 means no penalty.
- Parameters:
- valuefloat
The repetition penalty
References
See Ctrl: A Conditional Transformer Language Model For Controllable Generation for more details.
- setNoRepeatNgramSize(value)[source]#
Sets size of n-grams that can only occur once.
If set to int > 0, all ngrams of that size can only occur once.
- Parameters:
- valueint
N-gram size can only occur once
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- MarianTransformer
The restored model
- static pretrained(name='opus_mt_en_fr', lang='xx', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “opus_mt_en_fr”
- langstr, optional
Language of the pretrained model, by default “xx”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- MarianTransformer
The restored model