sparknlp.annotator.cleaners.cleaner
#
Contains classes for Cleaner.
Module Contents#
Classes#
MarianTransformer: Fast Neural Machine Translation |
- class Cleaner(classname='com.johnsnowlabs.nlp.annotators.cleaners.Cleaner', java_model=None)[source]#
MarianTransformer: Fast Neural Machine Translation
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.
Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> marian = MarianTransformer.pretrained() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("translation")
The default model is
"opus_mt_en_fr"
, default language is"xx"
(meaning multi-lingual), if no values are provided.For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- batchSize
Size of every batch, by default 1
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- langId
Transformer’s task, e.g. “summarize>”, by default “”
- maxInputLength
Controls the maximum length for encoder inputs (source language texts), by default 40
- maxOutputLength
Controls the maximum length for decoder outputs (target language texts), by default 40
Notes
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
References
Marian: Fast Neural Machine Translation in C++
Paper Abstract:
We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \ ... .setInputCols("document") \ ... .setOutputCol("sentence") >>> marian = MarianTransformer.pretrained() \ ... .setInputCols("sentence") \ ... .setOutputCol("translation") \ ... .setMaxInputLength(30) >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... sentence, ... marian ... ]) >>> data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(translation.result) as result").show(truncate=False) +-------------------------------------+ |result | +-------------------------------------+ |Quelle est la capitale de la France ?| |On devrait le savoir en français. | +-------------------------------------+
- setEncoding(value)[source]#
Sets the encoding to be used for decoding the byte string (default is utf-8).
- Parameters:
- valuestr
The encoding to be used for decoding the byte string (default is utf-8)
- setCleanPrefixPattern(value)[source]#
Sets the pattern for the prefix. Can be a simple string or a regex pattern.
- Parameters:
- valuestr
The pattern for the prefix. Can be a simple string or a regex pattern.
- setCleanPostfixPattern(value)[source]#
Sets the pattern for the postfix. Can be a simple string or a regex pattern.
- Parameters:
- valuestr
The pattern for the postfix. Can be a simple string or a regex pattern.
- setCleanerMode(value)[source]#
Sets the cleaner mode.
- Possible values:
clean, bytes_string_to_string, clean_non_ascii_chars, clean_ordered_bullets, clean_postfix, clean_prefix, remove_punctuation, replace_unicode_quotes
- Parameters:
- valuestr
The mode for cleaning operations.
- setExtraWhitespace(value)[source]#
Sets whether to remove extra whitespace.
- Parameters:
- valuebool
Whether to remove extra whitespace.
- setDashes(value)[source]#
Sets whether to handle dashes in text.
- Parameters:
- valuebool
Whether to handle dashes in text.
- setBullets(value)[source]#
Sets whether to handle bullets in text.
- Parameters:
- valuebool
Whether to handle bullets in text.
- setTrailingPunctuation(value)[source]#
Sets whether to remove trailing punctuation from text.
- Parameters:
- valuebool
Whether to remove trailing punctuation from text.
- setLowercase(value)[source]#
Sets whether to convert text to lowercase.
- Parameters:
- valuebool
Whether to convert text to lowercase.