sparknlp.annotator.cleaners.cleaner#

Contains classes for Cleaner.

Module Contents#

Classes#

Cleaner

MarianTransformer: Fast Neural Machine Translation

class Cleaner(classname='com.johnsnowlabs.nlp.annotators.cleaners.Cleaner', java_model=None)[source]#

MarianTransformer: Fast Neural Machine Translation

Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences.

Pretrained models can be loaded with pretrained() of the companion object:

>>> marian = MarianTransformer.pretrained() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("translation")

The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
batchSize

Size of every batch, by default 1

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

langId

Transformer’s task, e.g. “summarize>”, by default “”

maxInputLength

Controls the maximum length for encoder inputs (source language texts), by default 40

maxOutputLength

Controls the maximum length for decoder outputs (target language texts), by default 40

Notes

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

References

MarianNMT at GitHub

Marian: Fast Neural Machine Translation in C++

Paper Abstract:

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
...     .setInputCols("document") \
...     .setOutputCol("sentence")
>>> marian = MarianTransformer.pretrained() \
...     .setInputCols("sentence") \
...     .setOutputCol("translation") \
...     .setMaxInputLength(30)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentence,
...       marian
...     ])
>>> data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(translation.result) as result").show(truncate=False)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+
name = 'Cleaner'[source]#
inputAnnotatorTypes[source]#
outputAnnotatorType = 'chunk'[source]#
encoding[source]#
cleanPrefixPattern[source]#
cleanPostfixPattern[source]#
cleanerMode[source]#
extraWhitespace[source]#
dashes[source]#
bullets[source]#
trailingPunctuation[source]#
lowercase[source]#
ignoreCase[source]#
strip[source]#
setEncoding(value)[source]#

Sets the encoding to be used for decoding the byte string (default is utf-8).

Parameters:
valuestr

The encoding to be used for decoding the byte string (default is utf-8)

setCleanPrefixPattern(value)[source]#

Sets the pattern for the prefix. Can be a simple string or a regex pattern.

Parameters:
valuestr

The pattern for the prefix. Can be a simple string or a regex pattern.

setCleanPostfixPattern(value)[source]#

Sets the pattern for the postfix. Can be a simple string or a regex pattern.

Parameters:
valuestr

The pattern for the postfix. Can be a simple string or a regex pattern.

setCleanerMode(value)[source]#

Sets the cleaner mode.

Possible values:

clean, bytes_string_to_string, clean_non_ascii_chars, clean_ordered_bullets, clean_postfix, clean_prefix, remove_punctuation, replace_unicode_quotes

Parameters:
valuestr

The mode for cleaning operations.

setExtraWhitespace(value)[source]#

Sets whether to remove extra whitespace.

Parameters:
valuebool

Whether to remove extra whitespace.

setDashes(value)[source]#

Sets whether to handle dashes in text.

Parameters:
valuebool

Whether to handle dashes in text.

setBullets(value)[source]#

Sets whether to handle bullets in text.

Parameters:
valuebool

Whether to handle bullets in text.

setTrailingPunctuation(value)[source]#

Sets whether to remove trailing punctuation from text.

Parameters:
valuebool

Whether to remove trailing punctuation from text.

setLowercase(value)[source]#

Sets whether to convert text to lowercase.

Parameters:
valuebool

Whether to convert text to lowercase.

setIgnoreCase(value)[source]#

Sets whether to ignore case in the pattern.

Parameters:
valuebool

If true, ignores case in the pattern.

setStrip(value)[source]#

Sets whether to remove leading or trailing whitespace from the cleaned string.

Parameters:
valuebool

If true, removes leading or trailing whitespace from the cleaned string.