Packages

package seq2seq

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class BartTransformer extends AnnotatorModel[BartTransformer] with HasBatchedAnnotate[BartTransformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with HasEngine with HasGeneratorProperties

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer

    The Facebook BART (Bidirectional and Auto-Regressive Transformer) model is a state-of-the-art language generation model that was introduced by Facebook AI in 2019. It is based on the transformer architecture and is designed to handle a wide range of natural language processing tasks such as text generation, summarization, and machine translation.

    BART is unique in that it is both bidirectional and auto-regressive, meaning that it can generate text both from left-to-right and from right-to-left. This allows it to capture contextual information from both past and future tokens in a sentence,resulting in more accurate and natural language generation.

    The model was trained on a large corpus of text data using a combination of unsupervised and supervised learning techniques. It incorporates pretraining and fine-tuning phases, where the model is first trained on a large unlabeled corpus of text, and then fine-tuned on specific downstream tasks.

    BART has achieved state-of-the-art performance on a wide range of NLP tasks, including summarization, question-answering, and language translation. Its ability to handle multiple tasks and its high performance on each of these tasks make it a versatile and valuable tool for natural language processing applications.

    Pretrained models can be loaded with pretrained of the companion object:

    val bart = BartTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "distilbart_xsum_12_6", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see BartTestSpec.

    References:

    Paper Abstract:

    We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val bart = BartTransformer.pretrained("distilbart_xsum_12_6")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(30)
      .setDoSample(true)
      .setTopK(50)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, bart))
    
    val data = Seq(
      "PG&E stated it scheduled the blackouts in response to forecasts for high winds " +
      "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were " +
      "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +--------------------------------------------------------------+
    |result                                                        |
    +--------------------------------------------------------------+
    |[Nearly 800 thousand customers were affected by the shutoffs.]|
    +--------------------------------------------------------------+
  2. class GPT2Transformer extends AnnotatorModel[GPT2Transformer] with HasBatchedAnnotate[GPT2Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with HasEngine

    GPT-2: the OpenAI Text-To-Text Transformer

    GPT-2: the OpenAI Text-To-Text Transformer

    GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

    GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

    Pretrained models can be loaded with pretrained of the companion object:

    val gpt2 = GPT2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "gpt2", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see GPT2TestSpec.

    References:

    Paper Abstract:

    Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val gpt2 = GPT2Transformer.pretrained("gpt2")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  3. class LLAMA2Transformer extends AnnotatorModel[LLAMA2Transformer] with HasBatchedAnnotate[LLAMA2Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    The Llama 2 release introduces a family of pretrained and fine-tuned LLMs, ranging in scale from 7B to 70B parameters (7B, 13B, 70B). The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model🔥!

    However, the most exciting part of this release is the fine-tuned models (Llama 2-Chat), which have been optimized for dialogue applications using Reinforcement Learning from Human Feedback (RLHF). Across a wide range of helpfulness and safety benchmarks, the Llama 2-Chat models perform better than most open models and achieve comparable performance to ChatGPT according to human evaluations.

    Pretrained models can be loaded with pretrained of the companion object:

    val llama2 = LLAMA2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "llama_2_7b_chat_hf_int4", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see LLAMA2TestSpec.

    References:

    Paper Abstract:

    In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.LLAMA2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val llama2 = LLAMA2Transformer.pretrained("llama_2_7b_chat_hf_int4")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, llama2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  4. class M2M100Transformer extends AnnotatorModel[M2M100Transformer] with HasBatchedAnnotate[M2M100Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    M2M100 : multilingual translation model

    M2M100 : multilingual translation model

    M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

    The model can directly translate between the 9,900 directions of 100 languages.

    Pretrained models can be loaded with pretrained of the companion object:

    val m2m100 = M2M100Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "m2m100_418M", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see M2M100TestSpec.

    References:

    Paper Abstract:

    Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

    Languages Covered:

    Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.M2M100Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val m2m100 = M2M100Transformer.pretrained("m2m100_418M")
      .setInputCols(Array("documents"))
      .setSrcLang("zh")
      .serTgtLang("en")
      .setMaxOutputLength(100)
      .setDoSample(false)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, m2m100))
    
    val data = Seq(
      "生活就像一盒巧克力。"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +-------------------------------------------------------------------------------------------+
    |result                                                                                     |
    +-------------------------------------------------------------------------------------------+
    |[ Life is like a box of chocolate.]                                                        |
    +-------------------------------------------------------------------------------------------+
  5. class MarianTransformer extends AnnotatorModel[MarianTransformer] with HasBatchedAnnotate[MarianTransformer] with WriteTensorflowModel with WriteOnnxModel with WriteSentencePieceModel with HasEngine with HasProtectedParams

    MarianTransformer: Fast Neural Machine Translation

    MarianTransformer: Fast Neural Machine Translation

    Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

    It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

    Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences first.

    Pretrained models can be loaded with pretrained of the companion object:

    val marian = MarianTransformer.pretrained()
      .setInputCols("sentence")
      .setOutputCol("translation")

    The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Examples and the MarianTransformerTestSpec.

    Sources :

    MarianNMT at GitHub

    Marian: Fast Neural Machine Translation in C++

    Paper Abstract:

    We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetectorDLModel
    import com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val marian = MarianTransformer.pretrained()
      .setInputCols("sentence")
      .setOutputCol("translation")
      .setMaxInputLength(30)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        marian
      ))
    
    val data = Seq("What is the capital of France? We should know this in french.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(translation.result) as result").show(false)
    +-------------------------------------+
    |result                               |
    +-------------------------------------+
    |Quelle est la capitale de la France ?|
    |On devrait le savoir en français.    |
    +-------------------------------------+
  6. class MistralTransformer extends AnnotatorModel[MistralTransformer] with HasBatchedAnnotate[MistralTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    Mistral 7B

    Mistral 7B

    Mistral 7B, a 7.3 billion-parameter model that stands out for its efficient and effective performance in natural language processing. Surpassing Llama 2 13B across all benchmarks and excelling over Llama 1 34B in various aspects, Mistral 7B strikes a balance between English language tasks and code comprehension, rivaling the capabilities of CodeLlama 7B in the latter.

    Mistral 7B introduces Grouped-query attention (GQA) for quicker inference, enhancing processing speed without compromising accuracy. This streamlined approach ensures a smoother user experience, making Mistral 7B a practical choice for real-world applications.

    Additionally, Mistral 7B adopts Sliding Window Attention (SWA) to efficiently handle longer sequences at a reduced computational cost. This feature enhances the model's ability to process extensive textual input, expanding its utility in handling more complex tasks.

    In summary, Mistral 7B represents a notable advancement in language models, offering a reliable and versatile solution for various natural language processing challenges.

    Pretrained models can be loaded with pretrained of the companion object:

    val mistral = MistralTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "mistral_7b", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see MistralTestSpec.

    References:

    Paper Abstract:

    We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.MistralTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val mistral = MistralTransformer.pretrained("mistral_7b")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, mistral))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     |result                                                                                                                                                                                              |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     |[Leonardo Da Vinci invented the microscope?\n Question: Leonardo Da Vinci invented the microscope?\n Answer: No, Leonardo Da Vinci did not invent the microscope. The first microscope was invented |
     | in the late 16th century, long after Leonardo']                                                                                                                                                    |
     -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  7. class Phi2Transformer extends AnnotatorModel[Phi2Transformer] with HasBatchedAnnotate[Phi2Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Phi-2: Textbooks Are All You Need.

    Phi-2: Textbooks Are All You Need.

    Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters.

    Phi-2 hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.

    Pretrained models can be loaded with pretrained of the companion object:

    val Phi2 = Phi2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "Phi2-13b", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see Phi2TestSpec.

    References:

    Paper Abstract:

    The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

    Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

    Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.Phi2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val Phi2 = Phi2Transformer.pretrained("phi2")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, Phi2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong   |
    | passion for learning and am always looking for ways to improve my knowledge and skills]                                                                                                            |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  8. trait ReadBartTransformerDLModel extends ReadTensorflowModel
  9. trait ReadGPT2TransformerDLModel extends ReadTensorflowModel
  10. trait ReadLLAMA2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  11. trait ReadM2M100TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  12. trait ReadMarianMTDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel
  13. trait ReadMistralTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  14. trait ReadPhi2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
  15. trait ReadT5TransformerDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
  16. trait ReadablePretrainedBartTransformerModel extends ParamsAndFeaturesReadable[BartTransformer] with HasPretrained[BartTransformer]
  17. trait ReadablePretrainedGPT2TransformerModel extends ParamsAndFeaturesReadable[GPT2Transformer] with HasPretrained[GPT2Transformer]
  18. trait ReadablePretrainedLLAMA2TransformerModel extends ParamsAndFeaturesReadable[LLAMA2Transformer] with HasPretrained[LLAMA2Transformer]
  19. trait ReadablePretrainedM2M100TransformerModel extends ParamsAndFeaturesReadable[M2M100Transformer] with HasPretrained[M2M100Transformer]
  20. trait ReadablePretrainedMarianMTModel extends ParamsAndFeaturesReadable[MarianTransformer] with HasPretrained[MarianTransformer]
  21. trait ReadablePretrainedMistralTransformerModel extends ParamsAndFeaturesReadable[MistralTransformer] with HasPretrained[MistralTransformer]
  22. trait ReadablePretrainedPhi2TransformerModel extends ParamsAndFeaturesReadable[Phi2Transformer] with HasPretrained[Phi2Transformer]
  23. trait ReadablePretrainedT5TransformerModel extends ParamsAndFeaturesReadable[T5Transformer] with HasPretrained[T5Transformer]
  24. class T5Transformer extends AnnotatorModel[T5Transformer] with HasBatchedAnnotate[T5Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasCaseSensitiveProperties with WriteSentencePieceModel with HasProtectedParams with HasEngine

    T5: the Text-To-Text Transfer Transformer

    T5: the Text-To-Text Transfer Transformer

    T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework is able to use the same model, loss function, and hyper-parameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even apply to regression tasks by training it to predict the string representation of a number instead of the number itself.

    Pretrained models can be loaded with pretrained of the companion object:

    val t5 = T5Transformer.pretrained()
      .setTask("summarize:")
      .setInputCols("document")
      .setOutputCol("summaries")

    The default model is "t5_small", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Examples and the T5TestSpec.

    References:

    Paper Abstract:

    Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val t5 = T5Transformer.pretrained("t5_small")
      .setTask("summarize:")
      .setInputCols(Array("documents"))
      .setMaxOutputLength(200)
      .setOutputCol("summaries")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
    
    val data = Seq(
      "Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
        "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
        " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
        "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
        "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
        "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
        "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
        "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
        "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
        "learning for NLP, we release our data set, pre-trained models, and code."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("summaries.result").show(false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Value Members

  1. object BartTransformer extends ReadablePretrainedBartTransformerModel with ReadBartTransformerDLModel with Serializable
  2. object GPT2Transformer extends ReadablePretrainedGPT2TransformerModel with ReadGPT2TransformerDLModel with Serializable
  3. object LLAMA2Transformer extends ReadablePretrainedLLAMA2TransformerModel with ReadLLAMA2TransformerDLModel with Serializable
  4. object M2M100Transformer extends ReadablePretrainedM2M100TransformerModel with ReadM2M100TransformerDLModel with Serializable
  5. object MarianTransformer extends ReadablePretrainedMarianMTModel with ReadMarianMTDLModel with ReadSentencePieceModel with Serializable

    This is the companion object of MarianTransformer.

    This is the companion object of MarianTransformer. Please refer to that class for the documentation.

  6. object MistralTransformer extends ReadablePretrainedMistralTransformerModel with ReadMistralTransformerDLModel with Serializable
  7. object Phi2Transformer extends ReadablePretrainedPhi2TransformerModel with ReadPhi2TransformerDLModel with Serializable
  8. object T5Transformer extends ReadablePretrainedT5TransformerModel with ReadT5TransformerDLModel with ReadSentencePieceModel with Serializable

    This is the companion object of T5Transformer.

    This is the companion object of T5Transformer. Please refer to that class for the documentation.

Ungrouped