Packages

package seq2seq

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class AutoGGUFModel extends AnnotatorModel[AutoGGUFModel] with HasBatchedAnnotate[AutoGGUFModel] with HasEngine with HasLlamaCppProperties with HasProtectedParams

    Annotator that uses the llama.cpp library to generate text completions with large language models.

    Annotator that uses the llama.cpp library to generate text completions with large language models.

    For settable parameters, and their explanations, see HasLlamaCppProperties and refer to the llama.cpp documentation of server.cpp for more information.

    If the parameters are not set, the annotator will default to use the parameters provided by the model.

    Pretrained models can be loaded with pretrained of the companion object:

    val autoGGUFModel = AutoGGUFModel.pretrained()
      .setInputCols("document")
      .setOutputCol("completions")

    The default model is "phi3.5_mini_4k_instruct_q4_gguf", if no name is provided.

    For available pretrained models please see the Models Hub.

    For extended examples of usage, see the AutoGGUFModelTest and the example notebook.

    Note

    To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

    When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

    Example

    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    import spark.implicits._
    
    val document = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val autoGGUFModel = AutoGGUFModel
      .pretrained()
      .setInputCols("document")
      .setOutputCol("completions")
      .setBatchSize(4)
      .setNPredict(20)
      .setNGpuLayers(99)
      .setTemperature(0.4f)
      .setTopK(40)
      .setTopP(0.9f)
      .setPenalizeNl(true)
    
    val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))
    
    val data = Seq("Hello, I am a").toDF("text")
    val result = pipeline.fit(data).transform(data)
    result.select("completions").show(truncate = false)
    +-----------------------------------------------------------------------------------------------------------------------------------+
    |completions                                                                                                                        |
    +-----------------------------------------------------------------------------------------------------------------------------------+
    |[{document, 0, 78,  new user.  I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
    +-----------------------------------------------------------------------------------------------------------------------------------+
  2. class BartTransformer extends AnnotatorModel[BartTransformer] with HasBatchedAnnotate[BartTransformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with HasEngine with HasGeneratorProperties

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer

    The Facebook BART (Bidirectional and Auto-Regressive Transformer) model is a state-of-the-art language generation model that was introduced by Facebook AI in 2019. It is based on the transformer architecture and is designed to handle a wide range of natural language processing tasks such as text generation, summarization, and machine translation.

    BART is unique in that it is both bidirectional and auto-regressive, meaning that it can generate text both from left-to-right and from right-to-left. This allows it to capture contextual information from both past and future tokens in a sentence,resulting in more accurate and natural language generation.

    The model was trained on a large corpus of text data using a combination of unsupervised and supervised learning techniques. It incorporates pretraining and fine-tuning phases, where the model is first trained on a large unlabeled corpus of text, and then fine-tuned on specific downstream tasks.

    BART has achieved state-of-the-art performance on a wide range of NLP tasks, including summarization, question-answering, and language translation. Its ability to handle multiple tasks and its high performance on each of these tasks make it a versatile and valuable tool for natural language processing applications.

    Pretrained models can be loaded with pretrained of the companion object:

    val bart = BartTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "distilbart_xsum_12_6", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see BartTestSpec.

    References:

    Paper Abstract:

    We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val bart = BartTransformer.pretrained("distilbart_xsum_12_6")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(30)
      .setDoSample(true)
      .setTopK(50)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, bart))
    
    val data = Seq(
      "PG&E stated it scheduled the blackouts in response to forecasts for high winds " +
      "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were " +
      "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +--------------------------------------------------------------+
    |result                                                        |
    +--------------------------------------------------------------+
    |[Nearly 800 thousand customers were affected by the shutoffs.]|
    +--------------------------------------------------------------+
  3. class CPMTransformer extends AnnotatorModel[CPMTransformer] with HasBatchedAnnotate[CPMTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    MiniCPM: Unveiling the Potential of End-side Large Language Models

    MiniCPM: Unveiling the Potential of End-side Large Language Models

    MiniCPM is a series of edge-side large language models, with the base model, MiniCPM-2B, having 2.4B non-embedding parameters. It ranks closely with Mistral-7B on comprehensive benchmarks (with better performance in Chinese, mathematics, and coding abilities), surpassing models like Llama2-13B, MPT-30B, and Falcon-40B. On the MTBench benchmark, which is closest to user experience, MiniCPM-2B also outperforms many representative open-source models such as Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, and Zephyr-7B-alpha.

    After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench.

    MiniCPM-V, based on MiniCPM-2B, achieves the best overall performance among multimodel models of the same scale, surpassing existing multimodal large models built on Phi-2 and achieving performance comparable to or even better than 9.6B Qwen-VL-Chat on some tasks.

    MiniCPM can be deployed and infer on smartphones, and the speed of streaming output is relatively higher than the verbal speed of human.

    Pretrained models can be loaded with pretrained of the companion object:

    val cpm = CPMTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "llama_2_7b_chat_hf_int4", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see CPMTestSpec.

    References:

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.CPMTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val cpm = CPMTransformer.pretrained("llama_2_7b_chat_hf_int4")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, cpm))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                 |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a student at the University of California, Los Angeles. I have a passion for writing and learning about different cultures. I enjoy playing basketball and watching movies]|
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  4. class GPT2Transformer extends AnnotatorModel[GPT2Transformer] with HasBatchedAnnotate[GPT2Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with HasEngine

    GPT-2: the OpenAI Text-To-Text Transformer

    GPT-2: the OpenAI Text-To-Text Transformer

    GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

    GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

    Pretrained models can be loaded with pretrained of the companion object:

    val gpt2 = GPT2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "gpt2", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see GPT2TestSpec.

    References:

    Paper Abstract:

    Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val gpt2 = GPT2Transformer.pretrained("gpt2")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  5. class LLAMA2Transformer extends AnnotatorModel[LLAMA2Transformer] with HasBatchedAnnotate[LLAMA2Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    The Llama 2 release introduces a family of pretrained and fine-tuned LLMs, ranging in scale from 7B to 70B parameters (7B, 13B, 70B). The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model🔥!

    However, the most exciting part of this release is the fine-tuned models (Llama 2-Chat), which have been optimized for dialogue applications using Reinforcement Learning from Human Feedback (RLHF). Across a wide range of helpfulness and safety benchmarks, the Llama 2-Chat models perform better than most open models and achieve comparable performance to ChatGPT according to human evaluations.

    Pretrained models can be loaded with pretrained of the companion object:

    val llama2 = LLAMA2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "llama_2_7b_chat_hf_int4", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see LLAMA2TestSpec.

    References:

    Paper Abstract:

    In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.LLAMA2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val llama2 = LLAMA2Transformer.pretrained("llama_2_7b_chat_hf_int4")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, llama2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  6. class LLAMA3Transformer extends AnnotatorModel[LLAMA3Transformer] with HasBatchedAnnotate[LLAMA3Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Llama 3: Cutting-Edge Foundation and Fine-Tuned Chat Models

    Llama 3: Cutting-Edge Foundation and Fine-Tuned Chat Models

    The Llama 3 release introduces a new family of large language models, ranging from 8B to 70B parameters. Llama 3 models are designed with a greater emphasis on efficiency, performance, and safety, achieving remarkable advancements in training and deployment processes. These models are trained on a diversified dataset that significantly enhances their capability to generate more accurate and contextually relevant outputs.

    The fine-tuned variants, known as Llama 3-instruct, are specifically optimized for dialogue-based applications, making use of Reinforcement Learning from Human Feedback (RLHF) with an advanced reward model. Llama 3-instruct models demonstrate state-of-the-art performance across multiple benchmarks and surpass the capabilities of Llama 2, particularly in conversational settings.

    Pretrained models can be loaded with pretrained of the companion object:

    val llama3 = LLAMA3Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "llama_3_7b_chat_hf_int8", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see LLAMA3TestSpec.

    References:

    Paper Abstract:

    Llama 3 represents Meta’s latest innovation in the development of large language models (LLMs), offering a series of models from 1 billion to 70 billion parameters. These models have been fine-tuned for dialogue applications under the Llama 3-Chat series, ensuring they are highly responsive and context-aware. Our Llama 3 models not only excel in various benchmarks but also incorporate enhanced safety and alignment features to address ethical concerns and ensure responsible AI deployment. We invite the community to explore the capabilities of Llama 3 and contribute to ongoing research in the field of natural language processing.

    Note:

    This is a resource-intensive module, especially with larger models and sequences. Use of accelerators such as GPUs is strongly recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.LLAMA3Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val llama3 = LLAMA3Transformer.pretrained("llama_3_7b_chat_hf_int8")
      .setInputCols(Array("documents"))
      .setMinOutputLength(15)
      .setMaxOutputLength(60)
      .setDoSample(false)
      .setTopK(40)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, llama3))
    
    val data = Seq(
      (
        1,
        """<|start_header_id|>system<|end_header_id|>
    
        You are a minion chatbot who always responds in minion speak!
    
        <|start_header_id|>user<|end_header_id|>
    
        Who are you?
    
        <|start_header_id|>assistant<|end_header_id|>
        """.stripMargin)
    ).toDF("id", "text")
    
    val result = pipeline.fit(data).transform(data)
    
    result.select("generation.result").show(truncate = false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                  |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[Oooh, me am Minion! Me help you with things! Me speak Minion language, yeah! Bana-na-na!]                                                                         |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  7. class M2M100Transformer extends AnnotatorModel[M2M100Transformer] with HasBatchedAnnotate[M2M100Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    M2M100 : multilingual translation model

    M2M100 : multilingual translation model

    M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

    The model can directly translate between the 9,900 directions of 100 languages.

    Pretrained models can be loaded with pretrained of the companion object:

    val m2m100 = M2M100Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "m2m100_418M", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see M2M100TestSpec.

    References:

    Paper Abstract:

    Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

    Languages Covered:

    Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.M2M100Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val m2m100 = M2M100Transformer.pretrained("m2m100_418M")
      .setInputCols(Array("documents"))
      .setSrcLang("zh")
      .serTgtLang("en")
      .setMaxOutputLength(100)
      .setDoSample(false)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, m2m100))
    
    val data = Seq(
      "生活就像一盒巧克力。"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +-------------------------------------------------------------------------------------------+
    |result                                                                                     |
    +-------------------------------------------------------------------------------------------+
    |[ Life is like a box of chocolate.]                                                        |
    +-------------------------------------------------------------------------------------------+
  8. class MarianTransformer extends AnnotatorModel[MarianTransformer] with HasBatchedAnnotate[MarianTransformer] with WriteTensorflowModel with WriteOnnxModel with WriteSentencePieceModel with HasEngine with HasProtectedParams

    MarianTransformer: Fast Neural Machine Translation

    MarianTransformer: Fast Neural Machine Translation

    Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

    It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

    Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences first.

    Pretrained models can be loaded with pretrained of the companion object:

    val marian = MarianTransformer.pretrained()
      .setInputCols("sentence")
      .setOutputCol("translation")

    The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Examples and the MarianTransformerTestSpec.

    Sources :

    MarianNMT at GitHub

    Marian: Fast Neural Machine Translation in C++

    Paper Abstract:

    We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetectorDLModel
    import com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val marian = MarianTransformer.pretrained()
      .setInputCols("sentence")
      .setOutputCol("translation")
      .setMaxInputLength(30)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        marian
      ))
    
    val data = Seq("What is the capital of France? We should know this in french.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(translation.result) as result").show(false)
    +-------------------------------------+
    |result                               |
    +-------------------------------------+
    |Quelle est la capitale de la France ?|
    |On devrait le savoir en français.    |
    +-------------------------------------+
  9. class MistralTransformer extends AnnotatorModel[MistralTransformer] with HasBatchedAnnotate[MistralTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    Mistral 7B

    Mistral 7B

    Mistral 7B, a 7.3 billion-parameter model that stands out for its efficient and effective performance in natural language processing. Surpassing Llama 2 13B across all benchmarks and excelling over Llama 1 34B in various aspects, Mistral 7B strikes a balance between English language tasks and code comprehension, rivaling the capabilities of CodeLlama 7B in the latter.

    Mistral 7B introduces Grouped-query attention (GQA) for quicker inference, enhancing processing speed without compromising accuracy. This streamlined approach ensures a smoother user experience, making Mistral 7B a practical choice for real-world applications.

    Additionally, Mistral 7B adopts Sliding Window Attention (SWA) to efficiently handle longer sequences at a reduced computational cost. This feature enhances the model's ability to process extensive textual input, expanding its utility in handling more complex tasks.

    In summary, Mistral 7B represents a notable advancement in language models, offering a reliable and versatile solution for various natural language processing challenges.

    Pretrained models can be loaded with pretrained of the companion object:

    val mistral = MistralTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "mistral_7b", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see MistralTestSpec.

    References:

    Paper Abstract:

    We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.MistralTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val mistral = MistralTransformer.pretrained("mistral_7b")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, mistral))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     |result                                                                                                                                                                                              |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     |[Leonardo Da Vinci invented the microscope?\n Question: Leonardo Da Vinci invented the microscope?\n Answer: No, Leonardo Da Vinci did not invent the microscope. The first microscope was invented |
     | in the late 16th century, long after Leonardo']                                                                                                                                                    |
     -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  10. class NLLBTransformer extends AnnotatorModel[NLLBTransformer] with HasBatchedAnnotate[NLLBTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    NLLB : multilingual translation model

    NLLB : multilingual translation model

    NLLB is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

    The model can directly translate between 200+ languages.

    Pretrained models can be loaded with pretrained of the companion object:

    val nllb = NLLBTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "nllb_418M", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see NLLBTestSpec.

    References:

    Paper Abstract:

    Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at this https URL.

    Languages Covered:

    Acehnese (Arabic script) (ace_Arab), Acehnese (Latin script) (ace_Latn), Mesopotamian Arabic (acm_Arab), Ta’izzi-Adeni Arabic (acq_Arab), Tunisian Arabic (aeb_Arab), Afrikaans (afr_Latn), South Levantine Arabic (ajp_Arab), Akan (aka_Latn), Amharic (amh_Ethi), North Levantine Arabic (apc_Arab), Modern Standard Arabic (arb_Arab), Modern Standard Arabic (Romanized) (arb_Latn), Najdi Arabic (ars_Arab), Moroccan Arabic (ary_Arab), Egyptian Arabic (arz_Arab), Assamese (asm_Beng), Asturian (ast_Latn), Awadhi (awa_Deva), Central Aymara (ayr_Latn), South Azerbaijani (azb_Arab), North Azerbaijani (azj_Latn), Bashkir (bak_Cyrl), Bambara (bam_Latn), Balinese (ban_Latn), Belarusian (bel_Cyrl), Bemba (bem_Latn), Bengali (ben_Beng), Bhojpuri (bho_Deva), Banjar (Arabic script) (bjn_Arab), Banjar (Latin script) (bjn_Latn), Standard Tibetan (bod_Tibt), Bosnian (bos_Latn), Buginese (bug_Latn), Bulgarian (bul_Cyrl), Catalan (cat_Latn), Cebuano (ceb_Latn), Czech (ces_Latn), Chokwe (cjk_Latn), Central Kurdish (ckb_Arab), Crimean Tatar (crh_Latn), Welsh (cym_Latn), Danish (dan_Latn), German (deu_Latn), Southwestern Dinka (dik_Latn), Dyula (dyu_Latn), Dzongkha (dzo_Tibt), Greek (ell_Grek), English (eng_Latn), Esperanto (epo_Latn), Estonian (est_Latn), Basque (eus_Latn), Ewe (ewe_Latn), Faroese (fao_Latn), Fijian (fij_Latn), Finnish (fin_Latn), Fon (fon_Latn), French (fra_Latn), Friulian (fur_Latn), Nigerian Fulfulde (fuv_Latn), Scottish Gaelic (gla_Latn), Irish (gle_Latn), Galician (glg_Latn), Guarani (grn_Latn), Gujarati (guj_Gujr), Haitian Creole (hat_Latn), Hausa (hau_Latn), Hebrew (heb_Hebr), Hindi (hin_Deva), Chhattisgarhi (hne_Deva), Croatian (hrv_Latn), Hungarian (hun_Latn), Armenian (hye_Armn), Igbo (ibo_Latn), Ilocano (ilo_Latn), Indonesian (ind_Latn), Icelandic (isl_Latn), Italian (ita_Latn), Javanese (jav_Latn), Japanese (jpn_Jpan), Kabyle (kab_Latn), Jingpho (kac_Latn), Kamba (kam_Latn), Kannada (kan_Knda), Kashmiri (Arabic script) (kas_Arab), Kashmiri (Devanagari script) (kas_Deva), Georgian (kat_Geor), Central Kanuri (Arabic script) (knc_Arab), Central Kanuri (Latin script) (knc_Latn), Kazakh (kaz_Cyrl), Kabiyè (kbp_Latn), Kabuverdianu (kea_Latn), Khmer (khm_Khmr), Kikuyu (kik_Latn), Kinyarwanda (kin_Latn), Kyrgyz (kir_Cyrl), Kimbundu (kmb_Latn), Northern Kurdish (kmr_Latn), Kikongo (kon_Latn), Korean (kor_Hang), Lao (lao_Laoo), Ligurian (lij_Latn), Limburgish (lim_Latn), Lingala (lin_Latn), Lithuanian (lit_Latn), Lombard (lmo_Latn), Latgalian (ltg_Latn), Luxembourgish (ltz_Latn), Luba-Kasai (lua_Latn), Ganda (lug_Latn), Luo (luo_Latn), Mizo (lus_Latn), Standard Latvian (lvs_Latn), Magahi (mag_Deva), Maithili (mai_Deva), Malayalam (mal_Mlym), Marathi (mar_Deva), Minangkabau (Arabic script) (min_Arab), Minangkabau (Latin script) (min_Latn), Macedonian (mkd_Cyrl), Plateau Malagasy (plt_Latn), Maltese (mlt_Latn), Meitei (Bengali script) (mni_Beng), Halh Mongolian (khk_Cyrl), Mossi (mos_Latn), Maori (mri_Latn), Burmese (mya_Mymr), Dutch (nld_Latn), Norwegian Nynorsk (nno_Latn), Norwegian Bokmål (nob_Latn), Nepali (npi_Deva), Northern Sotho (nso_Latn), Nuer (nus_Latn), Nyanja (nya_Latn), Occitan (oci_Latn), West Central Oromo (gaz_Latn), Odia (ory_Orya), Pangasinan (pag_Latn), Eastern Panjabi (pan_Guru), Papiamento (pap_Latn), Western Persian (pes_Arab), Polish (pol_Latn), Portuguese (por_Latn), Dari (prs_Arab), Southern Pashto (pbt_Arab), Ayacucho Quechua (quy_Latn), Romanian (ron_Latn), Rundi (run_Latn), Russian (rus_Cyrl), Sango (sag_Latn), Sanskrit (san_Deva), Santali (sat_Olck), Sicilian (scn_Latn), Shan (shn_Mymr), Sinhala (sin_Sinh), Slovak (slk_Latn), Slovenian (slv_Latn), Samoan (smo_Latn), Shona (sna_Latn), Sindhi (snd_Arab), Somali (som_Latn), Southern Sotho (sot_Latn), Spanish (spa_Latn), Tosk Albanian (als_Latn), Sardinian (srd_Latn), Serbian (srp_Cyrl), Swati (ssw_Latn), Sundanese (sun_Latn), Swedish (swe_Latn), Swahili (swh_Latn), Silesian (szl_Latn), Tamil (tam_Taml), Tatar (tat_Cyrl), Telugu (tel_Telu), Tajik (tgk_Cyrl), Tagalog (tgl_Latn), Thai (tha_Thai), Tigrinya (tir_Ethi), Tamasheq (Latin script) (taq_Latn), Tamasheq (Tifinagh script) (taq_Tfng), Tok Pisin (tpi_Latn), Tswana (tsn_Latn), Tsonga (tso_Latn), Turkmen (tuk_Latn), Tumbuka (tum_Latn), Turkish (tur_Latn), Twi (twi_Latn), Central Atlas Tamazight (tzm_Tfng), Uyghur (uig_Arab), Ukrainian (ukr_Cyrl), Umbundu (umb_Latn), Urdu (urd_Arab), Northern Uzbek (uzn_Latn), Venetian (vec_Latn), Vietnamese (vie_Latn), Waray (war_Latn), Wolof (wol_Latn), Xhosa (xho_Latn), Eastern Yiddish (ydd_Hebr), Yoruba (yor_Latn), Yue Chinese (yue_Hant), Chinese (Simplified) (zho_Hans), Chinese (Traditional) (zho_Hant), Standard Malay (zsm_Latn), Zulu (zul_Latn).

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.NLLBTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val nllb = NLLBTransformer.pretrained("nllb_418M")
      .setInputCols(Array("documents"))
      .setSrcLang("zho_Hans")
      .serTgtLang("eng_Latn")
      .setMaxOutputLength(100)
      .setDoSample(false)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, nllb))
    
    val data = Seq(
      "生活就像一盒巧克力。"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +-------------------------------------------------------------------------------------------+
    |result                                                                                     |
    +-------------------------------------------------------------------------------------------+
    |[ Life is like a box of chocolate.]                                                        |
    +-------------------------------------------------------------------------------------------+
  11. class Phi2Transformer extends AnnotatorModel[Phi2Transformer] with HasBatchedAnnotate[Phi2Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Phi-2: Textbooks Are All You Need.

    Phi-2: Textbooks Are All You Need.

    Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters.

    Phi-2 hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.

    Pretrained models can be loaded with pretrained of the companion object:

    val Phi2 = Phi2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "Phi2-13b", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see Phi2TestSpec.

    References:

    Paper Abstract:

    The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

    Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

    Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.Phi2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val Phi2 = Phi2Transformer.pretrained("phi2")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, Phi2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong   |
    | passion for learning and am always looking for ways to improve my knowledge and skills]                                                                                                            |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  12. class Phi3Transformer extends AnnotatorModel[Phi3Transformer] with HasBatchedAnnotate[Phi3Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine

    Phi-3

    Phi-3

    The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

    After initial training, the model underwent a post-training process that involved supervised fine-tuning and direct preference optimization to enhance its ability to follow instructions and adhere to safety measures. When evaluated against benchmarks that test common sense, language understanding, mathematics, coding, long-term context, and logical reasoning, the Phi-3 Mini-128K-Instruct demonstrated robust and state-of-the-art performance among models with fewer than 13 billion parameters.

    Pretrained models can be loaded with pretrained of the companion object:

    val phi3 = Phi3Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "phi_3_mini_128k_instruct_int8", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see Phi3TestSpec.

    References:

    Paper Abstract:

    We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.Phi3Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val phi3 = Phi3Transformer.pretrained("phi_3_mini_128k_instruct_int8")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, phi3))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  13. class QwenTransformer extends AnnotatorModel[QwenTransformer] with HasBatchedAnnotate[QwenTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    Qwen: comprehensive language model series

    Qwen: comprehensive language model series

    Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include:

    6 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B; Significant performance improvement in Chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all sizes

    Qwen1.5 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. For the beta version, temporarily we did not include GQA and the mixture of SWA and full attention.

    Pretrained models can be loaded with pretrained of the companion object:

    val Qwen = QwenTransformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "Qwen-13b", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see QwenTestSpec.

    References:

    Paper Abstract:

    Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.QwenTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val Qwen = QwenTransformer.pretrained("Qwen-7b")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, Qwen))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong   |
    | passion for learning and am always looking for ways to improve my knowledge and skills]                                                                                                            |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  14. trait ReadAutoGGUFModel extends AnyRef
  15. trait ReadBartTransformerDLModel extends ReadTensorflowModel with ReadOnnxModel
  16. trait ReadCPMTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  17. trait ReadGPT2TransformerDLModel extends ReadTensorflowModel with ReadOnnxModel
  18. trait ReadLLAMA2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  19. trait ReadLLAMA3TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
  20. trait ReadM2M100TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  21. trait ReadMarianMTDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel
  22. trait ReadMistralTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  23. trait ReadNLLBTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  24. trait ReadPhi2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
  25. trait ReadPhi3TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
  26. trait ReadQwenTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
  27. trait ReadStarCoderTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
  28. trait ReadT5TransformerDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
  29. trait ReadablePretrainedAutoGGUFModel extends ParamsAndFeaturesReadable[AutoGGUFModel] with HasPretrained[AutoGGUFModel]
  30. trait ReadablePretrainedBartTransformerModel extends ParamsAndFeaturesReadable[BartTransformer] with HasPretrained[BartTransformer]
  31. trait ReadablePretrainedCPMTransformerModel extends ParamsAndFeaturesReadable[CPMTransformer] with HasPretrained[CPMTransformer]
  32. trait ReadablePretrainedGPT2TransformerModel extends ParamsAndFeaturesReadable[GPT2Transformer] with HasPretrained[GPT2Transformer]
  33. trait ReadablePretrainedLLAMA2TransformerModel extends ParamsAndFeaturesReadable[LLAMA2Transformer] with HasPretrained[LLAMA2Transformer]
  34. trait ReadablePretrainedLLAMA3TransformerModel extends ParamsAndFeaturesReadable[LLAMA3Transformer] with HasPretrained[LLAMA3Transformer]
  35. trait ReadablePretrainedM2M100TransformerModel extends ParamsAndFeaturesReadable[M2M100Transformer] with HasPretrained[M2M100Transformer]
  36. trait ReadablePretrainedMarianMTModel extends ParamsAndFeaturesReadable[MarianTransformer] with HasPretrained[MarianTransformer]
  37. trait ReadablePretrainedMistralTransformerModel extends ParamsAndFeaturesReadable[MistralTransformer] with HasPretrained[MistralTransformer]
  38. trait ReadablePretrainedNLLBTransformerModel extends ParamsAndFeaturesReadable[NLLBTransformer] with HasPretrained[NLLBTransformer]
  39. trait ReadablePretrainedPhi2TransformerModel extends ParamsAndFeaturesReadable[Phi2Transformer] with HasPretrained[Phi2Transformer]
  40. trait ReadablePretrainedPhi3TransformerModel extends ParamsAndFeaturesReadable[Phi3Transformer] with HasPretrained[Phi3Transformer]
  41. trait ReadablePretrainedQwenTransformerModel extends ParamsAndFeaturesReadable[QwenTransformer] with HasPretrained[QwenTransformer]
  42. trait ReadablePretrainedStarCoderTransformerModel extends ParamsAndFeaturesReadable[StarCoderTransformer] with HasPretrained[StarCoderTransformer]
  43. trait ReadablePretrainedT5TransformerModel extends ParamsAndFeaturesReadable[T5Transformer] with HasPretrained[T5Transformer]
  44. class StarCoderTransformer extends AnnotatorModel[StarCoderTransformer] with HasBatchedAnnotate[StarCoderTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

    StarCoder2: The Versatile Code Companion.

    StarCoder2: The Versatile Code Companion.

    StarCoder2 is a Transformer model designed specifically for code generation and understanding. With 13 billion parameters, it builds upon the advancements of its predecessors and is trained on a diverse dataset that includes multiple programming languages. This extensive training allows StarCoder2 to support a wide array of coding tasks, from code completion to generation.

    StarCoder2 was developed to assist developers in writing and understanding code more efficiently, making it a valuable tool for various software development and data science tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val starcoder2 = StarCoder2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "StarCoder2-3B", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see StarCoder2TestSpec.

    References:

    Paper Abstract:

    The BigCode project,1 an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH),2 we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4× larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.

    We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

    Note:

    This is a computationally intensive module, especially for larger code sequences. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.StarCoder2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val starcoder2 = StarCoder2Transformer.pretrained("starcoder2")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, starcoder2))
    
    val data = Seq(
      "def add(a, b):"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[def add(a, b): return a + b]                                                                                                                                                                       |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  45. class T5Transformer extends AnnotatorModel[T5Transformer] with HasBatchedAnnotate[T5Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasCaseSensitiveProperties with WriteSentencePieceModel with HasProtectedParams with HasEngine

    T5: the Text-To-Text Transfer Transformer

    T5: the Text-To-Text Transfer Transformer

    T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework is able to use the same model, loss function, and hyper-parameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even apply to regression tasks by training it to predict the string representation of a number instead of the number itself.

    Pretrained models can be loaded with pretrained of the companion object:

    val t5 = T5Transformer.pretrained()
      .setTask("summarize:")
      .setInputCols("document")
      .setOutputCol("summaries")

    The default model is "t5_small", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Examples and the T5TestSpec.

    References:

    Paper Abstract:

    Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val t5 = T5Transformer.pretrained("t5_small")
      .setTask("summarize:")
      .setInputCols(Array("documents"))
      .setMaxOutputLength(200)
      .setOutputCol("summaries")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
    
    val data = Seq(
      "Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
        "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
        " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
        "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
        "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
        "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
        "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
        "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
        "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
        "learning for NLP, we release our data set, pre-trained models, and code."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("summaries.result").show(false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Value Members

  1. object AutoGGUFModel extends ReadablePretrainedAutoGGUFModel with ReadAutoGGUFModel with Serializable

    This is the companion object of AutoGGUFModel.

    This is the companion object of AutoGGUFModel. Please refer to that class for the documentation.

  2. object BartTransformer extends ReadablePretrainedBartTransformerModel with ReadBartTransformerDLModel with Serializable
  3. object CPMTransformer extends ReadablePretrainedCPMTransformerModel with ReadCPMTransformerDLModel with Serializable
  4. object GPT2Transformer extends ReadablePretrainedGPT2TransformerModel with ReadGPT2TransformerDLModel with Serializable
  5. object LLAMA2Transformer extends ReadablePretrainedLLAMA2TransformerModel with ReadLLAMA2TransformerDLModel with Serializable
  6. object LLAMA3Transformer extends ReadablePretrainedLLAMA3TransformerModel with ReadLLAMA3TransformerDLModel with Serializable
  7. object M2M100Transformer extends ReadablePretrainedM2M100TransformerModel with ReadM2M100TransformerDLModel with Serializable
  8. object MarianTransformer extends ReadablePretrainedMarianMTModel with ReadMarianMTDLModel with ReadSentencePieceModel with Serializable

    This is the companion object of MarianTransformer.

    This is the companion object of MarianTransformer. Please refer to that class for the documentation.

  9. object MistralTransformer extends ReadablePretrainedMistralTransformerModel with ReadMistralTransformerDLModel with Serializable
  10. object NLLBTransformer extends ReadablePretrainedNLLBTransformerModel with ReadNLLBTransformerDLModel with Serializable
  11. object Phi2Transformer extends ReadablePretrainedPhi2TransformerModel with ReadPhi2TransformerDLModel with Serializable
  12. object Phi3Transformer extends ReadablePretrainedPhi3TransformerModel with ReadPhi3TransformerDLModel with Serializable
  13. object QwenTransformer extends ReadablePretrainedQwenTransformerModel with ReadQwenTransformerDLModel with Serializable
  14. object StarCoderTransformer extends ReadablePretrainedStarCoderTransformerModel with ReadStarCoderTransformerDLModel with Serializable
  15. object T5Transformer extends ReadablePretrainedT5TransformerModel with ReadT5TransformerDLModel with ReadSentencePieceModel with Serializable

    This is the companion object of T5Transformer.

    This is the companion object of T5Transformer. Please refer to that class for the documentation.

Ungrouped