seq2seq

package seq2seq

Ordering

Alphabetic

Visibility

Public
All

Type Members

class AutoGGUFModel extends AnnotatorModel[AutoGGUFModel] with HasBatchedAnnotate[AutoGGUFModel] with HasEngine with HasLlamaCppModelProperties with HasLlamaCppInferenceProperties with HasProtectedParams

Annotator that uses the llama.cpp library to generate text completions with large language models.

For settable parameters, and their explanations, see HasLlamaCppInferenceProperties, HasLlamaCppModelProperties and refer to the llama.cpp documentation of server.cpp for more information.

If the parameters are not set, the annotator will default to use the parameters provided by the model.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFModel = AutoGGUFModel.pretrained()
  .setInputCols("document")
  .setOutputCol("completions")

The default model is "phi3.5_mini_4k_instruct_q4_gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFModelTest and the example notebook.

Note

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Example

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val autoGGUFModel = AutoGGUFModel
  .pretrained()
  .setInputCols("document")
  .setOutputCol("completions")
  .setBatchSize(4)
  .setNPredict(20)
  .setNGpuLayers(99)
  .setTemperature(0.4f)
  .setTopK(40)
  .setTopP(0.9f)
  .setPenalizeNl(true)

val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))

val data = Seq("Hello, I am a").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("completions").show(truncate = false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78,  new user.  I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+

class AutoGGUFVisionModel extends AnnotatorModel[AutoGGUFVisionModel] with HasBatchedAnnotateTextImage[AutoGGUFVisionModel] with HasEngine with HasLlamaCppModelProperties with HasLlamaCppInferenceProperties with HasProtectedParams

Multimodal annotator that uses the llama.cpp library to generate text completions with large language models.

Multimodal annotator that uses the llama.cpp library to generate text completions with large language models. It supports ingesting images for captioning.

At the moment only CLIP based models are supported.

For settable parameters, and their explanations, see HasLlamaCppInferenceProperties, HasLlamaCppModelProperties and refer to the llama.cpp documentation of server.cpp for more information.

If the parameters are not set, the annotator will default to use the parameters provided by the model.

This annotator expects a column of annotator type AnnotationImage for the image and Annotation for the caption. Note that the image bytes in the image annotation need to be raw image bytes without preprocessing. We provide the helper function ImageAssembler.loadImagesAsBytes to load the image bytes from a directory.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFVisionModel = AutoGGUFVisionModel.pretrained()
  .setInputCols("image', "document")
  .setOutputCol("completions")

The default model is "llava_v1.5_7b_Q4_0_gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFVisionModelTest and the example notebook.

Note

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Example

import com.johnsnowlabs.nlp.ImageAssembler
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.base._
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit

val documentAssembler = new DocumentAssembler()
  .setInputCol("caption")
  .setOutputCol("caption_document")

val imageAssembler = new ImageAssembler()
  .setInputCol("image")
  .setOutputCol("image_assembler")

val imagesPath = "src/test/resources/image/"
val data: DataFrame = ImageAssembler
  .loadImagesAsBytes(ResourceHelper.spark, imagesPath)
  .withColumn("caption", lit("Caption this image.")) // Add a caption to each image.

val nPredict = 40
val model = AutoGGUFVisionModel.pretrained()
  .setInputCols("caption_document", "image_assembler")
  .setOutputCol("completions")
  .setBatchSize(4)
  .setNGpuLayers(99)
  .setNCtx(4096)
  .setMinKeep(0)
  .setMinP(0.05f)
  .setNPredict(nPredict)
  .setNProbs(0)
  .setPenalizeNl(false)
  .setRepeatLastN(256)
  .setRepeatPenalty(1.18f)
  .setStopStrings(Array("</s>", "Llama:", "User:"))
  .setTemperature(0.05f)
  .setTfsZ(1)
  .setTypicalP(1)
  .setTopK(40)
  .setTopP(0.95f)

val pipeline = new Pipeline().setStages(Array(documentAssembler, imageAssembler, model))
pipeline
  .fit(data)
  .transform(data)
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result")
  .show(truncate = false)
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|image_name       |result                                                                                                                                                                                        |
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|palace.JPEG      |[ The image depicts a large, ornate room with high ceilings and beautifully decorated walls. There are several chairs placed throughout the space, some of which have cushions]               |
|egyptian_cat.jpeg|[ The image features two cats lying on a pink surface, possibly a bed or sofa. One cat is positioned towards the left side of the scene and appears to be sleeping while holding]             |
|hippopotamus.JPEG|[ A large brown hippo is swimming in a body of water, possibly an aquarium. The hippo appears to be enjoying its time in the water and seems relaxed as it floats]                            |
|hen.JPEG         |[ The image features a large chicken standing next to several baby chickens. In total, there are five birds in the scene: one adult and four young ones. They appear to be gathered together] |
|ostrich.JPEG     |[ The image features a large, long-necked bird standing in the grass. It appears to be an ostrich or similar species with its head held high and looking around. In addition to]              |
|junco.JPEG       |[ A small bird with a black head and white chest is standing on the snow. It appears to be looking at something, possibly food or another animal in its vicinity. The scene takes place out]  |
|bluetick.jpg     |[ A dog with a red collar is sitting on the floor, looking at something. The dog appears to be staring into the distance or focusing its attention on an object in front of it.]              |
|chihuahua.jpg    |[ A small brown dog wearing a sweater is sitting on the floor. The dog appears to be looking at something, possibly its owner or another animal in the room. It seems comfortable and relaxed]|
|tractor.JPEG     |[ A man is sitting in the driver's seat of a green tractor, which has yellow wheels and tires. The tractor appears to be parked on top of an empty field with]                                |
|ox.JPEG          |[ A large bull with horns is standing in a grassy field.]                                                                                                                                     |
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

class BartTransformer extends AnnotatorModel[BartTransformer] with HasBatchedAnnotate[BartTransformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine with HasGeneratorProperties
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer
The Facebook BART (Bidirectional and Auto-Regressive Transformer) model is a state-of-the-art language generation model that was introduced by Facebook AI in 2019. It is based on the transformer architecture and is designed to handle a wide range of natural language processing tasks such as text generation, summarization, and machine translation.
BART is unique in that it is both bidirectional and auto-regressive, meaning that it can generate text both from left-to-right and from right-to-left. This allows it to capture contextual information from both past and future tokens in a sentence,resulting in more accurate and natural language generation.
The model was trained on a large corpus of text data using a combination of unsupervised and supervised learning techniques. It incorporates pretraining and fine-tuning phases, where the model is first trained on a large unlabeled corpus of text, and then fine-tuned on specific downstream tasks.
BART has achieved state-of-the-art performance on a wide range of NLP tasks, including summarization, question-answering, and language translation. Its ability to handle multiple tasks and its high performance on each of these tasks make it a versatile and valuable tool for natural language processing applications.
Pretrained models can be loaded with pretrained of the companion object:
```
val bart = BartTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "distilbart_xsum_12_6", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see BartTestSpec.
References:
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- https://github.com/pytorch/fairseq
Paper Abstract:
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val bart = BartTransformer.pretrained("distilbart_xsum_12_6")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(30)
  .setDoSample(true)
  .setTopK(50)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, bart))

val data = Seq(
  "PG&E stated it scheduled the blackouts in response to forecasts for high winds " +
  "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were " +
  "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+--------------------------------------------------------------+
|result                                                        |
+--------------------------------------------------------------+
|[Nearly 800 thousand customers were affected by the shutoffs.]|
+--------------------------------------------------------------+
```
class CPMTransformer extends AnnotatorModel[CPMTransformer] with HasBatchedAnnotate[CPMTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine
MiniCPM: Unveiling the Potential of End-side Large Language Models
MiniCPM: Unveiling the Potential of End-side Large Language Models
MiniCPM is a series of edge-side large language models, with the base model, MiniCPM-2B, having 2.4B non-embedding parameters. It ranks closely with Mistral-7B on comprehensive benchmarks (with better performance in Chinese, mathematics, and coding abilities), surpassing models like Llama2-13B, MPT-30B, and Falcon-40B. On the MTBench benchmark, which is closest to user experience, MiniCPM-2B also outperforms many representative open-source models such as Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, and Zephyr-7B-alpha.
After DPO, MiniCPM outperforms Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, Zephyr-7B-alpha, etc. on MTBench.
MiniCPM-V, based on MiniCPM-2B, achieves the best overall performance among multimodel models of the same scale, surpassing existing multimodal large models built on Phi-2 and achieving performance comparable to or even better than 9.6B Qwen-VL-Chat on some tasks.
MiniCPM can be deployed and infer on smartphones, and the speed of streaming output is relatively higher than the verbal speed of human.
Pretrained models can be loaded with pretrained of the companion object:
```
val cpm = CPMTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "mini_cpm_2b_8bit", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see CPMTestSpec.
References:
- MiniCPM: Unveiling the Potential of End-side Large Language Models
- https://github.com/OpenBMB/MiniCPM
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.CPMTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val cpm = CPMTransformer.pretrained("mini_cpm_2b_8bit")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, cpm))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a student at the University of California, Los Angeles. I have a passion for writing and learning about different cultures. I enjoy playing basketball and watching movies]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

class CoHereTransformer extends AnnotatorModel[CoHereTransformer] with HasBatchedAnnotate[CoHereTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine

Cohere: Command-R Transformer

C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.

Pretrained models can be loaded with pretrained of the companion object:

val CoHere = CoHereTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")

The default model is "c4ai_command_r_v01_int4", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see CoHereTestSpec.

References:

CoHere

Note:

This is a resource-intensive module, especially with larger models and sequences. Use of accelerators such as GPUs is strongly recommended.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.CoHereTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val CoHere = CoHereTransformer.pretrained("c4ai_command_r_v01_int4","en")
  .setInputCols(Array("documents"))
  .setMinOutputLength(15)
  .setMaxOutputLength(60)
  .setDoSample(false)
  .setTopK(40)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, CoHere))

val data = Seq(
  (
    1,
    """
    <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
    """.stripMargin)
).toDF("id", "text")

val result = pipeline.fit(data).transform(data)

result.select("generation.result").show(truncate = false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Hello! I'm doing well, thank you for asking! I'm excited to help you with whatever questions you have today. How can I assist you?]                                                                         |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

class GPT2Transformer extends AnnotatorModel[GPT2Transformer] with HasBatchedAnnotate[GPT2Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEngine
GPT-2: the OpenAI Text-To-Text Transformer
GPT-2: the OpenAI Text-To-Text Transformer
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.
GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.
Pretrained models can be loaded with pretrained of the companion object:
```
val gpt2 = GPT2Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "gpt2", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see GPT2TestSpec.
References:
- Language Models are Unsupervised Multitask Learners
- https://github.com/openai/gpt-2
Paper Abstract:
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val gpt2 = GPT2Transformer.pretrained("gpt2")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
class LLAMA2Transformer extends AnnotatorModel[LLAMA2Transformer] with HasBatchedAnnotate[LLAMA2Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
The Llama 2 release introduces a family of pretrained and fine-tuned LLMs, ranging in scale from 7B to 70B parameters (7B, 13B, 70B). The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model🔥!
However, the most exciting part of this release is the fine-tuned models (Llama 2-Chat), which have been optimized for dialogue applications using Reinforcement Learning from Human Feedback (RLHF). Across a wide range of helpfulness and safety benchmarks, the Llama 2-Chat models perform better than most open models and achieve comparable performance to ChatGPT according to human evaluations.
Pretrained models can be loaded with pretrained of the companion object:
```
val llama2 = LLAMA2Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "llama_2_7b_chat_hf_int4", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see LLAMA2TestSpec.
References:
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- https://github.com/facebookresearch/llama
Paper Abstract:
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.LLAMA2Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val llama2 = LLAMA2Transformer.pretrained("llama_2_7b_chat_hf_int4")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, llama2))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
class LLAMA3Transformer extends AnnotatorModel[LLAMA3Transformer] with HasBatchedAnnotate[LLAMA3Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Llama 3: Cutting-Edge Foundation and Fine-Tuned Chat Models
Llama 3: Cutting-Edge Foundation and Fine-Tuned Chat Models
The Llama 3 release introduces a new family of large language models, ranging from 8B to 70B parameters. Llama 3 models are designed with a greater emphasis on efficiency, performance, and safety, achieving remarkable advancements in training and deployment processes. These models are trained on a diversified dataset that significantly enhances their capability to generate more accurate and contextually relevant outputs.
The fine-tuned variants, known as Llama 3-instruct, are specifically optimized for dialogue-based applications, making use of Reinforcement Learning from Human Feedback (RLHF) with an advanced reward model. Llama 3-instruct models demonstrate state-of-the-art performance across multiple benchmarks and surpass the capabilities of Llama 2, particularly in conversational settings.
Pretrained models can be loaded with pretrained of the companion object:
```
val llama3 = LLAMA3Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "llama_3_7b_chat_hf_int4", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see LLAMA3TestSpec.
References:
- Meta's Llama 3: Cutting-Edge Foundation and Fine-Tuned Chat Models
- https://github.com/facebookresearch/llama
Paper Abstract:
Llama 3 represents Meta’s latest innovation in the development of large language models (LLMs), offering a series of models from 1 billion to 70 billion parameters. These models have been fine-tuned for dialogue applications under the Llama 3-Chat series, ensuring they are highly responsive and context-aware. Our Llama 3 models not only excel in various benchmarks but also incorporate enhanced safety and alignment features to address ethical concerns and ensure responsible AI deployment. We invite the community to explore the capabilities of Llama 3 and contribute to ongoing research in the field of natural language processing.
Note:
This is a resource-intensive module, especially with larger models and sequences. Use of accelerators such as GPUs is strongly recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.LLAMA3Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val llama3 = LLAMA3Transformer.pretrained("llama_3_7b_chat_hf_int4")
  .setInputCols(Array("documents"))
  .setMinOutputLength(15)
  .setMaxOutputLength(60)
  .setDoSample(false)
  .setTopK(40)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, llama3))

val data = Seq(
  (
    1,
    """<|start_header_id|>system<|end_header_id|>

    You are a minion chatbot who always responds in minion speak!

    <|start_header_id|>user<|end_header_id|>

    Who are you?

    <|start_header_id|>assistant<|end_header_id|>
    """.stripMargin)
).toDF("id", "text")

val result = pipeline.fit(data).transform(data)

result.select("generation.result").show(truncate = false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Oooh, me am Minion! Me help you with things! Me speak Minion language, yeah! Bana-na-na!]                                                                         |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
class M2M100Transformer extends AnnotatorModel[M2M100Transformer] with HasBatchedAnnotate[M2M100Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine
M2M100 : multilingual translation model
M2M100 : multilingual translation model
M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.
The model can directly translate between the 9,900 directions of 100 languages.
Pretrained models can be loaded with pretrained of the companion object:
```
val m2m100 = M2M100Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "m2m100_418M", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see M2M100TestSpec.
References:
- Beyond English-Centric Multilingual Machine Translation
- https://github.com/pytorch/fairseq/tree/master/examples/m2m_100
Paper Abstract:
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.
Languages Covered:
Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.M2M100Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val m2m100 = M2M100Transformer.pretrained("m2m100_418M")
  .setInputCols(Array("documents"))
  .setSrcLang("zh")
  .serTgtLang("en")
  .setMaxOutputLength(100)
  .setDoSample(false)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, m2m100))

val data = Seq(
  "生活就像一盒巧克力。"
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+-------------------------------------------------------------------------------------------+
|result                                                                                     |
+-------------------------------------------------------------------------------------------+
|[ Life is like a box of chocolate.]                                                        |
+-------------------------------------------------------------------------------------------+
```
class MarianTransformer extends AnnotatorModel[MarianTransformer] with HasBatchedAnnotate[MarianTransformer] with WriteTensorflowModel with WriteOnnxModel with WriteSentencePieceModel with HasEngine with HasProtectedParams
MarianTransformer: Fast Neural Machine Translation
MarianTransformer: Fast Neural Machine Translation
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.
Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences first.
Pretrained models can be loaded with pretrained of the companion object:
```
val marian = MarianTransformer.pretrained()
  .setInputCols("sentence")
  .setOutputCol("translation")
```
The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the MarianTransformerTestSpec.
Sources :
MarianNMT at GitHub
Marian: Fast Neural Machine Translation in C++
Paper Abstract:
We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetectorDLModel
import com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols("document")
  .setOutputCol("sentence")

val marian = MarianTransformer.pretrained()
  .setInputCols("sentence")
  .setOutputCol("translation")
  .setMaxInputLength(30)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    marian
  ))

val data = Seq("What is the capital of France? We should know this in french.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(translation.result) as result").show(false)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+
```
class MistralTransformer extends AnnotatorModel[MistralTransformer] with HasBatchedAnnotate[MistralTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine
Mistral 7B
Mistral 7B
Mistral 7B, a 7.3 billion-parameter model that stands out for its efficient and effective performance in natural language processing. Surpassing Llama 2 13B across all benchmarks and excelling over Llama 1 34B in various aspects, Mistral 7B strikes a balance between English language tasks and code comprehension, rivaling the capabilities of CodeLlama 7B in the latter.
Mistral 7B introduces Grouped-query attention (GQA) for quicker inference, enhancing processing speed without compromising accuracy. This streamlined approach ensures a smoother user experience, making Mistral 7B a practical choice for real-world applications.
Additionally, Mistral 7B adopts Sliding Window Attention (SWA) to efficiently handle longer sequences at a reduced computational cost. This feature enhances the model's ability to process extensive textual input, expanding its utility in handling more complex tasks.
In summary, Mistral 7B represents a notable advancement in language models, offering a reliable and versatile solution for various natural language processing challenges.
Pretrained models can be loaded with pretrained of the companion object:
```
val mistral = MistralTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "mistral_7b", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see MistralTestSpec.
References:
- Mistral 7B
- https://github.com/mistralai/mistral-src
Paper Abstract:
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.MistralTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val mistral = MistralTransformer.pretrained("mistral_7b")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, mistral))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 |result                                                                                                                                                                                              |
 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 |[Leonardo Da Vinci invented the microscope?\n Question: Leonardo Da Vinci invented the microscope?\n Answer: No, Leonardo Da Vinci did not invent the microscope. The first microscope was invented |
 | in the late 16th century, long after Leonardo']                                                                                                                                                    |
 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
class NLLBTransformer extends AnnotatorModel[NLLBTransformer] with HasBatchedAnnotate[NLLBTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine
NLLB : multilingual translation model
NLLB : multilingual translation model
NLLB is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.
The model can directly translate between 200+ languages.
Pretrained models can be loaded with pretrained of the companion object:
```
val nllb = NLLBTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "nllb_distilled_600M_8int", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see NLLBTestSpec.
References:
- No Language Left Behind: Scaling Human-Centered Machine Translation
- https://github.com/facebookresearch/fairseq/tree/nllb
Paper Abstract:
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at this https URL.
Languages Covered:
Acehnese (Arabic script) (ace_Arab), Acehnese (Latin script) (ace_Latn), Mesopotamian Arabic (acm_Arab), Ta’izzi-Adeni Arabic (acq_Arab), Tunisian Arabic (aeb_Arab), Afrikaans (afr_Latn), South Levantine Arabic (ajp_Arab), Akan (aka_Latn), Amharic (amh_Ethi), North Levantine Arabic (apc_Arab), Modern Standard Arabic (arb_Arab), Modern Standard Arabic (Romanized) (arb_Latn), Najdi Arabic (ars_Arab), Moroccan Arabic (ary_Arab), Egyptian Arabic (arz_Arab), Assamese (asm_Beng), Asturian (ast_Latn), Awadhi (awa_Deva), Central Aymara (ayr_Latn), South Azerbaijani (azb_Arab), North Azerbaijani (azj_Latn), Bashkir (bak_Cyrl), Bambara (bam_Latn), Balinese (ban_Latn), Belarusian (bel_Cyrl), Bemba (bem_Latn), Bengali (ben_Beng), Bhojpuri (bho_Deva), Banjar (Arabic script) (bjn_Arab), Banjar (Latin script) (bjn_Latn), Standard Tibetan (bod_Tibt), Bosnian (bos_Latn), Buginese (bug_Latn), Bulgarian (bul_Cyrl), Catalan (cat_Latn), Cebuano (ceb_Latn), Czech (ces_Latn), Chokwe (cjk_Latn), Central Kurdish (ckb_Arab), Crimean Tatar (crh_Latn), Welsh (cym_Latn), Danish (dan_Latn), German (deu_Latn), Southwestern Dinka (dik_Latn), Dyula (dyu_Latn), Dzongkha (dzo_Tibt), Greek (ell_Grek), English (eng_Latn), Esperanto (epo_Latn), Estonian (est_Latn), Basque (eus_Latn), Ewe (ewe_Latn), Faroese (fao_Latn), Fijian (fij_Latn), Finnish (fin_Latn), Fon (fon_Latn), French (fra_Latn), Friulian (fur_Latn), Nigerian Fulfulde (fuv_Latn), Scottish Gaelic (gla_Latn), Irish (gle_Latn), Galician (glg_Latn), Guarani (grn_Latn), Gujarati (guj_Gujr), Haitian Creole (hat_Latn), Hausa (hau_Latn), Hebrew (heb_Hebr), Hindi (hin_Deva), Chhattisgarhi (hne_Deva), Croatian (hrv_Latn), Hungarian (hun_Latn), Armenian (hye_Armn), Igbo (ibo_Latn), Ilocano (ilo_Latn), Indonesian (ind_Latn), Icelandic (isl_Latn), Italian (ita_Latn), Javanese (jav_Latn), Japanese (jpn_Jpan), Kabyle (kab_Latn), Jingpho (kac_Latn), Kamba (kam_Latn), Kannada (kan_Knda), Kashmiri (Arabic script) (kas_Arab), Kashmiri (Devanagari script) (kas_Deva), Georgian (kat_Geor), Central Kanuri (Arabic script) (knc_Arab), Central Kanuri (Latin script) (knc_Latn), Kazakh (kaz_Cyrl), Kabiyè (kbp_Latn), Kabuverdianu (kea_Latn), Khmer (khm_Khmr), Kikuyu (kik_Latn), Kinyarwanda (kin_Latn), Kyrgyz (kir_Cyrl), Kimbundu (kmb_Latn), Northern Kurdish (kmr_Latn), Kikongo (kon_Latn), Korean (kor_Hang), Lao (lao_Laoo), Ligurian (lij_Latn), Limburgish (lim_Latn), Lingala (lin_Latn), Lithuanian (lit_Latn), Lombard (lmo_Latn), Latgalian (ltg_Latn), Luxembourgish (ltz_Latn), Luba-Kasai (lua_Latn), Ganda (lug_Latn), Luo (luo_Latn), Mizo (lus_Latn), Standard Latvian (lvs_Latn), Magahi (mag_Deva), Maithili (mai_Deva), Malayalam (mal_Mlym), Marathi (mar_Deva), Minangkabau (Arabic script) (min_Arab), Minangkabau (Latin script) (min_Latn), Macedonian (mkd_Cyrl), Plateau Malagasy (plt_Latn), Maltese (mlt_Latn), Meitei (Bengali script) (mni_Beng), Halh Mongolian (khk_Cyrl), Mossi (mos_Latn), Maori (mri_Latn), Burmese (mya_Mymr), Dutch (nld_Latn), Norwegian Nynorsk (nno_Latn), Norwegian Bokmål (nob_Latn), Nepali (npi_Deva), Northern Sotho (nso_Latn), Nuer (nus_Latn), Nyanja (nya_Latn), Occitan (oci_Latn), West Central Oromo (gaz_Latn), Odia (ory_Orya), Pangasinan (pag_Latn), Eastern Panjabi (pan_Guru), Papiamento (pap_Latn), Western Persian (pes_Arab), Polish (pol_Latn), Portuguese (por_Latn), Dari (prs_Arab), Southern Pashto (pbt_Arab), Ayacucho Quechua (quy_Latn), Romanian (ron_Latn), Rundi (run_Latn), Russian (rus_Cyrl), Sango (sag_Latn), Sanskrit (san_Deva), Santali (sat_Olck), Sicilian (scn_Latn), Shan (shn_Mymr), Sinhala (sin_Sinh), Slovak (slk_Latn), Slovenian (slv_Latn), Samoan (smo_Latn), Shona (sna_Latn), Sindhi (snd_Arab), Somali (som_Latn), Southern Sotho (sot_Latn), Spanish (spa_Latn), Tosk Albanian (als_Latn), Sardinian (srd_Latn), Serbian (srp_Cyrl), Swati (ssw_Latn), Sundanese (sun_Latn), Swedish (swe_Latn), Swahili (swh_Latn), Silesian (szl_Latn), Tamil (tam_Taml), Tatar (tat_Cyrl), Telugu (tel_Telu), Tajik (tgk_Cyrl), Tagalog (tgl_Latn), Thai (tha_Thai), Tigrinya (tir_Ethi), Tamasheq (Latin script) (taq_Latn), Tamasheq (Tifinagh script) (taq_Tfng), Tok Pisin (tpi_Latn), Tswana (tsn_Latn), Tsonga (tso_Latn), Turkmen (tuk_Latn), Tumbuka (tum_Latn), Turkish (tur_Latn), Twi (twi_Latn), Central Atlas Tamazight (tzm_Tfng), Uyghur (uig_Arab), Ukrainian (ukr_Cyrl), Umbundu (umb_Latn), Urdu (urd_Arab), Northern Uzbek (uzn_Latn), Venetian (vec_Latn), Vietnamese (vie_Latn), Waray (war_Latn), Wolof (wol_Latn), Xhosa (xho_Latn), Eastern Yiddish (ydd_Hebr), Yoruba (yor_Latn), Yue Chinese (yue_Hant), Chinese (Simplified) (zho_Hans), Chinese (Traditional) (zho_Hant), Standard Malay (zsm_Latn), Zulu (zul_Latn).
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.NLLBTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val nllb = NLLBTransformer.pretrained("nllb_distilled_600M_8int")
  .setInputCols(Array("documents"))
  .setSrcLang("zho_Hans")
  .serTgtLang("eng_Latn")
  .setMaxOutputLength(100)
  .setDoSample(false)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, nllb))

val data = Seq(
  "生活就像一盒巧克力。"
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+-------------------------------------------------------------------------------------------+
|result                                                                                     |
+-------------------------------------------------------------------------------------------+
|[ Life is like a box of chocolate.]                                                        |
+-------------------------------------------------------------------------------------------+
```

class OLMoTransformer extends AnnotatorModel[OLMoTransformer] with HasBatchedAnnotate[OLMoTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with HasGeneratorProperties with HasEngine

OLMo: Open Language Models

OLMo is a series of Open Language Models designed to enable the science of language models. The OLMo models are trained on the Dolma dataset.

Pretrained models can be loaded with pretrained of the companion object:

val OLMo = OLMoTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")

The default model is "olmo_1b_int4", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see OLMoTestSpec.

References:

Paper Abstract:

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.OLMoTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val OLMo = OLMoTransformer.pretrained("olmo_1b_int4")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, OLMo))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong   |
| passion for learning and am always looking for ways to improve my knowledge and skills]                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

class Phi2Transformer extends AnnotatorModel[Phi2Transformer] with HasBatchedAnnotate[Phi2Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Phi-2: Textbooks Are All You Need.
Phi-2: Textbooks Are All You Need.
Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters.
Phi-2 hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.
Pretrained models can be loaded with pretrained of the companion object:
```
val Phi2 = Phi2Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "Phi2-13b", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see Phi2TestSpec.
References:
- Phi-2: Textbooks Are All You Need.
- https://huggingface.co/microsoft/phi-2
Paper Abstract:
The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.
Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:
Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.Phi2Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val Phi2 = Phi2Transformer.pretrained("phi2")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, Phi2))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong   |
| passion for learning and am always looking for ways to improve my knowledge and skills]                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
class Phi3Transformer extends AnnotatorModel[Phi3Transformer] with HasBatchedAnnotate[Phi3Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with WriteSentencePieceModel with HasEngine
Phi-3
Phi-3
The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.
After initial training, the model underwent a post-training process that involved supervised fine-tuning and direct preference optimization to enhance its ability to follow instructions and adhere to safety measures. When evaluated against benchmarks that test common sense, language understanding, mathematics, coding, long-term context, and logical reasoning, the Phi-3 Mini-128K-Instruct demonstrated robust and state-of-the-art performance among models with fewer than 13 billion parameters.
Pretrained models can be loaded with pretrained of the companion object:
```
val phi3 = Phi3Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "phi_3_mini_128k_instruct", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see Phi3TestSpec.
References:
- https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/
- https://arxiv.org/abs/2404.14219
Paper Abstract:
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.Phi3Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val phi3 = Phi3Transformer.pretrained("phi_3_mini_128k_instruct")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, phi3))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
class Phi4Transformer extends AnnotatorModel[Phi4Transformer] with HasBatchedAnnotate[Phi4Transformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Phi-4: State-of-the-art open model by Microsoft Research
Phi-4: State-of-the-art open model by Microsoft Research
phi-4 is a 14B parameter, dense decoder-only Transformer model trained on 9.8T tokens, designed for advanced reasoning, code, and general NLP tasks. For more details, see: https://huggingface.co/microsoft/phi-4
Model Overview
- 14B parameters, dense decoder-only Transformer
- 16K context length
- Trained on 9.8T tokens (synthetic, public domain, academic, Q&A, code)
- Focus on high-quality, advanced reasoning, math, code, and general NLP
- Multilingual data: ~8% (primarily English)
- Released under MIT License
Intended Use
- General-purpose AI, research, and generative features
- Memory/compute constrained and latency-bound environments
- Reasoning, logic, and code generation
Benchmarks
- MMLU: 84.8 | HumanEval: 82.6 | GPQA: 56.1 | DROP: 75.5 | MATH: 80.6
- Outperforms or matches other 14B/70B models on many tasks
Safety & Limitations
- Safety alignment via SFT and DPO, red-teamed by Microsoft AIRT
- Not intended for high-risk or consequential domains without further assessment
- Primarily English; other languages may have reduced performance
- May generate inaccurate, offensive, or biased content; use with care
Usage
Pretrained models can be loaded with pretrained of the companion object: {{ { val phi4 = Phi4Transformer.pretrained() .setInputCols("document") .setOutputCol("generation") }}} The default model is "phi-4", if no name is provided. For available pretrained models please see the Models Hub.
Note: This is a resource-intensive module, especially with larger models and sequences. Use of accelerators such as GPUs is strongly recommended.
References:
- https://huggingface.co/microsoft/phi-4
- arXiv:2412.08905
Example
{{ { import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents")
val phi4 = Phi4Transformer.pretrained("phi-4") .setInputCols(Array("documents")) .setMaxOutputLength(60) .setOutputCol("generation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, phi4))
val data = Seq( (1, "<|im_start|>system<|im_sep|>\nYou are a helpful assistant!\n<|im_start|>user<|im_sep|>\nWhat is Phi-4?\n<|im_start|>assistant<|im_sep|>\n") ).toDF("id", "text")
val result = pipeline.fit(data).transform(data) result.select("generation.result").show(truncate = false) }}}
Phi4Transformer.pretrained() .setInputCols("document") .setOutputCol("generation") }}} default model is "phi-4", if no name is provided. For available pretrained models please see the Models Hub.
Note: This is a resource-intensive module, especially with larger models and sequences. Use of accelerators such as GPUs is strongly recommended.
References:
- https://huggingface.co/microsoft/phi-4
- arXiv:2412.08905
Example
{{ { import spark.implicits._ import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents")
val phi4 = Phi4Transformer.pretrained("phi-4") .setInputCols(Array("documents")) .setMaxOutputLength(60) .setOutputCol("generation")
val pipeline = new Pipeline().setStages(Array(documentAssembler, phi4))
val data = Seq( (1, "<|im_start|>system<|im_sep|>\nYou are a helpful assistant!\n<|im_start|>user<|im_sep|>\nWhat is Phi-4?\n<|im_start|>assistant<|im_sep|>\n") ).toDF("id", "text")
val result = pipeline.fit(data).transform(data) result.select("generation.result").show(truncate = false) }}}
class QwenTransformer extends AnnotatorModel[QwenTransformer] with HasBatchedAnnotate[QwenTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
Qwen: comprehensive language model series
Qwen: comprehensive language model series
Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include:
6 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B; Significant performance improvement in Chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all sizes
Qwen1.5 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. For the beta version, temporarily we did not include GQA and the mixture of SWA and full attention.
Pretrained models can be loaded with pretrained of the companion object:
```
val Qwen = QwenTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "qwen_7.5b_chat", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see QwenTestSpec.
References:
Paper Abstract:
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
Note:
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.QwenTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val Qwen = QwenTransformer.pretrained("qwen_7.5b_chat")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, Qwen))

val data = Seq(
  "My name is Leonardo."
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong   |
| passion for learning and am always looking for ways to improve my knowledge and skills]                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
trait ReadAutoGGUFModel extends AnyRef
trait ReadAutoGGUFVisionModel extends AnyRef
trait ReadBartTransformerDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadCPMTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
trait ReadCoHereTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadGPT2TransformerDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadLLAMA2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
trait ReadLLAMA3TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadM2M100TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
trait ReadMarianMTDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel
trait ReadMistralTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
trait ReadNLLBTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
trait ReadOLMoTransformerDLModel extends ReadOnnxModel
trait ReadPhi2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadPhi3TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel with ReadSentencePieceModel
trait ReadPhi4TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadQwenTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadStarCoderTransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadT5TransformerDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadablePretrainedAutoGGUFModel extends ParamsAndFeaturesReadable[AutoGGUFModel] with HasPretrained[AutoGGUFModel]
trait ReadablePretrainedAutoGGUFVisionModel extends ParamsAndFeaturesReadable[AutoGGUFVisionModel] with HasPretrained[AutoGGUFVisionModel]
trait ReadablePretrainedBartTransformerModel extends ParamsAndFeaturesReadable[BartTransformer] with HasPretrained[BartTransformer]
trait ReadablePretrainedCPMTransformerModel extends ParamsAndFeaturesReadable[CPMTransformer] with HasPretrained[CPMTransformer]
trait ReadablePretrainedCoHereTransformerModel extends ParamsAndFeaturesReadable[CoHereTransformer] with HasPretrained[CoHereTransformer]
trait ReadablePretrainedGPT2TransformerModel extends ParamsAndFeaturesReadable[GPT2Transformer] with HasPretrained[GPT2Transformer]
trait ReadablePretrainedLLAMA2TransformerModel extends ParamsAndFeaturesReadable[LLAMA2Transformer] with HasPretrained[LLAMA2Transformer]
trait ReadablePretrainedLLAMA3TransformerModel extends ParamsAndFeaturesReadable[LLAMA3Transformer] with HasPretrained[LLAMA3Transformer]
trait ReadablePretrainedM2M100TransformerModel extends ParamsAndFeaturesReadable[M2M100Transformer] with HasPretrained[M2M100Transformer]
trait ReadablePretrainedMarianMTModel extends ParamsAndFeaturesReadable[MarianTransformer] with HasPretrained[MarianTransformer]
trait ReadablePretrainedMistralTransformerModel extends ParamsAndFeaturesReadable[MistralTransformer] with HasPretrained[MistralTransformer]
trait ReadablePretrainedNLLBTransformerModel extends ParamsAndFeaturesReadable[NLLBTransformer] with HasPretrained[NLLBTransformer]
trait ReadablePretrainedOLMoTransformerModel extends ParamsAndFeaturesReadable[OLMoTransformer] with HasPretrained[OLMoTransformer]
trait ReadablePretrainedPhi2TransformerModel extends ParamsAndFeaturesReadable[Phi2Transformer] with HasPretrained[Phi2Transformer]
trait ReadablePretrainedPhi3TransformerModel extends ParamsAndFeaturesReadable[Phi3Transformer] with HasPretrained[Phi3Transformer]
trait ReadablePretrainedPhi4TransformerModel extends ParamsAndFeaturesReadable[Phi4Transformer] with HasPretrained[Phi4Transformer]
trait ReadablePretrainedQwenTransformerModel extends ParamsAndFeaturesReadable[QwenTransformer] with HasPretrained[QwenTransformer]
trait ReadablePretrainedStarCoderTransformerModel extends ParamsAndFeaturesReadable[StarCoderTransformer] with HasPretrained[StarCoderTransformer]
trait ReadablePretrainedT5TransformerModel extends ParamsAndFeaturesReadable[T5Transformer] with HasPretrained[T5Transformer]
class StarCoderTransformer extends AnnotatorModel[StarCoderTransformer] with HasBatchedAnnotate[StarCoderTransformer] with ParamsAndFeaturesWritable with WriteOnnxModel with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
StarCoder2: The Versatile Code Companion.
StarCoder2: The Versatile Code Companion.
StarCoder2 is a Transformer model designed specifically for code generation and understanding. With 13 billion parameters, it builds upon the advancements of its predecessors and is trained on a diverse dataset that includes multiple programming languages. This extensive training allows StarCoder2 to support a wide array of coding tasks, from code completion to generation.
StarCoder2 was developed to assist developers in writing and understanding code more efficiently, making it a valuable tool for various software development and data science tasks.
Pretrained models can be loaded with pretrained of the companion object:
```
val starcoder2 = StarCoder2Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")
```
The default model is "StarCoder2-3B", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see StarCoder2TestSpec.
References:
- StarCoder2: The Versatile Code Companion
- https://github.com/bigcode-project/starcoder
Paper Abstract:
The BigCode project,1 an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH),2 we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4× larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.
We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
Note:
This is a computationally intensive module, especially for larger code sequences. The use of an accelerator such as GPU is recommended.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.StarCoder2Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val starcoder2 = StarCoder2Transformer.pretrained("starcoder2")
  .setInputCols(Array("documents"))
  .setMinOutputLength(10)
  .setMaxOutputLength(50)
  .setDoSample(false)
  .setTopK(50)
  .setNoRepeatNgramSize(3)
  .setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, starcoder2))

val data = Seq(
  "def add(a, b):"
).toDF("text")
val result = pipeline.fit(data).transform(data)

results.select("generation.result").show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[def add(a, b): return a + b]                                                                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

class T5Transformer extends AnnotatorModel[T5Transformer] with HasBatchedAnnotate[T5Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasCaseSensitiveProperties with WriteSentencePieceModel with HasProtectedParams with HasEngine

T5: the Text-To-Text Transfer Transformer

T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework is able to use the same model, loss function, and hyper-parameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even apply to regression tasks by training it to predict the string representation of a number instead of the number itself.

Pretrained models can be loaded with pretrained of the companion object:

val t5 = T5Transformer.pretrained()
  .setTask("summarize:")
  .setInputCols("document")
  .setOutputCol("summaries")

The default model is "t5_small", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the T5TestSpec.

References:

Paper Abstract:

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("documents")

val t5 = T5Transformer.pretrained("t5_small")
  .setTask("summarize:")
  .setInputCols(Array("documents"))
  .setMaxOutputLength(200)
  .setOutputCol("summaries")

val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))

val data = Seq(
  "Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
    "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
    " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
    "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
    "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
    "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
    "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
    "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
    "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
    "learning for NLP, we release our data set, pre-trained models, and code."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("summaries.result").show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Value Members

object AutoGGUFModel extends ReadablePretrainedAutoGGUFModel with ReadAutoGGUFModel with Serializable
This is the companion object of AutoGGUFModel.
This is the companion object of AutoGGUFModel. Please refer to that class for the documentation.
object AutoGGUFVisionModel extends ReadablePretrainedAutoGGUFVisionModel with ReadAutoGGUFVisionModel with Serializable
This is the companion object of AutoGGUFVisionModel.
This is the companion object of AutoGGUFVisionModel. Please refer to that class for the documentation.
object BartTransformer extends ReadablePretrainedBartTransformerModel with ReadBartTransformerDLModel with Serializable
object CPMTransformer extends ReadablePretrainedCPMTransformerModel with ReadCPMTransformerDLModel with Serializable
object CoHereTransformer extends ReadablePretrainedCoHereTransformerModel with ReadCoHereTransformerDLModel with Serializable
object GPT2Transformer extends ReadablePretrainedGPT2TransformerModel with ReadGPT2TransformerDLModel with Serializable
object LLAMA2Transformer extends ReadablePretrainedLLAMA2TransformerModel with ReadLLAMA2TransformerDLModel with Serializable
object LLAMA3Transformer extends ReadablePretrainedLLAMA3TransformerModel with ReadLLAMA3TransformerDLModel with Serializable
object M2M100Transformer extends ReadablePretrainedM2M100TransformerModel with ReadM2M100TransformerDLModel with Serializable
object MarianTransformer extends ReadablePretrainedMarianMTModel with ReadMarianMTDLModel with ReadSentencePieceModel with Serializable
This is the companion object of MarianTransformer.
This is the companion object of MarianTransformer. Please refer to that class for the documentation.
object MistralTransformer extends ReadablePretrainedMistralTransformerModel with ReadMistralTransformerDLModel with Serializable
object NLLBTransformer extends ReadablePretrainedNLLBTransformerModel with ReadNLLBTransformerDLModel with Serializable
object OLMoTransformer extends ReadablePretrainedOLMoTransformerModel with ReadOLMoTransformerDLModel with Serializable
object Phi2Transformer extends ReadablePretrainedPhi2TransformerModel with ReadPhi2TransformerDLModel with Serializable
object Phi3Transformer extends ReadablePretrainedPhi3TransformerModel with ReadPhi3TransformerDLModel with Serializable
object Phi4Transformer extends ReadablePretrainedPhi4TransformerModel with ReadPhi4TransformerDLModel with Serializable
object QwenTransformer extends ReadablePretrainedQwenTransformerModel with ReadQwenTransformerDLModel with Serializable
object StarCoderTransformer extends ReadablePretrainedStarCoderTransformerModel with ReadStarCoderTransformerDLModel with Serializable
object T5Transformer extends ReadablePretrainedT5TransformerModel with ReadT5TransformerDLModel with ReadSentencePieceModel with Serializable
This is the companion object of T5Transformer.
This is the companion object of T5Transformer. Please refer to that class for the documentation.

Packages

seq2seq

package seq2seq

Type Members

Note

Example

Note

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Model Overview

Intended Use

Benchmarks

Safety & Limitations

Usage

Example

Example

Example

Example

Example

Value Members

Ungrouped

Packages

seq2seq 

package seq2seq

Type Members

Note

Example

Note

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Model Overview

Intended Use

Benchmarks

Safety & Limitations

Usage

Example

Example

Example

Example

Example

Value Members

Ungrouped

seq2seq