embeddings

package embeddings

Ordering

Alphabetic

Visibility

Public
All

Type Members

class AlbertEmbeddings extends AnnotatorModel[AlbertEmbeddings] with HasBatchedAnnotate[AlbertEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago
These word embeddings represent the outputs generated by the Albert model. All official Albert releases by google in TF-HUB are supported with this Albert Wrapper:
Ported TF-Hub Models:
"albert_base_uncased" | albert_base | 768-embed-dim, 12-layer, 12-heads, 12M parameters
"albert_large_uncased" | albert_large | 1024-embed-dim, 24-layer, 16-heads, 18M parameters
"albert_xlarge_uncased" | albert_xlarge | 2048-embed-dim, 24-layer, 32-heads, 60M parameters
"albert_xxlarge_uncased" | albert_xxlarge | 4096-embed-dim, 12-layer, 64-heads, 235M parameters
This model requires input tokenization with SentencePiece model, which is provided by Spark-NLP (See tokenizers package).
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = AlbertEmbeddings.pretrained()
 .setInputCols("sentence", "token")
 .setOutputCol("embeddings")
```
The default model is "albert_base_uncased", if no name is provided.
For extended examples of usage, see the Examples and the AlbertEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
References:
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
https://github.com/google-research/ALBERT
https://tfhub.dev/s?q=albert
Paper abstract:
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
Tips: ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.AlbertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = AlbertEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[1.1342473030090332,-1.3855540752410889,0.9818322062492371,-0.784737348556518...|
|[0.847029983997345,-1.047153353691101,-0.1520637571811676,-0.6245765686035156...|
|[-0.009860038757324219,-0.13450059294700623,2.707749128341675,1.2916892766952...|
|[-0.04192575812339783,-0.5764210224151611,-0.3196685314178467,-0.527840495109...|
|[0.15583214163780212,-0.1614152491092682,-0.28423872590065,-0.135491415858268...|
+--------------------------------------------------------------------------------+
```
See also
AlbertForTokenClassification for AlbertEmbeddings with a token classification layer on top
Annotators Main Page for a list of transformer based embeddings

class AutoGGUFEmbeddings extends AnnotatorModel[AutoGGUFEmbeddings] with HasBatchedAnnotate[AutoGGUFEmbeddings] with HasEngine with HasLlamaCppModelProperties with HasProtectedParams

Annotator that uses the llama.cpp library to generate text embeddings with large language models.

The type of embedding pooling can be set with the setPoolingType method. The default is "MEAN". The available options are "MEAN", "CLS", and "LAST".

For all settable parameters, and their explanations, see HasLlamaCppModelProperties.

Pretrained models can be loaded with pretrained of the companion object:

val autoGGUFModel = AutoGGUFEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

The default model is "Qwen3_Embedding_0.6B_Q8_0_gguf", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the AutoGGUFEmbeddingsTest and the example notebook.

Note

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Example

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val autoGGUFModel = AutoGGUFEmbeddings
  .pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")
  .setBatchSize(4)
  .setPoolingType("MEAN")

val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))

val data = Seq(
  "The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(truncate = false)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+

class BGEEmbeddings extends AnnotatorModel[BGEEmbeddings] with HasBatchedAnnotate[BGEEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasClsTokenProperties with HasEngine
Sentence embeddings using BGE.
Sentence embeddings using BGE.
BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
Note that this annotator is only supported for Spark Versions 3.4 and up.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = BGEEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")
```
The default model is "bge_small_en_v1.5", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see BGEEmbeddingsTestSpec.
Sources :
C-Pack: Packaged Resources To Advance General Chinese Embedding
BGE Github Repository
Paper abstract
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = BGEEmbeddings.pretrained("bge_small_en_v1.5", "en")
  .setInputCols("document")
  .setOutputCol("bge_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("bge_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings
class BertEmbeddings extends AnnotatorModel[BertEmbeddings] with HasBatchedAnnotate[BertEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Token-level embeddings using BERT.
Token-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = BertEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("bert_embeddings")
```
The default model is "small_bert_L2_768", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the BertEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
Sources :
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://github.com/google-research/bert
Paper abstract
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en")
  .setInputCols("token", "document")
  .setOutputCol("bert_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("bert_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-2.3497989177703857,0.480538547039032,-0.3238905668258667,-1.612930893898010...|
|[-2.1357314586639404,0.32984697818756104,-0.6032363176345825,-1.6791689395904...|
|[-1.8244884014129639,-0.27088963985443115,-1.059438943862915,-0.9817547798156...|
|[-1.1648050546646118,-0.4725411534309387,-0.5938255786895752,-1.5780693292617...|
|[-0.9125322699546814,0.4563939869403839,-0.3975459933280945,-1.81611204147338...|
+--------------------------------------------------------------------------------+
```
See also
BertSentenceEmbeddings for sentence-level embeddings
BertForTokenClassification For BertEmbeddings with a token classification layer on top
Annotators Main Page for a list of transformer based embeddings
class BertSentenceEmbeddings extends AnnotatorModel[BertSentenceEmbeddings] with HasBatchedAnnotate[BertSentenceEmbeddings] with WriteTensorflowModel with WriteOpenvinoModel with WriteOnnxModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine with HasProtectedParams
Sentence-level embeddings using BERT.
Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = BertSentenceEmbeddings.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_bert_embeddings")
```
The default model is "sent_small_bert_L2_768", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the BertSentenceEmbeddingsTestSpec.
Sources :
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://github.com/google-research/bert
Paper abstract
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128")
  .setInputCols("sentence")
  .setOutputCol("sentence_bert_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_bert_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  embeddings,
  embeddingsFinisher
))

val data = Seq("John loves apples. Mary loves oranges. John loves Mary.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.8951074481010437,0.13753940165042877,0.3108254075050354,-1.65693199634552...|
|[-0.6180210709571838,-0.12179657071828842,-0.191165953874588,-1.4497021436691...|
|[-0.822715163230896,0.7568016648292542,-0.1165061742067337,-1.59048593044281,...|
+--------------------------------------------------------------------------------+
```
See also
BertSentenceEmbeddings for sentence-level embeddings
BertForSequenceClassification for embeddings with a sequence classification layer on top
Annotators Main Page for a list of transformer based embeddings
class CamemBertEmbeddings extends AnnotatorModel[CamemBertEmbeddings] with HasBatchedAnnotate[CamemBertEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
The CamemBERT model was proposed in CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot.
The CamemBERT model was proposed in CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook’s RoBERTa model released in 2019. It is a model trained on 138GB of French text.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = CamemBertEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("camembert_embeddings")
```
The default model is "camembert_base", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the CamemBertEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
Sources :
CamemBERT: a Tasty French Language Model
https://huggingface.co/camembert
Paper abstract
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.CamemBertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = CamemBertEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("camembert_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("camembert_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("C'est une phrase.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.08442357927560806,-0.12863239645957947,-0.03835778683423996,0.200479581952...|
|[0.048462312668561935,0.12637358903884888,-0.27429091930389404,-0.07516729831...|
|[0.02690504491329193,0.12104076147079468,0.012526623904705048,-0.031543646007...|
|[0.05877285450696945,-0.08773420006036758,-0.06381352990865707,0.122621834278...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings

class ChunkEmbeddings extends AnnotatorModel[ChunkEmbeddings] with HasSimpleAnnotate[ChunkEmbeddings]

This annotator utilizes WordEmbeddings, BertEmbeddings etc.

This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

For extended examples of usage, see the Examples and the ChunkEmbeddingsTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.{NGramGenerator, Tokenizer}
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.ChunkEmbeddings
import org.apache.spark.ml.Pipeline

// Extract the Embeddings from the NGrams
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("chunk")
  .setN(2)

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

// Convert the NGram chunks into Word Embeddings
val chunkEmbeddings = new ChunkEmbeddings()
  .setInputCols("chunk", "embeddings")
  .setOutputCol("chunk_embeddings")
  .setPoolingStrategy("AVERAGE")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams,
    embeddings,
    chunkEmbeddings
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result")
  .select("result.annotatorType", "result.result", "result.embeddings")
  .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+

class DeBertaEmbeddings extends AnnotatorModel[DeBertaEmbeddings] with HasBatchedAnnotate[DeBertaEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.
The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.
This model requires input tokenization with SentencePiece model, which is provided by Spark NLP (See tokenizers package).
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = DeBertaEmbeddings.pretrained()
 .setInputCols("sentence", "token")
 .setOutputCol("embeddings")
```
The default model is "deberta_v3_base", if no name is provided.
For extended examples see DeBertaEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.
References:
https://github.com/microsoft/DeBERTa
https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/
Paper abstract:
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.DeBertaEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = DeBertaEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[1.1342473030090332,-1.3855540752410889,0.9818322062492371,-0.784737348556518...|
|[0.847029983997345,-1.047153353691101,-0.1520637571811676,-0.6245765686035156...|
|[-0.009860038757324219,-0.13450059294700623,2.707749128341675,1.2916892766952...|
|[-0.04192575812339783,-0.5764210224151611,-0.3196685314178467,-0.527840495109...|
|[0.15583214163780212,-0.1614152491092682,-0.28423872590065,-0.135491415858268...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings
class DistilBertEmbeddings extends AnnotatorModel[DistilBertEmbeddings] with HasBatchedAnnotate[DistilBertEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base.
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = DistilBertEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
```
The default model is "distilbert_base_cased", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the DistilBertEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
The DistilBERT model was proposed in the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.
Paper Abstract:
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
Tips:
- DistilBERT doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:[SEP]).
- DistilBERT doesn't have options to select the input positions (:obj:position_ids input). This could be added if necessary though, just let us know if you need this option.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.DistilBertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = DistilBertEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.1127224713563919,-0.1982710212469101,0.5360898375511169,-0.272536993026733...|
|[0.35534414649009705,0.13215228915214539,0.40981462597846985,0.14036104083061...|
|[0.328085333108902,-0.06269335001707077,-0.017595693469047546,-0.024373905733...|
|[0.15617232024669647,0.2967822253704071,0.22324979305267334,-0.04568954557180...|
|[0.45411425828933716,0.01173491682857275,0.190129816532135,0.1178255230188369...|
+--------------------------------------------------------------------------------+
```
See also
DistilBertForTokenClassification for DistilBertEmbeddings with a token classification layer on top
DistilBertForSequenceClassification for DistilBertEmbeddings with a sequence classification layer on top
Annotators Main Page for a list of transformer based embeddings
class Doc2VecApproach extends AnnotatorApproach[Doc2VecModel] with HasStorageRef with HasEnableCachingProperties with HasProtectedParams
Trains a Word2Vec model that creates vector representations of words in a text corpus.
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Doc2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Doc2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)
```

class Doc2VecModel extends AnnotatorModel[Doc2VecModel] with HasSimpleAnnotate[Doc2VecModel] with HasStorageRef with HasEmbeddingsProperties with ParamsAndFeaturesWritable

Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

This is the instantiated model of the Doc2VecApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = Doc2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

The default model is "doc2vec_gigaword_300", if no name is provided.

For available pretrained models please see the Models Hub.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecModel}
import com.johnsnowlabs.nlp.EmbeddingsFinisher

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = Doc2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

class E5Embeddings extends AnnotatorModel[E5Embeddings] with HasBatchedAnnotate[E5Embeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine

Sentence embeddings using E5.

E5, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = E5Embeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("e5_embeddings")

The default model is "e5_small", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see E5EmbeddingsTestSpec.

Sources :

Text Embeddings by Weakly-Supervised Contrastive Pre-training

E5 Github Repository

Paper abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40× more parameters.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.E5Embeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = E5Embeddings.pretrained("e5_small", "en")
  .setInputCols("document")
  .setOutputCol("e5_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("e5_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

See also: Annotators Main Page for a list of transformer based embeddings

class E5VEmbeddings extends AnnotatorModel[E5VEmbeddings] with HasBatchedAnnotateImage[E5VEmbeddings] with HasImageFeatureProperties with WriteOpenvinoModel with HasGeneratorProperties with HasEngine
E5VEmbeddings provides universal multimodal embeddings using the E5-V model, which is fine-tuned from lmms-lab/llama3-llava-next-8b.
E5VEmbeddings provides universal multimodal embeddings using the E5-V model, which is fine-tuned from lmms-lab/llama3-llava-next-8b.
E5-V bridges the modality gap between different input types (text, image) and demonstrates strong performance in multimodal embeddings, even without fine-tuning. It also supports a single-modality training approach, where the model is trained exclusively on text pairs, often yielding better performance than multimodal training.
For more details, see the Hugging Face model card: https://huggingface.co/royokong/e5-v
Overview
E5-V can embed both text and images into a shared space, enabling cross-modal retrieval and similarity tasks. The model is designed for universal embeddings and is suitable for scenarios where you want to compare or retrieve across modalities.
Example
Image + Text Embedding
{{ { import org.apache.spark.sql.functions.lit import com.johnsnowlabs.nlp.base.ImageAssembler import com.johnsnowlabs.nlp.embeddings.E5VEmbeddings import org.apache.spark.ml.Pipeline
val imageDF = spark.read.format("image").option("dropInvalid", value = true).load(imageFolder) val imagePrompt = "<|start_header_id|>user<|end_header_id|>\n\n<image>\\nSummary above image in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n" val testDF = imageDF.withColumn("text", lit(imagePrompt))
val imageAssembler = new ImageAssembler().setInputCol("image").setOutputCol("image_assembler") val e5vEmbeddings = E5VEmbeddings.pretrained() .setInputCols("image_assembler") .setOutputCol("e5v")
val pipeline = new Pipeline().setStages(Array(imageAssembler, e5vEmbeddings)) val result = pipeline.fit(testDF).transform(testDF) result.select("e5v.embeddings").show(truncate = false) }}
Text-Only Embedding
{{ { import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.lit import com.johnsnowlabs.nlp.util.EmbeddingsDataFrameUtils.{emptyImageRow, imageSchema} import com.johnsnowlabs.nlp.embeddings.E5VEmbeddings
val spark: SparkSession = ... val textPrompt = "<|start_header_id|>user<|end_header_id|>\n\n<sent>\\nSummary above sentence in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n" val textDesc = "A cat sitting in a box." val nullImageDF = spark.createDataFrame(spark.sparkContext.parallelize(Seq(emptyImageRow)), imageSchema) val textDF = nullImageDF.withColumn("text", lit(textPrompt.replace("<sent>", textDesc)))
val e5vEmbeddings = E5VEmbeddings.pretrained() .setInputCols("image") .setOutputCol("e5v") val result = e5vEmbeddings.transform(textDF) result.select("e5v.embeddings").show(truncate = false) }}
References
- Hugging Face model card: https://huggingface.co/royokong/e5-v
- Paper: https://arxiv.org/abs/2407.12580
- Code: https://github.com/kongds/E5-V
See also
CLIPForZeroShotClassification for Zero Shot Image Classifier
Annotators Main Page for a list of transformer based classifiers
class ElmoEmbeddings extends AnnotatorModel[ElmoEmbeddings] with HasSimpleAnnotate[ElmoEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.
Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.
Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = ElmoEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("elmo_embeddings")
```
The default model is "elmo", if no name is provided.
For available pretrained models please see the Models Hub.
The pooling layer can be set with setPoolingLayer to the following values:
- "word_emb": the character-based word representations with shape [batch_size, max_length, 512].
- "lstm_outputs1": the first LSTM hidden state with shape [batch_size, max_length, 1024].
- "lstm_outputs2": the second LSTM hidden state with shape [batch_size, max_length, 1024].
- "elmo": the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024].
For extended examples of usage, see the Examples and the ElmoEmbeddingsTestSpec.
References:
https://tfhub.dev/google/elmo/3
Deep contextualized word representations
Paper abstract:
We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.ElmoEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = ElmoEmbeddings.pretrained()
  .setPoolingLayer("word_emb")
  .setInputCols("token", "document")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[6.662458181381226E-4,-0.2541114091873169,-0.6275503039360046,0.5787073969841...|
|[0.19154725968837738,0.22998669743537903,-0.2894386649131775,0.21524395048618...|
|[0.10400570929050446,0.12288510054349899,-0.07056470215320587,-0.246389418840...|
|[0.49932169914245605,-0.12706467509269714,0.30969417095184326,0.2643227577209...|
|[-0.8871506452560425,-0.20039963722229004,-1.0601330995559692,0.0348707810044...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of other transformer based embeddings
trait EmbeddingsCoverage extends AnyRef
trait HasEmbeddingsProperties extends Params with HasProtectedParams
class InstructorEmbeddings extends AnnotatorModel[InstructorEmbeddings] with HasBatchedAnnotate[InstructorEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine
Sentence embeddings using INSTRUCTOR.
Sentence embeddings using INSTRUCTOR.
Instructor👨‍🏫, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor👨‍ achieves sota on 70 diverse embedding tasks!
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = InstructorEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("instructor_embeddings")
```
The default model is "instructor_base", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see InstructorEmbeddingsTestSpec.
Sources :
One Embedder, Any Task: Instruction-Finetuned Text Embeddings
INSTRUCTOR Github Repository
Paper abstract
We introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). Unlike encoders from prior work that are more specialized, INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains, without any further training. We first annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss. We evaluate INSTRUCTOR on 70 embedding evaluation tasks (66 of which are unseen during training), ranging from classification and information retrieval to semantic textual similarity and text generation evaluation. INSTRUCTOR, while having an order of magnitude fewer parameters than the previous best model, achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets. Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge of training a single model on diverse datasets. Our model, code, and data are available at this https URL. https://instructor-embedding.github.io/
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.InstructorEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = InstructorEmbeddings.pretrained("instructor_base", "en")
  .setInputCols("document")
  .setInstruction("Represent the Medicine sentence for clustering: ")
  .setOutputCol("instructor_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("instructor_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-2.3497989177703857,0.480538547039032,-0.3238905668258667,-1.612930893898010...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings
class LongformerEmbeddings extends AnnotatorModel[LongformerEmbeddings] with HasBatchedAnnotate[LongformerEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Longformer is a transformer model for long documents.
Longformer is a transformer model for long documents. The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan. longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = LongformerEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
```
The default model is "longformer_base_4096", if no name is provided. For available pretrained models please see the Models Hub.
For some examples of usage, see LongformerEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
Paper Abstract:
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
The original code can be found here https://github.com/allenai/longformer.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = LongformerEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
|[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
|[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
|[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
|[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
+--------------------------------------------------------------------------------+
```
See also
LongformerForTokenClassification for Longformer embeddings with a token classification layer on top
Annotators Main Page for a list of transformer based embeddings
class MPNetEmbeddings extends AnnotatorModel[MPNetEmbeddings] with HasBatchedAnnotate[MPNetEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Sentence embeddings using MPNet.
Sentence embeddings using MPNet.
The MPNet model was proposed in MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. MPNet adopts a novel pre-training method, named masked and permuted language modeling, to inherit the advantages of masked language modeling and permuted language modeling for natural language understanding.
Note that this annotator is only supported for Spark Versions 3.4 and up.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = MPNetEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("mpnet_embeddings")
```
The default model is "all_mpnet_base_v2", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see MPNetEmbeddingsTestSpec.
Sources :
MPNet: Masked and Permuted Pre-training for Language Understanding
MPNet Github Repository
Paper abstract
BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.MPNetEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = MPNetEmbeddings.pretrained("all_mpnet_base_v2", "en")
  .setInputCols("document")
  .setOutputCol("mpnet_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("mpnet_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is an example sentence", "Each sentence is converted").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[0.022502584, -0.078291744, -0.023030775, -0.0051000593, -0.080340415, 0.039...|
|[[0.041702367, 0.0010974605, -0.015534201, 0.07092203, -0.0017729357, 0.04661...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings

class MiniLMEmbeddings extends AnnotatorModel[MiniLMEmbeddings] with HasBatchedAnnotate[MiniLMEmbeddings] with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine

Sentence embeddings using MiniLM.

MiniLM, a lightweight and efficient sentence embedding model that can generate text embeddings for various NLP tasks (e.g., classification, retrieval, clustering, text evaluation, etc.)

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = MiniLMEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("minilm_embeddings")

The default model is "minilm_l6_v2", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see MiniLMEmbeddingsTestSpec.

Sources :

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

MiniLM Github Repository

Paper abstract

We present a simple and effective approach to compress large pre-trained Transformer models by distilling the self-attention module of the last Transformer layer. The compressed model (called MiniLM) can be trained with task-agnostic distillation and then fine-tuned on various downstream tasks. We evaluate MiniLM on the GLUE benchmark and show that it achieves comparable results with BERT-base while being 4.3x smaller and 5.5x faster. We also show that MiniLM can be further compressed to 22x smaller and 12x faster than BERT-base while maintaining comparable performance.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.MiniLMEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = MiniLMEmbeddings.pretrained("minilm_l6_v2", "en")
  .setInputCols("document")
  .setOutputCol("minilm_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("minilm_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sample sentence for embedding generation.",
"Another example sentence to demonstrate MiniLM embeddings.")

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[0.1234567, -0.2345678, 0.3456789, -0.4567890, 0.5678901, -0.6789012...|
|[[0.2345678, -0.3456789, 0.4567890, -0.5678901, 0.6789012, -0.7890123...|
+--------------------------------------------------------------------------------+

See also: Annotators Main Page for a list of transformer based embeddings

class MxbaiEmbeddings extends AnnotatorModel[MxbaiEmbeddings] with HasBatchedAnnotate[MxbaiEmbeddings] with WriteTensorflowModel with WriteOnnxModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine

Sentence embeddings using Mxbai Embeddings.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = MxbaiEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("Mxbai_embeddings")

The default model is "mxbai_large_v1", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see MxbaiEmbeddingsTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.MxbaiEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = MxbaiEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("Mxbai_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("Mxbai_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("hello world", "hello moon").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.50387806, 0.5861606, 0.35129607, -0.76046336, -0.32446072, -0.117674336, 0...|
|[0.6660665, 0.961762, 0.24854276, -0.1018044, -0.6569202, 0.027635604, 0.1915...|
+--------------------------------------------------------------------------------+

See also: Annotators Main Page for a list of transformer based embeddings

class NomicEmbeddings extends AnnotatorModel[NomicEmbeddings] with HasBatchedAnnotate[NomicEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine

Sentence embeddings using NomicEmbeddings.

nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = NomicEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("nomic_embeddings")

The default model is "nomic_embed_v1", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see NomicEmbeddingsTestSpec.

Sources :

Nomic Embed: Training a Reproducible Long Context Text Embedder

NomicEmbeddings Github Repository

Paper abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, opendata, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embedtext-v1. You can find code and data to replicate the model at https://github.com/nomicai/contrastors.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.NomicEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = NomicEmbeddings.pretrained("nomic_embed_v1", "en")
  .setInputCols("document")
  .setOutputCol("nomic_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("nomic_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("query: how much protein should a female eat",
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
marathon. Check out the chart below to see how much protein you should be eating each day."

).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

See also: Annotators Main Page for a list of transformer based embeddings

trait ReadAlbertDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadAutoGGUFEmbeddings extends AnyRef
trait ReadBGEDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadBertDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadBertSentenceDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadCamemBertDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadDeBertaDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadDistilBertDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadE5DLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadE5VEmbeddingsDLModel extends ReadOpenvinoModel
trait ReadElmoDLModel extends ReadTensorflowModel
trait ReadInstructorDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadLongformerDLModel extends ReadTensorflowModel
trait ReadMPNetDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadMiniLMDLModel extends ReadOnnxModel with ReadOpenvinoModel
trait ReadMxbaiDLModel extends ReadTensorflowModel with ReadOnnxModel
trait ReadNomicEmbeddingsDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadRobertaDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadRobertaSentenceDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadSnowFlakeDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadUAEDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadUSEDLModel extends ReadTensorflowModel
trait ReadXlmRobertaDLModel extends ReadTensorflowModel with ReadSentencePieceModel with ReadOnnxModel with ReadOpenvinoModel
trait ReadXlmRobertaSentenceDLModel extends ReadTensorflowModel with ReadOnnxModel with ReadSentencePieceModel with ReadOpenvinoModel
trait ReadXlnetDLModel extends ReadTensorflowModel with ReadSentencePieceModel
trait ReadablePretrainedAlbertModel extends ParamsAndFeaturesReadable[AlbertEmbeddings] with HasPretrained[AlbertEmbeddings]
trait ReadablePretrainedAutoGGUFEmbeddings extends ParamsAndFeaturesFallbackReadable[AutoGGUFEmbeddings] with HasPretrained[AutoGGUFEmbeddings]
trait ReadablePretrainedBGEModel extends ParamsAndFeaturesReadable[BGEEmbeddings] with HasPretrained[BGEEmbeddings]
trait ReadablePretrainedBertModel extends ParamsAndFeaturesReadable[BertEmbeddings] with HasPretrained[BertEmbeddings]
trait ReadablePretrainedBertSentenceModel extends ParamsAndFeaturesReadable[BertSentenceEmbeddings] with HasPretrained[BertSentenceEmbeddings]
trait ReadablePretrainedCamemBertModel extends ParamsAndFeaturesReadable[CamemBertEmbeddings] with HasPretrained[CamemBertEmbeddings]
trait ReadablePretrainedDeBertaModel extends ParamsAndFeaturesReadable[DeBertaEmbeddings] with HasPretrained[DeBertaEmbeddings]
trait ReadablePretrainedDistilBertModel extends ParamsAndFeaturesReadable[DistilBertEmbeddings] with HasPretrained[DistilBertEmbeddings]
trait ReadablePretrainedDoc2Vec extends ParamsAndFeaturesReadable[Doc2VecModel] with HasPretrained[Doc2VecModel]
trait ReadablePretrainedE5Model extends ParamsAndFeaturesReadable[E5Embeddings] with HasPretrained[E5Embeddings]
trait ReadablePretrainedE5VEmbeddings extends ParamsAndFeaturesReadable[E5VEmbeddings] with HasPretrained[E5VEmbeddings]
trait ReadablePretrainedElmoModel extends ParamsAndFeaturesReadable[ElmoEmbeddings] with HasPretrained[ElmoEmbeddings]
trait ReadablePretrainedInstructorModel extends ParamsAndFeaturesReadable[InstructorEmbeddings] with HasPretrained[InstructorEmbeddings]
trait ReadablePretrainedLongformerModel extends ParamsAndFeaturesReadable[LongformerEmbeddings] with HasPretrained[LongformerEmbeddings]
trait ReadablePretrainedMPNetModel extends ParamsAndFeaturesReadable[MPNetEmbeddings] with HasPretrained[MPNetEmbeddings]
trait ReadablePretrainedMiniLMModel extends ParamsAndFeaturesReadable[MiniLMEmbeddings] with HasPretrained[MiniLMEmbeddings]
trait ReadablePretrainedMxbaiModel extends ParamsAndFeaturesReadable[MxbaiEmbeddings] with HasPretrained[MxbaiEmbeddings]
trait ReadablePretrainedNomicEmbeddingsModel extends ParamsAndFeaturesReadable[NomicEmbeddings] with HasPretrained[NomicEmbeddings]
trait ReadablePretrainedRobertaModel extends ParamsAndFeaturesReadable[RoBertaEmbeddings] with HasPretrained[RoBertaEmbeddings]
trait ReadablePretrainedRobertaSentenceModel extends ParamsAndFeaturesReadable[RoBertaSentenceEmbeddings] with HasPretrained[RoBertaSentenceEmbeddings]
trait ReadablePretrainedSnowFlakeModel extends ParamsAndFeaturesReadable[SnowFlakeEmbeddings] with HasPretrained[SnowFlakeEmbeddings]
trait ReadablePretrainedUAEModel extends ParamsAndFeaturesReadable[UAEEmbeddings] with HasPretrained[UAEEmbeddings]
trait ReadablePretrainedUSEModel extends ParamsAndFeaturesReadable[UniversalSentenceEncoder] with HasPretrained[UniversalSentenceEncoder]
trait ReadablePretrainedWord2Vec extends ParamsAndFeaturesReadable[Word2VecModel] with HasPretrained[Word2VecModel]
trait ReadablePretrainedWordEmbeddings extends StorageReadable[WordEmbeddingsModel] with HasPretrained[WordEmbeddingsModel]
trait ReadablePretrainedXlmRobertaModel extends ParamsAndFeaturesReadable[XlmRoBertaEmbeddings] with HasPretrained[XlmRoBertaEmbeddings]
trait ReadablePretrainedXlmRobertaSentenceModel extends ParamsAndFeaturesReadable[XlmRoBertaSentenceEmbeddings] with HasPretrained[XlmRoBertaSentenceEmbeddings]
trait ReadablePretrainedXlnetModel extends ParamsAndFeaturesReadable[XlnetEmbeddings] with HasPretrained[XlnetEmbeddings]
trait ReadsFromBytes extends AnyRef
class RoBertaEmbeddings extends AnnotatorModel[RoBertaEmbeddings] with HasBatchedAnnotate[RoBertaEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = RoBertaEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
```
The default model is "roberta_base", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the RoBertaEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
Paper Abstract:
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
Tips:
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
- RoBERTa doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)
The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.RoBertaEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
|[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
|[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
|[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
|[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
+--------------------------------------------------------------------------------+
```
See also
RoBertaSentenceEmbeddings for sentence-level embeddings
RoBertaForTokenClassification For RoBerta embeddings with a token classification layer on top
Annotators Main Page for a list of transformer based embeddings
class RoBertaSentenceEmbeddings extends AnnotatorModel[RoBertaSentenceEmbeddings] with HasBatchedAnnotate[RoBertaSentenceEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Sentence-level embeddings using RoBERTa.
Sentence-level embeddings using RoBERTa. The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = RoBertaSentenceEmbeddings.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")
```
The default model is "sent_roberta_base", if no name is provided. For available pretrained models please see the Models Hub.
To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see RoBertaEmbeddingsTestSpec.
Paper Abstract:
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
Tips:
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
- RoBERTa doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)
The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val sentenceEmbeddings = RoBertaSentenceEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")
  .setCaseSensitive(true)

// you can either use the output to train ClassifierDL, SentimentDL, or MultiClassifierDL
// or you can use EmbeddingsFinisher to prepare the results for Spark ML functions

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    sentenceEmbeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
|[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
|[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
|[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
|[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
+--------------------------------------------------------------------------------+
```
See also
RoBertaEmbeddings for token-level embeddings
Annotators Main Page for a list of transformer based embeddings

class SentenceEmbeddings extends AnnotatorModel[SentenceEmbeddings] with HasSimpleAnnotate[SentenceEmbeddings] with HasEmbeddingsProperties with HasStorageRef

Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

This can be configured with setPoolingStrategy, which either be "AVERAGE" or "SUM".

For more extended examples see the Examples. and the SentenceEmbeddingsTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.SentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsSentence = new SentenceEmbeddings()
  .setInputCols(Array("document", "embeddings"))
  .setOutputCol("sentence_embeddings")
  .setPoolingStrategy("AVERAGE")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsSentence,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+

class SnowFlakeEmbeddings extends AnnotatorModel[SnowFlakeEmbeddings] with HasBatchedAnnotate[SnowFlakeEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Sentence embeddings using SnowFlake.
Sentence embeddings using SnowFlake.
snowflake-arctic-embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = SnowFlakeEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("snowflake_embeddings")
```
The default model is "snowflake_artic_m", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see SnowFlakeEmbeddingsTestSpec.
Sources :
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
Snowflake Arctic-Embed Models
Paper abstract
The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.SnowFlakeEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = SnowFlakeEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("snowflake_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("snowflake_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("hello world", "hello moon").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
--------------------+
finished_embeddings|
--------------------+
[[-0.45763275, 0....|
[[-0.43076283, 0....|
--------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings
class UAEEmbeddings extends AnnotatorModel[UAEEmbeddings] with HasBatchedAnnotate[UAEEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Sentence embeddings using Universal AnglE Embedding (UAE).
Sentence embeddings using Universal AnglE Embedding (UAE).
UAE is a novel angle-optimized text embedding model, designed to improve semantic textual similarity tasks, which are crucial for Large Language Model (LLM) applications. By introducing angle optimization in a complex space, AnglE effectively mitigates saturation of the cosine similarity function.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = UAEEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("UAE_embeddings")
```
The default model is "uae_large_v1", if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see UAEEmbeddingsTestSpec.
Sources :
AnglE-optimized Text Embeddings
UAE Github Repository
Paper abstract
High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.UAEEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = UAEEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("UAE_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("UAE_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  embeddingsFinisher
))

val data = Seq("hello world", "hello moon").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.50387806, 0.5861606, 0.35129607, -0.76046336, -0.32446072, -0.117674336, 0...|
|[0.6660665, 0.961762, 0.24854276, -0.1018044, -0.6569202, 0.027635604, 0.1915...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings
class UniversalSentenceEncoder extends AnnotatorModel[UniversalSentenceEncoder] with HasBatchedAnnotate[UniversalSentenceEncoder] with HasEmbeddingsProperties with HasStorageRef with WriteTensorflowModel with HasEngine
The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
Pretrained models can be loaded with pretrained of the companion object:
```
val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")
```
The default model is "tfhub_use", if no name is provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the UniversalSentenceEncoderTestSpec.
References:
Universal Sentence Encoder
https://tfhub.dev/google/universal-sentence-encoder/2
Paper abstract:
We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.04616805538535118,0.022307956591248512,-0.044395286589860916,-0.0016493503...|
+--------------------------------------------------------------------------------+
```
See also
Annotators Main Page for a list of transformer based embeddings
class Word2VecApproach extends AnnotatorApproach[Word2VecModel] with HasStorageRef with HasEnableCachingProperties with HasProtectedParams
Trains a Word2Vec model that creates vector representations of words in a text corpus.
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Word2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Word2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)
```

class Word2VecModel extends AnnotatorModel[Word2VecModel] with HasSimpleAnnotate[Word2VecModel] with HasStorageRef with HasEmbeddingsProperties with ParamsAndFeaturesWritable

Word2Vec model that creates vector representations of words in a text corpus.

This is the instantiated model of the Word2VecApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = Word2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

The default model is "word2vec_gigaword_300", if no name is provided.

For available pretrained models please see the Models Hub.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecModel}
import com.johnsnowlabs.nlp.EmbeddingsFinisher

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = Word2VecModel.pretrained()
  .setInputCols("token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+

class WordEmbeddings extends AnnotatorApproach[WordEmbeddingsModel] with HasStorage with HasEmbeddingsProperties

Word Embeddings lookup annotator that maps tokens to vectors.

For instantiated/pretrained models, see WordEmbeddingsModel.

A custom token lookup dictionary for embeddings can be set with setStoragePath. Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces.

...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...

If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with WordEmbeddingsModel.withCoverageColumn and WordEmbeddingsModel.overallCoverage.

For extended examples of usage, see the Examples and the WordEmbeddingsTestSpec.

Example

In this example, the file random_embeddings_dim4.txt has the form of the content above.

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddings
import com.johnsnowlabs.nlp.util.io.ReadAs
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new WordEmbeddings()
  .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
  .setStorageRef("glove_4d")
  .setDimension(4)
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("The patient was diagnosed with diabetes.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(false)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+

See also: SentenceEmbeddings to combine embeddings into a sentence-level representation
Annotators Main Page for a list of transformer based embeddings

class WordEmbeddingsModel extends AnnotatorModel[WordEmbeddingsModel] with HasSimpleAnnotate[WordEmbeddingsModel] with HasEmbeddingsProperties with HasStorageModel with ParamsAndFeaturesWritable with ReadsFromBytes

Word Embeddings lookup annotator that maps tokens to vectors

This is the instantiated model of WordEmbeddings.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

The default model is "glove_100d", if no name is provided. For available pretrained models please see the Models Hub.

There are also two convenient functions to retrieve the embeddings coverage with respect to the transformed dataset:

withCoverageColumn(dataset, embeddingsCol, outputCol): Adds a custom column with word coverage stats for the embedded field: (coveredWords, totalWords, coveragePercentage). This creates a new column with statistics for each row.

val wordsCoverage = WordEmbeddingsModel.withCoverageColumn(resultDF, "embeddings", "cov_embeddings")
wordsCoverage.select("text","cov_embeddings").show(false)
+-------------------+--------------+
|text               |cov_embeddings|
+-------------------+--------------+
|This is a sentence.|[5, 5, 1.0]   |
+-------------------+--------------+

overallCoverage(dataset, embeddingsCol): Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.

val wordsOverallCoverage = WordEmbeddingsModel.overallCoverage(wordsCoverage,"embeddings").percentage
1.0

For extended examples of usage, see the Examples and the WordEmbeddingsTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.570580005645752,0.44183000922203064,0.7010200023651123,-0.417129993438720...|
|[-0.542639970779419,0.4147599935531616,1.0321999788284302,-0.4024400115013122...|
|[-0.2708599865436554,0.04400600120425224,-0.020260000601410866,-0.17395000159...|
|[0.6191999912261963,0.14650000631809235,-0.08592499792575836,-0.2629800140857...|
|[-0.3397899866104126,0.20940999686717987,0.46347999572753906,-0.6479200124740...|
+--------------------------------------------------------------------------------+

See also: SentenceEmbeddings to combine embeddings into a sentence-level representation
Annotators Main Page for a list of transformer based embeddings

class WordEmbeddingsReader extends StorageReader[Array[Float]] with ReadsFromBytes
class WordEmbeddingsWriter extends StorageBatchWriter[Array[Float]] with ReadsFromBytes
class XlmRoBertaEmbeddings extends AnnotatorModel[XlmRoBertaEmbeddings] with HasBatchedAnnotate[XlmRoBertaEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with WriteOnnxModel with WriteOpenvinoModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = XlmRoBertaEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
```
The default model is "xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples and the XlmRoBertaEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
Paper Abstract:
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.
Tips:
- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
- This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.XlmRoBertaEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = XlmRoBertaEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
|[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
|[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
|[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
|[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
+--------------------------------------------------------------------------------+
```
See also
XlmRoBertaSentenceEmbeddings for sentence-level embeddings
XlmRoBertaForTokenClassification For XlmRoBerta embeddings with a token classification layer on top
Annotators Main Page for a list of transformer based embeddings
class XlmRoBertaSentenceEmbeddings extends AnnotatorModel[XlmRoBertaSentenceEmbeddings] with HasBatchedAnnotate[XlmRoBertaSentenceEmbeddings] with WriteTensorflowModel with WriteOnnxModel with WriteOpenvinoModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
Sentence-level embeddings using XLM-RoBERTa.
Sentence-level embeddings using XLM-RoBERTa. The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = XlmRoBertaSentenceEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")
```
The default model is "sent_xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.
To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see XlmRoBertaSentenceEmbeddingsTestSpec.
Paper Abstract:
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.
Tips:
- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
- This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val sentenceEmbeddings = XlmRoBertaSentenceEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")
  .setCaseSensitive(true)

// you can either use the output to train ClassifierDL, SentimentDL, or MultiClassifierDL
// or you can use EmbeddingsFinisher to prepare the results for Spark ML functions

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    sentenceEmbeddings,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
|[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
|[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
|[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
|[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
+--------------------------------------------------------------------------------+
```
See also
XlmRoBertaEmbeddings for token-level embeddings
Annotators Main Page for a list of transformer based embeddings
class XlnetEmbeddings extends AnnotatorModel[XlnetEmbeddings] with HasBatchedAnnotate[XlnetEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties with HasEngine
XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding
XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding
XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.
These word embeddings represent the outputs generated by the XLNet models.
Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.
"xlnet_large_cased" = XLNet-Large \| 24-layer, 1024-hidden, 16-heads
"xlnet_base_cased" = XLNet-Base \| 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).
Pretrained models can be loaded with pretrained of the companion object:
```
val embeddings = XlnetEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
```
The default model is "xlnet_base_cased", if no name is provided.
For extended examples of usage, see the Examples and the XlnetEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
Sources :
XLNet: Generalized Autoregressive Pretraining for Language Understanding
https://github.com/zihangdai/xlnet
Paper abstract:
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.XlnetEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = XlnetEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  embeddings,
  embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.6287205219268799,-0.4865287244319916,-0.186111718416214,0.234187275171279...|
|[-1.1967450380325317,0.2746637463569641,0.9481253027915955,0.3431355059146881...|
|[-1.0777631998062134,-2.092679977416992,-1.5331977605819702,-1.11190271377563...|
|[-0.8349916934967041,-0.45627787709236145,-0.7890847325325012,-1.028069257736...|
|[-0.134845569729805,-0.11672890186309814,0.4945235550403595,-0.66587203741073...|
+--------------------------------------------------------------------------------+
```
See also
XlnetForTokenClassification For Xlnet embeddings with a token classification layer on top
Annotators Main Page for a list of transformer based embeddings

Value Members

object AlbertEmbeddings extends ReadablePretrainedAlbertModel with ReadAlbertDLModel with Serializable
This is the companion object of AlbertEmbeddings.
This is the companion object of AlbertEmbeddings. Please refer to that class for the documentation.
object AutoGGUFEmbeddings extends ReadablePretrainedAutoGGUFEmbeddings with ReadAutoGGUFEmbeddings with Serializable
This is the companion object of AutoGGUFEmbeddings.
This is the companion object of AutoGGUFEmbeddings. Please refer to that class for the documentation.
object BGEEmbeddings extends ReadablePretrainedBGEModel with ReadBGEDLModel with Serializable
This is the companion object of BGEEmbeddings.
This is the companion object of BGEEmbeddings. Please refer to that class for the documentation.
object BertEmbeddings extends ReadablePretrainedBertModel with ReadBertDLModel with Serializable
This is the companion object of BertEmbeddings.
This is the companion object of BertEmbeddings. Please refer to that class for the documentation.
object BertSentenceEmbeddings extends ReadablePretrainedBertSentenceModel with ReadBertSentenceDLModel with Serializable
This is the companion object of BertSentenceEmbeddings.
This is the companion object of BertSentenceEmbeddings. Please refer to that class for the documentation.
object CamemBertEmbeddings extends ReadablePretrainedCamemBertModel with ReadCamemBertDLModel with Serializable
This is the companion object of CamemBertEmbeddings.
This is the companion object of CamemBertEmbeddings. Please refer to that class for the documentation.
object ChunkEmbeddings extends DefaultParamsReadable[ChunkEmbeddings] with Serializable
This is the companion object of ChunkEmbeddings.
This is the companion object of ChunkEmbeddings. Please refer to that class for the documentation.
object DeBertaEmbeddings extends ReadablePretrainedDeBertaModel with ReadDeBertaDLModel with Serializable
This is the companion object of DeBertaEmbeddings.
This is the companion object of DeBertaEmbeddings. Please refer to that class for the documentation.
object DistilBertEmbeddings extends ReadablePretrainedDistilBertModel with ReadDistilBertDLModel with Serializable
This is the companion object of DistilBertEmbeddings.
This is the companion object of DistilBertEmbeddings. Please refer to that class for the documentation.
object Doc2VecApproach extends DefaultParamsReadable[Doc2VecApproach] with Serializable
This is the companion object of Doc2VecApproach.
This is the companion object of Doc2VecApproach. Please refer to that class for the documentation.
object Doc2VecModel extends ReadablePretrainedDoc2Vec with Serializable
This is the companion object of Doc2VecModel.
This is the companion object of Doc2VecModel. Please refer to that class for the documentation.
object E5Embeddings extends ReadablePretrainedE5Model with ReadE5DLModel with Serializable
This is the companion object of E5Embeddings.
This is the companion object of E5Embeddings. Please refer to that class for the documentation.
object E5VEmbeddings extends ReadablePretrainedE5VEmbeddings with ReadE5VEmbeddingsDLModel with Serializable
object ElmoEmbeddings extends ReadablePretrainedElmoModel with ReadElmoDLModel with Serializable
This is the companion object of ElmoEmbeddings.
This is the companion object of ElmoEmbeddings. Please refer to that class for the documentation.
object InstructorEmbeddings extends ReadablePretrainedInstructorModel with ReadInstructorDLModel with ReadSentencePieceModel with Serializable
This is the companion object of InstructorEmbeddings.
This is the companion object of InstructorEmbeddings. Please refer to that class for the documentation.
object LongformerEmbeddings extends ReadablePretrainedLongformerModel with ReadLongformerDLModel with Serializable
This is the companion object of LongformerEmbeddings.
This is the companion object of LongformerEmbeddings. Please refer to that class for the documentation.
object MPNetEmbeddings extends ReadablePretrainedMPNetModel with ReadMPNetDLModel with Serializable
This is the companion object of MPNetEmbeddings.
This is the companion object of MPNetEmbeddings. Please refer to that class for the documentation.
object MiniLMEmbeddings extends ReadablePretrainedMiniLMModel with ReadMiniLMDLModel with Serializable
This is the companion object of MiniLMEmbeddings.
This is the companion object of MiniLMEmbeddings. Please refer to that class for the documentation.
object MxbaiEmbeddings extends ReadablePretrainedMxbaiModel with ReadMxbaiDLModel with Serializable
This is the companion object of MxbaiEmbeddings.
This is the companion object of MxbaiEmbeddings. Please refer to that class for the documentation.
object NomicEmbeddings extends ReadablePretrainedNomicEmbeddingsModel with ReadNomicEmbeddingsDLModel with Serializable
This is the companion object of NomicEmbeddings.
This is the companion object of NomicEmbeddings. Please refer to that class for the documentation.
object PoolingStrategy
object RoBertaEmbeddings extends ReadablePretrainedRobertaModel with ReadRobertaDLModel with Serializable
This is the companion object of RoBertaEmbeddings.
This is the companion object of RoBertaEmbeddings. Please refer to that class for the documentation.
object RoBertaSentenceEmbeddings extends ReadablePretrainedRobertaSentenceModel with ReadRobertaSentenceDLModel with Serializable
This is the companion object of RoBertaSentenceEmbeddings.
This is the companion object of RoBertaSentenceEmbeddings. Please refer to that class for the documentation.
object SentenceEmbeddings extends DefaultParamsReadable[SentenceEmbeddings] with Serializable
This is the companion object of SentenceEmbeddings.
This is the companion object of SentenceEmbeddings. Please refer to that class for the documentation.
object SnowFlakeEmbeddings extends ReadablePretrainedSnowFlakeModel with ReadSnowFlakeDLModel with Serializable
This is the companion object of SnowFlakeEmbeddings.
This is the companion object of SnowFlakeEmbeddings. Please refer to that class for the documentation.
object UAEEmbeddings extends ReadablePretrainedUAEModel with ReadUAEDLModel with Serializable
This is the companion object of UAEEmbeddings.
This is the companion object of UAEEmbeddings. Please refer to that class for the documentation.
object UniversalSentenceEncoder extends ReadablePretrainedUSEModel with ReadUSEDLModel with Serializable
This is the companion object of UniversalSentenceEncoder.
This is the companion object of UniversalSentenceEncoder. Please refer to that class for the documentation.
object Word2VecApproach extends DefaultParamsReadable[Word2VecApproach] with Serializable
This is the companion object of Word2VecApproach.
This is the companion object of Word2VecApproach. Please refer to that class for the documentation.
object Word2VecModel extends ReadablePretrainedWord2Vec with Serializable
This is the companion object of Word2VecModel.
This is the companion object of Word2VecModel. Please refer to that class for the documentation.
object WordEmbeddings extends DefaultParamsReadable[WordEmbeddings] with Serializable
This is the companion object of WordEmbeddings.
This is the companion object of WordEmbeddings. Please refer to that class for the documentation.
object WordEmbeddingsBinaryIndexer
object WordEmbeddingsModel extends ReadablePretrainedWordEmbeddings with EmbeddingsCoverage with Serializable
This is the companion object of WordEmbeddingsModel.
This is the companion object of WordEmbeddingsModel. Please refer to that class for the documentation.
object WordEmbeddingsTextIndexer
object XlmRoBertaEmbeddings extends ReadablePretrainedXlmRobertaModel with ReadXlmRobertaDLModel with Serializable
object XlmRoBertaSentenceEmbeddings extends ReadablePretrainedXlmRobertaSentenceModel with ReadXlmRobertaSentenceDLModel with Serializable
object XlnetEmbeddings extends ReadablePretrainedXlnetModel with ReadXlnetDLModel with Serializable
This is the companion object of XlnetEmbeddings.
This is the companion object of XlnetEmbeddings. Please refer to that class for the documentation.

Packages

embeddings 

package embeddings

Type Members

Example

Note

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Overview

Example

Image + Text Embedding

Text-Only Embedding

References

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Value Members

Ungrouped

embeddings