`sparknlp.annotator.embeddings.nomic_embeddings`#

Contains classes for E5Embeddings.

Module Contents#

Classes#

NomicEmbeddings

Sentence embeddings using NomicEmbeddings.

class NomicEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.NomicEmbeddings', java_model=None)[source]#

Sentence embeddings using NomicEmbeddings.

nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = NomicEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("nomic_embeddings")

The default model is "nomic_embed_v1", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotation types	Output Annotation type
`DOCUMENT`	`SENTENCE_EMBEDDINGS`

Parameters:

batchSize: Size of every batch , by default 8
dimension: Number of embedding dimensions, by default 768
caseSensitive: Whether to ignore case in tokens for embeddings matching, by default False
maxSentenceLength: Max sentence length to process, by default 512
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.

References

Text Embeddings by Weakly-Supervised Contrastive Pre-training

microsoft/unilm

Paper abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, opendata, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embedtext-v1. You can find code and data to replicate the model at https://github.com/nomicai/contrastors.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> embeddings = NomicEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("nomic_embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["nomic_embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     embeddings,
...     embeddingsFinisher
... ])
>>> data = spark.createDataFrame([["query: how much protein should a female eat",
... "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +     ... "But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +     ... "marathon. Check out the chart below to see how much protein you should be eating each day.",
... ]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

name = 'NomicEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'sentence_embeddings'[source]#

configProtoBytes[source]#

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

static loadSavedModel(folder, spark_session, use_openvino=False)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

NomicEmbeddings: The restored model

static pretrained(name='nomic_embed_v1', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “nomic_embed_v1”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

NomicEmbeddings: The restored model

sparknlp.annotator.embeddings.nomic_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.nomic_embeddings`#