sparknlp.annotator.embeddings.nomic_embeddings
#
Contains classes for E5Embeddings.
Module Contents#
Classes#
Sentence embeddings using NomicEmbeddings. |
- class NomicEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.NomicEmbeddings', java_model=None)[source]#
Sentence embeddings using NomicEmbeddings.
nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> embeddings = NomicEmbeddings.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("nomic_embeddings")
The default model is
"nomic_small"
, if no name is provided.For available pretrained models please see the Models Hub.
Input Annotation types
Output Annotation type
DOCUMENT
SENTENCE_EMBEDDINGS
- Parameters:
- batchSize
Size of every batch , by default 8
- dimension
Number of embedding dimensions, by default 768
- caseSensitive
Whether to ignore case in tokens for embeddings matching, by default False
- maxSentenceLength
Max sentence length to process, by default 512
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
References
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Paper abstract
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, opendata, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source models, we release a training data loader with 235 million curated text pairs that allows for the full replication of nomic-embedtext-v1. You can find code and data to replicate the model at https://github.com/nomicai/contrastors.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> embeddings = NomicEmbeddings.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("nomic_embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["nomic_embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["query: how much protein should a female eat", ... "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + ... "But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + ... "marathon. Check out the chart below to see how much protein you should be eating each day.", ... ]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...| |[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...| +--------------------------------------------------------------------------------+
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- static loadSavedModel(folder, spark_session, use_openvino=False)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- NomicEmbeddings
The restored model
- static pretrained(name='nomic_small', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “nomic_small”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- NomicEmbeddings
The restored model