sparknlp.annotator.embeddings.bge_embeddings
#
Contains classes for BGEEmbeddings.
Module Contents#
Classes#
Sentence embeddings using BGE. |
- class BGEEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.BGEEmbeddings', java_model=None)[source]#
Sentence embeddings using BGE.
BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
Pretrained models can be loaded with pretrained of the companion object:
>>> embeddings = BGEEmbeddings.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("bge_embeddings")
The default model is
"bge_base"
, if no name is provided.For available pretrained models please see the Models Hub.
Input Annotation types
Output Annotation type
DOCUMENT
SENTENCE_EMBEDDINGS
- Parameters:
- batchSize
Size of every batch , by default 8
- dimension
Number of embedding dimensions, by default 768
- caseSensitive
Whether to ignore case in tokens for embeddings matching, by default False
- maxSentenceLength
Max sentence length to process, by default 512
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> embeddings = BGEEmbeddings.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("bge_embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["bge_embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["query: how much protein should a female eat", ... "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + ... "But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + ... "marathon. Check out the chart below to see how much protein you should be eating each day.", ... ]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...| |[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...| +--------------------------------------------------------------------------------+
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- BGEEmbeddings
The restored model
- static pretrained(name='bge_base', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “bge_base”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- BGEEmbeddings
The restored model