`sparknlp.annotator.embeddings.mxbai_embeddings`#

Contains classes for MxbaiEmbeddings.

Module Contents#

Classes#

MxbaiEmbeddings

Sentence embeddings using Mxbai Embeddings.

class MxbaiEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.MxbaiEmbeddings', java_model=None)[source]#

Sentence embeddings using Mxbai Embeddings.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = MxbaiEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("Mxbai_embeddings")

The default model is "mxbai_large_v1", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotation types	Output Annotation type
`DOCUMENT`	`SENTENCE_EMBEDDINGS`

Parameters:

batchSize: Size of every batch , by default 8
dimension: Number of embedding dimensions, by default 768
caseSensitive: Whether to ignore case in tokens for embeddings matching, by default False
maxSentenceLength: Max sentence length to process, by default 512
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> embeddings = MxbaiEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols("embeddings") \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     embeddings,
...     embeddingsFinisher
... ])
>>> data = spark.createDataFrame([["hello world", "hello moon"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.50387806, 0.5861606, 0.35129607, -0.76046336, -0.32446072, -0.117674336, 0...|
|[0.6660665, 0.961762, 0.24854276, -0.1018044, -0.6569202, 0.027635604, 0.1915...|
+--------------------------------------------------------------------------------+

name = 'MxbaiEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'sentence_embeddings'[source]#

poolingStrategy[source]#

setPoolingStrategy(value)[source]#

Pooling strategy to use for sentence embeddings.

Available pooling strategies for sentence embeddings are:

“cls”: leading [CLS] token
“cls_avg”: leading [CLS] token + mean of all other tokens
“last”: embeddings of the last token in the sequence
“avg”: mean of all tokens
“max”: max of all embedding features of the entire token sequence
“int”: An integer number, which represents the index of the token to use as the
embedding

Parameters:

valuestr: Pooling strategy to use for sentence embeddings

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

MxbaiEmbeddings: The restored model

static pretrained(name='mxbai_large_v1', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “mxbai_large_v1”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

MxbaiEmbeddings: The restored model

sparknlp.annotator.embeddings.mxbai_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.mxbai_embeddings`#