sparknlp.annotator.embeddings.auto_gguf_embeddings#

Contains classes for the AutoGGUFEmbeddings.

Module Contents#

Classes#

AutoGGUFEmbeddings

Annotator that uses the llama.cpp library to generate text embeddings with large language

class AutoGGUFEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.AutoGGUFEmbeddings', java_model=None)[source]#

Annotator that uses the llama.cpp library to generate text embeddings with large language models

The type of embedding pooling can be set with the setPoolingType method. The default is “MEAN”. The available options are “NONE”, “MEAN”, “CLS”, and “LAST”.

Pretrained models can be loaded with pretrained() of the companion object:

>>> auto_gguf_model = AutoGGUFEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("embeddings")

The default model is "nomic-embed-text-v1.5.Q8_0.gguf", if no name is provided.

For extended examples of usage, see the AutoGGUFEmbeddingsTest and the example notebook.

For available pretrained models please see the Models Hub.

Input Annotation types

Output Annotation type

DOCUMENT

SENTENCE_EMBEDDINGS

Parameters:
nThreads

Set the number of threads to use during generation

nThreadsBatch

Set the number of threads to use during batch and prompt processing

nCtx

Set the size of the prompt context

nBatch

Set the logical batch size for prompt processing (must be >=32 to use BLAS)

nUbatch

Set the physical batch size for prompt processing (must be >=32 to use BLAS)

nChunks

Set the maximal number of chunks to process

nSequences

Set the number of sequences to decode

nGpuLayers

Set the number of layers to store in VRAM (-1 - use default)

gpuSplitMode

Set how to split the model across GPUs

mainGpu

Set the main GPU that is used for scratch and small tensors.

tensorSplit

Set how split tensors should be distributed across GPUs

grpAttnN

Set the group-attention factor

grpAttnW

Set the group-attention width

ropeFreqBase

Set the RoPE base frequency, used by NTK-aware scaling

ropeFreqScale

Set the RoPE frequency scaling factor, expands context by a factor of 1/N

yarnExtFactor

Set the YaRN extrapolation mix factor

yarnAttnFactor

Set the YaRN scale sqrt(t) or attention magnitude

yarnBetaFast

Set the YaRN low correction dim or beta

yarnBetaSlow

Set the YaRN high correction dim or alpha

yarnOrigCtx

Set the YaRN original context size of model

defragmentationThreshold

Set the KV cache defragmentation threshold

numaStrategy

Set optimization strategies that help on some NUMA systems (if available)

ropeScalingType

Set the RoPE frequency scaling method, defaults to linear unless specified by the model

poolingType

Set the pooling type for embeddings, use model default if unspecified

flashAttention

Whether to enable Flash Attention

useMmap

Whether to use memory-map model (faster load but may increase pageouts if not using mlock)

useMlock

Whether to force the system to keep model in RAM rather than swapping or compressing

noKvOffload

Whether to disable KV offload

Notes

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("embeddings") \
...     .setBatchSize(4) \
...     .setNGpuLayers(99) \
...     .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("embeddings.embeddings").show(truncate = False)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
setNThreads(nThreads: int)[source]#

Set the number of threads to use during generation

setNThreadsBatch(nThreadsBatch: int)[source]#

Set the number of threads to use during batch and prompt processing

setNCtx(nCtx: int)[source]#

Set the size of the prompt context

setNBatch(nBatch: int)[source]#

Set the logical batch size for prompt processing (must be >=32 to use BLAS)

setNUbatch(nUbatch: int)[source]#

Set the physical batch size for prompt processing (must be >=32 to use BLAS)

setNChunks(nChunks: int)[source]#

Set the maximal number of chunks to process

setNSequences(nSequences: int)[source]#

Set the number of sequences to decode

setNGpuLayers(nGpuLayers: int)[source]#

Set the number of layers to store in VRAM (-1 - use default)

setGpuSplitMode(gpuSplitMode: str)[source]#

Set how to split the model across GPUs

setMainGpu(mainGpu: int)[source]#

Set the main GPU that is used for scratch and small tensors.

setTensorSplit(tensorSplit: List[float])[source]#

Set how split tensors should be distributed across GPUs

setGrpAttnN(grpAttnN: int)[source]#

Set the group-attention factor

setGrpAttnW(grpAttnW: int)[source]#

Set the group-attention width

setRopeFreqBase(ropeFreqBase: float)[source]#

Set the RoPE base frequency, used by NTK-aware scaling

setRopeFreqScale(ropeFreqScale: float)[source]#

Set the RoPE frequency scaling factor, expands context by a factor of 1/N

setYarnExtFactor(yarnExtFactor: float)[source]#

Set the YaRN extrapolation mix factor

setYarnAttnFactor(yarnAttnFactor: float)[source]#

Set the YaRN scale sqrt(t) or attention magnitude

setYarnBetaFast(yarnBetaFast: float)[source]#

Set the YaRN low correction dim or beta

setYarnBetaSlow(yarnBetaSlow: float)[source]#

Set the YaRN high correction dim or alpha

setYarnOrigCtx(yarnOrigCtx: int)[source]#

Set the YaRN original context size of model

setDefragmentationThreshold(defragmentationThreshold: float)[source]#

Set the KV cache defragmentation threshold

setNumaStrategy(numaStrategy: str)[source]#

Set optimization strategies that help on some NUMA systems (if available)

setRopeScalingType(ropeScalingType: str)[source]#

Set the RoPE frequency scaling method, defaults to linear unless specified by the model

setPoolingType(poolingType: str)[source]#

Set the pooling type for embeddings, use model default if unspecified

setFlashAttention(flashAttention: bool)[source]#

Whether to enable Flash Attention

setUseMmap(useMmap: bool)[source]#

Whether to use memory-map model (faster load but may increase pageouts if not using mlock)

setUseMlock(useMlock: bool)[source]#

Whether to force the system to keep model in RAM rather than swapping or compressing

setNoKvOffload(noKvOffload: bool)[source]#

Whether to disable KV offload

getMetadata()[source]#

Gets the metadata of the model

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:
folderstr

Folder of the saved model

spark_sessionpyspark.sql.SparkSession

The current SparkSession

Returns:
AutoGGUFEmbeddings

The restored model

static pretrained(name='nomic-embed-text-v1.5.Q8_0.gguf', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “nomic-embed-text-v1.5.Q8_0.gguf”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
AutoGGUFEmbeddings

The restored model