`sparknlp.annotator.embeddings.auto_gguf_embeddings`#

Contains classes for the AutoGGUFEmbeddings.

Module Contents#

Classes#

AutoGGUFEmbeddings

Annotator that uses the llama.cpp library to generate text embeddings with large language

class AutoGGUFEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.AutoGGUFEmbeddings', java_model=None)[source]#

Annotator that uses the llama.cpp library to generate text embeddings with large language models

The type of embedding pooling can be set with the setPoolingType method. The default is “MEAN”. The available options are “NONE”, “MEAN”, “CLS”, and “LAST”.

Pretrained models can be loaded with pretrained() of the companion object:

>>> auto_gguf_model = AutoGGUFEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("embeddings")

The default model is "nomic-embed-text-v1.5.Q8_0.gguf", if no name is provided.

For extended examples of usage, see the AutoGGUFEmbeddingsTest and the example notebook.

For available pretrained models please see the Models Hub.

Input Annotation types	Output Annotation type
`DOCUMENT`	`SENTENCE_EMBEDDINGS`

Parameters:

nThreads: Set the number of threads to use during generation
nThreadsBatch: Set the number of threads to use during batch and prompt processing
nCtx: Set the size of the prompt context
nBatch: Set the logical batch size for prompt processing (must be >=32 to use BLAS)
nUbatch: Set the physical batch size for prompt processing (must be >=32 to use BLAS)
nChunks: Set the maximal number of chunks to process
nSequences: Set the number of sequences to decode
nGpuLayers: Set the number of layers to store in VRAM (-1 - use default)
gpuSplitMode: Set how to split the model across GPUs
mainGpu: Set the main GPU that is used for scratch and small tensors.
tensorSplit: Set how split tensors should be distributed across GPUs
grpAttnN: Set the group-attention factor
grpAttnW: Set the group-attention width
ropeFreqBase: Set the RoPE base frequency, used by NTK-aware scaling
ropeFreqScale: Set the RoPE frequency scaling factor, expands context by a factor of 1/N
yarnExtFactor: Set the YaRN extrapolation mix factor
yarnAttnFactor: Set the YaRN scale sqrt(t) or attention magnitude
yarnBetaFast: Set the YaRN low correction dim or beta
yarnBetaSlow: Set the YaRN high correction dim or alpha
yarnOrigCtx: Set the YaRN original context size of model
defragmentationThreshold: Set the KV cache defragmentation threshold
numaStrategy: Set optimization strategies that help on some NUMA systems (if available)
ropeScalingType: Set the RoPE frequency scaling method, defaults to linear unless specified by the model
poolingType: Set the pooling type for embeddings, use model default if unspecified
flashAttention: Whether to enable Flash Attention
useMmap: Whether to use memory-map model (faster load but may increase pageouts if not using mlock)
useMlock: Whether to force the system to keep model in RAM rather than swapping or compressing
noKvOffload: Whether to disable KV offload

Notes

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.

When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("embeddings") \
...     .setBatchSize(4) \
...     .setNGpuLayers(99) \
...     .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("embeddings.embeddings").show(truncate = False)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+

setNThreads(nThreads: int)[source]#: Set the number of threads to use during generation

setNThreadsBatch(nThreadsBatch: int)[source]#: Set the number of threads to use during batch and prompt processing

setNCtx(nCtx: int)[source]#: Set the size of the prompt context

setNBatch(nBatch: int)[source]#: Set the logical batch size for prompt processing (must be >=32 to use BLAS)

setNUbatch(nUbatch: int)[source]#: Set the physical batch size for prompt processing (must be >=32 to use BLAS)

setNChunks(nChunks: int)[source]#: Set the maximal number of chunks to process

setNSequences(nSequences: int)[source]#: Set the number of sequences to decode

setNGpuLayers(nGpuLayers: int)[source]#: Set the number of layers to store in VRAM (-1 - use default)

setGpuSplitMode(gpuSplitMode: str)[source]#: Set how to split the model across GPUs

setMainGpu(mainGpu: int)[source]#: Set the main GPU that is used for scratch and small tensors.

setTensorSplit(tensorSplit: List[float])[source]#: Set how split tensors should be distributed across GPUs

setGrpAttnN(grpAttnN: int)[source]#: Set the group-attention factor

setGrpAttnW(grpAttnW: int)[source]#: Set the group-attention width

setRopeFreqBase(ropeFreqBase: float)[source]#: Set the RoPE base frequency, used by NTK-aware scaling

setRopeFreqScale(ropeFreqScale: float)[source]#: Set the RoPE frequency scaling factor, expands context by a factor of 1/N

setYarnExtFactor(yarnExtFactor: float)[source]#: Set the YaRN extrapolation mix factor

setYarnAttnFactor(yarnAttnFactor: float)[source]#: Set the YaRN scale sqrt(t) or attention magnitude

setYarnBetaFast(yarnBetaFast: float)[source]#: Set the YaRN low correction dim or beta

setYarnBetaSlow(yarnBetaSlow: float)[source]#: Set the YaRN high correction dim or alpha

setYarnOrigCtx(yarnOrigCtx: int)[source]#: Set the YaRN original context size of model

setDefragmentationThreshold(defragmentationThreshold: float)[source]#: Set the KV cache defragmentation threshold

setNumaStrategy(numaStrategy: str)[source]#: Set optimization strategies that help on some NUMA systems (if available)

setRopeScalingType(ropeScalingType: str)[source]#: Set the RoPE frequency scaling method, defaults to linear unless specified by the model

setPoolingType(poolingType: str)[source]#: Set the pooling type for embeddings, use model default if unspecified

setFlashAttention(flashAttention: bool)[source]#: Whether to enable Flash Attention

setUseMmap(useMmap: bool)[source]#: Whether to use memory-map model (faster load but may increase pageouts if not using mlock)

setUseMlock(useMlock: bool)[source]#: Whether to force the system to keep model in RAM rather than swapping or compressing

setNoKvOffload(noKvOffload: bool)[source]#: Whether to disable KV offload

getMetadata()[source]#: Gets the metadata of the model

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

AutoGGUFEmbeddings: The restored model

static pretrained(name='nomic-embed-text-v1.5.Q8_0.gguf', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “nomic-embed-text-v1.5.Q8_0.gguf”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

AutoGGUFEmbeddings: The restored model

sparknlp.annotator.embeddings.auto_gguf_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.auto_gguf_embeddings`#