sparknlp.annotator.embeddings.auto_gguf_embeddings
#
Contains classes for the AutoGGUFEmbeddings.
Module Contents#
Classes#
Annotator that uses the llama.cpp library to generate text embeddings with large language |
- class AutoGGUFEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.AutoGGUFEmbeddings', java_model=None)[source]#
Annotator that uses the llama.cpp library to generate text embeddings with large language models
The type of embedding pooling can be set with the setPoolingType method. The default is “MEAN”. The available options are “NONE”, “MEAN”, “CLS”, and “LAST”.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> auto_gguf_model = AutoGGUFEmbeddings.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("embeddings")
The default model is
"nomic-embed-text-v1.5.Q8_0.gguf"
, if no name is provided.For extended examples of usage, see the AutoGGUFEmbeddingsTest and the example notebook.
For available pretrained models please see the Models Hub.
Input Annotation types
Output Annotation type
DOCUMENT
SENTENCE_EMBEDDINGS
- Parameters:
- nThreads
Set the number of threads to use during generation
- nThreadsBatch
Set the number of threads to use during batch and prompt processing
- nCtx
Set the size of the prompt context
- nBatch
Set the logical batch size for prompt processing (must be >=32 to use BLAS)
- nUbatch
Set the physical batch size for prompt processing (must be >=32 to use BLAS)
- nChunks
Set the maximal number of chunks to process
- nSequences
Set the number of sequences to decode
- nGpuLayers
Set the number of layers to store in VRAM (-1 - use default)
- gpuSplitMode
Set how to split the model across GPUs
- mainGpu
Set the main GPU that is used for scratch and small tensors.
- tensorSplit
Set how split tensors should be distributed across GPUs
- grpAttnN
Set the group-attention factor
- grpAttnW
Set the group-attention width
- ropeFreqBase
Set the RoPE base frequency, used by NTK-aware scaling
- ropeFreqScale
Set the RoPE frequency scaling factor, expands context by a factor of 1/N
- yarnExtFactor
Set the YaRN extrapolation mix factor
- yarnAttnFactor
Set the YaRN scale sqrt(t) or attention magnitude
- yarnBetaFast
Set the YaRN low correction dim or beta
- yarnBetaSlow
Set the YaRN high correction dim or alpha
- yarnOrigCtx
Set the YaRN original context size of model
- defragmentationThreshold
Set the KV cache defragmentation threshold
- numaStrategy
Set optimization strategies that help on some NUMA systems (if available)
- ropeScalingType
Set the RoPE frequency scaling method, defaults to linear unless specified by the model
- poolingType
Set the pooling type for embeddings, use model default if unspecified
- flashAttention
Whether to enable Flash Attention
- useMmap
Whether to use memory-map model (faster load but may increase pageouts if not using mlock)
- useMlock
Whether to force the system to keep model in RAM rather than swapping or compressing
- noKvOffload
Whether to disable KV offload
Notes
To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method.
When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> document = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("embeddings") \ ... .setBatchSize(4) \ ... .setNGpuLayers(99) \ ... .setPoolingType("MEAN") >>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings]) >>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("embeddings.embeddings").show(truncate = False) +--------------------------------------------------------------------------------+ | embeddings| +--------------------------------------------------------------------------------+ |[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...| +--------------------------------------------------------------------------------+
- setNThreadsBatch(nThreadsBatch: int)[source]#
Set the number of threads to use during batch and prompt processing
- setNBatch(nBatch: int)[source]#
Set the logical batch size for prompt processing (must be >=32 to use BLAS)
- setNUbatch(nUbatch: int)[source]#
Set the physical batch size for prompt processing (must be >=32 to use BLAS)
- setNGpuLayers(nGpuLayers: int)[source]#
Set the number of layers to store in VRAM (-1 - use default)
- setTensorSplit(tensorSplit: List[float])[source]#
Set how split tensors should be distributed across GPUs
- setRopeFreqBase(ropeFreqBase: float)[source]#
Set the RoPE base frequency, used by NTK-aware scaling
- setRopeFreqScale(ropeFreqScale: float)[source]#
Set the RoPE frequency scaling factor, expands context by a factor of 1/N
- setDefragmentationThreshold(defragmentationThreshold: float)[source]#
Set the KV cache defragmentation threshold
- setNumaStrategy(numaStrategy: str)[source]#
Set optimization strategies that help on some NUMA systems (if available)
- setRopeScalingType(ropeScalingType: str)[source]#
Set the RoPE frequency scaling method, defaults to linear unless specified by the model
- setPoolingType(poolingType: str)[source]#
Set the pooling type for embeddings, use model default if unspecified
- setUseMmap(useMmap: bool)[source]#
Whether to use memory-map model (faster load but may increase pageouts if not using mlock)
- setUseMlock(useMlock: bool)[source]#
Whether to force the system to keep model in RAM rather than swapping or compressing
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- AutoGGUFEmbeddings
The restored model
- static pretrained(name='nomic-embed-text-v1.5.Q8_0.gguf', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “nomic-embed-text-v1.5.Q8_0.gguf”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- AutoGGUFEmbeddings
The restored model