Packages

t

com.johnsnowlabs.nlp

HasLlamaCppModelProperties

trait HasLlamaCppModelProperties extends AnyRef

Contains settable model parameters for the AutoGGUFModel.

Self Type
HasLlamaCppModelProperties with ParamsAndFeaturesWritable with HasProtectedParams
Linear Supertypes
AnyRef, Any
Known Subclasses
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. HasLlamaCppModelProperties
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Parameter setters

  1. def setChatTemplate(chatTemplate: String): HasLlamaCppModelProperties.this

    The chat template to use

  2. def setDefragmentationThreshold(defragThold: Float): HasLlamaCppModelProperties.this

    Set the KV cache defragmentation threshold

  3. def setEmbedding(embedding: Boolean): HasLlamaCppModelProperties.this

    Whether to load model with embedding support

  4. def setFlashAttention(flashAttention: Boolean): HasLlamaCppModelProperties.this

    Whether to enable Flash Attention

  5. def setGpuSplitMode(splitMode: String): HasLlamaCppModelProperties.this

    Set how to split the model across GPUs

    Set how to split the model across GPUs

    • NONE: No GPU split -LAYER: Split the model across GPUs by layer 2. ROW: Split the model across GPUs by rows
  6. def setGrpAttnN(grpAttnN: Int): HasLlamaCppModelProperties.this

    Set the group-attention factor

  7. def setGrpAttnW(grpAttnW: Int): HasLlamaCppModelProperties.this

    Set the group-attention width

  8. def setInputPrefixBos(inputPrefixBos: Boolean): HasLlamaCppModelProperties.this

    Whether to add prefix BOS to user inputs, preceding the --in-prefix string

  9. def setLookupCacheDynamicFilePath(lookupCacheDynamicFilePath: String): HasLlamaCppModelProperties.this

    Set path to dynamic lookup cache to use for lookup decoding (updated by generation)

  10. def setLookupCacheStaticFilePath(lookupCacheStaticFilePath: String): HasLlamaCppModelProperties.this

    Set path to static lookup cache to use for lookup decoding (not updated by generation)

  11. def setLoraAdapters(loraAdapters: HashMap[String, Double]): HasLlamaCppModelProperties.this

    Sets paths to lora adapters with user defined scale.

    Sets paths to lora adapters with user defined scale. (PySpark Override)

  12. def setLoraAdapters(loraAdapters: Map[String, Float]): HasLlamaCppModelProperties.this

    Sets paths to lora adapters with user defined scale.

  13. def setMainGpu(mainGpu: Int): HasLlamaCppModelProperties.this

    Set the GPU that is used for scratch and small tensors

  14. def setMetadata(metadata: String): HasLlamaCppModelProperties.this

    Set the metadata for the model

  15. def setModelDraft(modelDraft: String): HasLlamaCppModelProperties.this

    Set the draft model for speculative decoding

  16. def setNBatch(nBatch: Int): HasLlamaCppModelProperties.this

    Set the logical batch size for prompt processing (must be >=32 to use BLAS)

  17. def setNChunks(nChunks: Int): HasLlamaCppModelProperties.this

    Set the maximal number of chunks to process

  18. def setNCtx(nCtx: Int): HasLlamaCppModelProperties.this

    Set the size of the prompt context

  19. def setNDraft(nDraft: Int): HasLlamaCppModelProperties.this

    Set the number of tokens to draft for speculative decoding

  20. def setNGpuLayers(nGpuLayers: Int): HasLlamaCppModelProperties.this

    Set the number of layers to store in VRAM (-1 - use default)

  21. def setNGpuLayersDraft(nGpuLayersDraft: Int): HasLlamaCppModelProperties.this

    Set the number of layers to store in VRAM for the draft model (-1 - use default)

  22. def setNSequences(nSequences: Int): HasLlamaCppModelProperties.this

    Set the number of sequences to decode

  23. def setNThreads(nThreads: Int): HasLlamaCppModelProperties.this

    Set the number of threads to use during generation

  24. def setNThreadsBatch(nThreadsBatch: Int): HasLlamaCppModelProperties.this

    Set the number of threads to use during batch and prompt processing

  25. def setNThreadsBatchDraft(nThreadsBatchDraft: Int): HasLlamaCppModelProperties.this

    Set the number of threads to use during batch and prompt processing

  26. def setNThreadsDraft(nThreadsDraft: Int): HasLlamaCppModelProperties.this

    Set the number of threads to use during draft generation

  27. def setNUbatch(nUbatch: Int): HasLlamaCppModelProperties.this

    Set the physical batch size for prompt processing (must be >=32 to use BLAS)

  28. def setNoKvOffload(noKvOffload: Boolean): HasLlamaCppModelProperties.this

    Whether to disable KV offload

  29. def setNumaStrategy(numa: String): HasLlamaCppModelProperties.this

    Set optimization strategies that help on some NUMA systems (if available)

    Set optimization strategies that help on some NUMA systems (if available)

    Available Strategies:

    • DISABLED: No NUMA optimizations
    • DISTRIBUTE: spread execution evenly over all
    • ISOLATE: only spawn threads on CPUs on the node that execution started on
    • NUMA_CTL: use the CPU map provided by numactl
    • MIRROR: Mirrors the model across NUMA nodes
  30. def setPSplit(pSplit: Float): HasLlamaCppModelProperties.this

    Set the speculative decoding split probability

  31. def setPoolingType(poolingType: String): HasLlamaCppModelProperties.this

    Set the pooling type for embeddings, use model default if unspecified

    Set the pooling type for embeddings, use model default if unspecified

    • 0 NONE: Don't use any pooling and return token embeddings (if the model supports it)
    • 1 MEAN: Mean Pooling
    • 2 CLS: Choose the CLS token
    • 3 LAST: Choose the last token
  32. def setRopeFreqBase(ropeFreqBase: Float): HasLlamaCppModelProperties.this

    Set the RoPE base frequency, used by NTK-aware scaling

  33. def setRopeFreqScale(ropeFreqScale: Float): HasLlamaCppModelProperties.this

    Set the RoPE frequency scaling factor, expands context by a factor of 1/N

  34. def setRopeScalingType(ropeScalingType: String): HasLlamaCppModelProperties.this

    Set the RoPE frequency scaling method, defaults to linear unless specified by the model.

    Set the RoPE frequency scaling method, defaults to linear unless specified by the model.

    • UNSPECIFIED: Don't use any scaling
    • LINEAR: Linear scaling
    • YARN: YaRN RoPE scaling
  35. def setSystemPrompt(systemPrompt: String): HasLlamaCppModelProperties.this

    Set a system prompt to use

  36. def setTensorSplit(tensorSplit: Array[Double]): HasLlamaCppModelProperties.this

    Set how split tensors should be distributed across GPUs

  37. def setUseMlock(useMlock: Boolean): HasLlamaCppModelProperties.this

    Whether to force the system to keep model in RAM rather than swapping or compressing

  38. def setUseMmap(useMmap: Boolean): HasLlamaCppModelProperties.this

    Whether to use memory-map model (faster load but may increase pageouts if not using mlock)

  39. def setYarnAttnFactor(yarnAttnFactor: Float): HasLlamaCppModelProperties.this

    Set the YaRN scale sqrt(t) or attention magnitude

  40. def setYarnBetaFast(yarnBetaFast: Float): HasLlamaCppModelProperties.this

    Set the YaRN low correction dim or beta

  41. def setYarnBetaSlow(yarnBetaSlow: Float): HasLlamaCppModelProperties.this

    Set the YaRN high correction dim or alpha

  42. def setYarnExtFactor(yarnExtFactor: Float): HasLlamaCppModelProperties.this

    Set the YaRN extrapolation mix factor

  43. def setYarnOrigCtx(yarnOrigCtx: Int): HasLlamaCppModelProperties.this

    Set the YaRN original context size of model

Parameter getters

  1. def getChatTemplate: String

  2. def getDefragmentationThreshold: Float

  3. def getEmbedding: Boolean

  4. def getFlashAttention: Boolean

  5. def getGrpAttnW: Int

  6. def getInputPrefixBos: Boolean

  7. def getLookupCacheDynamicFilePath: String

  8. def getLookupCacheStaticFilePath: String

  9. def getLoraAdapters: Map[String, Float]

  10. def getMainGpu: Int

  11. def getMetadata: String

    Get the metadata for the model

  12. def getModelDraft: String

  13. def getNBatch: Int

  14. def getNChunks: Int

  15. def getNCtx: Int

  16. def getNDraft: Int

  17. def getNGpuLayers: Int

  18. def getNGpuLayersDraft: Int

  19. def getNSequences: Int

  20. def getNThreads: Int

  21. def getNThreadsBatch: Int

  22. def getNThreadsBatchDraft: Int

  23. def getNThreadsDraft: Int

  24. def getNUbatch: Int

  25. def getNoKvOffload: Boolean

  26. def getNuma: String

  27. def getPSplit: Float

  28. def getPoolingType: String

  29. def getRopeFreqBase: Float

  30. def getRopeFreqScale: Float

  31. def getRopeScalingType: String

  32. def getSplitMode: String

  33. def getSystemPrompt: String

  34. def getTensorSplit: Array[Double]

  35. def getUseMlock: Boolean

  36. def getUseMmap: Boolean

  37. def getYarnAttnFactor: Float

  38. def getYarnBetaFast: Float

  39. def getYarnBetaSlow: Float

  40. def getYarnExtFactor: Float

  41. def getYarnOrigCtx: Int

Parameters

A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.

  1. val chatTemplate: Param[String]

  2. val defragmentationThreshold: FloatParam

  3. val embedding: BooleanParam

  4. val flashAttention: BooleanParam

  5. val gpuSplitMode: Param[String]

    Set how to split the model across GPUs

    Set how to split the model across GPUs

    • NONE: No GPU split
    • LAYER: Split the model across GPUs by layer
    • ROW: Split the model across GPUs by rows
  6. val grpAttnN: IntParam

  7. val grpAttnW: IntParam

  8. val inputPrefixBos: BooleanParam

  9. val lookupCacheDynamicFilePath: Param[String]

  10. val lookupCacheStaticFilePath: Param[String]

  11. val loraAdapters: StructFeature[Map[String, Float]]

  12. val mainGpu: IntParam

  13. val modelDraft: Param[String]

  14. val nBatch: IntParam

  15. val nChunks: IntParam

  16. val nCtx: IntParam

  17. val nDraft: IntParam

  18. val nGpuLayers: IntParam

  19. val nGpuLayersDraft: IntParam

  20. val nSequences: IntParam

  21. val nThreads: IntParam

  22. val nThreadsBatch: IntParam

  23. val nThreadsBatchDraft: IntParam

  24. val nThreadsDraft: IntParam

  25. val nUbatch: IntParam

  26. val noKvOffload: BooleanParam

  27. val numaStrategy: Param[String]

    Set optimization strategies that help on some NUMA systems (if available)

    Set optimization strategies that help on some NUMA systems (if available)

    Available Strategies:

    • DISABLED: No NUMA optimizations
    • DISTRIBUTE: Spread execution evenly over all
    • ISOLATE: Only spawn threads on CPUs on the node that execution started on
    • NUMA_CTL: Use the CPU map provided by numactl
    • MIRROR: Mirrors the model across NUMA nodes
  28. val pSplit: FloatParam

  29. val poolingType: Param[String]

    Set the pooling type for embeddings, use model default if unspecified

    Set the pooling type for embeddings, use model default if unspecified

    • 0 NONE: Don't use any pooling
    • 1 MEAN: Mean Pooling
    • 2 CLS: Choose the CLS token
    • 3 LAST: Choose the last token
  30. val ropeFreqBase: FloatParam

  31. val ropeFreqScale: FloatParam

  32. val ropeScalingType: Param[String]

    Set the RoPE frequency scaling method, defaults to linear unless specified by the model.

    Set the RoPE frequency scaling method, defaults to linear unless specified by the model.

    • UNSPECIFIED: Don't use any scaling
    • LINEAR: Linear scaling
    • YARN: YaRN RoPE scaling
  33. val systemPrompt: Param[String]

  34. val tensorSplit: DoubleArrayParam

  35. val useMlock: BooleanParam

  36. val useMmap: BooleanParam

  37. val yarnAttnFactor: FloatParam

  38. val yarnBetaFast: FloatParam

  39. val yarnBetaSlow: FloatParam

  40. val yarnExtFactor: FloatParam

  41. val yarnOrigCtx: IntParam

Ungrouped

  1. def getGrpAttnN: Int
  2. def getMetadataMap: Map[String, String]
  3. val metadata: (HasLlamaCppModelProperties.this)#ProtectedParam[String]