sparknlp.annotator.seq2seq.starcoder_transformer#

Contains classes for the StarCoderTransformer.

Module Contents#

Classes#

StarCoderTransformer

StarCoder2: The Versatile Code Companion.

class StarCoderTransformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.StarCoderTransformer', java_model=None)[source]#

StarCoder2: The Versatile Code Companion.

StarCoder2 is a Transformer model designed specifically for code generation and understanding. With 13 billion parameters, it builds upon the advancements of its predecessors and is trained on a diverse dataset that includes multiple programming languages. This extensive training allows StarCoder2 to support a wide array of coding tasks, from code completion to generation.

StarCoder2 was developed to assist developers in writing and understanding code more efficiently, making it a valuable tool for various software development and data science tasks.

Pretrained models can be loaded with pretrained() of the companion object:

>>> starcoder2 = StarCoder2Transformer.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("generation")

The default model is "starcoder2-13b", if no name is provided. For available pretrained models please see the Models Hub.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

minOutputLength

Minimum length of the sequence to be generated, by default 0

maxOutputLength

Maximum length of output text, by default 20

doSample

Whether or not to use sampling; use greedy decoding otherwise, by default False

temperature

The value used to modulate the next token probabilities, by default 1.0

topK

The number of highest probability vocabulary tokens to keep for top-k-filtering, by default 50

topP

Top cumulative probability for vocabulary tokens, by default 1.0

If set to float < 1, only the most probable tokens with probabilities that add up to topP or higher are kept for generation.

repetitionPenalty

The parameter for repetition penalty, 1.0 means no penalty. , by default 1.0

noRepeatNgramSize

If set to int > 0, all ngrams of that size can only occur once, by default 0

ignoreTokenIds

A list of token ids which are ignored in the decoder’s output, by default []

Notes

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

References

Paper Abstract:

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4× larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.

We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2-15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder-33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the Software Heritage persistent Identifiers (SWHIDs) of the source code data.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("documents")
>>> starcoder2 = StarCoder2Transformer.pretrained("starcoder2") \
...     .setInputCols(["documents"]) \
...     .setMaxOutputLength(50) \
...     .setOutputCol("generation")
>>> pipeline = Pipeline().setStages([documentAssembler, starcoder2])
>>> data = spark.createDataFrame([["def add(a, b):"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("generation.result").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[def add(a, b): return a + b]                                                                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
setIgnoreTokenIds(value)[source]#

A list of token ids which are ignored in the decoder’s output.

Parameters:
valueList[int]

The words to be filtered out

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setMinOutputLength(value)[source]#

Sets minimum length of the sequence to be generated.

Parameters:
valueint

Minimum length of the sequence to be generated

setMaxOutputLength(value)[source]#

Sets maximum length of output text.

Parameters:
valueint

Maximum length of output text

setDoSample(value)[source]#

Sets whether or not to use sampling, use greedy decoding otherwise.

Parameters:
valuebool

Whether or not to use sampling; use greedy decoding otherwise

setTemperature(value)[source]#

Sets the value used to module the next token probabilities.

Parameters:
valuefloat

The value used to module the next token probabilities

setTopK(value)[source]#

Sets the number of highest probability vocabulary tokens to keep for top-k-filtering.

Parameters:
valueint

Number of highest probability vocabulary tokens to keep

setTopP(value)[source]#

Sets the top cumulative probability for vocabulary tokens.

If set to float < 1, only the most probable tokens with probabilities that add up to topP or higher are kept for generation.

Parameters:
valuefloat

Cumulative probability for vocabulary tokens

setRepetitionPenalty(value)[source]#

Sets the parameter for repetition penalty. 1.0 means no penalty.

Parameters:
valuefloat

The repetition penalty

References

See Ctrl: A Conditional Transformer Language Model For Controllable Generation for more details.

setNoRepeatNgramSize(value)[source]#

Sets size of n-grams that can only occur once.

If set to int > 0, all ngrams of that size can only occur once.

Parameters:
valueint

N-gram size can only occur once

static loadSavedModel(folder, spark_session, use_openvino=False)[source]#

Loads a locally saved model.

Parameters:
folderstr

Folder of the saved model

spark_sessionpyspark.sql.SparkSession

The current SparkSession

Returns:
StarCoderTransformer

The restored model

static pretrained(name='starcoder', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “starcoder”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
StarCoderTransformer

The restored model