sparknlp.annotator.seq2seq.cohere_transformer
#
Contains classes for the CoHereTransformer.
Module Contents#
Classes#
Cohere: Command-R Transformer |
- class CoHereTransformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.CoHereTransformer', java_model=None)[source]#
Cohere: Command-R Transformer
C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> CoHere = CoHereTransformer.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("generation")
The default model is
"c4ai_command_r_v01_int4"
, if no name is provided. For available pretrained models please see the Models Hub.Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- minOutputLength
Minimum length of the sequence to be generated, by default 0
- maxOutputLength
Maximum length of output text, by default 60
- doSample
Whether or not to use sampling; use greedy decoding otherwise, by default False
- temperature
The value used to modulate the next token probabilities, by default 1.0
- topK
The number of highest probability vocabulary tokens to keep for top-k-filtering, by default 40
- topP
Top cumulative probability for vocabulary tokens, by default 1.0
If set to float < 1, only the most probable tokens with probabilities that add up to
topP
or higher are kept for generation.- repetitionPenalty
The parameter for repetition penalty, 1.0 means no penalty. , by default 1.0
- noRepeatNgramSize
If set to int > 0, all ngrams of that size can only occur once, by default 0
- ignoreTokenIds
A list of token ids which are ignored in the decoder’s output, by default []
Notes
This is a very computationally expensive module, especially on larger sequences. The use of an accelerator such as GPU is recommended.
References
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("documents") >>> CoHere = CoHereTransformer.pretrained("c4ai_command_r_v01_int4","en") \ ... .setInputCols(["documents"]) \ ... .setMaxOutputLength(60) \ ... .setOutputCol("generation") >>> pipeline = Pipeline().setStages([documentAssembler, CoHere]) >>> data = spark.createDataFrame([ ... ( ... 1, ... "<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>" ... ) ... ]).toDF("id", "text") >>> result = pipeline.fit(data).transform(data) >>> result.select("generation.result").show(truncate=False) +------------------------------------------------+ |result | +------------------------------------------------+ |[Hello! I'm doing well, thank you for asking! I'm excited to help you with whatever questions you have today. How can I assist you?]| +------------------------------------------------+
- setIgnoreTokenIds(value)[source]#
A list of token ids which are ignored in the decoder’s output.
- Parameters:
- valueList[int]
The words to be filtered out
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setMinOutputLength(value)[source]#
Sets minimum length of the sequence to be generated.
- Parameters:
- valueint
Minimum length of the sequence to be generated
- setMaxOutputLength(value)[source]#
Sets maximum length of output text.
- Parameters:
- valueint
Maximum length of output text
- setDoSample(value)[source]#
Sets whether or not to use sampling, use greedy decoding otherwise.
- Parameters:
- valuebool
Whether or not to use sampling; use greedy decoding otherwise
- setTemperature(value)[source]#
Sets the value used to module the next token probabilities.
- Parameters:
- valuefloat
The value used to module the next token probabilities
- setTopK(value)[source]#
Sets the number of highest probability vocabulary tokens to keep for top-k-filtering.
- Parameters:
- valueint
Number of highest probability vocabulary tokens to keep
- setTopP(value)[source]#
Sets the top cumulative probability for vocabulary tokens.
If set to float < 1, only the most probable tokens with probabilities that add up to
topP
or higher are kept for generation.- Parameters:
- valuefloat
Cumulative probability for vocabulary tokens
- setRepetitionPenalty(value)[source]#
Sets the parameter for repetition penalty. 1.0 means no penalty.
- Parameters:
- valuefloat
The repetition penalty
References
See Ctrl: A Conditional Transformer Language Model For Controllable Generation for more details.
- setNoRepeatNgramSize(value)[source]#
Sets size of n-grams that can only occur once.
If set to int > 0, all ngrams of that size can only occur once.
- Parameters:
- valueint
N-gram size can only occur once
- setBeamSize(value)[source]#
Sets the number of beams to use for beam search.
- Parameters:
- valueint
The number of beams to use for beam search
- setStopTokenIds(value)[source]#
Sets a list of token ids which are considered as stop tokens in the decoder’s output.
- Parameters:
- valueList[int]
The words to be considered as stop tokens
- static loadSavedModel(folder, spark_session, use_openvino=False)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- CoHereTransformer
The restored model
- static pretrained(name='c4ai_command_r_v01_int4', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “c4ai_command_r_v01_int4”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- CoHereTransformer
The restored model