sparknlp.annotator.seq2seq.phi2_transformer#
Contains classes for the Phi2Transformer.
Module Contents#
Classes#
| Phi-2: Textbooks Are All You Need. | 
- class Phi2Transformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.Phi2Transformer', java_model=None)[source]#
- Phi-2: Textbooks Are All You Need. - Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. - Phi-2 hasn’t been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more. - Pretrained models can be loaded with - pretrained()of the companion object:- >>> phi2 = Phi2Transformer.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("generation") - The default model is - "llam2-7b", if no name is provided. For available pretrained models please see the Models Hub.- Input Annotation types - Output Annotation type - DOCUMENT- DOCUMENT- Parameters:
- configProtoBytes
- ConfigProto from tensorflow, serialized into byte array. 
- minOutputLength
- Minimum length of the sequence to be generated, by default 0 
- maxOutputLength
- Maximum length of output text, by default 20 
- doSample
- Whether or not to use sampling; use greedy decoding otherwise, by default False 
- temperature
- The value used to module the next token probabilities, by default 1.0 
- topK
- The number of highest probability vocabulary tokens to keep for top-k-filtering, by default 50 
- topP
- Top cumulative probability for vocabulary tokens, by default 1.0 - If set to float < 1, only the most probable tokens with probabilities that add up to - topPor higher are kept for generation.
- repetitionPenalty
- The parameter for repetition penalty, 1.0 means no penalty. , by default 1.0 
- noRepeatNgramSize
- If set to int > 0, all ngrams of that size can only occur once, by default 0 
- ignoreTokenIds
- A list of token ids which are ignored in the decoder’s output, by default [] 
 
 - Notes - This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - References - Paper Abstract: - In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. - Examples - >>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("documents") >>> phi2 = Phi2Transformer.pretrained("phi2") \ ... .setInputCols(["documents"]) \ ... .setMaxOutputLength(50) \ ... .setOutputCol("generation") >>> pipeline = Pipeline().setStages([documentAssembler, phi2]) >>> data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("summaries.generation").show(truncate=False) +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[My name is Leonardo . I am a student of the University of California, Berkeley. I am interested in the field of Artificial Intelligence and its applications in the real world. I have a strong | | passion for learning and am always looking for ways to improve my knowledge and skills] | -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - setIgnoreTokenIds(value)[source]#
- A list of token ids which are ignored in the decoder’s output. - Parameters:
- valueList[int]
- The words to be filtered out 
 
 
 - setConfigProtoBytes(b)[source]#
- Sets configProto from tensorflow, serialized into byte array. - Parameters:
- bList[int]
- ConfigProto from tensorflow, serialized into byte array 
 
 
 - setMinOutputLength(value)[source]#
- Sets minimum length of the sequence to be generated. - Parameters:
- valueint
- Minimum length of the sequence to be generated 
 
 
 - setMaxOutputLength(value)[source]#
- Sets maximum length of output text. - Parameters:
- valueint
- Maximum length of output text 
 
 
 - setDoSample(value)[source]#
- Sets whether or not to use sampling, use greedy decoding otherwise. - Parameters:
- valuebool
- Whether or not to use sampling; use greedy decoding otherwise 
 
 
 - setTemperature(value)[source]#
- Sets the value used to module the next token probabilities. - Parameters:
- valuefloat
- The value used to module the next token probabilities 
 
 
 - setTopK(value)[source]#
- Sets the number of highest probability vocabulary tokens to keep for top-k-filtering. - Parameters:
- valueint
- Number of highest probability vocabulary tokens to keep 
 
 
 - setTopP(value)[source]#
- Sets the top cumulative probability for vocabulary tokens. - If set to float < 1, only the most probable tokens with probabilities that add up to - topPor higher are kept for generation.- Parameters:
- valuefloat
- Cumulative probability for vocabulary tokens 
 
 
 - setRepetitionPenalty(value)[source]#
- Sets the parameter for repetition penalty. 1.0 means no penalty. - Parameters:
- valuefloat
- The repetition penalty 
 
 - References - See Ctrl: A Conditional Transformer Language Model For Controllable Generation for more details. 
 - setNoRepeatNgramSize(value)[source]#
- Sets size of n-grams that can only occur once. - If set to int > 0, all ngrams of that size can only occur once. - Parameters:
- valueint
- N-gram size can only occur once 
 
 
 - static loadSavedModel(folder, spark_session, use_openvino=False)[source]#
- Loads a locally saved model. - Parameters:
- folderstr
- Folder of the saved model 
- spark_sessionpyspark.sql.SparkSession
- The current SparkSession 
 
- Returns:
- Phi2Transformer
- The restored model 
 
 
 - static pretrained(name='phi2', lang='en', remote_loc=None)[source]#
- Downloads and loads a pretrained model. - Parameters:
- namestr, optional
- Name of the pretrained model, by default “phi2” 
- langstr, optional
- Language of the pretrained model, by default “en” 
- remote_locstr, optional
- Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise. 
 
- Returns:
- Phi2Transformer
- The restored model