sparknlp.annotator.seq2seq.gpt2_transformer
#
Contains classes for the GPT2Transformer.
Module Contents#
Classes#
GPT2: the OpenAI Text-To-Text Transformer |
- class GPT2Transformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer', java_model=None)[source]#
GPT2: the OpenAI Text-To-Text Transformer
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.
GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> gpt2 = GPT2Transformer.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("generation")
The default model is
"gpt2"
, if no name is provided. For available pretrained models please see the Models Hub.Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- task
Transformer’s task, e.g.
summarize:
, by default “”- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- minOutputLength
Minimum length of the sequence to be generated, by default 0
- maxOutputLength
Maximum length of output text, by default 20
- doSample
Whether or not to use sampling; use greedy decoding otherwise, by default False
- temperature
The value used to module the next token probabilities, by default 1.0
- topK
The number of highest probability vocabulary tokens to keep for top-k-filtering, by default 50
- topP
Top cumulative probability for vocabulary tokens, by default 1.0
If set to float < 1, only the most probable tokens with probabilities that add up to
topP
or higher are kept for generation.- repetitionPenalty
The parameter for repetition penalty, 1.0 means no penalty. , by default 1.0
- noRepeatNgramSize
If set to int > 0, all ngrams of that size can only occur once, by default 0
- ignoreTokenIds
A list of token ids which are ignored in the decoder’s output, by default []
Notes
This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.
References
Paper Abstract:
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("documents") >>> gpt2 = GPT2Transformer.pretrained("gpt2") \ ... .setInputCols(["documents"]) \ ... .setMaxOutputLength(50) \ ... .setOutputCol("generation") >>> pipeline = Pipeline().setStages([documentAssembler, gpt2]) >>> data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("summaries.generation").show(truncate=False) +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776.]| -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- setTask(value)[source]#
Sets the transformer’s task, e.g.
summarize:
.- Parameters:
- valuestr
The transformer’s task
- setIgnoreTokenIds(value)[source]#
A list of token ids which are ignored in the decoder’s output.
- Parameters:
- valueList[int]
The words to be filtered out
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setMinOutputLength(value)[source]#
Sets minimum length of the sequence to be generated.
- Parameters:
- valueint
Minimum length of the sequence to be generated
- setMaxOutputLength(value)[source]#
Sets maximum length of output text.
- Parameters:
- valueint
Maximum length of output text
- setDoSample(value)[source]#
Sets whether or not to use sampling, use greedy decoding otherwise.
- Parameters:
- valuebool
Whether or not to use sampling; use greedy decoding otherwise
- setTemperature(value)[source]#
Sets the value used to module the next token probabilities.
- Parameters:
- valuefloat
The value used to module the next token probabilities
- setTopK(value)[source]#
Sets the number of highest probability vocabulary tokens to keep for top-k-filtering.
- Parameters:
- valueint
Number of highest probability vocabulary tokens to keep
- setTopP(value)[source]#
Sets the top cumulative probability for vocabulary tokens.
If set to float < 1, only the most probable tokens with probabilities that add up to
topP
or higher are kept for generation.- Parameters:
- valuefloat
Cumulative probability for vocabulary tokens
- setRepetitionPenalty(value)[source]#
Sets the parameter for repetition penalty. 1.0 means no penalty.
- Parameters:
- valuefloat
The repetition penalty
References
See Ctrl: A Conditional Transformer Language Model For Controllable Generation for more details.
- setNoRepeatNgramSize(value)[source]#
Sets size of n-grams that can only occur once.
If set to int > 0, all ngrams of that size can only occur once.
- Parameters:
- valueint
N-gram size can only occur once
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- GPT2Transformer
The restored model
- static pretrained(name='gpt2', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “gpt2”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- GPT2Transformer
The restored model