sparknlp.annotator.embeddings.bert_sentence_embeddings
#
Contains classes for BertSentenceEmbeddings.
Module Contents#
Classes#
Sentence-level embeddings using BERT. BERT (Bidirectional Encoder |
- class BertSentenceEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings', java_model=None)[source]#
Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.
Pretrained models can be loaded with
pretrained()
of the companion object:>>>embeddings = BertSentenceEmbeddings.pretrained() … .setInputCols([“sentence”]) … .setOutputCol(“sentence_bert_embeddings”)
The default model is
"sent_small_bert_L2_768"
, if no name is provided.For available pretrained models please see the Models Hub.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
SENTENCE_EMBEDDINGS
- Parameters:
- batchSize
Size of every batch, by default 8
- caseSensitive
Whether to ignore case in tokens for embeddings matching, by default False
- dimension
Number of embedding dimensions, by default 768
- maxSentenceLength
Max sentence length to process, by default 128
- isLong
Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int.
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
References
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper abstract
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("sentence_bert_embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["sentence_bert_embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["John loves apples. Mary loves oranges. John loves Mary."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[-0.8951074481010437,0.13753940165042877,0.3108254075050354,-1.65693199634552...| |[-0.6180210709571838,-0.12179657071828842,-0.191165953874588,-1.4497021436691...| |[-0.822715163230896,0.7568016648292542,-0.1165061742067337,-1.59048593044281,...| +--------------------------------------------------------------------------------+
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setIsLong(value)[source]#
Sets whether to use Long type instead of Int type for inputs buffer.
Some Bert models require Long instead of Int.
- Parameters:
- valuebool
Whether to use Long type instead of Int type for inputs buffer
- static loadSavedModel(folder, spark_session, use_openvino=False)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- use_openvino: bool
Use OpenVINO backend
- Returns:
- BertSentenceEmbeddings
The restored model
- static pretrained(name='sent_small_bert_L2_768', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “sent_small_bert_L2_768”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- BertSentenceEmbeddings
The restored model