`sparknlp.annotator.embeddings.bert_sentence_embeddings`#

Contains classes for BertSentenceEmbeddings.

Module Contents#

Classes#

BertSentenceEmbeddings

Sentence-level embeddings using BERT. BERT (Bidirectional Encoder

class BertSentenceEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings', java_model=None)[source]#

Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

Pretrained models can be loaded with pretrained() of the companion object:

>>>embeddings = BertSentenceEmbeddings.pretrained() … .setInputCols([“sentence”]) … .setOutputCol(“sentence_bert_embeddings”)

The default model is "sent_small_bert_L2_768", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT`	`SENTENCE_EMBEDDINGS`

Parameters:

batchSize: Size of every batch, by default 8
caseSensitive: Whether to ignore case in tokens for embeddings matching, by default False
dimension: Number of embedding dimensions, by default 768
maxSentenceLength: Max sentence length to process, by default 128
isLong: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int.
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

google-research/bert

Paper abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128") \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("sentence_bert_embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["sentence_bert_embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     embeddings,
...     embeddingsFinisher
... ])
>>> data = spark.createDataFrame([["John loves apples. Mary loves oranges. John loves Mary."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.8951074481010437,0.13753940165042877,0.3108254075050354,-1.65693199634552...|
|[-0.6180210709571838,-0.12179657071828842,-0.191165953874588,-1.4497021436691...|
|[-0.822715163230896,0.7568016648292542,-0.1165061742067337,-1.59048593044281,...|
+--------------------------------------------------------------------------------+

name = 'BertSentenceEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'sentence_embeddings'[source]#

isLong[source]#

configProtoBytes[source]#

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

setIsLong(value)[source]#

Sets whether to use Long type instead of Int type for inputs buffer.

Some Bert models require Long instead of Int.

Parameters:

valuebool: Whether to use Long type instead of Int type for inputs buffer

static loadSavedModel(folder, spark_session, use_openvino=False)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession
use_openvino: bool: Use OpenVINO backend

Returns:

BertSentenceEmbeddings: The restored model

static pretrained(name='sent_small_bert_L2_768', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “sent_small_bert_L2_768”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

BertSentenceEmbeddings: The restored model

sparknlp.annotator.embeddings.bert_sentence_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.bert_sentence_embeddings`#