sparknlp.annotator.embeddings.sentence_embeddings#

Contains classes for SentenceEmbeddings.

Module Contents#

Classes#

SentenceEmbeddings

Converts the results from WordEmbeddings, BertEmbeddings, or other word

class SentenceEmbeddings[source]#

Converts the results from WordEmbeddings, BertEmbeddings, or other word embeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

This can be configured with setPoolingStrategy(), which either be "AVERAGE" or "SUM".

For more extended examples see the Examples..

Input Annotation types

Output Annotation type

DOCUMENT, WORD_EMBEDDINGS

SENTENCE_EMBEDDINGS

Parameters:
dimension

Number of embedding dimensions

poolingStrategy

Choose how you would like to aggregate Word Embeddings to Sentence Embeddings: AVERAGE or SUM, by default AVERAGE

Notes

If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentences as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsSentence = SentenceEmbeddings() \
...     .setInputCols(["document", "embeddings"]) \
...     .setOutputCol("sentence_embeddings") \
...     .setPoolingStrategy("AVERAGE")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["sentence_embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsSentence,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+
setPoolingStrategy(strategy)[source]#

Sets how to aggregate the word Embeddings to sentence embeddings, by default AVERAGE.

Can either be AVERAGE or SUM.

Parameters:
strategystr

Pooling Strategy, either be AVERAGE or SUM

Returns:
[type]

[description]