`sparknlp.annotator.embeddings.sentence_embeddings`#

Contains classes for SentenceEmbeddings.

Module Contents#

Classes#

SentenceEmbeddings

Converts the results from WordEmbeddings, BertEmbeddings, or other word

class SentenceEmbeddings[source]#

Converts the results from WordEmbeddings, BertEmbeddings, or other word embeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

This can be configured with setPoolingStrategy(), which either be "AVERAGE" or "SUM".

For more extended examples see the Examples..

Input Annotation types	Output Annotation type
`DOCUMENT, WORD_EMBEDDINGS`	`SENTENCE_EMBEDDINGS`

Parameters:

dimension: Number of embedding dimensions
poolingStrategy: Choose how you would like to aggregate Word Embeddings to Sentence Embeddings: AVERAGE or SUM, by default AVERAGE

Notes

If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentences as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsSentence = SentenceEmbeddings() \
...     .setInputCols(["document", "embeddings"]) \
...     .setOutputCol("sentence_embeddings") \
...     .setPoolingStrategy("AVERAGE")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["sentence_embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsSentence,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+

name = 'SentenceEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'sentence_embeddings'[source]#

poolingStrategy[source]#

setPoolingStrategy(strategy)[source]#

Sets how to aggregate the word Embeddings to sentence embeddings, by default AVERAGE.

Can either be AVERAGE or SUM.

Parameters:

strategystr: Pooling Strategy, either be AVERAGE or SUM

Returns:

[type]: [description]

sparknlp.annotator.embeddings.sentence_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.sentence_embeddings`#