sparknlp.annotator.embeddings.sentence_embeddings
#
Contains classes for SentenceEmbeddings.
Module Contents#
Classes#
Converts the results from WordEmbeddings, BertEmbeddings, or other word |
- class SentenceEmbeddings[source]#
Converts the results from WordEmbeddings, BertEmbeddings, or other word embeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).
This can be configured with
setPoolingStrategy()
, which either be"AVERAGE"
or"SUM"
.For more extended examples see the Examples..
Input Annotation types
Output Annotation type
DOCUMENT, WORD_EMBEDDINGS
SENTENCE_EMBEDDINGS
- Parameters:
- dimension
Number of embedding dimensions
- poolingStrategy
Choose how you would like to aggregate Word Embeddings to Sentence Embeddings: AVERAGE or SUM, by default AVERAGE
Notes
If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentences as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("embeddings") >>> embeddingsSentence = SentenceEmbeddings() \ ... .setInputCols(["document", "embeddings"]) \ ... .setOutputCol("sentence_embeddings") \ ... .setPoolingStrategy("AVERAGE") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["sentence_embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) \ ... .setCleanAnnotations(False) >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... tokenizer, ... embeddings, ... embeddingsSentence, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...| +--------------------------------------------------------------------------------+