`sparknlp.annotator.embeddings.elmo_embeddings`#

Contains classes for ElmoEmbeddings.

Module Contents#

Classes#

ElmoEmbeddings

Word embeddings from ELMo (Embeddings from Language Models), a language

class ElmoEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.ElmoEmbeddings', java_model=None)[source]#

Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.

Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = ElmoEmbeddings.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("elmo_embeddings")

The default model is "elmo", if no name is provided.

For available pretrained models please see the Models Hub.

The pooling layer can be set with setPoolingLayer() to the following values:

"word_emb": the character-based word representations with shape [batch_size, max_length, 512].
"lstm_outputs1": the first LSTM hidden state with shape [batch_size, max_length, 1024].
"lstm_outputs2": the second LSTM hidden state with shape [batch_size, max_length, 1024].
"elmo": the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024].

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`WORD_EMBEDDINGS`

Parameters:

batchSize: Batch size. Large values allows faster processing but requires more memory, by default 32
dimension: Number of embedding dimensions
caseSensitive: Whether to ignore case in tokens for embeddings matching
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
poolingLayer: Set ELMO pooling layer to: word_emb, lstm_outputs1, lstm_outputs2, or elmo, by default word_emb

References

https://tfhub.dev/google/elmo/3

Deep contextualized word representations

Paper abstract:

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = ElmoEmbeddings.pretrained() \
...     .setPoolingLayer("word_emb") \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     embeddings,
...     embeddingsFinisher
... ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[6.662458181381226E-4,-0.2541114091873169,-0.6275503039360046,0.5787073969841...|
|[0.19154725968837738,0.22998669743537903,-0.2894386649131775,0.21524395048618...|
|[0.10400570929050446,0.12288510054349899,-0.07056470215320587,-0.246389418840...|
|[0.49932169914245605,-0.12706467509269714,0.30969417095184326,0.2643227577209...|
|[-0.8871506452560425,-0.20039963722229004,-1.0601330995559692,0.0348707810044...|
+--------------------------------------------------------------------------------+

name = 'ElmoEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'word_embeddings'[source]#

batchSize[source]#

configProtoBytes[source]#

poolingLayer[source]#

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

setBatchSize(value)[source]#

Sets batch size, by default 32.

Parameters:

valueint: Batch size

setPoolingLayer(layer)[source]#

Sets ELMO pooling layer to: word_emb, lstm_outputs1, lstm_outputs2, or elmo, by default word_emb

Parameters:

layerstr: ELMO pooling layer

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

ElmoEmbeddings: The restored model

static pretrained(name='elmo', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “elmo”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

ElmoEmbeddings: The restored model

sparknlp.annotator.embeddings.elmo_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.elmo_embeddings`#