sparknlp.annotator.embeddings.elmo_embeddings
#
Contains classes for ElmoEmbeddings.
Module Contents#
Classes#
Word embeddings from ELMo (Embeddings from Language Models), a language |
- class ElmoEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.ElmoEmbeddings', java_model=None)[source]#
Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.
Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> embeddings = ElmoEmbeddings.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("elmo_embeddings")
The default model is
"elmo"
, if no name is provided.For available pretrained models please see the Models Hub.
The pooling layer can be set with
setPoolingLayer()
to the following values:"word_emb"
: the character-based word representations with shape[batch_size, max_length, 512]
."lstm_outputs1"
: the first LSTM hidden state with shape[batch_size, max_length, 1024]
."lstm_outputs2"
: the second LSTM hidden state with shape[batch_size, max_length, 1024]
."elmo"
: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape[batch_size, max_length, 1024]
.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
WORD_EMBEDDINGS
- Parameters:
- batchSize
Batch size. Large values allows faster processing but requires more memory, by default 32
- dimension
Number of embedding dimensions
- caseSensitive
Whether to ignore case in tokens for embeddings matching
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- poolingLayer
Set ELMO pooling layer to: word_emb, lstm_outputs1, lstm_outputs2, or elmo, by default word_emb
References
https://tfhub.dev/google/elmo/3
Deep contextualized word representations
Paper abstract:
We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = ElmoEmbeddings.pretrained() \ ... .setPoolingLayer("word_emb") \ ... .setInputCols(["token", "document"]) \ ... .setOutputCol("embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) \ ... .setCleanAnnotations(False) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[6.662458181381226E-4,-0.2541114091873169,-0.6275503039360046,0.5787073969841...| |[0.19154725968837738,0.22998669743537903,-0.2894386649131775,0.21524395048618...| |[0.10400570929050446,0.12288510054349899,-0.07056470215320587,-0.246389418840...| |[0.49932169914245605,-0.12706467509269714,0.30969417095184326,0.2643227577209...| |[-0.8871506452560425,-0.20039963722229004,-1.0601330995559692,0.0348707810044...| +--------------------------------------------------------------------------------+
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setPoolingLayer(layer)[source]#
Sets ELMO pooling layer to: word_emb, lstm_outputs1, lstm_outputs2, or elmo, by default word_emb
- Parameters:
- layerstr
ELMO pooling layer
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- ElmoEmbeddings
The restored model
- static pretrained(name='elmo', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “elmo”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- ElmoEmbeddings
The restored model