sparknlp.annotator.embeddings.doc2vec
#
Contains classes for Doc2Vec.
Module Contents#
Classes#
Trains a Word2Vec model that creates vector representations of words in a |
|
Word2Vec model that creates vector representations of words in a text |
- class Doc2VecApproach[source]#
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see
Doc2VecModel
.For available pretrained models please see the Models Hub.
Input Annotation types
Output Annotation type
TOKEN
SENTENCE_EMBEDDINGS
- Parameters:
- vectorSize
The dimension of codes after transforming from words (> 0), by default 100
- windowSize
The window size (context words from [-window, window]) (> 0), by default 5
- numPartitions
Number of partitions for sentences of words (> 0), by default 1
- minCount
The minimum number of times a token must appear to be included in the word2vec model’s vocabulary (>= 0), by default 1
- maxSentenceLength
The window size (Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size (> 0), by default 1000
- stepSize
Step size (learning rate) to be used for each iteration of optimization (> 0), by default 0.025
- maxIter
Maximum number of iterations (>= 0), by default 1
- seed
Random seed, by default 44
References
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = Doc2VecApproach() \ ... .setInputCols(["token"]) \ ... .setOutputCol("embeddings") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... tokenizer, ... embeddings ... ]) >>> path = "sherlockholmes.txt" >>> dataset = spark.read.text(path).toDF("text") >>> pipelineModel = pipeline.fit(dataset)
- setNumPartitions(numPartitions)[source]#
Sets number of partitions (default: 1). Use a small number for accuracy.
- setMaxIter(numIterations)[source]#
Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.
- class Doc2VecModel(classname='com.johnsnowlabs.nlp.embeddings.Doc2VecModel', java_model=None)[source]#
Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
This is the instantiated model of the
Doc2VecApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> embeddings = Doc2VecModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("embeddings")
The default model is “doc2vec_gigaword_300”, if no name is provided.
Input Annotation types
Output Annotation type
TOKEN
SENTENCE_EMBEDDINGS
- Parameters:
- vectorSize
The dimension of codes after transforming from words (> 0) , by default 100
References
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = Doc2VecModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(1, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...| +--------------------------------------------------------------------------------+
- static pretrained(name='doc2vec_gigaword_300', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “doc2vec_wiki”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- Doc2VecModel
The restored model