sparknlp.annotator.embeddings.word2vec#
Contains classes for Word2Vec.
Module Contents#
Classes#
| Trains a Word2Vec model that creates vector representations of words in a | |
| Word2Vec model that creates vector representations of words in a text | 
- class Word2VecApproach[source]#
- Trains a Word2Vec model that creates vector representations of words in a text corpus. - The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. - We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation. - For instantiated/pretrained models, see - Word2VecModel.- For available pretrained models please see the Models Hub. - Input Annotation types - Output Annotation type - TOKEN- WORD_EMBEDDINGS- Parameters:
- vectorSize
- The dimension of codes after transforming from words (> 0), by default 100 
- windowSize
- The window size (context words from [-window, window]) (> 0), by default 5 
- numPartitions
- Number of partitions for sentences of words (> 0), by default 1 
- minCount
- The minimum number of times a token must appear to be included in the word2vec model’s vocabulary (>= 0), by default 1 
- maxSentenceLength
- The window size (Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size (> 0), by default 1000 
- stepSize
- Step size (learning rate) to be used for each iteration of optimization (> 0), by default 0.025 
- maxIter
- Maximum number of iterations (>= 0), by default 1 
- seed
- Random seed, by default 44 
 
 - References - For the original C implementation, see https://code.google.com/p/word2vec/ - For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. - Examples - >>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = Word2VecApproach() \ ... .setInputCols(["token"]) \ ... .setOutputCol("embeddings") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... tokenizer, ... embeddings ... ]) >>> path = "sherlockholmes.txt" >>> dataset = spark.read.text(path).toDF("text") >>> pipelineModel = pipeline.fit(dataset) - setNumPartitions(numPartitions)[source]#
- Sets number of partitions (default: 1). Use a small number for accuracy. 
 - setMaxIter(numIterations)[source]#
- Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. 
 
- class Word2VecModel(classname='com.johnsnowlabs.nlp.embeddings.Word2VecModel', java_model=None)[source]#
- Word2Vec model that creates vector representations of words in a text corpus. - The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. - We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation. - This is the instantiated model of the - Word2VecApproach. For training your own model, please see the documentation of that class.- Pretrained models can be loaded with - pretrained()of the companion object:- >>> embeddings = Word2VecModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("embeddings") - The default model is “word2vec_gigaword_300”, if no name is provided. - Input Annotation types - Output Annotation type - TOKEN- WORD_EMBEDDINGS- Parameters:
- vectorSize
- The dimension of codes after transforming from words (> 0), by default 100 
 
 - References - For the original C implementation, see https://code.google.com/p/word2vec/ - For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. - Examples - >>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> embeddings = Word2VecModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("embeddings") >>> embeddingsFinisher = EmbeddingsFinisher() \ ... .setInputCols(["embeddings"]) \ ... .setOutputCols("finished_embeddings") \ ... .setOutputAsVector(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... embeddings, ... embeddingsFinisher ... ]) >>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(finished_embeddings) as result").show(1, 80) +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...| +--------------------------------------------------------------------------------+ - static pretrained(name='word2vec_gigaword_300', lang='en', remote_loc=None)[source]#
- Downloads and loads a pretrained model. - Parameters:
- namestr, optional
- Name of the pretrained model, by default “word2vec_wiki” 
- langstr, optional
- Language of the pretrained model, by default “en” 
- remote_locstr, optional
- Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise. 
 
- Returns:
- Word2VecModel
- The restored model