Packages

class Doc2VecApproach extends AnnotatorApproach[Doc2VecModel] with HasStorageRef with HasEnableCachingProperties with HasProtectedParams

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Doc2VecModel.

Sources :

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new Doc2VecApproach()
  .setInputCols("token")
  .setOutputCol("embeddings")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings
  ))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)
Linear Supertypes
HasProtectedParams, HasEnableCachingProperties, HasStorageRef, ParamsAndFeaturesWritable, HasFeatures, AnnotatorApproach[Doc2VecModel], CanBeLazy, DefaultParamsWritable, MLWritable, HasOutputAnnotatorType, HasOutputAnnotationCol, HasInputAnnotationCols, Estimator[Doc2VecModel], PipelineStage, Logging, Params, Serializable, Serializable, Identifiable, AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. Doc2VecApproach
  2. HasProtectedParams
  3. HasEnableCachingProperties
  4. HasStorageRef
  5. ParamsAndFeaturesWritable
  6. HasFeatures
  7. AnnotatorApproach
  8. CanBeLazy
  9. DefaultParamsWritable
  10. MLWritable
  11. HasOutputAnnotatorType
  12. HasOutputAnnotationCol
  13. HasInputAnnotationCols
  14. Estimator
  15. PipelineStage
  16. Logging
  17. Params
  18. Serializable
  19. Serializable
  20. Identifiable
  21. AnyRef
  22. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Parameters

A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.

  1. val enableCaching: BooleanParam

    Whether to enable caching DataFrames or RDDs during the training

    Whether to enable caching DataFrames or RDDs during the training

    Definition Classes
    HasEnableCachingProperties
  2. val maxIter: IntParam

    Param for maximum number of iterations (>= 0) (Default: 1)

  3. val maxSentenceLength: IntParam

    Sets the maximum length (in words) of each sentence in the input data (Default: 1000).

    Sets the maximum length (in words) of each sentence in the input data (Default: 1000). Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength size.

  4. val minCount: IntParam

    The minimum number of times a token must appear to be included in the word2vec model's vocabulary.

    The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5

  5. val numPartitions: IntParam

    Number of partitions for sentences of words (Default: 1).

  6. val seed: IntParam

    Random seed for shuffling the dataset (Default: 44)

  7. val stepSize: DoubleParam

    Param for Step size to be used for each iteration of optimization (> 0) (Default: 0.025).

  8. val storageRef: Param[String]

    Unique identifier for storage (Default: this.uid)

    Unique identifier for storage (Default: this.uid)

    Definition Classes
    HasStorageRef
  9. val vectorSize: ProtectedParam[Int]

    The dimension of the code that you want to transform from words (Default: 100).

  10. val windowSize: IntParam

    The window size (context words from [-window, window]) (Default: 5)

Annotator types

Required input and expected output annotator types

  1. val inputAnnotatorTypes: Array[AnnotatorType]

    Input Annotator Types: TOKEN

    Input Annotator Types: TOKEN

    Definition Classes
    Doc2VecApproachHasInputAnnotationCols
  2. val outputAnnotatorType: String

    Output Annotator Types: SENTENCE_EMBEDDINGS

    Output Annotator Types: SENTENCE_EMBEDDINGS

    Definition Classes
    Doc2VecApproachHasOutputAnnotatorType

Members

  1. implicit class ProtectedParam[T] extends Param[T]
    Definition Classes
    HasProtectedParams
  2. type AnnotatorType = String
    Definition Classes
    HasOutputAnnotatorType
  1. def beforeTraining(spark: SparkSession): Unit
    Definition Classes
    Doc2VecApproachAnnotatorApproach
  2. final def clear(param: Param[_]): Doc2VecApproach.this.type
    Definition Classes
    Params
  3. final def copy(extra: ParamMap): Estimator[Doc2VecModel]
    Definition Classes
    AnnotatorApproach → Estimator → PipelineStage → Params
  4. def createDatabaseConnection(database: Name): RocksDBConnection
    Definition Classes
    HasStorageRef
  5. val description: String
    Definition Classes
    Doc2VecApproachAnnotatorApproach
  6. def explainParam(param: Param[_]): String
    Definition Classes
    Params
  7. def explainParams(): String
    Definition Classes
    Params
  8. final def extractParamMap(): ParamMap
    Definition Classes
    Params
  9. final def extractParamMap(extra: ParamMap): ParamMap
    Definition Classes
    Params
  10. val features: ArrayBuffer[Feature[_, _, _]]
    Definition Classes
    HasFeatures
  11. final def fit(dataset: Dataset[_]): Doc2VecModel
    Definition Classes
    AnnotatorApproach → Estimator
  12. def fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[Doc2VecModel]
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" )
  13. def fit(dataset: Dataset[_], paramMap: ParamMap): Doc2VecModel
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" )
  14. def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): Doc2VecModel
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" ) @varargs()
  15. final def get[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  16. final def getDefault[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  17. def getInputCols: Array[String]

    returns

    input annotations columns currently used

    Definition Classes
    HasInputAnnotationCols
  18. def getLazyAnnotator: Boolean
    Definition Classes
    CanBeLazy
  19. final def getOrDefault[T](param: Param[T]): T
    Definition Classes
    Params
  20. final def getOutputCol: String

    Gets annotation column name going to generate

    Gets annotation column name going to generate

    Definition Classes
    HasOutputAnnotationCol
  21. def getParam(paramName: String): Param[Any]
    Definition Classes
    Params
  22. def getStorageRef: String
    Definition Classes
    HasStorageRef
  23. final def hasDefault[T](param: Param[T]): Boolean
    Definition Classes
    Params
  24. def hasParam(paramName: String): Boolean
    Definition Classes
    Params
  25. final def isDefined(param: Param[_]): Boolean
    Definition Classes
    Params
  26. final def isSet(param: Param[_]): Boolean
    Definition Classes
    Params
  27. val lazyAnnotator: BooleanParam
    Definition Classes
    CanBeLazy
  28. def onTrained(model: Doc2VecModel, spark: SparkSession): Unit
    Definition Classes
    AnnotatorApproach
  29. val optionalInputAnnotatorTypes: Array[String]
    Definition Classes
    HasInputAnnotationCols
  30. lazy val params: Array[Param[_]]
    Definition Classes
    Params
  31. def save(path: String): Unit
    Definition Classes
    MLWritable
    Annotations
    @Since( "1.6.0" ) @throws( ... )
  32. def set[T](param: ProtectedParam[T], value: T): Doc2VecApproach.this.type

    Sets the value for a protected Param.

    Sets the value for a protected Param.

    If the parameter was already set, it will not be set again. Default values do not count as a set value and can be overridden.

    T

    Type of the parameter

    param

    Protected parameter to set

    value

    Value for the parameter

    returns

    This object

    Definition Classes
    HasProtectedParams
  33. final def set[T](param: Param[T], value: T): Doc2VecApproach.this.type
    Definition Classes
    Params
  34. final def setInputCols(value: String*): Doc2VecApproach.this.type
    Definition Classes
    HasInputAnnotationCols
  35. def setInputCols(value: Array[String]): Doc2VecApproach.this.type

    Overrides required annotators column if different than default

    Overrides required annotators column if different than default

    Definition Classes
    HasInputAnnotationCols
  36. def setLazyAnnotator(value: Boolean): Doc2VecApproach.this.type
    Definition Classes
    CanBeLazy
  37. final def setOutputCol(value: String): Doc2VecApproach.this.type

    Overrides annotation column name when transforming

    Overrides annotation column name when transforming

    Definition Classes
    HasOutputAnnotationCol
  38. def setStorageRef(value: String): Doc2VecApproach.this.type
    Definition Classes
    HasStorageRef
  39. def toString(): String
    Definition Classes
    Identifiable → AnyRef → Any
  40. def train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): Doc2VecModel
    Definition Classes
    Doc2VecApproachAnnotatorApproach
  41. final def transformSchema(schema: StructType): StructType

    requirement for pipeline transformation validation.

    requirement for pipeline transformation validation. It is called on fit()

    Definition Classes
    AnnotatorApproach → PipelineStage
  42. val uid: String
    Definition Classes
    Doc2VecApproach → Identifiable
  43. def validateStorageRef(dataset: Dataset[_], inputCols: Array[String], annotatorType: String): Unit
    Definition Classes
    HasStorageRef
  44. def write: MLWriter
    Definition Classes
    ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable

Parameter setters

  1. def setEnableCaching(value: Boolean): Doc2VecApproach.this.type

    Definition Classes
    HasEnableCachingProperties
  2. def setMaxIter(value: Int): Doc2VecApproach.this.type

  3. def setMaxSentenceLength(value: Int): Doc2VecApproach.this.type

  4. def setMinCount(value: Int): Doc2VecApproach.this.type

  5. def setNumPartitions(value: Int): Doc2VecApproach.this.type

  6. def setSeed(value: Int): Doc2VecApproach.this.type

  7. def setStepSize(value: Double): Doc2VecApproach.this.type

  8. def setVectorSize(value: Int): Doc2VecApproach.this.type

  9. def setWindowSize(value: Int): Doc2VecApproach.this.type

Parameter getters

  1. def getEnableCaching: Boolean

    Definition Classes
    HasEnableCachingProperties
  2. def getMaxIter: Int

  3. def getMaxSentenceLength: Int

  4. def getMinCount: Int

  5. def getNumPartitions: Int

  6. def getSeed: Int

  7. def getStepSize: Double

  8. def getVectorSize: Int

  9. def getWindowSize: Int