`sparknlp.annotator.embeddings.word_embeddings`#

Contains classes for WordEmbeddings.

Module Contents#

Classes#

`WordEmbeddings`	Word Embeddings lookup annotator that maps tokens to vectors.
`WordEmbeddingsModel`	Word Embeddings lookup annotator that maps tokens to vectors

class WordEmbeddings[source]#

Word Embeddings lookup annotator that maps tokens to vectors.

For instantiated/pretrained models, see WordEmbeddingsModel.

A custom token lookup dictionary for embeddings can be set with setStoragePath(). Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces:

...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...

If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with WordEmbeddingsModel.withCoverageColumn() and WordEmbeddingsModel.overallCoverage().

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`WORD_EMBEDDINGS`

Parameters:

writeBufferSize: Buffer size limit before dumping to disk storage while writing, by default 10000
readCacheSize: Cache size for items retrieved from storage. Increase for performance but higher memory consumption

See also

SentenceEmbeddings: to combine embeddings into a sentence-level representation

Examples

In this example, the file random_embeddings_dim4.txt has the form of the content above.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddings() \
...     .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \
...     .setStorageRef("glove_4d") \
...     .setDimension(4) \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(truncate=False)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+

name = 'WordEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'word_embeddings'[source]#

writeBufferSize[source]#

readCacheSize[source]#

setWriteBufferSize(v)[source]#

Sets buffer size limit before dumping to disk storage while writing, by default 10000.

Parameters:

vint: Buffer size limit

setReadCacheSize(v)[source]#

Sets cache size for items retrieved from storage. Increase for performance but higher memory consumption.

Parameters:

vint: Cache size for items retrieved from storage

class WordEmbeddingsModel(classname='com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel', java_model=None)[source]#

Word Embeddings lookup annotator that maps tokens to vectors

This is the instantiated model of WordEmbeddings.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = WordEmbeddingsModel.pretrained() \
...       .setInputCols(["document", "token"]) \
...       .setOutputCol("embeddings")

The default model is "glove_100d", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`WORD_EMBEDDINGS`

Parameters:

dimension: Number of embedding dimensions
readCacheSize: Cache size for items retrieved from storage. Increase for performance but higher memory consumption

See also

SentenceEmbeddings: to combine embeddings into a sentence-level representation

Notes

There are also two convenient functions to retrieve the embeddings coverage with respect to the transformed dataset:

withCoverageColumn(): Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.
overallCoverage(): Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.570580005645752,0.44183000922203064,0.7010200023651123,-0.417129993438720...|
|[-0.542639970779419,0.4147599935531616,1.0321999788284302,-0.4024400115013122...|
|[-0.2708599865436554,0.04400600120425224,-0.020260000601410866,-0.17395000159...|
|[0.6191999912261963,0.14650000631809235,-0.08592499792575836,-0.2629800140857...|
|[-0.3397899866104126,0.20940999686717987,0.46347999572753906,-0.6479200124740...|
+--------------------------------------------------------------------------------+

name = 'WordEmbeddingsModel'[source]#

databases = ['EMBEDDINGS'][source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'word_embeddings'[source]#

readCacheSize[source]#

setReadCacheSize(v)[source]#

Sets cache size for items retrieved from storage. Increase for performance but higher memory consumption.

Parameters:

vint: Cache size for items retrieved from storage

static overallCoverage(dataset, embeddings_col)[source]#

Calculates overall word coverage for the whole data in the embedded field.

This returns a single coverage object considering all rows in the field.

Parameters:

datasetpyspark.sql.DataFrame: The dataset with embeddings column
embeddings_colstr: Name of the embeddings column

Returns:

CoverageResult: CoverateResult object with extracted information

Examples

>>> wordsOverallCoverage = WordEmbeddingsModel.overallCoverage(
...     resultDF,"embeddings"
... ).percentage
1.0

static withCoverageColumn(dataset, embeddings_col, output_col='coverage')[source]#

Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.

Parameters:

datasetpyspark.sql.DataFrame: The dataset with embeddings column
embeddings_colstr: Name of the embeddings column
output_colstr, optional: Name for the resulting column, by default ‘coverage’

Returns:

pyspark.sql.DataFrame: Dataframe with calculated coverage

Examples

>>> wordsCoverage = WordEmbeddingsModel.withCoverageColumn(resultDF, "embeddings", "cov_embeddings")
>>> wordsCoverage.select("text","cov_embeddings").show(truncate=False)
+-------------------+--------------+
|text               |cov_embeddings|
+-------------------+--------------+
|This is a sentence.|[5, 5, 1.0]   |
+-------------------+--------------+

static pretrained(name='glove_100d', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “glove_100d”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

WordEmbeddingsModel: The restored model

static loadStorage(path, spark, storage_ref)[source]#

Loads the model from storage.

Parameters:

pathstr: Path to the model
sparkpyspark.sql.SparkSession: The current SparkSession
storage_refstr: Identifiers for the model parameters

sparknlp.annotator.embeddings.word_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.word_embeddings`#