sparknlp.annotator.embeddings.word_embeddings#

Contains classes for WordEmbeddings.

Module Contents#

Classes#

WordEmbeddings

Word Embeddings lookup annotator that maps tokens to vectors.

WordEmbeddingsModel

Word Embeddings lookup annotator that maps tokens to vectors

class WordEmbeddings[source]#

Word Embeddings lookup annotator that maps tokens to vectors.

For instantiated/pretrained models, see WordEmbeddingsModel.

A custom token lookup dictionary for embeddings can be set with setStoragePath(). Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces:

...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...

If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with WordEmbeddingsModel.withCoverageColumn() and WordEmbeddingsModel.overallCoverage().

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

WORD_EMBEDDINGS

Parameters:
writeBufferSize

Buffer size limit before dumping to disk storage while writing, by default 10000

readCacheSize

Cache size for items retrieved from storage. Increase for performance but higher memory consumption

See also

SentenceEmbeddings

to combine embeddings into a sentence-level representation

Examples

In this example, the file random_embeddings_dim4.txt has the form of the content above.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddings() \
...     .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \
...     .setStorageRef("glove_4d") \
...     .setDimension(4) \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(truncate=False)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+
setWriteBufferSize(v)[source]#

Sets buffer size limit before dumping to disk storage while writing, by default 10000.

Parameters:
vint

Buffer size limit

setReadCacheSize(v)[source]#

Sets cache size for items retrieved from storage. Increase for performance but higher memory consumption.

Parameters:
vint

Cache size for items retrieved from storage

class WordEmbeddingsModel(classname='com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel', java_model=None)[source]#

Word Embeddings lookup annotator that maps tokens to vectors

This is the instantiated model of WordEmbeddings.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = WordEmbeddingsModel.pretrained() \
...       .setInputCols(["document", "token"]) \
...       .setOutputCol("embeddings")

The default model is "glove_100d", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

WORD_EMBEDDINGS

Parameters:
dimension

Number of embedding dimensions

readCacheSize

Cache size for items retrieved from storage. Increase for performance but higher memory consumption

See also

SentenceEmbeddings

to combine embeddings into a sentence-level representation

Notes

There are also two convenient functions to retrieve the embeddings coverage with respect to the transformed dataset:

  • withCoverageColumn(): Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.

  • overallCoverage(): Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.570580005645752,0.44183000922203064,0.7010200023651123,-0.417129993438720...|
|[-0.542639970779419,0.4147599935531616,1.0321999788284302,-0.4024400115013122...|
|[-0.2708599865436554,0.04400600120425224,-0.020260000601410866,-0.17395000159...|
|[0.6191999912261963,0.14650000631809235,-0.08592499792575836,-0.2629800140857...|
|[-0.3397899866104126,0.20940999686717987,0.46347999572753906,-0.6479200124740...|
+--------------------------------------------------------------------------------+
setReadCacheSize(v)[source]#

Sets cache size for items retrieved from storage. Increase for performance but higher memory consumption.

Parameters:
vint

Cache size for items retrieved from storage

static overallCoverage(dataset, embeddings_col)[source]#

Calculates overall word coverage for the whole data in the embedded field.

This returns a single coverage object considering all rows in the field.

Parameters:
datasetpyspark.sql.DataFrame

The dataset with embeddings column

embeddings_colstr

Name of the embeddings column

Returns:
CoverageResult

CoverateResult object with extracted information

Examples

>>> wordsOverallCoverage = WordEmbeddingsModel.overallCoverage(
...     resultDF,"embeddings"
... ).percentage
1.0
static withCoverageColumn(dataset, embeddings_col, output_col='coverage')[source]#

Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.

Parameters:
datasetpyspark.sql.DataFrame

The dataset with embeddings column

embeddings_colstr

Name of the embeddings column

output_colstr, optional

Name for the resulting column, by default ‘coverage’

Returns:
pyspark.sql.DataFrame

Dataframe with calculated coverage

Examples

>>> wordsCoverage = WordEmbeddingsModel.withCoverageColumn(resultDF, "embeddings", "cov_embeddings")
>>> wordsCoverage.select("text","cov_embeddings").show(truncate=False)
+-------------------+--------------+
|text               |cov_embeddings|
+-------------------+--------------+
|This is a sentence.|[5, 5, 1.0]   |
+-------------------+--------------+
static pretrained(name='glove_100d', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “glove_100d”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
WordEmbeddingsModel

The restored model

static loadStorage(path, spark, storage_ref)[source]#

Loads the model from storage.

Parameters:
pathstr

Path to the model

sparkpyspark.sql.SparkSession

The current SparkSession

storage_refstr

Identifiers for the model parameters