sparknlp.annotator.embeddings.modernbert_embeddings#

Contains classes for ModernBertEmbeddings.

Module Contents#

Classes#

ModernBertEmbeddings

Token-level embeddings using ModernBERT.

class ModernBertEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.ModernBertEmbeddings', java_model=None)[source]#

Token-level embeddings using ModernBERT.

ModernBERT is a modernized bidirectional encoder model that is 8x faster, uses 5x less memory, and achieves better downstream performance than traditional BERT models. ModernBERT incorporates modern improvements including Flash Attention, unpadding, and GeGLU activation functions.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = ModernBertEmbeddings.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("modernbert_embeddings")

The default model is "modernbert-base", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

WORD_EMBEDDINGS

Parameters:
batchSize

Size of every batch , by default 8

dimension

Number of embedding dimensions, by default 768

caseSensitive

Whether to ignore case in tokens for embeddings matching, by default False

maxSentenceLength

Max sentence length to process, by default 8192

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

References

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Applications

https://huggingface.co/answerdotai/ModernBERT-base

Paper abstract

We introduce ModernBERT, a modernized bidirectional encoder model that is 8x faster, uses 5x less memory, and achieves better downstream performance than traditional BERT models. ModernBERT incorporates modern improvements including Flash Attention, unpadding, and GeGLU activation functions. The model supports sequence lengths up to 8192 tokens while maintaining competitive performance on tasks requiring long context understanding.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = ModernBertEmbeddings.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.8951656818389893,0.13753339648246765,0.11818419396877289,-0.6969502568244...|
|[-0.9860016107559204,-0.6775270700454712,-0.046373113244771957,-1.5230885744094...|
|[-0.9671071767807007,-0.17220760881900787,-0.09954319149255753,-1.1178797483444...|
|[-0.9847850799560547,-0.6675535440444946,-0.06431620568037033,-1.4423584938049...|
|[-0.8978064060211182,0.16901421546936035,0.1306578516960144,-0.6813133358955383...|
+--------------------------------------------------------------------------------+
name = 'ModernBertEmbeddings'[source]#
inputAnnotatorTypes[source]#
outputAnnotatorType = 'word_embeddings'[source]#
maxSentenceLength[source]#
configProtoBytes[source]#
setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setMaxSentenceLength(value)[source]#

Sets max sentence length to process.

Parameters:
valueint

Max sentence length to process

static loadSavedModel(folder, spark_session, use_openvino=False)[source]#

Loads a locally saved model.

Parameters:
folderstr

Folder of the saved model

spark_sessionpyspark.sql.SparkSession

The current SparkSession

use_openvinobool

Use OpenVINO backend

Returns:
ModernBertEmbeddings

The restored model

static pretrained(name='modernbert-base', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “modernbert-base”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLP repositories otherwise.

Returns:
ModernBertEmbeddings

The restored model