`sparknlp.annotator.embeddings.modernbert_embeddings`#

Contains classes for ModernBertEmbeddings.

Module Contents#

Classes#

ModernBertEmbeddings

Token-level embeddings using ModernBERT.

class ModernBertEmbeddings(classname='com.johnsnowlabs.nlp.embeddings.ModernBertEmbeddings', java_model=None)[source]#

Token-level embeddings using ModernBERT.

ModernBERT is a modernized bidirectional encoder model that is 8x faster, uses 5x less memory, and achieves better downstream performance than traditional BERT models. ModernBERT incorporates modern improvements including Flash Attention, unpadding, and GeGLU activation functions.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = ModernBertEmbeddings.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("modernbert_embeddings")

The default model is "modernbert-base", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`WORD_EMBEDDINGS`

Parameters:

batchSize: Size of every batch , by default 8
dimension: Number of embedding dimensions, by default 768
caseSensitive: Whether to ignore case in tokens for embeddings matching, by default False
maxSentenceLength: Max sentence length to process, by default 8192
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.

References

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Applications

https://huggingface.co/answerdotai/ModernBERT-base

Paper abstract

We introduce ModernBERT, a modernized bidirectional encoder model that is 8x faster, uses 5x less memory, and achieves better downstream performance than traditional BERT models. ModernBERT incorporates modern improvements including Flash Attention, unpadding, and GeGLU activation functions. The model supports sequence lengths up to 8192 tokens while maintaining competitive performance on tasks requiring long context understanding.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = ModernBertEmbeddings.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True) \
...     .setCleanAnnotations(False)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings,
...       embeddingsFinisher
...     ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.8951656818389893,0.13753339648246765,0.11818419396877289,-0.6969502568244...|
|[-0.9860016107559204,-0.6775270700454712,-0.046373113244771957,-1.5230885744094...|
|[-0.9671071767807007,-0.17220760881900787,-0.09954319149255753,-1.1178797483444...|
|[-0.9847850799560547,-0.6675535440444946,-0.06431620568037033,-1.4423584938049...|
|[-0.8978064060211182,0.16901421546936035,0.1306578516960144,-0.6813133358955383...|
+--------------------------------------------------------------------------------+

name = 'ModernBertEmbeddings'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'word_embeddings'[source]#

maxSentenceLength[source]#

configProtoBytes[source]#

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

setMaxSentenceLength(value)[source]#

Sets max sentence length to process.

Parameters:

valueint: Max sentence length to process

static loadSavedModel(folder, spark_session, use_openvino=False)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession
use_openvinobool: Use OpenVINO backend

Returns:

ModernBertEmbeddings: The restored model

static pretrained(name='modernbert-base', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “modernbert-base”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLP repositories otherwise.

Returns:

ModernBertEmbeddings: The restored model

sparknlp.annotator.embeddings.modernbert_embeddings#

Module Contents#

Classes#

`sparknlp.annotator.embeddings.modernbert_embeddings`#