sparknlp.annotator.classifier_dl.deberta_for_token_classification
#
Contains classes for DeBertaForTokenClassification.
Module Contents#
Classes#
DeBertaForTokenClassification can load DeBERTa v2&v3 Models with a token |
- class DeBertaForTokenClassification(classname='com.johnsnowlabs.nlp.annotators.classifier.dl.DeBertaForTokenClassification', java_model=None)[source]#
DeBertaForTokenClassification can load DeBERTa v2&v3 Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> embeddings = DeBertaForTokenClassification.pretrained() \ ... .setInputCols(["token", "document"]) \ ... .setOutputCol("label")
The default model is
"deberta_v3_xsmall_token_classifier_conll03"
, if no name is provided.For available pretrained models please see the Models Hub.
To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
NAMED_ENTITY
- Parameters:
- batchSize
Batch size. Large values allows faster processing but requires more memory, by default 8
- caseSensitive
Whether to ignore case in tokens for embeddings matching, by default True
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- maxSentenceLength
Max sentence length to process, by default 128
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> tokenClassifier = DeBertaForTokenClassification.pretrained() \ ... .setInputCols(["token", "document"]) \ ... .setOutputCol("label") \ ... .setCaseSensitive(True) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... tokenClassifier ... ]) >>> data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("label.result").show(truncate=False) +------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------+ |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]| +------------------------------------------------------------------------------------+
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model spark_session : pyspark.sql.SparkSession The current SparkSession
- Returns:
- DeBertaForTokenClassification
The restored model
- static pretrained(name='deberta_v3_xsmall_token_classifier_conll03', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “deberta_v3_xsmall_token_classifier_conll03” lang : str, optional Language of the pretrained model, by default “en” remote_loc : str, optional Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- DeBertaForTokenClassification
The restored model