`sparknlp.annotator.classifier_dl.distil_bert_for_token_classification`#

Contains classes for DistilBertForTokenClassification.

Module Contents#

Classes#

DistilBertForTokenClassification

DistilBertForTokenClassification can load Bert Models with a token

class DistilBertForTokenClassification(classname='com.johnsnowlabs.nlp.annotators.classifier.dl.DistilBertForTokenClassification', java_model=None)[source]#

DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained() of the companion object:

>>> labels = DistilBertForTokenClassification.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("label")

The default model is "distilbert_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`NAMED_ENTITY`

Parameters:

batchSize: Batch size. Large values allows faster processing but requires more memory, by default 8
caseSensitive: Whether to ignore case in tokens for embeddings matching, by default True
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
maxSentenceLength: Max sentence length to process, by default 128

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> tokenClassifier = DistilBertForTokenClassification.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("label") \
...     .setCaseSensitive(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     tokenClassifier
... ])
>>> data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

name = 'DistilBertForTokenClassification'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'named_entity'[source]#

configProtoBytes[source]#

getClasses()[source]#: Returns labels used to train this model

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

DistilBertForTokenClassification: The restored model

static pretrained(name='distilbert_base_token_classifier_conll03', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “distilbert_base_token_classifier_conll03”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

DistilBertForTokenClassification: The restored model

sparknlp.annotator.classifier_dl.distil_bert_for_token_classification#

Module Contents#

Classes#

`sparknlp.annotator.classifier_dl.distil_bert_for_token_classification`#