`sparknlp.annotator.classifier_dl.camembert_for_token_classification`#

Contains classes for CamemBertForTokenClassification.

Module Contents#

Classes#

CamemBertForTokenClassification

CamemBertForTokenClassification can load CamemBERT Models with a token

class CamemBertForTokenClassification(classname='com.johnsnowlabs.nlp.annotators.classifier.dl.CamemBertForTokenClassification', java_model=None)[source]#

CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained() of the companion object:

>>> token_classifier = CamemBertForTokenClassification.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("label")

The default model is "camembert_base_token_classifier_wikiner", if no name is provided.

For available pretrained models please see the Models Hub. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`NAMED_ENTITY`

Parameters:

batchSize: Batch size. Large values allows faster processing but requires more memory, by default 8
caseSensitive: Whether to ignore case in tokens for embeddings matching, by default True
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
maxSentenceLength: Max sentence length to process, by default 128

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> tokenClassifier = CamemBertForTokenClassification.pretrained() \
...     .setInputCols(["token", "document"]) \
...     .setOutputCol("label") \
...     .setCaseSensitive(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     tokenClassifier
... ])
>>> data = spark.createDataFrame([["george washington est allé à washington"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("label.result").show(truncate=False)
+------------------------------+
|result                        |
+------------------------------+
|[I-PER, I-PER, O, O, O, I-LOC]|
+------------------------------+

getClasses()[source]#: Returns labels used to train this model

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

CamemBertForTokenClassification: The restored model

static pretrained(name='camembert_base_token_classifier_wikiner', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “camembert_base_token_classifier_wikiner”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

CamemBertForTokenClassification: The restored model

sparknlp.annotator.classifier_dl.camembert_for_token_classification#

Module Contents#

Classes#

`sparknlp.annotator.classifier_dl.camembert_for_token_classification`#