sparknlp.annotator.ner.ner_dl#

Contains classes for NerDL.

Module Contents#

Classes#

NerDLApproach

This Named Entity recognition annotator allows to train generic NER model

NerDLModel

This Named Entity recognition annotator is a generic NER model based on

class NerDLApproach[source]#

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

Excluding the label, this can be done with for example:

  • a SentenceDetector,

  • a Tokenizer and

  • a WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).

Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset with a CoNLL dataset:

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> embeddings = WordEmbeddingsModel \
...     .pretrained() \
...     .setInputCols(["document", "token"]) \
...     .setOutputCol("embeddings")
>>> preProcessingPipeline = Pipeline().setStages([documentAssembler, embeddings])
>>> conll = CoNLL()
>>> (train, test) = conll \
...     .readDataset(spark, "src/test/resources/conll2003/eng.train") \
...     .randomSplit([0.8, 0.2])
>>> preProcessingPipeline \
...     .fit(test) \
...     .transform(test)
...     .write \
...     .mode("overwrite") \
...     .parquet("test_data")
>>> tagger = NerDLApproach() \
...     .setInputCols(["document", "token", "embeddings"]) \
...     .setLabelColumn("label") \
...     .setOutputCol("ner") \
...     .setTestDataset("test_data")

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, WORD_EMBEDDINGS

NAMED_ENTITY

Parameters:
labelColumn

Column with label per each token

entities

Entities to recognize

minEpochs

Minimum number of epochs to train, by default 0

maxEpochs

Maximum number of epochs to train, by default 50

verbose

Level of verbosity during training, by default 2

randomSeed

Random seed

lr

Learning Rate, by default 0.001

po

Learning rate decay coefficient. Real Learning Rage = lr / (1 + po * epoch), by default 0.005

batchSize

Batch size, by default 8

dropout

Dropout coefficient, by default 0.5

graphFolder

Folder path that contain external graph files

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

useContrib

whether to use contrib LSTM Cells. Not compatible with Windows. Might slightly improve accuracy

validationSplit

Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off, by default 0.0

evaluationLogExtended

Whether logs for validation to be extended, by default False.

testDataset

Path to a parquet file of a test dataset. If set, it is used to calculate statistics on it during training.

includeConfidence

whether to include confidence scores in annotation metadata, by default False

includeAllConfidenceScores

whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False

enableOutputLogs

Whether to use stdout in addition to Spark logs, by default False

outputLogsPath

Folder path to save training logs

enableMemoryOptimizer

Whether to optimize for large datasets or not. Enabling this option can slow down training, by default False

useBestModel

Whether to restore and use the model that has achieved the best performance at the end of the training.

bestModelMetric

Whether to check F1 Micro-average or F1 Macro-average as a final metric for the best model

See also

NerCrfApproach

for a generic CRF approach

NerConverter

to further process the results

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline

This CoNLL dataset already includes a sentence, token and label column with their respective annotator types. If a custom dataset is used, these need to be defined with for example:

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")

Then the training can start

>>> embeddings = BertEmbeddings.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings")
>>> nerTagger = NerDLApproach() \
...     .setInputCols(["sentence", "token", "embeddings"]) \
...     .setLabelColumn("label") \
...     .setOutputCol("ner") \
...     .setMaxEpochs(1) \
...     .setRandomSeed(0) \
...     .setVerbose(0)
>>> pipeline = Pipeline().setStages([
...     embeddings,
...     nerTagger
... ])

We use the sentences, tokens, and labels from the CoNLL dataset.

>>> conll = CoNLL()
>>> trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> pipelineModel = pipeline.fit(trainingData)
setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setGraphFolder(p)[source]#

Sets folder path that contain external graph files.

Parameters:
pstr

Folder path that contain external graph files

setUseContrib(v)[source]#

Sets whether to use contrib LSTM Cells. Not compatible with Windows. Might slightly improve accuracy.

Parameters:
vbool

Whether to use contrib LSTM Cells

Raises:
Exception

Windows not supported to use contrib

setLr(v)[source]#

Sets Learning Rate, by default 0.001.

Parameters:
vfloat

Learning Rate

setPo(v)[source]#

Sets Learning rate decay coefficient, by default 0.005.

Real Learning Rage is lr / (1 + po * epoch).

Parameters:
vfloat

Learning rate decay coefficient

setBatchSize(v)[source]#

Sets batch size, by default 64.

Parameters:
vint

Batch size

setDropout(v)[source]#

Sets dropout coefficient, by default 0.5.

Parameters:
vfloat

Dropout coefficient

setIncludeConfidence(value)[source]#

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters:
valuebool

Whether to include the confidence value in the output.

setIncludeAllConfidenceScores(value)[source]#

Sets whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False.

Parameters:
valuebool

Whether to include all confidence scores in annotation metadata or just the score of the predicted tag

setEnableMemoryOptimizer(value)[source]#

Sets Whether to optimize for large datasets or not, by default False. Enabling this option can slow down training.

Parameters:
valuebool

Whether to optimize for large datasets

setUseBestModel(value)[source]#

Whether to restore and use the model that has achieved the best performance at the end of the training. The metric that is being monitored is F1 for testDataset and if it’s not set it will be validationSplit, and if it’s not set finally looks for loss.

Parameters:
valuebool

Whether to restore and use the model that has achieved the best performance at the end of the training.

setBestModelMetric(value)[source]#

Whether to check F1 Micro-average or F1 Macro-average as a final metric for the best model when setUseBestModel is True

Parameters:
valuestr

Whether to check F1 Micro-average or F1 Macro-average as a final metric for the best model

class NerDLModel(classname='com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel', java_model=None)[source]#

This Named Entity recognition annotator is a generic NER model based on Neural Networks.

Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

This is the instantiated model of the NerDLApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> nerModel = NerDLModel.pretrained() \
...     .setInputCols(["sentence", "token", "embeddings"]) \
...     .setOutputCol("ner")

The default model is "ner_dl", if no name is provided.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.

Note that some pretrained models require specific types of embeddings, depending on which they were trained on. For example, the default model "ner_dl" requires the WordEmbeddings "glove_100d".

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, WORD_EMBEDDINGS

NAMED_ENTITY

Parameters:
batchSize

Size of every batch, by default 8

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

includeConfidence

Whether to include confidence scores in annotation metadata, by default False

includeAllConfidenceScores

Whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False

classes

Tags used to trained this NerDLModel

See also

NerCrfModel

for a generic CRF approach

NerConverter

to further process the results

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

First extract the prerequisites for the NerDLModel

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("bert")

Then NER can be extracted

>>> nerTagger = NerDLModel.pretrained() \
...     .setInputCols(["sentence", "token", "bert"]) \
...     .setOutputCol("ner")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     embeddings,
...     nerTagger
... ])
>>> data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[B-ORG, O, O, B-PER, O, O, B-LOC, O]|
+------------------------------------+
setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setIncludeConfidence(value)[source]#

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters:
valuebool

Whether to include the confidence value in the output.

setIncludeAllConfidenceScores(value)[source]#

Sets whether to include all confidence scores in annotation metadata or just the score of the predicted tag, by default False.

Parameters:
valuebool

Whether to include all confidence scores in annotation metadata or just the score of the predicted tag

static pretrained(name='ner_dl', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “ner_dl”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
NerDLModel

The restored model