package dl
- Alphabetic
- Public
- All
Type Members
-
class
NerDLApproach extends AnnotatorApproach[NerDLModel] with NerApproach[NerDLApproach] with Logging with ParamsAndFeaturesWritable with EvaluationDLParams
This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.
This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.
The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
For instantiated/pretrained models, see NerDLModel.
The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with
Annotation
type columns. The data should have columns of typeDOCUMENT, TOKEN, WORD_EMBEDDINGS
and an additional label column of annotator typeNAMED_ENTITY
. Excluding the label, this can be done with for example- a SentenceDetector,
- a Tokenizer and
- a WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
Setting a test dataset to monitor model metrics can be done with
.setTestDataset
. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset with a CoNLL dataset:val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = WordEmbeddingsModel .pretrained() .setInputCols("document", "token") .setOutputCol("embeddings") val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings)) val conll = CoNLL() val Array(train, test) = conll .readDataset(spark, "src/test/resources/conll2003/eng.train") .randomSplit(Array(0.8, 0.2)) preProcessingPipeline .fit(test) .transform(test) .write .mode("overwrite") .parquet("test_data") val nerTagger = new NerDLApproach() .setInputCols("document", "token", "embeddings") .setLabelColumn("label") .setOutputCol("ner") .setTestDataset("test_data")
For extended examples of usage, see the Examples and the NerDLSpec.
Example
import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector import com.johnsnowlabs.nlp.embeddings.BertEmbeddings import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLApproach import com.johnsnowlabs.nlp.training.CoNLL import org.apache.spark.ml.Pipeline // This CoNLL dataset already includes a sentence, token and label // column with their respective annotator types. If a custom dataset is used, // these need to be defined with for example: val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") // Then the training can start val embeddings = BertEmbeddings.pretrained() .setInputCols("sentence", "token") .setOutputCol("embeddings") val nerTagger = new NerDLApproach() .setInputCols("sentence", "token", "embeddings") .setLabelColumn("label") .setOutputCol("ner") .setMaxEpochs(1) .setRandomSeed(0) .setVerbose(0) val pipeline = new Pipeline().setStages(Array( embeddings, nerTagger )) // We use the sentences, tokens and labels from the CoNLL dataset val conll = CoNLL() val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train") val pipelineModel = pipeline.fit(trainingData)
- See also
NerCrfApproach for a generic CRF approach
NerConverter to further process the results
-
class
NerDLModel extends AnnotatorModel[NerDLModel] with HasBatchedAnnotate[NerDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with HasEngine
This Named Entity recognition annotator is a generic NER model based on Neural Networks.
This Named Entity recognition annotator is a generic NER model based on Neural Networks.
Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
This is the instantiated model of the NerDLApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with
pretrained
of the companion object:val nerModel = NerDLModel.pretrained() .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner")
The default model is
"ner_dl"
, if no name is provided.For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.
Note that some pretrained models require specific types of embeddings, depending on which they were trained on. For example, the default model
"ner_dl"
requires the WordEmbeddings"glove_100d"
.For extended examples of usage, see the Examples and the NerDLSpec.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel import org.apache.spark.ml.Pipeline // First extract the prerequisites for the NerDLModel val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("bert") // Then NER can be extracted val nerTagger = NerDLModel.pretrained() .setInputCols("sentence", "token", "bert") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, tokenizer, embeddings, nerTagger )) val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text") val result = pipeline.fit(data).transform(data) result.select("ner.result").show(false) +------------------------------------+ |result | +------------------------------------+ |[B-ORG, O, O, B-PER, O, O, B-LOC, O]| +------------------------------------+
- See also
NerCrfModel for a generic CRF approach
NerConverter to further process the results
- trait ReadZeroShotNerDLModel extends ReadTensorflowModel with ReadOnnxModel
- trait ReadablePretrainedNerDL extends ParamsAndFeaturesReadable[NerDLModel] with HasPretrained[NerDLModel]
- trait ReadablePretrainedZeroShotNer extends ParamsAndFeaturesReadable[ZeroShotNerModel] with HasPretrained[ZeroShotNerModel]
- trait ReadsNERGraph extends ParamsAndFeaturesReadable[NerDLModel] with ReadTensorflowModel
- trait WithGraphResolver extends AnyRef
-
class
ZeroShotNerModel extends RoBertaForQuestionAnswering
ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.
ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.
Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering.
For more extended examples see the Examples
Pretrained models can be loaded with
pretrained
of the companion object:val zeroShotNer = ZeroShotNerModel.pretrained() .setInputCols("document") .setOutputCol("zer_shot_ner")
For available pretrained models please see the Models Hub.
Example
val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val zeroShotNer = ZeroShotNerModel .pretrained() .setEntityDefinitions( Map( "NAME" -> Array("What is his name?", "What is her name?"), "CITY" -> Array("Which city?"))) .setPredictionThreshold(0.01f) .setInputCols("sentences") .setOutputCol("zero_shot_ner") val pipeline = new Pipeline() .setStages(Array( documentAssembler, sentenceDetector, zeroShotNer)) val model = pipeline.fit(Seq("").toDS.toDF("text")) val results = model.transform( Seq("Clara often travels between New York and Paris.").toDS.toDF("text")) results .selectExpr("document", "explode(zero_shot_ner) AS entity") .select( col("entity.result"), col("entity.metadata.word"), col("entity.metadata.sentence"), col("entity.begin"), col("entity.end"), col("entity.metadata.confidence"), col("entity.metadata.question")) .show(truncate=false) +------+-----+--------+-----+---+----------+------------------+ |result|word |sentence|begin|end|confidence|question | +------+-----+--------+-----+---+----------+------------------+ |B-CITY|Paris|0 |41 |45 |0.78655756|Which is the city?| |B-CITY|New |0 |28 |30 |0.29346612|Which city? | |I-CITY|York |0 |32 |35 |0.29346612|Which city? | +------+-----+--------+-----+---+----------+------------------+
- See also
https://arxiv.org/abs/1907.11692 for details about the RoBERTa transformer
RoBertaForQuestionAnswering for the SparkNLP implementation of RoBERTa question answering
Value Members
- object LoadsContrib
-
object
NerDLApproach extends DefaultParamsReadable[NerDLApproach] with WithGraphResolver with Serializable
This is the companion object of NerDLApproach.
This is the companion object of NerDLApproach. Please refer to that class for the documentation.
-
object
NerDLModel extends ReadablePretrainedNerDL with ReadsNERGraph with Serializable
This is the companion object of NerDLModel.
This is the companion object of NerDLModel. Please refer to that class for the documentation.
- object NerDLModelPythonReader
- object ZeroShotNerModel extends ReadablePretrainedZeroShotNer with ReadZeroShotNerDLModel with Serializable