sparknlp.annotator.ner.zero_shot_ner_model
#
Module Contents#
Classes#
ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa |
- class ZeroShotNerModel(classname='com.johnsnowlabs.nlp.annotators.ner.dl.ZeroShotNerModel', java_model=None)[source]#
ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.
Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering.
For more extended examples see the Examples.
Pretrained models can be loaded with
pretrained
of the companion object:zeroShotNer = ZeroShotNerModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("zer_shot_ner")
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
NAMED_ENTITY
- Parameters:
- entityDefinitions
A dictionary with definitions of named entities. The keys of dictionary are the entity labels and the values are lists of questions. For example:
- {
“CITY”: [“Which city?”, “Which town?”], “NAME”: [“What is her name?”, “What is his name?”]}
- predictionThreshold
Minimal confidence score to encode an entity (Default: 0.01f)
- ignoreEntities
A list of entity labels which are discarded from the output.
References
RoBERTa: A Robustly Optimized BERT Pretraining Approach : for details about the RoBERTa transformer
RoBertaForQuestionAnswering
: for the SparkNLP implementation of RoBERTa question answeringExamples
>>> document_assembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence_detector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> zero_shot_ner = ZeroShotNerModel() \ ... .pretrained() \ ... .setEntityDefinitions( ... { ... "NAME": ["What is his name?", "What is my name?", "What is her name?"], ... "CITY": ["Which city?", "Which is the city?"] ... }) \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("zero_shot_ner") \ >>> data = spark.createDataFrame( ... [["My name is Clara, I live in New York and Hellen lives in Paris."]] ... ).toDF("text") >>> Pipeline() \ ... .setStages([document_assembler, sentence_detector, tokenizer, zero_shot_ner]) \ ... .fit(data) \ ... .transform(data) \ ... .selectExpr("document", "explode(zero_shot_ner) AS entity") \ ... .select( ... "document.result", ... "entity.result", ... "entity.metadata.word", ... "entity.metadata.confidence", ... "entity.metadata.question") \ ... .show(truncate=False) +-----------------------------------------------------------------+------+------+----------+------------------+ |result |result|word |confidence|question | +-----------------------------------------------------------------+------+------+----------+------------------+ |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|Paris |0.5328949 |Which is the city?| |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Clara |0.9360068 |What is my name? | |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|New |0.83294415|Which city? | |[My name is Clara, I live in New York and Hellen lives in Paris.]|I-CITY|York |0.83294415|Which city? | |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Hellen|0.45366877|What is her name? | +-----------------------------------------------------------------+------+------+----------+------------------+
- setPredictionThreshold(threshold)[source]#
Sets the minimal confidence score to encode an entity
- Parameters:
- thresholdfloat
minimal confidence score to encode an entity (default is 0.1)
- setEntityDefinitions(definitions)[source]#
Set entity definitions
- Parameters:
- definitionsdict[str, list[str]]
- static pretrained(name='zero_shot_ner_roberta', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “roberta_base_qa_squad2”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- RoBertaForQuestionAnswering
The restored model