sparknlp.annotator.ner.zero_shot_ner_model#

Module Contents#

Classes#

ZeroShotNerModel

ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa

class ZeroShotNerModel(classname='com.johnsnowlabs.nlp.annotators.ner.dl.ZeroShotNerModel', java_model=None)[source]#

ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.

Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering.

For more extended examples see the Examples.

Pretrained models can be loaded with pretrained of the companion object:

zeroShotNer = ZeroShotNerModel.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("zer_shot_ner")

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

NAMED_ENTITY

Parameters:
entityDefinitions

A dictionary with definitions of named entities. The keys of dictionary are the entity labels and the values are lists of questions. For example:

{

“CITY”: [“Which city?”, “Which town?”], “NAME”: [“What is her name?”, “What is his name?”]}

predictionThreshold

Minimal confidence score to encode an entity (Default: 0.01f)

ignoreEntities

A list of entity labels which are discarded from the output.

References

RoBERTa: A Robustly Optimized BERT Pretraining Approach : for details about the RoBERTa transformer RoBertaForQuestionAnswering : for the SparkNLP implementation of RoBERTa question answering

Examples

>>> document_assembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence_detector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> zero_shot_ner = ZeroShotNerModel() \
...     .pretrained() \
...     .setEntityDefinitions(
...         {
...             "NAME": ["What is his name?", "What is my name?", "What is her name?"],
...             "CITY": ["Which city?", "Which is the city?"]
...         }) \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("zero_shot_ner") \
>>> data = spark.createDataFrame(
...         [["My name is Clara, I live in New York and Hellen lives in Paris."]]
...     ).toDF("text")
>>> Pipeline() \
...     .setStages([document_assembler, sentence_detector, tokenizer, zero_shot_ner]) \
...     .fit(data) \
...     .transform(data) \
...     .selectExpr("document", "explode(zero_shot_ner) AS entity") \
...     .select(
...         "document.result",
...         "entity.result",
...         "entity.metadata.word",
...         "entity.metadata.confidence",
...         "entity.metadata.question") \
...     .show(truncate=False)
+-----------------------------------------------------------------+------+------+----------+------------------+
|result                                                           |result|word  |confidence|question          |
+-----------------------------------------------------------------+------+------+----------+------------------+
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|Paris |0.5328949 |Which is the city?|
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Clara |0.9360068 |What is my name?  |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|New   |0.83294415|Which city?       |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|I-CITY|York  |0.83294415|Which city?       |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Hellen|0.45366877|What is her name? |
+-----------------------------------------------------------------+------+------+----------+------------------+
setPredictionThreshold(threshold)[source]#

Sets the minimal confidence score to encode an entity

Parameters:
thresholdfloat

minimal confidence score to encode an entity (default is 0.1)

setEntityDefinitions(definitions)[source]#

Set entity definitions

Parameters:
definitionsdict[str, list[str]]
getClasses()[source]#

Returns the list of entities which are recognized

static pretrained(name='zero_shot_ner_roberta', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “roberta_base_qa_squad2”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
RoBertaForQuestionAnswering

The restored model

static load(path)[source]#

Reads an ML instance from the input path, a shortcut of read().load(path).