sparknlp.annotator.coref.spanbert_coref#

Contains classes for the SpanBertCorefModel.

Module Contents#

Classes#

SpanBertCorefModel

A coreference resolution model based on SpanBert.

class SpanBertCorefModel(classname='com.johnsnowlabs.nlp.annotators.coref.SpanBertCorefModel', java_model=None)[source]#

A coreference resolution model based on SpanBert.

A coreference resolution model identifies expressions which refer to the same entity in a text. For example, given a sentence “John told Mary he would like to borrow a book from her.” the model will link “he” to “John” and “her” to “Mary”.

This model is based on SpanBert, which is fine-tuned on the OntoNotes 5.0 data set.

Pretrained models can be loaded with pretrained() of the companion object:

>>> corefResolution = SpanBertCorefModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("coref")

The default model is "spanbert_base_coref", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN

DEPENDENCY

Parameters:
maxSentenceLength

Maximum sentence length to process

maxSegmentLength

Maximum segment length

textGenre

Text genre. One of the following values:

“bc”, // Broadcast conversation, default
“bn”, // Broadcast news
“nw”, // News wire
“pt”, // Pivot text: Old Testament and New Testament text
“tc”, // Telephone conversation
“wb” // Web data

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> corefResolution = SpanBertCorefModel() \
...     .pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("corefs") \
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     corefResolution
... ])
>>> data = spark.createDataFrame([
...     ["John told Mary he would like to borrow a book from her."]
... ]).toDF("text")
>>> results = pipeline.fit(data).transform(data))
>>> results \
...     .selectExpr("explode(corefs) AS coref")
...     .selectExpr("coref.result as token", "coref.metadata")
...     .show(truncate=False)
+-----+------------------------------------------------------------------------------------+
|token|metadata                                                                            |
+-----+------------------------------------------------------------------------------------+
|John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|he   |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0}   |
|Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|her  |{head.sentence -> 0, head -> Mary, head.begin -> 10, head.end -> 13, sentence -> 0} |
+-----+------------------------------------------------------------------------------------|
setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setMaxSegmentLength(value)[source]#

Sets max segment length

Parameters:
valueint

Max segment length

setTextGenre(value)[source]#
Sets the text genre, one of the following values:
“bc” : Broadcast conversation, default
“bn” Broadcast news
“nw” : News wire
“pt” : Pivot text: Old Testament and New Testament text
“tc” : Telephone conversation
“wb” : Web data
Parameters:
valuestring

Text genre code, default is ‘bc’

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:
folderstr

Folder of the saved model

spark_sessionpyspark.sql.SparkSession

The current SparkSession

Returns:
SpanBertCorefModel

The restored model

static pretrained(name='spanbert_base_coref', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “spanbert_base_coref”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
SpanBertCorefModel

The restored model