`sparknlp.annotator.coref.spanbert_coref`#

Contains classes for the SpanBertCorefModel.

Module Contents#

Classes#

SpanBertCorefModel

A coreference resolution model based on SpanBert.

class SpanBertCorefModel(classname='com.johnsnowlabs.nlp.annotators.coref.SpanBertCorefModel', java_model=None)[source]#

A coreference resolution model based on SpanBert.

A coreference resolution model identifies expressions which refer to the same entity in a text. For example, given a sentence “John told Mary he would like to borrow a book from her.” the model will link “he” to “John” and “her” to “Mary”.

This model is based on SpanBert, which is fine-tuned on the OntoNotes 5.0 data set.

Pretrained models can be loaded with pretrained() of the companion object:

>>> corefResolution = SpanBertCorefModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("coref")

The default model is "spanbert_base_coref", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`DEPENDENCY`

Parameters:

maxSentenceLength: Maximum sentence length to process
maxSegmentLength: Maximum segment length
textGenre: Text genre. One of the following values:

“bc”, // Broadcast conversation, default

“bn”, // Broadcast news

“nw”, // News wire

“pt”, // Pivot text: Old Testament and New Testament text

“tc”, // Telephone conversation

“wb” // Web data

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> corefResolution = SpanBertCorefModel() \
...     .pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("corefs") \
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     corefResolution
... ])
>>> data = spark.createDataFrame([
...     ["John told Mary he would like to borrow a book from her."]
... ]).toDF("text")
>>> results = pipeline.fit(data).transform(data))
>>> results \
...     .selectExpr("explode(corefs) AS coref")
...     .selectExpr("coref.result as token", "coref.metadata")
...     .show(truncate=False)
+-----+------------------------------------------------------------------------------------+
|token|metadata                                                                            |
+-----+------------------------------------------------------------------------------------+
|John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|he   |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0}   |
|Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|her  |{head.sentence -> 0, head -> Mary, head.begin -> 10, head.end -> 13, sentence -> 0} |
+-----+------------------------------------------------------------------------------------|

name = 'SpanBertCorefModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'dependency'[source]#

maxSegmentLength[source]#

textGenre[source]#

configProtoBytes[source]#

setConfigProtoBytes(b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

bList[int]: ConfigProto from tensorflow, serialized into byte array

setMaxSegmentLength(value)[source]#

Sets max segment length

Parameters:

valueint: Max segment length

setTextGenre(value)[source]#

Sets the text genre, one of the following values:: “bc” : Broadcast conversation, default

“bn” Broadcast news

“nw” : News wire

“pt” : Pivot text: Old Testament and New Testament text

“tc” : Telephone conversation

“wb” : Web data

Parameters:

valuestring: Text genre code, default is ‘bc’

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:

folderstr: Folder of the saved model
spark_sessionpyspark.sql.SparkSession: The current SparkSession

Returns:

SpanBertCorefModel: The restored model

static pretrained(name='spanbert_base_coref', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “spanbert_base_coref”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

SpanBertCorefModel: The restored model

sparknlp.annotator.coref.spanbert_coref#

Module Contents#

Classes#

`sparknlp.annotator.coref.spanbert_coref`#