sparknlp.annotator.coref.spanbert_coref
#
Contains classes for the SpanBertCorefModel.
Module Contents#
Classes#
A coreference resolution model based on SpanBert. |
- class SpanBertCorefModel(classname='com.johnsnowlabs.nlp.annotators.coref.SpanBertCorefModel', java_model=None)[source]#
A coreference resolution model based on SpanBert.
A coreference resolution model identifies expressions which refer to the same entity in a text. For example, given a sentence “John told Mary he would like to borrow a book from her.” the model will link “he” to “John” and “her” to “Mary”.
This model is based on SpanBert, which is fine-tuned on the OntoNotes 5.0 data set.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> corefResolution = SpanBertCorefModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("coref")
The default model is
"spanbert_base_coref"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
DEPENDENCY
- Parameters:
- maxSentenceLength
Maximum sentence length to process
- maxSegmentLength
Maximum segment length
- textGenre
Text genre. One of the following values:
“bc”, // Broadcast conversation, default“bn”, // Broadcast news“nw”, // News wire“pt”, // Pivot text: Old Testament and New Testament text“tc”, // Telephone conversation“wb” // Web data
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> corefResolution = SpanBertCorefModel() \ ... .pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("corefs") \ >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... corefResolution ... ]) >>> data = spark.createDataFrame([ ... ["John told Mary he would like to borrow a book from her."] ... ]).toDF("text") >>> results = pipeline.fit(data).transform(data)) >>> results \ ... .selectExpr("explode(corefs) AS coref") ... .selectExpr("coref.result as token", "coref.metadata") ... .show(truncate=False) +-----+------------------------------------------------------------------------------------+ |token|metadata | +-----+------------------------------------------------------------------------------------+ |John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}| |he |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0} | |Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}| |her |{head.sentence -> 0, head -> Mary, head.begin -> 10, head.end -> 13, sentence -> 0} | +-----+------------------------------------------------------------------------------------|
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setTextGenre(value)[source]#
- Sets the text genre, one of the following values:
- “bc” : Broadcast conversation, default“bn” Broadcast news“nw” : News wire“pt” : Pivot text: Old Testament and New Testament text“tc” : Telephone conversation“wb” : Web data
- Parameters:
- valuestring
Text genre code, default is ‘bc’
- static loadSavedModel(folder, spark_session)[source]#
Loads a locally saved model.
- Parameters:
- folderstr
Folder of the saved model
- spark_sessionpyspark.sql.SparkSession
The current SparkSession
- Returns:
- SpanBertCorefModel
The restored model
- static pretrained(name='spanbert_base_coref', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “spanbert_base_coref”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- SpanBertCorefModel
The restored model