sparknlp.annotator.coref.spanbert_coref#
Contains classes for the SpanBertCorefModel.
Module Contents#
Classes#
| A coreference resolution model based on SpanBert. | 
- class SpanBertCorefModel(classname='com.johnsnowlabs.nlp.annotators.coref.SpanBertCorefModel', java_model=None)[source]#
- A coreference resolution model based on SpanBert. - A coreference resolution model identifies expressions which refer to the same entity in a text. For example, given a sentence “John told Mary he would like to borrow a book from her.” the model will link “he” to “John” and “her” to “Mary”. - This model is based on SpanBert, which is fine-tuned on the OntoNotes 5.0 data set. - Pretrained models can be loaded with - pretrained()of the companion object:- >>> corefResolution = SpanBertCorefModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("coref") - The default model is - "spanbert_base_coref", if no name is provided. For available pretrained models please see the Models Hub.- For extended examples of usage, see the Examples. - Input Annotation types - Output Annotation type - DOCUMENT, TOKEN- DEPENDENCY- Parameters:
- maxSentenceLength
- Maximum sentence length to process 
- maxSegmentLength
- Maximum segment length 
- textGenre
- Text genre. One of the following values: “bc”, // Broadcast conversation, default“bn”, // Broadcast news“nw”, // News wire“pt”, // Pivot text: Old Testament and New Testament text“tc”, // Telephone conversation“wb” // Web data
 
 - Examples - >>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> corefResolution = SpanBertCorefModel() \ ... .pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("corefs") \ >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... corefResolution ... ]) >>> data = spark.createDataFrame([ ... ["John told Mary he would like to borrow a book from her."] ... ]).toDF("text") >>> results = pipeline.fit(data).transform(data)) >>> results \ ... .selectExpr("explode(corefs) AS coref") ... .selectExpr("coref.result as token", "coref.metadata") ... .show(truncate=False) +-----+------------------------------------------------------------------------------------+ |token|metadata | +-----+------------------------------------------------------------------------------------+ |John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}| |he |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0} | |Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}| |her |{head.sentence -> 0, head -> Mary, head.begin -> 10, head.end -> 13, sentence -> 0} | +-----+------------------------------------------------------------------------------------| - setConfigProtoBytes(b)[source]#
- Sets configProto from tensorflow, serialized into byte array. - Parameters:
- bList[int]
- ConfigProto from tensorflow, serialized into byte array 
 
 
 - setTextGenre(value)[source]#
- Sets the text genre, one of the following values:
- “bc” : Broadcast conversation, default“bn” Broadcast news“nw” : News wire“pt” : Pivot text: Old Testament and New Testament text“tc” : Telephone conversation“wb” : Web data
 - Parameters:
- valuestring
- Text genre code, default is ‘bc’ 
 
 
 - static loadSavedModel(folder, spark_session)[source]#
- Loads a locally saved model. - Parameters:
- folderstr
- Folder of the saved model 
- spark_sessionpyspark.sql.SparkSession
- The current SparkSession 
 
- Returns:
- SpanBertCorefModel
- The restored model 
 
 
 - static pretrained(name='spanbert_base_coref', lang='en', remote_loc=None)[source]#
- Downloads and loads a pretrained model. - Parameters:
- namestr, optional
- Name of the pretrained model, by default “spanbert_base_coref” 
- langstr, optional
- Language of the pretrained model, by default “en” 
- remote_locstr, optional
- Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise. 
 
- Returns:
- SpanBertCorefModel
- The restored model