Description
-
This model is imported from
Hugging Face
. -
RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the RoBERTa base model and has been pretrained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019.
Predicted Entities
OTH
, PER
, LOC
, ORG
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_bne_capitel_ner", "es"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = """Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_bne_capitel_ner", "es"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")
ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))
val example = Seq.empty["Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid."].toDS.toDF("text")
val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("es.ner.roberta").predict("""Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.""")
Results
+------------------------+---------+
|chunk |ner_label|
+------------------------+---------+
|Antonio |PER |
|fábrica de Mercedes-Benz|ORG |
|Madrid. |LOC |
+------------------------+---------+
Model Information
Model Name: | roberta_token_classifier_bne_capitel_ner |
Compatibility: | Spark NLP 3.3.2+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | es |
Case sensitive: | true |
Max sentense length: | 256 |
Data Source
https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus
Benchmarking
label score
f1 0.8867
PREVIOUSChinese NER Model
NEXTDutch NER Model