Description
This NER model was trained over the MIT Movie Corpus complex queries dataset to detect movie trivia. We used BertEmbeddings (bert_base_cased) model for the embeddings to train this NER model.
Predicted Entities
- Actor
- Award
- Character_Name
- Director
- Genre
- Opinion
- Origin
- Plot
- Quote
- Relationship
- Soundtrack
- Year
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
embeddings = BertEmbeddings\
.pretrained('bert_base_cased', 'en')\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained('ner_mit_movie_complex_bert_base_cased', 'en') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
example = spark.createDataFrame(pd.DataFrame({'text': ['My name is John!']}))
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_mit_movie_complex_bert_base_cased", "en")
.setInputCols("document"', "token", "embeddings")
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols("document", "token", "ner")
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["My name is John!"].toDS.toDF("text")).transform(data)
import nlu
text = ["My name is John!"]
ner_df = nlu.load('en.ner.ner_mit_movie_complex_bert_base_cased').predict(text, output_level='token')
Model Information
Model Name: | ner_mit_movie_complex_bert_base_cased |
Type: | ner |
Compatibility: | Spark NLP 3.1.3+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Data Source
https://groups.csail.mit.edu/sls/downloads/movie/
Benchmarking
processed 15904 tokens with 2278 phrases; found: 2292 phrases; correct: 1664.
accuracy: 88.81%; (non-O)
accuracy: 88.78%; precision: 72.60%; recall: 73.05%; FB1: 72.82
Actor: precision: 96.46%; recall: 94.97%; FB1: 95.71 509
Award: precision: 63.64%; recall: 61.76%; FB1: 62.69 33
Character_Name: precision: 61.62%; recall: 68.54%; FB1: 64.89 99
Director: precision: 83.43%; recall: 84.36%; FB1: 83.89 181
Genre: precision: 74.07%; recall: 73.62%; FB1: 73.85 324
Opinion: precision: 39.18%; recall: 46.91%; FB1: 42.70 97
Origin: precision: 35.37%; recall: 40.85%; FB1: 37.91 82
Plot: precision: 53.95%; recall: 53.60%; FB1: 53.77 621
Quote: precision: 64.29%; recall: 39.13%; FB1: 48.65 14
Relationship: precision: 48.00%; recall: 50.00%; FB1: 48.98 50
Soundtrack: precision: 80.00%; recall: 57.14%; FB1: 66.67 5
Year: precision: 94.22%; recall: 93.88%; FB1: 94.05 277