Description
This NER model was trained over the MIT Movie Corpus complex queries dataset to detect movie trivia. We used DistilBertEmbeddings (distilbert_base_cased) model for the embeddings to train this NER model.
Predicted Entities
- Actor
- Award
- Character_Name
- Director
- Genre
- Opinion
- Origin
- Plot
- Quote
- Relationship
- Soundtrack
- Year
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
embeddings = DistilBertEmbeddings\
.pretrained('distilbert_base_cased', 'en')\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained('ner_mit_movie_complex_distilbert_base_cased', 'en') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
example = spark.createDataFrame(pd.DataFrame({'text': ['My name is John!']}))
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_mit_movie_complex_distilbert_base_cased", "en")
.setInputCols("document"', "token", "embeddings")
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols("document", "token", "ner")
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["My name is John!"].toDS.toDF("text")).transform(data)
import nlu
text = ["My name is John!"]
ner_df = nlu.load('en.ner.ner_mit_movie_complex_distilbert_base_cased').predict(text, output_level='token')
Model Information
Model Name: | ner_mit_movie_complex_distilbert_base_cased |
Type: | ner |
Compatibility: | Spark NLP 3.1.3+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Data Source
https://groups.csail.mit.edu/sls/downloads/movie/
Benchmarking
processed 15904 tokens with 2278 phrases; found: 2277 phrases; correct: 1674.
accuracy: 89.18%; (non-O)
accuracy: 88.41%; precision: 73.52%; recall: 73.49%; FB1: 73.50
Actor: precision: 96.50%; recall: 96.13%; FB1: 96.32 515
Award: precision: 51.85%; recall: 41.18%; FB1: 45.90 27
Character_Name: precision: 72.53%; recall: 74.16%; FB1: 73.33 91
Director: precision: 81.77%; recall: 87.71%; FB1: 84.64 192
Genre: precision: 75.00%; recall: 74.54%; FB1: 74.77 324
Opinion: precision: 41.94%; recall: 48.15%; FB1: 44.83 93
Origin: precision: 37.70%; recall: 32.39%; FB1: 34.85 61
Plot: precision: 53.43%; recall: 53.60%; FB1: 53.51 627
Quote: precision: 56.25%; recall: 39.13%; FB1: 46.15 16
Relationship: precision: 55.10%; recall: 56.25%; FB1: 55.67 49
Soundtrack: precision: 42.86%; recall: 42.86%; FB1: 42.86 7
Year: precision: 94.91%; recall: 93.88%; FB1: 94.39 275