Description
This NER model was trained over the MIT Movie Corpus simple queries dataset to detect movie trivia. We used DistilBertEmbeddings (distilbert_base_cased) model for the embeddings to train this NER model.
Predicted Entities
- ACTOR
- CHARACTER
- DIRECTOR
- GENRE
- PLOT
- RATING
- RATINGS_AVERAGE
- REVIEW
- SONG
- TITLE
- TRAILER
- YEAR
How to use
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
embeddings = DistilBertEmbeddings\
.pretrained('distilbert_base_cased', 'en')\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")
ner_model = NerDLModel.pretrained('ner_mit_movie_simple_distilbert_base_cased', 'en') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
example = spark.createDataFrame(pd.DataFrame({'text': ['My name is John!']}))
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("ner_mit_movie_simple_distilbert_base_cased", "en")
.setInputCols("document"', "token", "embeddings")
.setOutputCol("ner")
val ner_converter = NerConverter()
.setInputCols("document", "token", "ner")
.setOutputCol("entities")
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["My name is John!"].toDS.toDF("text")).transform(data)
import nlu
text = ["My name is John!"]
ner_df = nlu.load('en.ner. ner_mit_movie_simple_distilbert_base_cased').predict(text, output_level='token')
Model Information
Model Name: | ner_mit_movie_simple_distilbert_base_cased |
Type: | ner |
Compatibility: | Spark NLP 3.1.3+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Data Source
https://groups.csail.mit.edu/sls/downloads/movie/
Benchmarking
processed 24686 tokens with 5339 phrases; found: 5331 phrases; correct: 4677.
accuracy: 87.88%; (non-O)
accuracy: 93.74%; precision: 87.73%; recall: 87.60%; FB1: 87.67
ACTOR: precision: 88.34%; recall: 95.20%; FB1: 91.64 875
CHARACTER: precision: 64.56%; recall: 56.67%; FB1: 60.36 79
DIRECTOR: precision: 93.00%; recall: 84.43%; FB1: 88.51 414
GENRE: precision: 91.04%; recall: 94.63%; FB1: 92.80 1161
PLOT: precision: 70.86%; recall: 72.30%; FB1: 71.57 501
RATING: precision: 93.16%; recall: 92.60%; FB1: 92.88 497
RATINGS_AVERAGE: precision: 83.94%; recall: 86.92%; FB1: 85.40 467
REVIEW: precision: 47.06%; recall: 14.29%; FB1: 21.92 17
SONG: precision: 76.32%; recall: 53.70%; FB1: 63.04 38
TITLE: precision: 84.60%; recall: 83.10%; FB1: 83.84 552
TRAILER: precision: 83.87%; recall: 86.67%; FB1: 85.25 31
YEAR: precision: 95.99%; recall: 93.19%; FB1: 94.57 699