Description
This NER model trained with GloVe 100d word embeddings, annotates text to find features like the names of people , places and organizations.
nerdl_model = NerDLModel.pretrained("Ner_conll2003_100d", "en", "@gokhanturer")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
Predicted Entities
PER
, LOC
, ORG
, MISC
Open in Colab Download Copy S3 URI
How to use
Open In Colab
JOHNSNOWLABS_LOGO.png
Colab Setup
In [1]:
! pip install -q pyspark==3.1.2 spark-nlp
! pip install -q spark-nlp-display
|████████████████████████████████| 212.4 MB 82 kB/s
|████████████████████████████████| 140 kB 61.0 MB/s
|████████████████████████████████| 198 kB 73.0 MB/s
Building wheel for pyspark (setup.py) ... done
|████████████████████████████████| 95 kB 2.0 MB/s
|████████████████████████████████| 66 kB 3.5 MB/s
In [3]:
import sparknlp
spark = sparknlp.start(gpu = True)
from sparknlp.base import *
from sparknlp.annotator import *
import pyspark.sql.functions as F
from sparknlp.training import CoNLL
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)
spark
Spark NLP version 3.4.0
Apache Spark version: 3.1.2
Out[3]:
SparkSession - in-memory
SparkContext
Spark UI
Versionv3.1.2Masterlocal[*]AppNameSpark NLP
CONLL Data Prep
In [2]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa
Train Data
In [5]:
with open ("eng.train") as f:
train_data = f.read()
print (train_data[:500])
-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER
BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
said VBD B-VP O
on IN B-PP O
Thursday NNP B-NP O
it PRP B-NP O
disagreed VBD B-VP O
with IN B-PP O
German JJ B-NP B-MISC
advice NN I-NP O
to TO B-PP O
consumers NNS B-NP
In [6]:
train_data = CoNLL().readDataset(spark, 'eng.train')
train_data.show(3)
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| sentence| token| pos| label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
| Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows
In [7]:
train_data.count()
Out[7]:
14041
In [8]:
train_data.select(F.explode(F.arrays_zip('token.result', 'pos.result', 'label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("pos"),
F.expr("cols['2']").alias("ner_label")).show(truncate=50)
+----------+---+---------+
| token|pos|ner_label|
+----------+---+---------+
| EU|NNP| B-ORG|
| rejects|VBZ| O|
| German| JJ| B-MISC|
| call| NN| O|
| to| TO| O|
| boycott| VB| O|
| British| JJ| B-MISC|
| lamb| NN| O|
| .| .| O|
| Peter|NNP| B-PER|
| Blackburn|NNP| I-PER|
| BRUSSELS|NNP| B-LOC|
|1996-08-22| CD| O|
| The| DT| O|
| European|NNP| B-ORG|
|Commission|NNP| I-ORG|
| said|VBD| O|
| on| IN| O|
| Thursday|NNP| O|
| it|PRP| O|
+----------+---+---------+
only showing top 20 rows
In [9]:
train_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False)
+------------+------+
|ground_truth|count |
+------------+------+
|O |169578|
|B-LOC |7140 |
|B-PER |6600 |
|B-ORG |6321 |
|I-PER |4528 |
|I-ORG |3704 |
|B-MISC |3438 |
|I-LOC |1157 |
|I-MISC |1155 |
+------------+------+
In [10]:
#conll_data.select(F.countDistinct("label.result")).show()
#conll_data.groupBy("label.result").count().show(truncate=False)
train_data = train_data.withColumn('unique', F.array_distinct("label.result"))\
.withColumn('c', F.size('unique'))\
.filter(F.col('c')>1)
train_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"))\
.groupBy('ground_truth')\
.count()\
.orderBy('count', ascending=False)\
.show(100,truncate=False)
+------------+------+
|ground_truth|count |
+------------+------+
|O |137736|
|B-LOC |7125 |
|B-PER |6596 |
|B-ORG |6288 |
|I-PER |4528 |
|I-ORG |3704 |
|B-MISC |3437 |
|I-LOC |1157 |
|I-MISC |1155 |
+------------+------+
Test Data
In [11]:
with open ("eng.testa") as f:
test_data = f.read()
print (test_data[:500])
-DOCSTART- -X- -X- O
CRICKET NNP B-NP O
- : O O
LEICESTERSHIRE NNP B-NP B-ORG
TAKE NNP I-NP O
OVER IN B-PP O
AT NNP B-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O
LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O
West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP B-PER
Simmons NNP I-NP I-PER
took VBD B-VP O
four CD B-NP O
for IN B-PP O
38 CD B-NP O
on IN B-PP O
Friday NNP B-NP O
as IN B-PP O
Leicestershire NNP B-NP B-ORG
beat VBD B-VP
In [12]:
test_data = CoNLL().readDataset(spark, 'eng.testa')
test_data.show(3)
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| sentence| token| pos| label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|
| LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows
In [13]:
test_data.count()
Out[13]:
3250
In [14]:
test_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False)
+------------+-----+
|ground_truth|count|
+------------+-----+
|O |42759|
|B-PER |1842 |
|B-LOC |1837 |
|B-ORG |1341 |
|I-PER |1307 |
|B-MISC |922 |
|I-ORG |751 |
|I-MISC |346 |
|I-LOC |257 |
+------------+-----+
NERDL Model with Glove_100d
In [15]:
glove_embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
In [16]:
glove_embeddings.transform(test_data).write.parquet('test_data_embeddings.parquet')
In [17]:
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(8)\
.setLr(0.002)\
.setDropout(0.5)\
.setBatchSize(16)\
.setRandomSeed(0)\
.setVerbose(1)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setTestDataset('test_data_embeddings.parquet')\
.setEnableMemoryOptimizer(False)
ner_pipeline = Pipeline(stages=[
glove_embeddings,
nerTagger
])
In [19]:
%%time
ner_model = ner_pipeline.fit(train_data)
CPU times: user 10.6 s, sys: 1.08 s, total: 11.7 s
Wall time: 35min 21s
In [20]:
!cd ~/annotator_logs/ && ls -lt
total 16
-rw-r--r-- 1 root root 13178 Feb 6 17:05 NerDLApproach_c5bf4e4c6211.log
In [21]:
!cat ~/annotator_logs/NerDLApproach_c5bf4e4c6211.log
Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079
Epoch 1/8 started, lr: 0.002, dataset size: 11079
Epoch 1/8 - 159.93s - loss: 2234.436 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.74s
label tp fp fn prec rec f1
B-LOC 1695 94 142 0.94745666 0.92270005 0.93491447
I-ORG 528 76 223 0.8741722 0.7030626 0.77933586
I-MISC 255 88 91 0.7434402 0.7369942 0.74020314
I-LOC 189 14 68 0.9310345 0.73540854 0.8217391
I-PER 1270 59 37 0.95560575 0.9716909 0.9635812
B-MISC 797 142 125 0.84877527 0.8644252 0.85652876
B-ORG 1139 170 202 0.8701299 0.8493661 0.85962266
B-PER 1802 176 40 0.91102123 0.9782845 0.94345546
tp: 7675 fp: 819 fn: 928 labels: 8
Macro-average prec: 0.8852045, rec: 0.84524155, f1: 0.86476153
Micro-average prec: 0.903579, rec: 0.8921307, f1: 0.8978184
Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079
Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079
Epoch 1/8 started, lr: 0.002, dataset size: 11079
Epoch 2/8 - 246.28s - loss: 839.1736 - batches: 695
Quality on test dataset:
time to finish evaluation: 19.66s
label tp fp fn prec rec f1
B-LOC 1762 124 75 0.9342524 0.95917255 0.9465484
I-ORG 585 76 166 0.8850227 0.77896136 0.82861185
I-MISC 247 39 99 0.8636364 0.71387285 0.7816456
I-LOC 233 74 24 0.7589577 0.9066148 0.8262412
I-PER 1275 54 32 0.95936793 0.97551644 0.9673748
B-MISC 791 70 131 0.9186992 0.85791755 0.88726866
B-ORG 1150 151 191 0.88393545 0.857569 0.8705526
B-PER 1800 147 42 0.9244992 0.9771987 0.9501188
tp: 7843 fp: 735 fn: 760 labels: 8
Macro-average prec: 0.89104635, rec: 0.8783529, f1: 0.88465416
Micro-average prec: 0.9143157, rec: 0.9116587, f1: 0.9129852
Epoch 3/8 started, lr: 0.001980198, dataset size: 11079
Epoch 1/8 - 254.10s - loss: 2203.116 - batches: 695
Quality on test dataset:
time to finish evaluation: 22.30s
label tp fp fn prec rec f1
B-LOC 1660 82 177 0.95292765 0.90364724 0.9276334
I-ORG 560 123 191 0.81991214 0.74567246 0.781032
I-MISC 227 65 119 0.7773973 0.65606934 0.7115987
I-LOC 155 10 102 0.93939394 0.6031128 0.73459715
I-PER 1259 60 48 0.954511 0.96327466 0.9588728
B-MISC 762 110 160 0.8738532 0.82646424 0.8494984
B-ORG 1160 237 181 0.83035076 0.8650261 0.8473338
B-PER 1785 170 57 0.9130435 0.96905535 0.94021595
tp: 7568 fp: 857 fn: 1035 labels: 8
Macro-average prec: 0.88267374, rec: 0.81654024, f1: 0.84832007
Micro-average prec: 0.89827895, rec: 0.87969315, f1: 0.8888889
Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079
Epoch 3/8 - 257.88s - loss: 610.81525 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.07s
label tp fp fn prec rec f1
B-LOC 1764 104 73 0.9443255 0.9602613 0.9522267
I-ORG 640 140 111 0.82051283 0.85219705 0.8360548
I-MISC 227 22 119 0.9116466 0.65606934 0.7630252
I-LOC 223 43 34 0.8383459 0.8677043 0.8527725
I-PER 1265 31 42 0.97608024 0.96786535 0.9719554
B-MISC 785 62 137 0.9268005 0.85141 0.8875071
B-ORG 1207 174 134 0.87400436 0.90007454 0.8868479
B-PER 1795 94 47 0.9502382 0.97448426 0.96220857
tp: 7906 fp: 670 fn: 697 labels: 8
Macro-average prec: 0.90524435, rec: 0.87875825, f1: 0.8918047
Micro-average prec: 0.921875, rec: 0.91898173, f1: 0.9204261
Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079
Epoch 2/8 - 252.19s - loss: 828.8285 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.37s
label tp fp fn prec rec f1
B-LOC 1722 67 115 0.9625489 0.93739796 0.9498069
I-ORG 624 134 127 0.823219 0.83089215 0.8270378
I-MISC 230 30 116 0.88461536 0.6647399 0.75907594
I-LOC 199 13 58 0.9386792 0.77431905 0.8486141
I-PER 1274 44 33 0.9666161 0.97475135 0.9706667
B-MISC 787 70 135 0.9183197 0.85357916 0.8847667
B-ORG 1212 204 129 0.8559322 0.9038031 0.87921643
B-PER 1807 109 35 0.94311064 0.98099893 0.9616817
tp: 7855 fp: 671 fn: 748 labels: 8
Macro-average prec: 0.9116301, rec: 0.8650602, f1: 0.88773483
Micro-average prec: 0.9212996, rec: 0.9130536, f1: 0.917158
Epoch 3/8 started, lr: 0.001980198, dataset size: 11079
Epoch 4/8 - 250.45s - loss: 512.68085 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.78s
label tp fp fn prec rec f1
B-LOC 1767 78 70 0.95772356 0.9618944 0.9598045
I-ORG 658 75 93 0.89768076 0.8761651 0.8867924
I-MISC 257 38 89 0.87118644 0.74277455 0.801872
I-LOC 229 18 28 0.9271255 0.8910506 0.9087302
I-PER 1264 21 43 0.9836576 0.9671002 0.97530866
B-MISC 841 127 81 0.86880165 0.9121475 0.8899471
B-ORG 1202 114 139 0.9133739 0.89634603 0.90477985
B-PER 1799 87 43 0.95387065 0.9766558 0.9651288
tp: 8017 fp: 558 fn: 586 labels: 8
Macro-average prec: 0.92167753, rec: 0.9030168, f1: 0.9122517
Micro-average prec: 0.9349271, rec: 0.9318842, f1: 0.93340325
Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079
Epoch 3/8 - 252.61s - loss: 604.5874 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.28s
label tp fp fn prec rec f1
B-LOC 1764 112 73 0.9402985 0.9602613 0.95017505
I-ORG 614 84 137 0.87965614 0.8175766 0.847481
I-MISC 244 34 102 0.8776978 0.70520234 0.78205127
I-LOC 220 29 37 0.88353413 0.8560311 0.8695652
I-PER 1268 38 39 0.9709035 0.97016066 0.97053194
B-MISC 799 96 123 0.89273745 0.8665944 0.87947166
B-ORG 1205 123 136 0.9073795 0.8985832 0.90295994
B-PER 1792 110 50 0.94216615 0.97285557 0.95726496
tp: 7906 fp: 626 fn: 697 labels: 8
Macro-average prec: 0.9117967, rec: 0.88090813, f1: 0.89608634
Micro-average prec: 0.9266292, rec: 0.91898173, f1: 0.92278963
Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079
Epoch 5/8 - 257.56s - loss: 437.73123 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.89s
label tp fp fn prec rec f1
B-LOC 1806 163 31 0.91721684 0.9831247 0.94902784
I-ORG 606 26 145 0.95886075 0.8069241 0.8763557
I-MISC 287 99 59 0.7435233 0.82947975 0.78415304
I-LOC 233 54 24 0.8118467 0.9066148 0.85661757
I-PER 1273 26 34 0.9799846 0.9739862 0.9769762
B-MISC 846 146 76 0.8528226 0.9175705 0.8840125
B-ORG 1149 37 192 0.9688027 0.85682327 0.9093787
B-PER 1797 77 45 0.9589114 0.97557 0.96716905
tp: 7997 fp: 628 fn: 606 labels: 8
Macro-average prec: 0.8989962, rec: 0.90626174, f1: 0.9026143
Micro-average prec: 0.9271884, rec: 0.92955947, f1: 0.9283724
Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079
Epoch 4/8 - 255.39s - loss: 508.80334 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.63s
label tp fp fn prec rec f1
B-LOC 1799 270 38 0.8695022 0.9793141 0.921147
I-ORG 616 97 135 0.86395514 0.82023966 0.8415301
I-MISC 253 33 93 0.88461536 0.73121387 0.8006329
I-LOC 236 117 21 0.66855526 0.91828793 0.77377045
I-PER 1256 18 51 0.98587126 0.96097934 0.9732662
B-MISC 799 66 123 0.92369944 0.8665944 0.89423615
B-ORG 1162 106 179 0.9164038 0.86651754 0.89076275
B-PER 1754 52 88 0.9712071 0.95222586 0.96162283
tp: 7875 fp: 759 fn: 728 labels: 8
Macro-average prec: 0.8854762, rec: 0.8869216, f1: 0.8861983
Micro-average prec: 0.91209173, rec: 0.91537833, f1: 0.9137321
Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079
Epoch 6/8 - 262.11s - loss: 382.8735 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.92s
label tp fp fn prec rec f1
B-LOC 1749 61 88 0.96629834 0.9520958 0.95914453
I-ORG 682 136 69 0.83374083 0.9081225 0.8693435
I-MISC 268 40 78 0.8701299 0.7745665 0.81957185
I-LOC 215 14 42 0.93886465 0.83657587 0.8847737
I-PER 1280 39 27 0.97043216 0.979342 0.97486675
B-MISC 837 96 85 0.8971061 0.90780914 0.90242594
B-ORG 1232 120 109 0.9112426 0.9187174 0.91496474
B-PER 1795 93 47 0.9507415 0.97448426 0.96246654
tp: 8058 fp: 599 fn: 545 labels: 8
Macro-average prec: 0.91731954, rec: 0.9064642, f1: 0.9118596
Micro-average prec: 0.9308074, rec: 0.93665, f1: 0.9337195
Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079
Epoch 5/8 - 263.75s - loss: 450.50388 - batches: 695
Quality on test dataset:
time to finish evaluation: 17.58s
label tp fp fn prec rec f1
B-LOC 1749 64 88 0.9646994 0.9520958 0.95835614
I-ORG 689 180 62 0.79286534 0.9174434 0.85061723
I-MISC 275 83 71 0.7681564 0.79479766 0.78124994
I-LOC 210 16 47 0.9292035 0.8171206 0.8695652
I-PER 1271 33 36 0.97469324 0.972456 0.9735733
B-MISC 825 103 97 0.88900864 0.8947939 0.89189196
B-ORG 1239 127 102 0.90702784 0.9239374 0.9154045
B-PER 1791 71 51 0.96186894 0.9723127 0.96706253
tp: 8049 fp: 677 fn: 554 labels: 8
Macro-average prec: 0.89844036, rec: 0.90561974, f1: 0.90201575
Micro-average prec: 0.9224158, rec: 0.93560386, f1: 0.92896307
Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079
Epoch 7/8 - 262.34s - loss: 330.9146 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.09s
label tp fp fn prec rec f1
B-LOC 1760 74 77 0.95965105 0.9580838 0.9588668
I-ORG 630 36 121 0.9459459 0.8388815 0.88920254
I-MISC 283 93 63 0.75265956 0.8179191 0.7839335
I-LOC 225 20 32 0.9183673 0.8754864 0.8964143
I-PER 1273 32 34 0.97547895 0.9739862 0.974732
B-MISC 837 113 85 0.8810526 0.90780914 0.8942308
B-ORG 1230 96 111 0.9276018 0.91722596 0.92238474
B-PER 1801 70 41 0.9625869 0.9777416 0.97010505
tp: 8039 fp: 534 fn: 564 labels: 8
Macro-average prec: 0.915418, rec: 0.9083917, f1: 0.91189134
Micro-average prec: 0.9377114, rec: 0.93444145, f1: 0.93607366
Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079
Epoch 6/8 - 264.34s - loss: 384.8886 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.07s
label tp fp fn prec rec f1
B-LOC 1772 84 65 0.95474136 0.96461624 0.9596534
I-ORG 635 50 116 0.9270073 0.8455393 0.88440114
I-MISC 274 78 72 0.77840906 0.7919075 0.7851002
I-LOC 228 19 29 0.9230769 0.8871595 0.9047619
I-PER 1273 32 34 0.97547895 0.9739862 0.974732
B-MISC 842 125 80 0.8707342 0.9132321 0.8914769
B-ORG 1218 86 123 0.93404907 0.9082774 0.920983
B-PER 1791 65 51 0.96497846 0.9723127 0.9686317
tp: 8033 fp: 539 fn: 570 labels: 8
Macro-average prec: 0.9160594, rec: 0.90712893, f1: 0.9115723
Micro-average prec: 0.93712085, rec: 0.933744, f1: 0.9354294
Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079
Epoch 8/8 - 266.21s - loss: 301.41052 - batches: 695
Quality on test dataset:
time to finish evaluation: 18.45s
label tp fp fn prec rec f1
B-LOC 1768 68 69 0.962963 0.96243876 0.96270084
I-ORG 658 49 93 0.9306931 0.8761651 0.9026063
I-MISC 267 56 79 0.8266254 0.7716763 0.7982063
I-LOC 228 14 29 0.94214875 0.8871595 0.91382766
I-PER 1272 35 35 0.9732211 0.9732211 0.9732211
B-MISC 834 98 88 0.8948498 0.9045553 0.8996764
B-ORG 1239 97 102 0.9273952 0.9239374 0.925663
B-PER 1806 94 36 0.9505263 0.98045605 0.9652592
tp: 8072 fp: 511 fn: 531 labels: 8
Macro-average prec: 0.9260528, rec: 0.90995115, f1: 0.9179313
Micro-average prec: 0.9404637, rec: 0.93827736, f1: 0.9393693
Epoch 7/8 - 256.62s - loss: 335.06775 - batches: 695
Quality on test dataset:
time to finish evaluation: 8.79s
label tp fp fn prec rec f1
B-LOC 1791 128 46 0.9332986 0.9749592 0.95367414
I-ORG 639 78 112 0.8912134 0.8508655 0.8705722
I-MISC 262 48 84 0.8451613 0.75722545 0.7987805
I-LOC 238 60 19 0.7986577 0.92607003 0.8576577
I-PER 1260 19 47 0.9851446 0.9640398 0.97447795
B-MISC 811 72 111 0.9184598 0.8796095 0.89861494
B-ORG 1215 95 126 0.92748094 0.90604025 0.9166353
B-PER 1786 56 56 0.96959823 0.96959823 0.96959823
tp: 8002 fp: 556 fn: 601 labels: 8
Macro-average prec: 0.90862685, rec: 0.903551, f1: 0.9060818
Micro-average prec: 0.93503153, rec: 0.9301407, f1: 0.9325797
Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079
Epoch 8/8 - 133.22s - loss: 299.64578 - batches: 695
Quality on test dataset:
time to finish evaluation: 8.91s
label tp fp fn prec rec f1
B-LOC 1746 56 91 0.9689234 0.9504627 0.95960426
I-ORG 673 77 78 0.8973333 0.8961385 0.8967355
I-MISC 270 43 76 0.8626198 0.7803468 0.8194234
I-LOC 223 10 34 0.95708156 0.8677043 0.9102041
I-PER 1272 41 35 0.9687738 0.9732211 0.9709923
B-MISC 832 109 90 0.88416576 0.9023861 0.893183
B-ORG 1264 143 77 0.8983653 0.94258016 0.9199418
B-PER 1801 76 41 0.95950985 0.9777416 0.9685399
tp: 8081 fp: 555 fn: 522 labels: 8
Macro-average prec: 0.9245966, rec: 0.9113227, f1: 0.91791165
Micro-average prec: 0.93573415, rec: 0.9393235, f1: 0.9375254
In [22]:
import pyspark.sql.functions as F
predictions = ner_model.transform(test_data)
predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).show(truncate=False)
+--------------+------------+----------+
|token |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET |O |O |
|- |O |O |
|LEICESTERSHIRE|B-ORG |B-ORG |
|TAKE |O |O |
|OVER |O |O |
|AT |O |O |
|TOP |O |O |
|AFTER |O |O |
|INNINGS |O |O |
|VICTORY |O |O |
|. |O |O |
|LONDON |B-LOC |B-LOC |
|1996-08-30 |O |O |
|West |B-MISC |B-MISC |
|Indian |I-MISC |I-MISC |
|all-rounder |O |O |
|Phil |B-PER |B-PER |
|Simmons |I-PER |I-PER |
|took |O |O |
|four |O |O |
+--------------+------------+----------+
only showing top 20 rows
In [23]:
from sklearn.metrics import classification_report
preds_df = predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
F.expr("cols['1']").alias("ground_truth"),
F.expr("cols['2']").alias("prediction")).toPandas()
print (classification_report(preds_df['ground_truth'], preds_df['prediction']))
precision recall f1-score support
B-LOC 0.97 0.95 0.96 1837
B-MISC 0.88 0.90 0.89 922
B-ORG 0.90 0.94 0.92 1341
B-PER 0.96 0.98 0.97 1842
I-LOC 0.96 0.87 0.91 257
I-MISC 0.86 0.78 0.82 346
I-ORG 0.90 0.90 0.90 751
I-PER 0.97 0.97 0.97 1307
O 1.00 1.00 1.00 42759
accuracy 0.99 51362
macro avg 0.93 0.92 0.93 51362
weighted avg 0.99 0.99 0.99 51362
Saving the Trained Model
In [24]:
ner_model.stages
Out[24]:
[WORD_EMBEDDINGS_MODEL_48cffc8b9a76, NerDLModel_6a88a8ead3fd]
In [25]:
ner_model.stages[1].write().overwrite().save("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")
Prediction Pipeline
In [28]:
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence = SentenceDetector()\
.setInputCols(['document'])\
.setOutputCol('sentence')
token = Tokenizer()\
.setInputCols(['sentence'])\
.setOutputCol('token')
glove_embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
loaded_ner_model = NerDLModel.load("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_span")
ner_prediction_pipeline = Pipeline(stages = [
document,
sentence,
token,
glove_embeddings,
loaded_ner_model,
converter
])
empty_data = spark.createDataFrame([['']]).toDF("text")
prediction_model = ner_prediction_pipeline.fit(empty_data)
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
In [33]:
text = '''
The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga.
'''
sample_data = spark.createDataFrame([[text]]).toDF("text")
sample_data.show(truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga.
|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
In [34]:
preds = prediction_model.transform(sample_data)
result_df = preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
F.expr("entities['1'].entity").alias("entity")).show(truncate=False)
+---------------+------+
|chunk |entity|
+---------------+------+
|Merseyside |ORG |
|Liverpool |ORG |
|Mo Salah |PER |
|Egypt |LOC |
|Sadio Mané |PER |
|Senegal |LOC |
|African |MISC |
|European |MISC |
|English |MISC |
|Premier League |ORG |
|Mr Salah |PER |
|Mr Mané |PER |
|Riyad Mahrez |PER |
|Algeria |LOC |
|Manchester City|LOC |
|Wilfred Ndidi |PER |
|Nigeria |LOC |
|Chelsea |ORG |
|Edouard Mendy |PER |
|Senegal’s |PER |
+---------------+------+
only showing top 20 rows
In [35]:
from sparknlp.base import LightPipeline
light_model = LightPipeline(prediction_model)
result = light_model.annotate(text)
list(zip(result['token'], result['ner']))
Out[35]:
[('The', 'O'),
('final', 'O'),
('has', 'O'),
('its', 'O'),
('own', 'O'),
('Merseyside', 'B-ORG'),
('subplot', 'O'),
(',', 'O'),
('as', 'O'),
('it', 'O'),
('will', 'O'),
('pit', 'O'),
('Liverpool', 'B-ORG'),
('forwards', 'O'),
('Mo', 'B-PER'),
('Salah', 'I-PER'),
('(', 'O'),
('of', 'O'),
('Egypt', 'B-LOC'),
(':', 'O'),
('pictured', 'O'),
('above', 'O'),
(',', 'O'),
('in', 'O'),
('white', 'O'),
(',', 'O'),
('in', 'O'),
('the', 'O'),
('semi-final', 'O'),
(')', 'O'),
('and', 'O'),
('Sadio', 'B-PER'),
('Mané', 'I-PER'),
('(', 'O'),
('of', 'O'),
('Senegal', 'B-LOC'),
(')', 'O'),
('against', 'O'),
('each', 'O'),
('other', 'O'),
('.', 'O'),
('They', 'O'),
('are', 'O'),
('just', 'O'),
('two', 'O'),
('of', 'O'),
('the', 'O'),
('African', 'B-MISC'),
('stars', 'O'),
('to', 'O'),
('play', 'O'),
('for', 'O'),
('European', 'B-MISC'),
('clubs—the', 'O'),
('world’s', 'O'),
('strongest', 'O'),
('.', 'O'),
('In', 'O'),
('fact', 'O'),
(',', 'O'),
('only', 'O'),
('four', 'O'),
('teams', 'O'),
('in', 'O'),
('the', 'O'),
('English', 'B-MISC'),
('Premier', 'B-ORG'),
('League', 'I-ORG'),
('don’t', 'O'),
('have', 'O'),
('a', 'O'),
('player', 'O'),
('from', 'O'),
('the', 'O'),
('continent', 'O'),
('.', 'O'),
('Besides', 'O'),
('Mr', 'B-PER'),
('Salah', 'I-PER'),
('and', 'O'),
('Mr', 'B-PER'),
('Mané', 'I-PER'),
(',', 'O'),
('Riyad', 'B-PER'),
('Mahrez', 'I-PER'),
('of', 'O'),
('Algeria', 'B-LOC'),
('is', 'O'),
('at', 'O'),
('Manchester', 'B-LOC'),
('City', 'I-LOC'),
(',', 'O'),
('Wilfred', 'B-PER'),
('Ndidi', 'I-PER'),
('of', 'O'),
('Nigeria', 'B-LOC'),
('and', 'O'),
('Chelsea', 'B-ORG'),
('boasts', 'O'),
('Edouard', 'B-PER'),
('Mendy', 'I-PER'),
(',', 'O'),
('Senegal’s', 'B-PER'),
('goalkeeper', 'O'),
(',', 'O'),
('and', 'O'),
('Hakim', 'B-PER'),
('Ziyech', 'I-PER'),
('of', 'O'),
('Morocco', 'B-LOC'),
('.', 'O'),
('In', 'O'),
('Italy’s', 'B-MISC'),
('Serie', 'I-MISC'),
('A', 'I-MISC'),
(',', 'O'),
('Kalidou', 'B-PER'),
('Koulibaly', 'I-PER'),
('of', 'O'),
('Senegal', 'B-LOC'),
('plays', 'O'),
('for', 'O'),
('Napoli', 'B-ORG'),
('and', 'O'),
('Franck', 'B-PER'),
('Kessie', 'I-PER'),
('of', 'O'),
('the', 'O'),
('Ivory', 'B-LOC'),
('Coast', 'I-LOC'),
('turns', 'O'),
('out', 'O'),
('for', 'O'),
('AC', 'B-ORG'),
('Milan', 'I-ORG'),
('.', 'O'),
('Eric', 'B-PER'),
('Maxim', 'I-PER'),
('Choupo-Moting', 'I-PER'),
('of', 'O'),
('Cameroon', 'B-LOC'),
('and', 'O'),
('Bouna', 'B-PER'),
('Sarr', 'I-PER'),
('of', 'O'),
('Senegal', 'B-LOC'),
('both', 'O'),
('play', 'O'),
('for', 'O'),
('Bayern', 'B-ORG'),
('Munich', 'I-ORG'),
(',', 'O'),
('the', 'O'),
('dominant', 'O'),
('club', 'O'),
('in', 'O'),
('Germany’s', 'B-MISC'),
('Bundesliga', 'I-MISC'),
('.', 'O')]
In [37]:
import pandas as pd
result = light_model.fullAnnotate(text)
ner_df= pd.DataFrame([(int(x.metadata['sentence']), x.result, x.begin, x.end, y.result) for x,y in zip(result[0]["token"], result[0]["ner"])],
columns=['sent_id','token','start','end','ner'])
ner_df.head(15)
Out[37]:
sent_id token start end ner
0 0 The 1 3 O
1 0 final 5 9 O
2 0 has 11 13 O
3 0 its 15 17 O
4 0 own 19 21 O
5 0 Merseyside 23 32 B-ORG
6 0 subplot 34 40 O
7 0 , 41 41 O
8 0 as 43 44 O
9 0 it 46 47 O
10 0 will 49 52 O
11 0 pit 54 56 O
12 0 Liverpool 58 66 B-ORG
13 0 forwards 68 75 O
14 0 Mo 77 78 B-PER
Highlight Entities
In [38]:
ann_text = light_model.fullAnnotate(text)[0]
ann_text.keys()
Out[38]:
dict_keys(['document', 'ner_span', 'token', 'ner', 'embeddings', 'sentence'])
In [39]:
from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()
print ('Standard Output')
visualiser.display(ann_text, label_col='ner_span', document_col='document')
Standard Output
The final has its own Merseyside ORG subplot, as it will pit Liverpool ORG forwards Mo Salah PER (of Egypt LOC: pictured above, in white, in the semi-final) and Sadio Mané PER (of Senegal LOC) against each other. They are just two of the African MISC stars to play for European MISC clubs—the world’s strongest. In fact, only four teams in the English MISC Premier League ORG don’t have a player from the continent. Besides Mr Salah PER and Mr Mané PER, Riyad Mahrez PER of Algeria LOC is at Manchester City LOC, Wilfred Ndidi PER of Nigeria LOC and Chelsea ORG boasts Edouard Mendy PER, Senegal’s PER goalkeeper, and Hakim Ziyech PER of Morocco LOC. In Italy’s Serie A MISC, Kalidou Koulibaly PER of Senegal LOC plays for Napoli ORG and Franck Kessie PER of the Ivory Coast LOC turns out for AC Milan ORG. Eric Maxim Choupo-Moting PER of Cameroon LOC and Bouna Sarr PER of Senegal LOC both play for Bayern Munich ORG, the dominant club in Germany’s Bundesliga MISC.
Streamlit
In [14]:
! pip install -q pyspark==3.1.2 spark-nlp
! pip install -q spark-nlp-display
In [ ]:
!pip install streamlit
!pip install pyngrok==4.1.1
In [2]:
! wget https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py
--2022-02-06 22:39:33-- https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7979 (7.8K) [text/plain]
Saving to: ‘streamlit_me_ner_model.py.3’
streamlit_me_ner_mo 100%[===================>] 7.79K --.-KB/s in 0s
2022-02-06 22:39:34 (93.4 MB/s) - ‘streamlit_me_ner_model.py.3’ saved [7979/7979]
In [3]:
!ngrok authtoken 24jtZ2Watn1mc1bSG6v19fel7p1_2bYeRjRkniKqqhfgRs6ub
Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml
In [5]:
!streamlit run streamlit_me_ner_model.py &>/dev/null&
In [6]:
from pyngrok import ngrok
public_url = ngrok.connect(port='8501')
public_url
Out[6]:
'http://2d54-34-125-109-11.ngrok.io'
In [7]:
!killall ngrok
public_url = ngrok.connect(port='8501')
public_url
Out[7]:
'http://df30-34-125-109-11.ngrok.io'
Results
+---------------+------+
|chunk |entity|
+---------------+------+
|Merseyside |ORG |
|Liverpool |ORG |
|Mo Salah |PER |
|Egypt |LOC |
|Sadio Mané |PER |
|Senegal |LOC |
|African |MISC |
|European |MISC |
|English |MISC |
|Premier League |ORG |
|Mr Salah |PER |
|Mr Mané |PER |
|Riyad Mahrez |PER |
|Algeria |LOC |
|Manchester City|LOC |
|Wilfred Ndidi |PER |
|Nigeria |LOC |
|Chelsea |ORG |
|Edouard Mendy |PER |
|Senegal’s |PER |
+---------------+------+
Model Information
Model Name: | Ner_conll2003_100d |
Type: | ner |
Compatibility: | Spark NLP 3.1.2+ |
License: | Open Source |
Edition: | Community |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.3 MB |
Dependencies: | glove100d |
References
This model is trained based on data from : https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa
Benchmarking
label precision recall f1-score support
B-LOC 0.97 0.95 0.96 1837
B-MISC 0.88 0.90 0.89 922
B-ORG 0.90 0.94 0.92 1341
B-PER 0.96 0.98 0.97 1842
I-LOC 0.96 0.87 0.91 257
I-MISC 0.86 0.78 0.82 346
I-ORG 0.90 0.90 0.90 751
I-PER 0.97 0.97 0.97 1307
O 1.00 1.00 1.00 42759
accuracy - - 0.99 51362
macro-avg 0.93 0.92 0.93 51362
weighted-avg 0.99 0.99 0.99 51362