Dutch NER Model

Description

This model was imported from Hugging Face, and has been fine-tuned on Universal Dependencies Lassy dataset for Dutch language, leveraging Bert embeddings and BertForTokenClassification for NER purposes.

Predicted Entities

CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_dutch_udlassy_ner", "nl"))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("ner")

ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")
      
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)
text = """Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
       .setInputCols(Array("document"))
       .setOutputCol("sentence")

val tokenizer = Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_dutch_udlassy_ner", "nl"))
  .setInputCols(Array("sentence","token"))
  .setOutputCol("ner")

ner_converter = NerConverter()
      .setInputCols(Array("sentence", "token", "ner"))
      .setOutputCol("ner_chunk")
      
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor."].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("nl.ner.bert").predict("""Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.""")

Results

+------------------------+---------+
|chunk                   |ner_label|
+------------------------+---------+
|Peter Fergusson         |PERSON   |
|oktober 2011            |DATE     |
|New York                |GPE      |
|5 jaar                  |DATE     |
|Tesla Motor             |ORG      |
+------------------------+---------+

Model Information

Model Name: bert_token_classifier_dutch_udlassy_ner
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: nl
Case sensitive: true
Max sentense length: 256

Data Source

https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner