German BERT Base Uncased Model

Description

The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens. The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps.

Download Copy S3 URI

How to use

embeddings = BertEmbeddings.pretrained("bert_base_german_uncased", "de") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
val embeddings = BertEmbeddings.pretrained("bert_base_german_uncased", "de")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
import nlu
nlu.load("de.embed.bert.uncased").predict("""Put your text here.""")

Model Information

Model Name: bert_base_german_uncased
Compatibility: Spark NLP 3.1.0+
License: Open Source
Edition: Official
Input Labels: [token, sentence]
Output Labels: [embeddings]
Language: de
Case sensitive: true

Data Source

https://huggingface.co/dbmdz/bert-base-german-uncased

Benchmarking

For results on downstream tasks like NER or PoS tagging, please refer to
[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).