Description
Pre-Training with Whole Word Masking for Chinese BERT Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, Guoping Hu
More resources by HFL: https://github.com/ymcui/HFL-Anthology
If you find the technical report or resource is useful, please cite the following technical report in your paper.
- Primary: https://arxiv.org/abs/2004.13922
- Secondary: https://arxiv.org/abs/1906.08101
How to use
embeddings = BertEmbeddings.pretrained("chinese_bert_wwm", "zh") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
val embeddings = BertEmbeddings.pretrained("chinese_bert_wwm", "zh")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
import nlu
nlu.load("zh.embed.bert.wwm").predict("""Put your text here.""")
Model Information
Model Name: | chinese_bert_wwm |
Compatibility: | Spark NLP 3.1.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [token, sentence] |
Output Labels: | [embeddings] |
Language: | zh |
Case sensitive: | true |
Data Source
https://huggingface.co/hfl/chinese-bert-wwm
Benchmarking
- BERTGoogle BERT-wwm BERT-wwm-ext RoBERTa-wwm-ext RoBERTa-wwm-ext-large
Masking WordPiece WWM[1] WWM WWM WWM
Type base base base base large
Data Source wiki wiki wiki+ext[2] wiki+ext wiki+ext
Training Tokens # 0.4B 0.4B 5.4B 5.4B 5.4B
Device TPU Pod v2 TPU v3 TPU v3 TPU v3 TPU Pod v3-32[3]
Training Steps ? 100KMAX128
+100KMAX512 1MMAX128
+400KMAX512 1MMAX512 2MMAX512
Batch Size ? 2,560 / 384 2,560 / 384 384 512
Optimizer AdamW LAMB LAMB AdamW AdamW
Vocabulary 21,128 ~BERT[4] ~BERT ~BERT ~BERT
Init Checkpoint Random Init ~BERT ~BERT ~BERT Random Init