Chinese BERT with Whole Word Masking


Pre-Training with Whole Word Masking for Chinese BERT Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, Guoping Hu

If you find the technical report or resource is useful, please cite the following technical report in your paper.

How to use

embeddings = BertEmbeddings.pretrained("chinese_bert_wwm", "zh") \
      .setInputCols("sentence", "token") \
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])

val embeddings = BertEmbeddings.pretrained("chinese_bert_wwm", "zh")
      .setInputCols("sentence", "token")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
import nlu
nlu.load("zh.embed.bert.wwm").predict("""Put your text here.""")

Model Information

Model Name: chinese_bert_wwm
Compatibility: Spark NLP 3.1.0+
License: Open Source
Edition: Official
Input Labels: [token, sentence]
Output Labels: [embeddings]
Language: zh
Case sensitive: true

Data Source


-	BERTGoogle	BERT-wwm	BERT-wwm-ext	RoBERTa-wwm-ext	RoBERTa-wwm-ext-large
Masking	WordPiece	WWM[1]	WWM	WWM	WWM
Type	base	base	base	base	large
Data Source	wiki	wiki	wiki+ext[2]	wiki+ext	wiki+ext
Training Tokens #	0.4B	0.4B	5.4B	5.4B	5.4B
Device	TPU Pod v2	TPU v3	TPU v3	TPU v3	TPU Pod v3-32[3]
Training Steps	?	100KMAX128
+100KMAX512	1MMAX128
+400KMAX512	1MMAX512	2MMAX512
Batch Size	?	2,560 / 384	2,560 / 384	384	512
Optimizer	AdamW	LAMB	LAMB	AdamW	AdamW
Vocabulary	21,128	~BERT[4]	~BERT	~BERT	~BERT
Init Checkpoint	Random Init	~BERT	~BERT	~BERT	Random Init