Description
A version of Google’s BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. FinBERT
features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words.
FinBERT
has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT
.
These features allow FinBERT
to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks.
How to use
...
embeddings = BertEmbeddings.pretrained("bert_finnish_cased", "fi") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['Rakastan NLP: tä']], ["text"]))
...
val embeddings = BertEmbeddings.pretrained("bert_finnish_cased", "fi")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Rakastan NLP: tä").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["Rakastan NLP: tä"]
embeddings_df = nlu.load('fi.embed.bert.cased.').predict(text, output_level='token')
embeddings_df
Results
fi_embed_bert_cased__embeddings token
[0.09888151288032532, -0.72500079870224, 1.001... Rakastan
[0.46280959248542786, -0.7008669972419739, 0.9... NLP
[0.061913054436445236, 1.1024340391159058, 0.9... :
[1.0134484767913818, -0.822027325630188, 1.353... tä
Model Information
Model Name: | bert_finnish_cased |
Type: | embeddings |
Compatibility: | Spark NLP 2.6.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [word_embeddings] |
Language: | [fi] |
Dimension: | 768 |
Case sensitive: | true |
Data Source
The model is imported from https://github.com/TurkuNLP/FinBERT