Finnish BERT Embeddings (Base Cased)

Description

A version of Google’s BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.

FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released multilingual BERT models from Google.

FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.

Predicted Entities

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

embeddings = BertEmbeddings.pretrained("bert_base_finnish_cased", "fi") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")

sample_data= spark.createDataFrame([['Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.']], ["text"])
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(sample_data)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence_detector = SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")

val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_base_finnish_cased", "fi")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("fi.embed_sentence.bert.cased").predict("""Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.""")

Results

+--------------------+---------------+
|          embeddings|          token|
+--------------------+---------------+
|[0.53366333, -0.4...|  Syväoppiminen|
|[0.49171034, -1.1...|       perustuu|
|[-0.0017492473, -...| keinotekoisiin|
|[0.61259747, -0.7...|      hermoihin|
|[-0.008151092, -0...|              ,|
|[-0.4050159, -0.2...|          jotka|
|[-0.69079936, 0.6...|    muodostavat|
|[-0.45641452, 0.4...|monikerroksisen|
|[1.278124, -1.218...|    neuroverkon|
|[0.42451048, -1.2...|              .|
+--------------------+---------------+

Model Information

Model Name: bert_base_finnish_cased
Compatibility: Spark NLP 3.3.4+
License: Open Source
Edition: Official
Input Labels: [sentence]
Output Labels: [bert]
Language: fi
Size: 464.2 MB
Case sensitive: true