Description
A version of Google’s BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released multilingual BERT models from Google.
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT.
Predicted Entities
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_base_finnish_uncased", "fi") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
sample_data= spark.createDataFrame([['Syväoppimisen tavoitteena on luoda algoritmien avulla neuroverkko, joka pystyy ratkaisemaan sille annetut ongelmat.']], ["text"])
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(sample_data)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_base_finnish_uncased", "fi")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("Syväoppimisen tavoitteena on luoda algoritmien avulla neuroverkko, joka pystyy ratkaisemaan sille annetut ongelmat.").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("fi.embed_sentence.bert").predict("""Syväoppimisen tavoitteena on luoda algoritmien avulla neuroverkko, joka pystyy ratkaisemaan sille annetut ongelmat.""")
Results
+--------------------+-------------+
| embeddings| token|
+--------------------+-------------+
|[0.9422476, -0.14...|Syväoppimisen|
|[2.0408847, -1.45...| tavoitteena|
|[2.33223, -1.7228...| on|
|[0.6425015, -0.96...| luoda|
|[0.10455999, -0.2...| algoritmien|
|[0.28626734, -0.2...| avulla|
|[1.0091506, -0.75...| neuroverkko|
|[1.501086, -0.651...| ,|
|[1.2654709, -0.82...| joka|
|[1.710053, -0.406...| pystyy|
|[0.43736708, -0.2...| ratkaisemaan|
|[1.0496894, 0.191...| sille|
|[0.8630942, -0.16...| annetut|
|[0.50174934, -1.3...| ongelmat|
|[0.27278847, -0.9...| .|
+--------------------+-------------+
Model Information
Model Name: | bert_base_finnish_uncased |
Compatibility: | Spark NLP 3.4.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence] |
Output Labels: | [bert] |
Language: | fi |
Size: | 464.1 MB |
Case sensitive: | false |