Description
This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.
How to use
...
embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"]))
...
val embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val data = Seq("I hate cancer").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = ["I hate cancer"]
embeddings_df = nlu.load('en.embed.biobert.pmc_base_cased').predict(text, output_level='token')
embeddings_df
Results
token en_embed_biobert_pmc_base_cased_embeddings
I [0.0654267892241478, 0.06330983340740204, 0.13...
hate [0.3058323264122009, 0.4778319299221039, -0.09...
cancer [0.3130614757537842, 0.024675076827406883, -0....
Model Information
Model Name: | biobert_pmc_base_cased |
Type: | embeddings |
Compatibility: | Spark NLP 2.6.0 |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [word_embeddings] |
Language: | [en] |
Dimension: | 768 |
Case sensitive: | true |
Data Source
The model is imported from https://github.com/dmis-lab/biobert