Description
This model is the v1.2 of biobert_pubmed_base_cased model and contains pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.
How to use
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased_v1.2","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])
data = spark.createDataFrame([["I hate cancer"]]).toDF("text")
result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased_v1.2","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))
val data = Seq("I hate cancer").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.embed.biobert.pubmed.cased_base").predict("""I hate cancer""")
Model Information
Model Name: | biobert_pubmed_base_cased_v1.2 |
Compatibility: | Spark NLP 4.0.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [embeddings] |
Language: | en |
Size: | 406.5 MB |
Case sensitive: | true |
References
- https://arxiv.org/abs/1901.08746v2
- https://huggingface.co/dmis-lab/biobert-base-cased-v1.2