BioBERT Embeddings (Pubmed)

Description

This model is the v1.2 of biobert_pubmed_base_cased model and contains pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols("document") \
      .setOutputCol("token")

embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased_v1.2","en") \
      .setInputCols(["document", "token"]) \
      .setOutputCol("embeddings")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])

data = spark.createDataFrame([["I hate cancer"]]).toDF("text")

result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler() 
  .setInputCol("text") 
  .setOutputCol("document")

val tokenizer = new Tokenizer() 
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased_v1.2","en") 
  .setInputCols(Array("document", "token")) 
  .setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))

val data = Seq("I hate cancer").toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.embed.biobert.pubmed.cased_base").predict("""I hate cancer""")

Model Information

Model Name: biobert_pubmed_base_cased_v1.2
Compatibility: Spark NLP 4.0.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [embeddings]
Language: en
Size: 406.5 MB
Case sensitive: true

References

  • https://arxiv.org/abs/1901.08746v2
  • https://huggingface.co/dmis-lab/biobert-base-cased-v1.2