BioBERT Embeddings (Pubmed)

Description

This model is the v1.2 of biobert_pubmed_base_cased model and contains pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.

Download Copy S3 URICopied!

How to use

documentAssembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols("document") \
      .setOutputCol("token")

embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased_v1.2","en") \
      .setInputCols(["document", "token"]) \
      .setOutputCol("embeddings")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])

data = spark.createDataFrame([["I hate cancer"]]).toDF("text")

result = pipeline.fit(data).transform(data)

Model Information

Model Name: biobert_pubmed_base_cased_v1.2
Compatibility: Spark NLP 4.0.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [embeddings]
Language: en
Size: 406.5 MB
Case sensitive: true

References

  • https://arxiv.org/abs/1901.08746v2
  • https://huggingface.co/dmis-lab/biobert-base-cased-v1.2