BioBERT Sentence Embeddings (Pubmed PMC)

Description

This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.

Download Copy S3 URI

How to use

...
embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_pmc_base_cased", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))

...
val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_pmc_base_cased", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_pmc_base_cased').predict(text, output_level='sentence')
embeddings_df

Results

sentence	                en_embed_sentence_biobert_pubmed_pmc_base_cased_embeddings

	I hate cancer	                [0.2354733943939209, 0.30127033591270447, -0.1...
	Antibiotics aren't painkiller	    [0.2837969958782196, 0.03842488303780556, 0.04...

Model Information

Model Name:	sent_biobert_pubmed_pmc_base_cased
Type:	embeddings
Compatibility:	Spark NLP 2.6.0
License:	Open Source
Edition:	Official
Input Labels:	[sentence]
Output Labels:	[sentence_embeddings]
Language:	[en]
Dimension:	768
Case sensitive:	true

Data Source

The model is imported from https://github.com/dmis-lab/biobert

PREVIOUSBioBERT Sentence Embeddings (Pubmed Large)

NEXTSmaller BERT Sentence Embeddings (L-10_H-128_A-2)