Description
This model is the English version of CODER, a knowledge-infused biomedical embedding model designed for medical concept normalization and cross-lingual representation learning.
How to use
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("umlsbert_eng_onnx", "en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
pipeline = Pipeline(stages=[
documentAssembler,
tokenizer,
embeddings
])
data = spark.createDataFrame([
["Artificial intelligence is transforming the world."],
["Machine learning enables powerful data-driven systems."]
]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show()
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings
.pretrained("umlsbert_eng_onnx", "en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings
))
val data = Seq(
"Artificial intelligence is transforming the world.",
"Machine learning enables powerful data-driven systems."
).toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(truncate = false)
Results
+--------------------+
| embeddings|
+--------------------+
|[[-0.6771237, 0.5...|
|[[-1.016453, 0.21...|
+--------------------+
Model Information
| Model Name: | umlsbert_eng_onnx |
| Compatibility: | Healthcare NLP 6.1.0+ |
| License: | Open Source |
| Edition: | Official |
| Input Labels: | [document, token] |
| Output Labels: | [embeddings] |
| Language: | en |
| Size: | 408.0 MB |
| Case sensitive: | false |
| Max sentence length: | 512 |