Detect Biomedical Entities in English

Description

Named Entity Recognition model that finds Protein entitites in biomedical texts.

Predicted Entities

Protein

Open in Colab Download Copy S3 URI

How to use

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')
    
glove_embeddings = WordEmbeddingsModel.pretrained()\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

loaded_ner_model = NerDLModel.pretrained("ner_protein_glove", "en")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(stages = [
      document,
      sentence,
      token,
      glove_embeddings,
      loaded_ner_model,
      converter
  ])

text = '''
MACROPHAGES ARE MONONUCLEAR phagocytes that reside within almost all tissues including adipose tissue, where they are identifiable as distinct populations with tissue-specific morphology, localization, and function (1). During the process of atherosclerosis, monocytes adhere to the endothelium and migrate into the intima, express scavenger receptors, and bind internalized lipoprotein particles resulting in the formation of foam cells (2). In obesity, adipose tissue contains an increased number of resident macrophages (3, 4). Macrophage accumulation in proportion to adipocyte size may increase the adipose tissue production of proinflammatory and acute-phase molecules and thereby contribute to the pathophysiological consequences of obesity (1, 3). These facts indicate that macrophages play an important role in a variety of diseases. When activated, macrophages release stereotypical profiles of cytokines and biological molecules such as nitric oxide TNF-α, IL-6, and IL-1 (5). TNF-α is a potent chemoattractant (6) and originates predominantly from residing mouse peritoneal macrophages (MPM) and mast cells (7). TNF-α induces leukocyte adhesion and degranulation, stimulates nicotinamide adenine dinucleotide phosphate (NADPH) oxidase, and enhances expression of IL-2 receptors and expression of E-selectin and intercellular adhesion molecules on the endothelium (8). TNF-α also stimulates expression of IL-1, IL-2, IL-6, and platelet-activating factor receptor (9). In addition, TNF-α decreases insulin sensitivity and increases lipolysis in adipocytes (10, 11). IL-6 also increase lipolysis and has been implicated in the hypertriglyceridemia and increased serum free fatty acid levels associated with obesity (12). Increased IL-6 signaling induces the expression of C-reactive protein and haptoglubin in liver (13). Recombinant IL-6 treatment increases atherosclerotic lesion size 5-fold (14). IL-6 also dose-dependently increases macrophage oxidative low-density lipoprotein (LDL) degradation and CD36 mRNA expression in vitro (15). These data clearly indicate that IL-6 and TNF-α are important pathogenetic factors associated with obesity, insulin resistance, and atherosclerosis. However, the factors regulating gene expression of these cytokines in macrophages have not been fully clarified.
'''

sample_data = spark.createDataFrame([[text]]).toDF("text")
prediction_model = ner_prediction_pipeline.fit(sample_data)
preds = prediction_model.transform(sample_data)
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
    
val glove_embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val loaded_ner_model = NerDLModel.pretrained("ner_protein_glove", "en")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_span")

val ner_prediction_pipeline = new Pipeline().setStages(Array(
      document,
      sentence,
      token,
      glove_embeddings,
      loaded_ner_model,
      converter
  ))

val text =Seq("
MACROPHAGES ARE MONONUCLEAR phagocytes that reside within almost all tissues including adipose tissue, where they are identifiable as distinct populations with tissue-specific morphology, localization, and function (1). During the process of atherosclerosis, monocytes adhere to the endothelium and migrate into the intima, express scavenger receptors, and bind internalized lipoprotein particles resulting in the formation of foam cells (2). In obesity, adipose tissue contains an increased number of resident macrophages (3, 4). Macrophage accumulation in proportion to adipocyte size may increase the adipose tissue production of proinflammatory and acute-phase molecules and thereby contribute to the pathophysiological consequences of obesity (1, 3). These facts indicate that macrophages play an important role in a variety of diseases. When activated, macrophages release stereotypical profiles of cytokines and biological molecules such as nitric oxide TNF-α, IL-6, and IL-1 (5). TNF-α is a potent chemoattractant (6) and originates predominantly from residing mouse peritoneal macrophages (MPM) and mast cells (7). TNF-α induces leukocyte adhesion and degranulation, stimulates nicotinamide adenine dinucleotide phosphate (NADPH) oxidase, and enhances expression of IL-2 receptors and expression of E-selectin and intercellular adhesion molecules on the endothelium (8). TNF-α also stimulates expression of IL-1, IL-2, IL-6, and platelet-activating factor receptor (9). In addition, TNF-α decreases insulin sensitivity and increases lipolysis in adipocytes (10, 11). IL-6 also increase lipolysis and has been implicated in the hypertriglyceridemia and increased serum free fatty acid levels associated with obesity (12). Increased IL-6 signaling induces the expression of C-reactive protein and haptoglubin in liver (13). Recombinant IL-6 treatment increases atherosclerotic lesion size 5-fold (14). IL-6 also dose-dependently increases macrophage oxidative low-density lipoprotein (LDL) degradation and CD36 mRNA expression in vitro (15). These data clearly indicate that IL-6 and TNF-α are important pathogenetic factors associated with obesity, insulin resistance, and atherosclerosis. However, the factors regulating gene expression of these cytokines in macrophages have not been fully clarified.
").toDF("text")

val preds = prediction_model.fit(sample_data).transform(sample_data)

Results

+--------------+-------+
|chunk         |entity |
+--------------+-------+
|IL-6          |Protein|
|IL-1 (5).     |Protein|
|TNF-α         |Protein|
|IL-2 receptors|Protein|
|IL-2          |Protein|
|IL-6          |Protein|
|insulin       |Protein|
|haptoglubin   |Protein|
|CD36          |Protein|
|IL-6          |Protein|
|insulin       |Protein|
+--------------+-------+

Model Information

Model Name: Ner_glove_100d
Type: ner
Compatibility: Spark NLP 3.1.2+
License: Open Source
Edition: Community
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.3 MB
Dependencies: glove100d

References

This model trained based on following dataset

Benchmarking

       label  precision    recall  f1-score   support
   B-Protein       0.86      0.87      0.87      3589
   I-Protein       0.84      0.86      0.85      4078
           O       0.99      0.98      0.98     66957
    accuracy         -         -       0.97     74624
   macro-avg       0.90      0.91      0.90     74624
weighted-avg       0.97      0.97      0.97     74624