English DistilBertForSequenceClassification Base Uncased model (from Aureliano)

Description

Pretrained DistilBertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. distilbert-base-uncased-if is a English model originally trained by Aureliano.

Predicted Entities

charge.v.17, kill.v.01, put.v.01, switch_off.v.01, ask.v.01, dig.v.01, search.v.04, repeat.v.01, wear.v.02, play.v.03, ask.v.02, wait.v.01, smash.v.02, clean.v.01, drink.v.01, inventory.v.01, climb.v.01, close.v.01, set.v.05, hit.v.03, remove.v.01, hit.v.02, sit_down.v.01, memorize.v.01, stand.v.03, write.v.07, insert.v.01, light_up.v.05, show.v.01, travel.v.01, listen.v.01, sequence.n.02, brandish.v.01, take_off.v.06, wake_up.v.02, connect.v.01, say.v.08, burn.v.01, talk.v.02, turn.v.09, smell.v.01, pull.v.04, move.v.02, shoot.v.01, press.v.01, exit.v.01, take.v.04, examine.v.02, read.v.01, follow.v.01, jump.v.01, rub.v.01, throw.v.01, answer.v.01, shake.v.01, drive.v.01, buy.v.01, eat.v.01, open.v.01, break.v.05, note.v.04, sleep.v.01, drop.v.01, blow.v.01, fill.v.01, choose.v.01, enter.v.01, pray.v.01, skid.v.04, lower.v.01, lie_down.v.01, cut.v.01, look.v.01, unlock.v.01, give.v.03, tell.v.03, unknown, switch_on.v.01, consult.v.02, raise.v.02, insert.v.02, pour.v.01, touch.v.01, push.v.01

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

sequenceClassifier_loaded = DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_distilbert_base_uncased_if","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("class")

pipeline = Pipeline(stages=[documentAssembler, tokenizer,sequenceClassifier_loaded])

data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")

result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler() 
          .setInputCol("text") 
          .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols(Array("document"))
    .setOutputCol("token")

val sequenceClassifier_loaded = DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_distilbert_base_uncased_if","en") 
    .setInputCols(Array("document", "token")) 
    .setOutputCol("class")

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer,sequenceClassifier_loaded))

val data = Seq("PUT YOUR STRING HERE").toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.distil_bert.uncased_base").predict("""PUT YOUR STRING HERE""")

Model Information

Model Name: distilbert_sequence_classifier_distilbert_base_uncased_if
Compatibility: Spark NLP 4.1.0+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [ner]
Language: en
Size: 250.0 MB
Case sensitive: false
Max sentence length: 128

References

  • https://huggingface.co/Aureliano/distilbert-base-uncased-if
  • https://rasa.com/docs/rasa/components#languagemodelfeaturizer
  • https://github.com/aporporato/jericho-corpora