Description
Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. FastPDN-distiluse
is a Polish model originally trained by clarin-pl
.
Predicted Entities
nam_num_phone
, nam_loc_land_mountain
, nam_loc_gpe_subdivision
, nam_adj_country
, nam_oth_currency
, nam_loc_country_region
, nam_pro_title_document
, nam_loc_gpe_admin1
, nam_eve_human_sport
, nam_loc_hydronym_sea
, nam_fac_park
, nam_adj_city
, nam_loc_land_region
, nam_liv_animal
, nam_liv_person
, nam_pro_media_tv
, nam_fac_bridge
, nam_pro_model_car
, nam_oth_tech
, nam_oth_position
, nam_loc_land_island
, nam_liv_habitant
, nam_pro_award
, nam_pro_title_article
, nam_org_group
, nam_num_house
, nam_pro_title_book
, nam_pro_media_periodic
, nam_pro_media_web
, nam_pro_title_treaty
, nam_loc_gpe_conurbation
, nam_pro_software_game
, nam_pro_brand
, nam_fac_goe
, nam_loc_historical_region
, nam_pro
, nam_pro_media_radio
, nam_pro_title
, nam_loc_hydronym_river
, nam_loc_land
, nam_org_group_team
, nam_fac_system
, nam_org_company
, nam_pro_title_song
, nam_loc_land_peak
, nam_eve
, nam_loc_hydronym_ocean
, nam_org_group_band
, nam_liv_character
, nam_loc_gpe_admin2
, nam_org_organization
, nam_adj_person
, nam_eve_human
, nam_org_nation
, nam_loc_gpe_district
, nam_liv_god
, nam_org_political_party
, nam_oth_data_format
, nam_loc_land_continent
, nam_fac_goe_stop
, nam_loc
, nam_oth
, nam_loc_gpe_admin3
, nam_pro_media
, nam_loc_gpe_city
, nam_loc_hydronym_lake
, nam_pro_title_tv
, nam_oth_license
, nam_org_organization_sub
, nam_adj
, nam_loc_hydronym
, nam_oth_www
, nam_org_institution
, nam_pro_vehicle
, nam_pro_software
, nam_loc_gpe_country
, nam_eve_human_cultural
, nam_fac_road
, nam_pro_title_album
, nam_loc_astronomical
, nam_eve_human_holiday
, nam_fac_square
How to use
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_fastpdn_distiluse","pl") \
.setInputCols(["document", "token"]) \
.setOutputCol("ner")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_fastpdn_distiluse","pl")
.setInputCols(Array("document", "token"))
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier))
val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
Model Information
Model Name: | distilbert_token_classifier_fastpdn_distiluse |
Compatibility: | Spark NLP 4.3.1+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document, token] |
Output Labels: | [ner] |
Language: | pl |
Size: | 509.2 MB |
Case sensitive: | true |
Max sentence length: | 128 |
References
- https://huggingface.co/clarin-pl/FastPDN-distiluse
- https://gitlab.clarin-pl.eu/information-extraction/poldeepner2
- https://gitlab.clarin-pl.eu/grupa-wieszcz/ner/fast-pdn
- https://clarin-pl.eu/dspace/bitstream/handle/11321/294/WytyczneKPWr-jednostkiidentyfikacyjne.pdf