Description
Identify cyberbullying using a multi-class classification framework that distinguishes six different types of cyberbullying. We have used a Twitter dataset from Kaggle and applied various techniques such as text cleaning, data augmentation, document assembling, universal sentence encoding and tensorflow classification model to process and analyze the data. We have also used snscrape to retrieve tweet data for validating our model’s performance. Our results show that our model achieved an accuracy of 85% for testing data and 89% for training data.
Open in Colab Download Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("cleaned_text")\
.setOutputCol("document")
use = UniversalSentenceEncoder.pretrained(name="tfhub_use_lg", lang="en")\
.setInputCols("document")\
.setOutputCol("sentence_embeddings")\
.setDimension(768)
classifierdl = ClassifierDLApproach()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("class")\
.setLabelColumn("cyberbullying_type")\
.setBatchSize(16)\
.setMaxEpochs(42)\
.setDropout(0.4) \
.setEnableOutputLogs(True)\
.setLr(4e-3)
use_clf_pipeline = Pipeline(
stages = [documentAssembler,
use,
classifierdl])
Results
precision recall f1-score support
age 0.94 0.96 0.95 796
ethnicity 0.94 0.94 0.94 810
gender 0.87 0.86 0.86 816
not_cyberbullying 0.74 0.67 0.70 766
other_cyberbullying 0.67 0.71 0.69 775
religion 0.94 0.96 0.95 731
accuracy 0.85 4694
macro avg 0.85 0.85 0.85 4694
weighted avg 0.85 0.85 0.85 4694
Model Information
Model Name: | CyberbullyingDetection_ClassifierDL_tfhub |
Type: | pipeline |
Compatibility: | Spark NLP 4.4.0+ |
License: | Open Source |
Edition: | Community |
Language: | en |
Size: | 811.9 MB |
Included Models
- DocumentAssembler
- UniversalSentenceEncoder
- ClassifierDLModel