Skip to content

CodeNinjaSarthak/abusive-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hinglish Abusive Comment Detection Using Transformer-Based Models

A comparative study of multilingual transformer baselines and hybrid Transformer-BiLSTM architectures for detecting abusive content in code-mixed Hinglish text.

Python 3.9+ PyTorch Hugging Face License: MIT


Abstract

Detecting abusive language in code-mixed Hinglish content presents distinct challenges due to script mixing, transliteration variation, and deliberate obfuscation. This repository accompanies a research paper that benchmarks four model configurations — two fine-tuned transformer baselines (mBERT and XLM-RoBERTa) and two hybrid Transformer-BiLSTM variants — on a large-scale dataset of approximately 824K user comments sourced from the ShareChat-IndoML Datathon. A key preprocessing contribution is an obfuscation-aware normalization pipeline that combines regex-based character substitution with Levenshtein distance matching to recover intentionally disguised profanity. All training code, evaluation plots, and preprocessing utilities are provided for full reproducibility. Quantitative results and per-class analysis are presented in the Results and Analysis section.


Motivation and Challenges

Hinglish — a code-mixed register of Hindi and English — is among the most widely used informal registers on Indian social media platforms. Automated moderation of abusive Hinglish content is complicated by several factors:

  • Script mixing: Users freely alternate between Devanagari and Roman scripts, often within a single sentence.
  • Transliteration variation: The same Hindi word may appear in numerous romanized spellings (e.g., "kutta", "kuttaa", "kutha").
  • Deliberate obfuscation: Profanity is frequently disguised through character substitution (e.g., @ for a, $ for s), vowel insertion, and repeated characters.
  • Dialectal diversity: The dataset spans six languages — Hindi, English, Punjabi, Bhojpuri, Haryanvi, and Rajasthani — each contributing distinct vocabulary and grammatical patterns.
  • Class imbalance: The dataset exhibits an approximate 68.8% / 31.2% split between non-abusive and abusive labels, requiring careful metric selection during evaluation.

Repository Structure

abusive-detection/
├── abusive-detection-dataset.ipynb   # Data loading, preprocessing, baseline model training & evaluation
├── hybrid_model.ipynb                # Hybrid Transformer-BiLSTM training & evaluation
├── Dataset/
│   ├── output.csv                    # Preprocessed dataset (~824K comments)
│   └── profane_words.sample.txt      # Format specification for profanity lexicon (placeholder)
├── plots/
│   ├── loss_plot_bert_base_multilingual.png
│   ├── accuracy_plot_bert_base_multilingual.png
│   ├── eval_metrics_bar_bert_base_multilingual.png
│   ├── roc_curve_bert_base_multilingual.png
│   ├── loss_plot_xlm-roberta-base.png
│   ├── accuracy_plot_xlm-roberta-base.png
│   ├── eval_metrics_bar_xlm-roberta-base.png
│   └── roc_curve_xlm-roberta-base.png
├── LICENSE                           # MIT License
├── .gitignore
└── README.md

The two notebooks are self-contained and sequential: abusive-detection-dataset.ipynb handles data preparation and baseline experiments, while hybrid_model.ipynb implements the hybrid architectures. Both notebooks share the same preprocessed dataset (Dataset/output.csv).


Model Architectures

Baselines

Two pre-trained multilingual transformers are fine-tuned with a standard sequence classification head (AutoModelForSequenceClassification, 2-class output):

Model Identifier Parameters Pre-training Languages
mBERT bert-base-multilingual-cased ~178M 104 languages
XLM-RoBERTa xlm-roberta-base ~278M 100 languages

Both baselines use the default pooled [CLS] representation followed by a linear classification layer.

Hybrid Transformer-BiLSTM

The hybrid architecture augments the transformer encoder with a bidirectional LSTM to capture sequential dependencies beyond the transformer's attention patterns:

Input Tokens
     │
     ▼
┌─────────────────────┐
│  Transformer Encoder │   (mBERT or XLM-R, frozen/fine-tuned)
│  [last_hidden_state] │
└─────────┬───────────┘
          │  (batch, seq_len, hidden_size)
          ▼
┌─────────────────────┐
│   Bidirectional LSTM │   hidden_size=128, num_layers=1
│                      │
└─────────┬───────────┘
          │  (batch, seq_len, 256)
          ▼
┌─────────────────────┐
│     Mean Pooling     │   average over sequence dimension
└─────────┬───────────┘
          │  (batch, 256)
          ▼
┌─────────────────────┐
│    Dropout (0.3)     │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Linear (256 → 2)   │   classification logits
└─────────────────────┘

Two variants are implemented:

  • TransformerLSTMClassifier — uses bert-base-multilingual-cased as the encoder
  • XLMRobertaLSTMClassifier — uses xlm-roberta-base as the encoder

Both are implemented as PreTrainedModel subclasses, enabling seamless integration with the Hugging Face Trainer API.


Dataset Description

Property Value
Source ShareChat-IndoML Datathon (NSFW Comment Challenge)
Languages Hindi, English, Punjabi, Bhojpuri, Haryanvi, Rajasthani
Script Mixed (Devanagari + Roman)
Size ~824K comments (after filtering and deduplication)
Columns commentText, label
Labels 0 = Not Abusive, 1 = Abusive
Class Distribution ~68.8% Not Abusive / ~31.2% Abusive
Preprocessing Obfuscation-aware pipeline (see Preprocessing Pipeline)

The raw dataset originally included additional metadata columns (CommentId, user_index, post_index, report_count_comment, report_count_post, like_count_comment, like_count_post, language) which are dropped during preparation. Rows with missing or empty commentText values are removed. Duplicate comments are deduplicated. Only comments from the six target languages are retained.


Experimental Setup

Hyperparameters

Parameter Baselines (mBERT / XLM-R) Hybrid (mBERT) Hybrid (XLM-R)
Learning rate 2e-5 2e-5 2e-5
Weight decay 0.01 0.01 0.01
Max epochs 50 50 50
Max sequence length 128 128 128
Train batch size 512 512 256
Eval batch size 256 512 256
Early stopping patience 3 3 3
LR scheduler ReduceLROnPlateau (patience=2, factor=0.1) ReduceLROnPlateau (patience=2, factor=0.1) ReduceLROnPlateau (patience=2, factor=0.1)
Train / Val / Test split 80% / — / 20% 75% / 12.5% / 12.5% 75% / 12.5% / 12.5%
Best model selection F1 score F1 score F1 score

Training Procedure

All models are trained using the Hugging Face Trainer API with the following configuration:

  • GPU auto-detection via torch.cuda.is_available()
  • Epoch-level evaluation with logging, saving, and evaluation at each epoch boundary
  • Best checkpoint selection by validation F1 score (load_best_model_at_end=True)
  • Custom callbacks:
    • ReduceLROnPlateauCallback — reduces learning rate by a factor of 0.1 when F1 plateaus for 2 consecutive evaluations
    • TrainingAccuracyCallback — computes and logs training set accuracy at the end of each epoch
  • Early stopping via EarlyStoppingCallback with a patience of 3 epochs

Results and Analysis

mBERT (bert-base-multilingual-cased)

mBERT Loss mBERT Accuracy
mBERT Eval Metrics mBERT ROC Curve

The mBERT model converges within approximately 8 epochs. The divergence between training and validation loss curves in later epochs indicates some degree of overfitting, which early stopping mitigates. Per-class classification reports are available in the notebooks.

XLM-RoBERTa (xlm-roberta-base)

XLM-R Loss XLM-R Accuracy
XLM-R Eval Metrics XLM-R ROC Curve

The XLM-RoBERTa model exhibits a similar convergence profile, stabilizing within approximately 7 epochs. Training dynamics are comparable to mBERT, with marginal differences in the overfitting gap.

Discussion

Performance across the four model configurations is broadly comparable, suggesting that the preprocessing pipeline and dataset characteristics are the dominant factors in this task. The observed train-validation gap across all models indicates room for additional regularization (e.g., increased dropout, data augmentation, or label smoothing). Detailed per-class precision, recall, and F1 scores, along with confusion matrices, are available in the respective notebooks.


Preprocessing Pipeline

The preprocessing pipeline is implemented in abusive-detection-dataset.ipynb and consists of four core functions:

  1. load_profanity_lexicon(file_path) — Reads a comma-separated profanity word list from disk, normalizes entries to lowercase, and removes punctuation.
  2. build_profanity_patterns(profanity_list) — Compiles regex patterns for each profanity term using character-level substitution maps (e.g., a → [a@4], s → [s5$z]) to match obfuscated variants.
  3. preprocess_text(text, profanity_patterns, profanity_list) — Applies the full normalization pipeline to a single text string.
  4. preprocess_dataframe_parallel(df, profanity_patterns, profanity_list) — Wraps preprocess_text with joblib.Parallel for multi-core execution across the dataframe.

Processing flow:

Raw Text
  │
  ├─► Lowercase
  ├─► Emoji removal
  ├─► URL removal
  ├─► Profanity normalization (regex substitution patterns)
  ├─► Mention / hashtag removal
  ├─► Repeated character reduction + Levenshtein fuzzy matching
  ├─► Special character / digit removal
  └─► Whitespace normalization
  │
  ▼
Cleaned Text

The Levenshtein matching step is particularly important: after reducing repeated characters (e.g., worrrdworrd), the pipeline computes edit distances against the profanity lexicon and normalizes words within a threshold (edit distance ≤ 1 or ≤ 20% of the target word length).


Reproducibility

Environment Setup

git clone https://github.com/CodeNinjaSarthak/abusive-detection.git
cd abusive-detection
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install torch transformers datasets scikit-learn matplotlib seaborn \
    fuzzywuzzy[speedup] python-Levenshtein emoji pandas numpy joblib tqdm rapidfuzz

A CUDA-compatible GPU is recommended for training. The notebooks auto-detect GPU availability via torch.cuda.is_available() and will fall back to CPU if unavailable (training times will increase substantially).

Running the Experiments

  1. Prepare the profanity lexicon (if reproducing from the raw datathon data): Place your lexicon file at Dataset/Profane words.txt following the format documented in Dataset/profane_words.sample.txt. This step is not required if using the provided Dataset/output.csv, which is already preprocessed.

  2. Run the dataset notebook: Open and execute abusive-detection-dataset.ipynb top-to-bottom. This notebook handles data loading, preprocessing, and baseline model training and evaluation.

  3. Run the hybrid model notebook: Open and execute hybrid_model.ipynb. This notebook loads output.csv and trains both hybrid Transformer-BiLSTM variants.

  4. View outputs: Training plots are saved to plots/. Classification reports and confusion matrices are printed inline in the notebooks.

Note: File paths in the notebooks assume execution from the repository root. Adjust paths if your working directory differs. Due to non-determinism in GPU operations, exact numerical results may vary slightly across runs.


Data Availability and Ethics

Dataset

The file Dataset/output.csv is included in this repository. It is derived from the publicly available ShareChat-IndoML Datathon NSFW Comment Challenge dataset. All personally identifiable information (PII) has been removed during preprocessing; the dataset contains only comment text and binary labels.

Profanity Lexicon

The file Dataset/Profane words.txt has been intentionally excluded from this repository. The lexicon was compiled by aggregating content from multiple publicly accessible web sources. It is withheld for two reasons:

  1. Ambiguous redistribution rights — the provenance of the aggregated sources does not clearly permit redistribution.
  2. Ethical concerns — unrestricted distribution of a concentrated list of abusive terms in multiple languages and scripts poses risks of misuse.

This omission does not affect methodological transparency. The preprocessing pipeline, model architectures, training procedures, and evaluation protocols are fully documented in the notebooks and remain independently verifiable. The lexicon serves as a configurable input to the preprocessing stage; any equivalent word list can be substituted to replicate the general workflow. The expected format is documented in Dataset/profane_words.sample.txt.

Requesting Access

Researchers who require the original lexicon for academic, non-commercial purposes may request access on a case-by-case basis:

  • GitHub Issues: Open an issue in this repository with the subject line [Lexicon Access Request], including a brief description of intended use and institutional affiliation.
  • Email: Contact the author directly at <author-email>.

All requests are reviewed individually. Access is granted solely for research and educational use and may not be redistributed without written permission.


License

This project is licensed under the MIT License. See LICENSE for details.


Citation

If you use this code or dataset in your research, please cite:

@article{<citation-key>,
  title     = {Hinglish Abusive Comment Detection Using Transformer-Based Models},
  author    = {},
  journal   = {},
  year      = {},
  url       = {https://github.com/<CodeNinjaSarthak>/abusive-detection}
}

Paper reference will be updated upon publication.

About

Abusive language detection for code-mixed Hinglish using mBERT, XLM-RoBERTa, and hybrid Transformer-BiLSTM models on 824K ShareChat comments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors