A comparative study of multilingual transformer baselines and hybrid Transformer-BiLSTM architectures for detecting abusive content in code-mixed Hinglish text.
Detecting abusive language in code-mixed Hinglish content presents distinct challenges due to script mixing, transliteration variation, and deliberate obfuscation. This repository accompanies a research paper that benchmarks four model configurations — two fine-tuned transformer baselines (mBERT and XLM-RoBERTa) and two hybrid Transformer-BiLSTM variants — on a large-scale dataset of approximately 824K user comments sourced from the ShareChat-IndoML Datathon. A key preprocessing contribution is an obfuscation-aware normalization pipeline that combines regex-based character substitution with Levenshtein distance matching to recover intentionally disguised profanity. All training code, evaluation plots, and preprocessing utilities are provided for full reproducibility. Quantitative results and per-class analysis are presented in the Results and Analysis section.
Hinglish — a code-mixed register of Hindi and English — is among the most widely used informal registers on Indian social media platforms. Automated moderation of abusive Hinglish content is complicated by several factors:
- Script mixing: Users freely alternate between Devanagari and Roman scripts, often within a single sentence.
- Transliteration variation: The same Hindi word may appear in numerous romanized spellings (e.g., "kutta", "kuttaa", "kutha").
- Deliberate obfuscation: Profanity is frequently disguised through character substitution (e.g.,
@fora,$fors), vowel insertion, and repeated characters. - Dialectal diversity: The dataset spans six languages — Hindi, English, Punjabi, Bhojpuri, Haryanvi, and Rajasthani — each contributing distinct vocabulary and grammatical patterns.
- Class imbalance: The dataset exhibits an approximate 68.8% / 31.2% split between non-abusive and abusive labels, requiring careful metric selection during evaluation.
abusive-detection/
├── abusive-detection-dataset.ipynb # Data loading, preprocessing, baseline model training & evaluation
├── hybrid_model.ipynb # Hybrid Transformer-BiLSTM training & evaluation
├── Dataset/
│ ├── output.csv # Preprocessed dataset (~824K comments)
│ └── profane_words.sample.txt # Format specification for profanity lexicon (placeholder)
├── plots/
│ ├── loss_plot_bert_base_multilingual.png
│ ├── accuracy_plot_bert_base_multilingual.png
│ ├── eval_metrics_bar_bert_base_multilingual.png
│ ├── roc_curve_bert_base_multilingual.png
│ ├── loss_plot_xlm-roberta-base.png
│ ├── accuracy_plot_xlm-roberta-base.png
│ ├── eval_metrics_bar_xlm-roberta-base.png
│ └── roc_curve_xlm-roberta-base.png
├── LICENSE # MIT License
├── .gitignore
└── README.md
The two notebooks are self-contained and sequential: abusive-detection-dataset.ipynb handles data preparation and baseline experiments, while hybrid_model.ipynb implements the hybrid architectures. Both notebooks share the same preprocessed dataset (Dataset/output.csv).
Two pre-trained multilingual transformers are fine-tuned with a standard sequence classification head (AutoModelForSequenceClassification, 2-class output):
| Model | Identifier | Parameters | Pre-training Languages |
|---|---|---|---|
| mBERT | bert-base-multilingual-cased |
~178M | 104 languages |
| XLM-RoBERTa | xlm-roberta-base |
~278M | 100 languages |
Both baselines use the default pooled [CLS] representation followed by a linear classification layer.
The hybrid architecture augments the transformer encoder with a bidirectional LSTM to capture sequential dependencies beyond the transformer's attention patterns:
Input Tokens
│
▼
┌─────────────────────┐
│ Transformer Encoder │ (mBERT or XLM-R, frozen/fine-tuned)
│ [last_hidden_state] │
└─────────┬───────────┘
│ (batch, seq_len, hidden_size)
▼
┌─────────────────────┐
│ Bidirectional LSTM │ hidden_size=128, num_layers=1
│ │
└─────────┬───────────┘
│ (batch, seq_len, 256)
▼
┌─────────────────────┐
│ Mean Pooling │ average over sequence dimension
└─────────┬───────────┘
│ (batch, 256)
▼
┌─────────────────────┐
│ Dropout (0.3) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Linear (256 → 2) │ classification logits
└─────────────────────┘
Two variants are implemented:
TransformerLSTMClassifier— usesbert-base-multilingual-casedas the encoderXLMRobertaLSTMClassifier— usesxlm-roberta-baseas the encoder
Both are implemented as PreTrainedModel subclasses, enabling seamless integration with the Hugging Face Trainer API.
| Property | Value |
|---|---|
| Source | ShareChat-IndoML Datathon (NSFW Comment Challenge) |
| Languages | Hindi, English, Punjabi, Bhojpuri, Haryanvi, Rajasthani |
| Script | Mixed (Devanagari + Roman) |
| Size | ~824K comments (after filtering and deduplication) |
| Columns | commentText, label |
| Labels | 0 = Not Abusive, 1 = Abusive |
| Class Distribution | ~68.8% Not Abusive / ~31.2% Abusive |
| Preprocessing | Obfuscation-aware pipeline (see Preprocessing Pipeline) |
The raw dataset originally included additional metadata columns (CommentId, user_index, post_index, report_count_comment, report_count_post, like_count_comment, like_count_post, language) which are dropped during preparation. Rows with missing or empty commentText values are removed. Duplicate comments are deduplicated. Only comments from the six target languages are retained.
| Parameter | Baselines (mBERT / XLM-R) | Hybrid (mBERT) | Hybrid (XLM-R) |
|---|---|---|---|
| Learning rate | 2e-5 | 2e-5 | 2e-5 |
| Weight decay | 0.01 | 0.01 | 0.01 |
| Max epochs | 50 | 50 | 50 |
| Max sequence length | 128 | 128 | 128 |
| Train batch size | 512 | 512 | 256 |
| Eval batch size | 256 | 512 | 256 |
| Early stopping patience | 3 | 3 | 3 |
| LR scheduler | ReduceLROnPlateau (patience=2, factor=0.1) | ReduceLROnPlateau (patience=2, factor=0.1) | ReduceLROnPlateau (patience=2, factor=0.1) |
| Train / Val / Test split | 80% / — / 20% | 75% / 12.5% / 12.5% | 75% / 12.5% / 12.5% |
| Best model selection | F1 score | F1 score | F1 score |
All models are trained using the Hugging Face Trainer API with the following configuration:
- GPU auto-detection via
torch.cuda.is_available() - Epoch-level evaluation with logging, saving, and evaluation at each epoch boundary
- Best checkpoint selection by validation F1 score (
load_best_model_at_end=True) - Custom callbacks:
ReduceLROnPlateauCallback— reduces learning rate by a factor of 0.1 when F1 plateaus for 2 consecutive evaluationsTrainingAccuracyCallback— computes and logs training set accuracy at the end of each epoch
- Early stopping via
EarlyStoppingCallbackwith a patience of 3 epochs
![]() |
![]() |
![]() |
![]() |
The mBERT model converges within approximately 8 epochs. The divergence between training and validation loss curves in later epochs indicates some degree of overfitting, which early stopping mitigates. Per-class classification reports are available in the notebooks.
![]() |
![]() |
![]() |
![]() |
The XLM-RoBERTa model exhibits a similar convergence profile, stabilizing within approximately 7 epochs. Training dynamics are comparable to mBERT, with marginal differences in the overfitting gap.
Performance across the four model configurations is broadly comparable, suggesting that the preprocessing pipeline and dataset characteristics are the dominant factors in this task. The observed train-validation gap across all models indicates room for additional regularization (e.g., increased dropout, data augmentation, or label smoothing). Detailed per-class precision, recall, and F1 scores, along with confusion matrices, are available in the respective notebooks.
The preprocessing pipeline is implemented in abusive-detection-dataset.ipynb and consists of four core functions:
load_profanity_lexicon(file_path)— Reads a comma-separated profanity word list from disk, normalizes entries to lowercase, and removes punctuation.build_profanity_patterns(profanity_list)— Compiles regex patterns for each profanity term using character-level substitution maps (e.g.,a → [a@4],s → [s5$z]) to match obfuscated variants.preprocess_text(text, profanity_patterns, profanity_list)— Applies the full normalization pipeline to a single text string.preprocess_dataframe_parallel(df, profanity_patterns, profanity_list)— Wrapspreprocess_textwithjoblib.Parallelfor multi-core execution across the dataframe.
Processing flow:
Raw Text
│
├─► Lowercase
├─► Emoji removal
├─► URL removal
├─► Profanity normalization (regex substitution patterns)
├─► Mention / hashtag removal
├─► Repeated character reduction + Levenshtein fuzzy matching
├─► Special character / digit removal
└─► Whitespace normalization
│
▼
Cleaned Text
The Levenshtein matching step is particularly important: after reducing repeated characters (e.g., worrrd → worrd), the pipeline computes edit distances against the profanity lexicon and normalizes words within a threshold (edit distance ≤ 1 or ≤ 20% of the target word length).
git clone https://github.com/CodeNinjaSarthak/abusive-detection.git
cd abusive-detection
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install torch transformers datasets scikit-learn matplotlib seaborn \
fuzzywuzzy[speedup] python-Levenshtein emoji pandas numpy joblib tqdm rapidfuzzA CUDA-compatible GPU is recommended for training. The notebooks auto-detect GPU availability via torch.cuda.is_available() and will fall back to CPU if unavailable (training times will increase substantially).
-
Prepare the profanity lexicon (if reproducing from the raw datathon data): Place your lexicon file at
Dataset/Profane words.txtfollowing the format documented inDataset/profane_words.sample.txt. This step is not required if using the providedDataset/output.csv, which is already preprocessed. -
Run the dataset notebook: Open and execute
abusive-detection-dataset.ipynbtop-to-bottom. This notebook handles data loading, preprocessing, and baseline model training and evaluation. -
Run the hybrid model notebook: Open and execute
hybrid_model.ipynb. This notebook loadsoutput.csvand trains both hybrid Transformer-BiLSTM variants. -
View outputs: Training plots are saved to
plots/. Classification reports and confusion matrices are printed inline in the notebooks.
Note: File paths in the notebooks assume execution from the repository root. Adjust paths if your working directory differs. Due to non-determinism in GPU operations, exact numerical results may vary slightly across runs.
The file Dataset/output.csv is included in this repository. It is derived from the publicly available ShareChat-IndoML Datathon NSFW Comment Challenge dataset. All personally identifiable information (PII) has been removed during preprocessing; the dataset contains only comment text and binary labels.
The file Dataset/Profane words.txt has been intentionally excluded from this repository. The lexicon was compiled by aggregating content from multiple publicly accessible web sources. It is withheld for two reasons:
- Ambiguous redistribution rights — the provenance of the aggregated sources does not clearly permit redistribution.
- Ethical concerns — unrestricted distribution of a concentrated list of abusive terms in multiple languages and scripts poses risks of misuse.
This omission does not affect methodological transparency. The preprocessing pipeline, model architectures, training procedures, and evaluation protocols are fully documented in the notebooks and remain independently verifiable. The lexicon serves as a configurable input to the preprocessing stage; any equivalent word list can be substituted to replicate the general workflow. The expected format is documented in Dataset/profane_words.sample.txt.
Researchers who require the original lexicon for academic, non-commercial purposes may request access on a case-by-case basis:
- GitHub Issues: Open an issue in this repository with the subject line
[Lexicon Access Request], including a brief description of intended use and institutional affiliation. - Email: Contact the author directly at
<author-email>.
All requests are reviewed individually. Access is granted solely for research and educational use and may not be redistributed without written permission.
This project is licensed under the MIT License. See LICENSE for details.
If you use this code or dataset in your research, please cite:
@article{<citation-key>,
title = {Hinglish Abusive Comment Detection Using Transformer-Based Models},
author = {},
journal = {},
year = {},
url = {https://github.com/<CodeNinjaSarthak>/abusive-detection}
}Paper reference will be updated upon publication.







