Hinglish Abusive Comment Detection Using Transformer-Based Models

A comparative study of multilingual transformer baselines and hybrid Transformer-BiLSTM architectures for detecting abusive content in code-mixed Hinglish text.

Abstract

Detecting abusive language in code-mixed Hinglish content presents distinct challenges due to script mixing, transliteration variation, and deliberate obfuscation. This repository accompanies a research paper that benchmarks four model configurations — two fine-tuned transformer baselines (mBERT and XLM-RoBERTa) and two hybrid Transformer-BiLSTM variants — on a large-scale dataset of approximately 824K user comments sourced from the ShareChat-IndoML Datathon. A key preprocessing contribution is an obfuscation-aware normalization pipeline that combines regex-based character substitution with Levenshtein distance matching to recover intentionally disguised profanity. All training code, evaluation plots, and preprocessing utilities are provided for full reproducibility. Quantitative results and per-class analysis are presented in the Results and Analysis section.

Motivation and Challenges

Hinglish — a code-mixed register of Hindi and English — is among the most widely used informal registers on Indian social media platforms. Automated moderation of abusive Hinglish content is complicated by several factors:

Script mixing: Users freely alternate between Devanagari and Roman scripts, often within a single sentence.
Transliteration variation: The same Hindi word may appear in numerous romanized spellings (e.g., "kutta", "kuttaa", "kutha").
Deliberate obfuscation: Profanity is frequently disguised through character substitution (e.g., @ for a, $ for s), vowel insertion, and repeated characters.
Dialectal diversity: The dataset spans six languages — Hindi, English, Punjabi, Bhojpuri, Haryanvi, and Rajasthani — each contributing distinct vocabulary and grammatical patterns.
Class imbalance: The dataset exhibits an approximate 68.8% / 31.2% split between non-abusive and abusive labels, requiring careful metric selection during evaluation.

Repository Structure

abusive-detection/
├── abusive-detection-dataset.ipynb   # Data loading, preprocessing, baseline model training & evaluation
├── hybrid_model.ipynb                # Hybrid Transformer-BiLSTM training & evaluation
├── Dataset/
│   ├── output.csv                    # Preprocessed dataset (~824K comments)
│   └── profane_words.sample.txt      # Format specification for profanity lexicon (placeholder)
├── plots/
│   ├── loss_plot_bert_base_multilingual.png
│   ├── accuracy_plot_bert_base_multilingual.png
│   ├── eval_metrics_bar_bert_base_multilingual.png
│   ├── roc_curve_bert_base_multilingual.png
│   ├── loss_plot_xlm-roberta-base.png
│   ├── accuracy_plot_xlm-roberta-base.png
│   ├── eval_metrics_bar_xlm-roberta-base.png
│   └── roc_curve_xlm-roberta-base.png
├── LICENSE                           # MIT License
├── .gitignore
└── README.md

The two notebooks are self-contained and sequential: abusive-detection-dataset.ipynb handles data preparation and baseline experiments, while hybrid_model.ipynb implements the hybrid architectures. Both notebooks share the same preprocessed dataset (Dataset/output.csv).

Model Architectures

Baselines

Two pre-trained multilingual transformers are fine-tuned with a standard sequence classification head (AutoModelForSequenceClassification, 2-class output):

Model	Identifier	Parameters	Pre-training Languages
mBERT	`bert-base-multilingual-cased`	~178M	104 languages
XLM-RoBERTa	`xlm-roberta-base`	~278M	100 languages

Both baselines use the default pooled [CLS] representation followed by a linear classification layer.

Hybrid Transformer-BiLSTM

The hybrid architecture augments the transformer encoder with a bidirectional LSTM to capture sequential dependencies beyond the transformer's attention patterns:

Input Tokens
     │
     ▼
┌─────────────────────┐
│  Transformer Encoder │   (mBERT or XLM-R, frozen/fine-tuned)
│  [last_hidden_state] │
└─────────┬───────────┘
          │  (batch, seq_len, hidden_size)
          ▼
┌─────────────────────┐
│   Bidirectional LSTM │   hidden_size=128, num_layers=1
│                      │
└─────────┬───────────┘
          │  (batch, seq_len, 256)
          ▼
┌─────────────────────┐
│     Mean Pooling     │   average over sequence dimension
└─────────┬───────────┘
          │  (batch, 256)
          ▼
┌─────────────────────┐
│    Dropout (0.3)     │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Linear (256 → 2)   │   classification logits
└─────────────────────┘

Two variants are implemented:

TransformerLSTMClassifier — uses bert-base-multilingual-cased as the encoder
XLMRobertaLSTMClassifier — uses xlm-roberta-base as the encoder

Both are implemented as PreTrainedModel subclasses, enabling seamless integration with the Hugging Face Trainer API.

Dataset Description

Property	Value
Source	ShareChat-IndoML Datathon (NSFW Comment Challenge)
Languages	Hindi, English, Punjabi, Bhojpuri, Haryanvi, Rajasthani
Script	Mixed (Devanagari + Roman)
Size	~824K comments (after filtering and deduplication)
Columns	`commentText`, `label`
Labels	`0` = Not Abusive, `1` = Abusive
Class Distribution	~68.8% Not Abusive / ~31.2% Abusive
Preprocessing	Obfuscation-aware pipeline (see Preprocessing Pipeline)

The raw dataset originally included additional metadata columns (CommentId, user_index, post_index, report_count_comment, report_count_post, like_count_comment, like_count_post, language) which are dropped during preparation. Rows with missing or empty commentText values are removed. Duplicate comments are deduplicated. Only comments from the six target languages are retained.

Experimental Setup

Hyperparameters

Parameter	Baselines (mBERT / XLM-R)	Hybrid (mBERT)	Hybrid (XLM-R)
Learning rate	2e-5	2e-5	2e-5
Weight decay	0.01	0.01	0.01
Max epochs	50	50	50
Max sequence length	128	128	128
Train batch size	512	512	256
Eval batch size	256	512	256
Early stopping patience	3	3	3
LR scheduler	ReduceLROnPlateau (patience=2, factor=0.1)	ReduceLROnPlateau (patience=2, factor=0.1)	ReduceLROnPlateau (patience=2, factor=0.1)
Train / Val / Test split	80% / — / 20%	75% / 12.5% / 12.5%	75% / 12.5% / 12.5%
Best model selection	F1 score	F1 score	F1 score

Training Procedure

All models are trained using the Hugging Face Trainer API with the following configuration:

GPU auto-detection via torch.cuda.is_available()
Epoch-level evaluation with logging, saving, and evaluation at each epoch boundary
Best checkpoint selection by validation F1 score (load_best_model_at_end=True)
Custom callbacks:
- ReduceLROnPlateauCallback — reduces learning rate by a factor of 0.1 when F1 plateaus for 2 consecutive evaluations
- TrainingAccuracyCallback — computes and logs training set accuracy at the end of each epoch
Early stopping via EarlyStoppingCallback with a patience of 3 epochs

Results and Analysis

mBERT (bert-base-multilingual-cased)

The mBERT model converges within approximately 8 epochs. The divergence between training and validation loss curves in later epochs indicates some degree of overfitting, which early stopping mitigates. Per-class classification reports are available in the notebooks.

XLM-RoBERTa (xlm-roberta-base)

The XLM-RoBERTa model exhibits a similar convergence profile, stabilizing within approximately 7 epochs. Training dynamics are comparable to mBERT, with marginal differences in the overfitting gap.

Discussion

Performance across the four model configurations is broadly comparable, suggesting that the preprocessing pipeline and dataset characteristics are the dominant factors in this task. The observed train-validation gap across all models indicates room for additional regularization (e.g., increased dropout, data augmentation, or label smoothing). Detailed per-class precision, recall, and F1 scores, along with confusion matrices, are available in the respective notebooks.

Preprocessing Pipeline

The preprocessing pipeline is implemented in abusive-detection-dataset.ipynb and consists of four core functions:

load_profanity_lexicon(file_path) — Reads a comma-separated profanity word list from disk, normalizes entries to lowercase, and removes punctuation.
build_profanity_patterns(profanity_list) — Compiles regex patterns for each profanity term using character-level substitution maps (e.g., a → [a@4], s → [s5$z]) to match obfuscated variants.
preprocess_text(text, profanity_patterns, profanity_list) — Applies the full normalization pipeline to a single text string.
preprocess_dataframe_parallel(df, profanity_patterns, profanity_list) — Wraps preprocess_text with joblib.Parallel for multi-core execution across the dataframe.

Processing flow:

Raw Text
  │
  ├─► Lowercase
  ├─► Emoji removal
  ├─► URL removal
  ├─► Profanity normalization (regex substitution patterns)
  ├─► Mention / hashtag removal
  ├─► Repeated character reduction + Levenshtein fuzzy matching
  ├─► Special character / digit removal
  └─► Whitespace normalization
  │
  ▼
Cleaned Text

The Levenshtein matching step is particularly important: after reducing repeated characters (e.g., worrrd → worrd), the pipeline computes edit distances against the profanity lexicon and normalizes words within a threshold (edit distance ≤ 1 or ≤ 20% of the target word length).

Reproducibility

Environment Setup

git clone https://github.com/CodeNinjaSarthak/abusive-detection.git
cd abusive-detection
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install torch transformers datasets scikit-learn matplotlib seaborn \
    fuzzywuzzy[speedup] python-Levenshtein emoji pandas numpy joblib tqdm rapidfuzz

A CUDA-compatible GPU is recommended for training. The notebooks auto-detect GPU availability via torch.cuda.is_available() and will fall back to CPU if unavailable (training times will increase substantially).

Running the Experiments

Prepare the profanity lexicon (if reproducing from the raw datathon data): Place your lexicon file at Dataset/Profane words.txt following the format documented in Dataset/profane_words.sample.txt. This step is not required if using the provided Dataset/output.csv, which is already preprocessed.
Run the dataset notebook: Open and execute abusive-detection-dataset.ipynb top-to-bottom. This notebook handles data loading, preprocessing, and baseline model training and evaluation.
Run the hybrid model notebook: Open and execute hybrid_model.ipynb. This notebook loads output.csv and trains both hybrid Transformer-BiLSTM variants.
View outputs: Training plots are saved to plots/. Classification reports and confusion matrices are printed inline in the notebooks.

Note: File paths in the notebooks assume execution from the repository root. Adjust paths if your working directory differs. Due to non-determinism in GPU operations, exact numerical results may vary slightly across runs.

Data Availability and Ethics

Dataset

The file Dataset/output.csv is included in this repository. It is derived from the publicly available ShareChat-IndoML Datathon NSFW Comment Challenge dataset. All personally identifiable information (PII) has been removed during preprocessing; the dataset contains only comment text and binary labels.

Profanity Lexicon

The file Dataset/Profane words.txt has been intentionally excluded from this repository. The lexicon was compiled by aggregating content from multiple publicly accessible web sources. It is withheld for two reasons:

Ambiguous redistribution rights — the provenance of the aggregated sources does not clearly permit redistribution.
Ethical concerns — unrestricted distribution of a concentrated list of abusive terms in multiple languages and scripts poses risks of misuse.

This omission does not affect methodological transparency. The preprocessing pipeline, model architectures, training procedures, and evaluation protocols are fully documented in the notebooks and remain independently verifiable. The lexicon serves as a configurable input to the preprocessing stage; any equivalent word list can be substituted to replicate the general workflow. The expected format is documented in Dataset/profane_words.sample.txt.

Requesting Access

Researchers who require the original lexicon for academic, non-commercial purposes may request access on a case-by-case basis:

GitHub Issues: Open an issue in this repository with the subject line [Lexicon Access Request], including a brief description of intended use and institutional affiliation.
Email: Contact the author directly at <author-email>.

All requests are reviewed individually. Access is granted solely for research and educational use and may not be redistributed without written permission.

License

This project is licensed under the MIT License. See LICENSE for details.

Citation

If you use this code or dataset in your research, please cite:

@article{<citation-key>,
  title     = {Hinglish Abusive Comment Detection Using Transformer-Based Models},
  author    = {},
  journal   = {},
  year      = {},
  url       = {https://github.com/<CodeNinjaSarthak>/abusive-detection}
}

Paper reference will be updated upon publication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hinglish Abusive Comment Detection Using Transformer-Based Models

Abstract

Motivation and Challenges

Repository Structure

Model Architectures

Baselines

Hybrid Transformer-BiLSTM

Dataset Description

Experimental Setup

Hyperparameters

Training Procedure

Results and Analysis

mBERT (bert-base-multilingual-cased)

XLM-RoBERTa (xlm-roberta-base)

Discussion

Preprocessing Pipeline

Reproducibility

Environment Setup

Running the Experiments

Data Availability and Ethics

Dataset

Profanity Lexicon

Requesting Access

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Dataset		Dataset
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
abusive-detection-dataset.ipynb		abusive-detection-dataset.ipynb
hybrid_model.ipynb		hybrid_model.ipynb

Folders and files

Latest commit

History

Repository files navigation

Hinglish Abusive Comment Detection Using Transformer-Based Models

Abstract

Motivation and Challenges

Repository Structure

Model Architectures

Baselines

Hybrid Transformer-BiLSTM

Dataset Description

Experimental Setup

Hyperparameters

Training Procedure

Results and Analysis

mBERT (bert-base-multilingual-cased)

XLM-RoBERTa (xlm-roberta-base)

Discussion

Preprocessing Pipeline

Reproducibility

Environment Setup

Running the Experiments

Data Availability and Ethics

Dataset

Profanity Lexicon

Requesting Access

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages