Skip to content

najahaja/Sentiment-Analysis

Repository files navigation

🧠 Sentiment Analysis: TF-IDF + LR vs LSTM vs BERT

Python PyTorch HuggingFace Scikit-Learn License Kaggle

A comprehensive, production-ready comparison of classical ML, deep learning, and transformer-based approaches for binary sentiment classification on the IMDB Movie Reviews dataset.

📊 Results🚀 Quick Start📁 Project Structure📓 Notebooks🌐 Demo


📌 Table of Contents


🔍 Overview

This project benchmarks three sentiment analysis approaches across the IMDB 50K Movie Reviews dataset:

Approach Type Library
TF-IDF + Logistic Regression Classical ML Scikit-learn
LSTM Deep Learning (RNN) PyTorch
BERT (bert-base-uncased) Transformer 🤗 HuggingFace

Key highlights:

  • ✅ Full error analysis with misclassified sample inspection
  • Class imbalance handled via class weights + SMOTE
  • Confusion matrix, F1-score, ROC-AUC for every model
  • ✅ Fully reproducible Jupyter Notebooks
  • ✅ Interactive Gradio web demo

📂 Dataset

🗂️ IMDB Movie Reviews Dataset (50,000 Reviews)

Property Value
Source Kaggle / HuggingFace Datasets
Size 50,000 reviews
Classes Positive / Negative (Binary)
Balance 25,000 positive + 25,000 negative
Split 80% Train / 10% Val / 10% Test

📥 How to Get the Dataset

Option A — Via Kaggle (Recommended)

  1. Go to https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
  2. Click Download
  3. Place IMDB Dataset.csv inside the data/raw/ folder

Option B — Via Kaggle CLI (Automated)

pip install kaggle
# Place your kaggle.json API key in ~/.kaggle/
kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
unzip imdb-dataset-of-50k-movie-reviews.zip -d data/raw/

Option C — Via HuggingFace Datasets (No download needed)

from datasets import load_dataset
dataset = load_dataset("imdb")

💡 The notebooks auto-detect which method to use — just run them!


🤖 Models Compared

1. 📊 TF-IDF + Logistic Regression (Baseline)

  • Feature extraction: TF-IDF with unigrams + bigrams (max 50,000 features)
  • Model: Logistic Regression with L2 regularization
  • Class imbalance: class_weight='balanced'
  • Pros: Extremely fast, interpretable, strong baseline
  • Cons: Loses word order and context

2. 🔁 LSTM (Deep Learning)

  • Embedding: Pretrained GloVe 100d embeddings
  • Architecture: Bidirectional LSTM (128 hidden units) → Dropout(0.5) → FC → Sigmoid
  • Class imbalance: Weighted BCELoss
  • Training: Adam optimizer, early stopping
  • Pros: Captures sequential patterns
  • Cons: Slower than TF-IDF, weaker than BERT on long text

3. 🤗 BERT (bert-base-uncased)

  • Model: bert-base-uncased from HuggingFace Transformers
  • Fine-tuning: Last 4 transformer layers + classification head
  • Class imbalance: Weighted cross-entropy loss
  • Training: AdamW optimizer, linear warmup schedule, 3 epochs
  • Pros: State-of-the-art contextual understanding
  • Cons: Computationally expensive (GPU recommended)

📊 Results

Model Performance Comparison

Model Accuracy Precision Recall F1-Score ROC-AUC
TF-IDF + Log. Reg. 89.4% 89.2% 89.4% 89.3% 0.964
Bi-LSTM 91.8% 91.7% 91.8% 91.7% 0.972
BERT 94.1% 94.0% 94.1% 94.0% 0.988

📌 Results are on the held-out test set. Full metrics and confusion matrices are in the notebooks.

Confusion Matrices

See notebooks/04_comparison_report.ipynb for full visualizations.


📁 Project Structure

Sentiment_Analysis/
│
├── 📁 data/
│   ├── raw/                    # Raw IMDB dataset (.csv)
│   └── processed/              # Cleaned, split datasets
│
├── 📁 notebooks/
│   ├── 01_EDA_preprocessing.ipynb          # Exploratory Data Analysis
│   ├── 02_tfidf_logistic_regression.ipynb  # TF-IDF + LR model
│   ├── 03_lstm_model.ipynb                 # LSTM model
│   ├── 04_bert_model.ipynb                 # BERT fine-tuning
│   └── 05_comparison_report.ipynb          # Side-by-side comparison
│
├── 📁 src/
│   ├── preprocess.py           # Text cleaning & preprocessing
│   ├── tfidf_model.py          # TF-IDF + LR pipeline
│   ├── lstm_model.py           # LSTM architecture
│   ├── bert_model.py           # BERT fine-tuning code
│   ├── evaluate.py             # Metrics, confusion matrix, error analysis
│   └── utils.py                # Helper functions
│
├── 📁 models/
│   ├── tfidf_vectorizer.pkl    # Saved TF-IDF vectorizer
│   ├── lr_model.pkl            # Saved Logistic Regression model
│   ├── lstm_model.pth          # Saved LSTM weights
│   └── bert_finetuned/         # Saved BERT model (HuggingFace format)
│
├── 📁 results/
│   ├── confusion_matrices/     # PNG outputs
│   ├── metrics_summary.csv     # All model metrics
│   └── error_analysis.csv      # Misclassified samples
│
├── 📁 app/
│   └── demo.py                 # Gradio web demo
│
├── requirements.txt
├── environment.yml             # Conda environment
├── setup.py
├── .gitignore
└── README.md

🚀 Quick Start

# 1. Clone the repo
git clone https://github.com/najahaja/Sentiment-Analysis.git
cd Sentiment-Analysis

# 2. Create conda environment
conda env create -f environment.yml
conda activate sentiment-env

# OR use pip
pip install -r requirements.txt

# 3. Download dataset (auto via HuggingFace — no Kaggle account needed)
python src/utils.py --download

# 4. Run all notebooks in order, OR run the full pipeline:
python src/train_all.py

# 5. Launch the demo
python app/demo.py

🛠️ Installation

Prerequisites

  • Python 3.9+
  • CUDA GPU (for BERT fine-tuning, optional but recommended)
  • 8GB+ RAM

Step-by-Step

# Clone
git clone https://github.com/najahaja/Sentiment-Analysis.git
cd Sentiment-Analysis

# Option 1: Conda (recommended)
conda env create -f environment.yml
conda activate sentiment-env

# Option 2: pip virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
source venv/bin/activate     # Mac/Linux
pip install -r requirements.txt

📓 Notebooks

Run these notebooks in order for the full pipeline:

# Notebook Description
01 01_EDA_preprocessing.ipynb Load IMDB data, clean HTML/special chars, visualize class distribution, word clouds
02 02_tfidf_logistic_regression.ipynb TF-IDF feature extraction, train LR, evaluate, confusion matrix
03 03_lstm_model.ipynb Load GloVe, train Bi-LSTM, plot training curves, evaluate
04 04_bert_model.ipynb Fine-tune bert-base-uncased, evaluate, save model
05 05_comparison_report.ipynb Side-by-side metrics, error analysis, final conclusions

🔬 Error Analysis

The project includes a dedicated error analysis module in src/evaluate.py and notebooks/05_comparison_report.ipynb:

  • False Positives: Reviews predicted as positive but actually negative
  • False Negatives: Reviews predicted as negative but actually positive
  • Confidence scores for misclassified samples
  • Word importance via LIME for BERT predictions
  • Common error patterns: Sarcasm, negation, domain-specific vocabulary

Example output:

❌ Misclassified by BERT:
Text: "This film tries SO hard to be profound that it ends up being unintentionally hilarious."
True Label: Negative | Predicted: Positive | Confidence: 0.61
Pattern: Sarcasm / Mixed Sentiment

⚖️ Class Imbalance Handling

Although IMDB is balanced (50/50), the project demonstrates techniques for imbalanced datasets:

Technique Applied To
class_weight='balanced' Logistic Regression
Weighted BCELoss LSTM
Weighted CrossEntropyLoss BERT
SMOTE (oversampling demo) TF-IDF features
Stratified train/val/test split All models

🌐 Live Demo

Launch the Gradio interactive demo locally:

python app/demo.py

Then open: http://localhost:7860

The demo lets you:

  • Type any review text
  • See predictions from all 3 models side-by-side
  • View confidence scores and sentiment bars

Deploy to Hugging Face Spaces (Free Hosting)

# 1. Create account at huggingface.co/spaces
# 2. Create a new Space with Gradio SDK
# 3. Push your code
git remote add space https://huggingface.co/spaces/najahaja/Sentiment-Analysis
git push space main

📦 Dependencies

Key packages (see requirements.txt for full list):

transformers>=4.35.0
torch>=2.0.0
scikit-learn>=1.3.0
pandas>=2.0.0
numpy>=1.24.0
datasets>=2.14.0
gradio>=4.0.0
matplotlib>=3.7.0
seaborn>=0.12.0
nltk>=3.8.0
imbalanced-learn>=0.11.0
lime>=0.2.0.1

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/add-roberta
  3. Commit your changes: git commit -m 'Add RoBERTa comparison'
  4. Push to the branch: git push origin feature/add-roberta
  5. Open a Pull Request

📜 License

© 2025 Ahamed Najah — All Rights Reserved.

This project is protected. You may view the code for learning purposes only. Redistribution, modification, or commercial use without explicit permission is prohibited. See the LICENSE file for full details.


👤 Author

Ahamed Najah

GitHub LinkedIn


🏷️ Topics

sentiment-analysis nlp bert lstm transformers huggingface scikit-learn machine-learning deep-learning python pytorch imdb text-classification


⭐ Star this repo if you found it useful!

About

Comparing TF-IDF + Logistic Regression vs Bi-LSTM vs BERT for sentiment analysis on IMDB 50K reviews | PyTorch · HuggingFace · Gradio Demo

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors