Using NLP and sentiment analysis to score YouTube tutorials by the quality of their audience reception
A modular, end-to-end pipeline combining transformer models and classical ML to surface the best tutorials on any topic. By Shivam
Finding a good tutorial on YouTube is harder than it should be. View counts and likes are gameable, thumbnails are misleading, and the algorithm optimizes for engagement — not educational quality. But comments don't lie. Viewers who genuinely learn something say so. Viewers who waste 20 minutes say that too.
This project treats comment sentiment as a proxy for tutorial quality and builds a full pipeline — from search to scored output — that lets you compare tutorials on any topic through the lens of how their audiences actually responded.
The goal was to build a reproducible, modular scoring system that:
- Searches YouTube for tutorials on a given keyword
- Fetches and preprocesses comments at scale
- Classifies sentiment using one of three ML approaches
- Aggregates comment-level scores into a per-video ranking
Three independent modeling strategies were implemented and compared — from a classical TF-IDF pipeline to a custom fine-tuned transformer.
-
I implemented three distinct sentiment classification approaches, each fully self-contained:
- Custom Fine-Tuned DistilBERT (
Complex model/): A DistilBERT model fine-tuned end-to-end on labeled YouTube comment data using TensorFlow and HuggingFace Transformers. Includes scripts for fine-tuning, model conversion, and inference. This is the best performing model — see results below. - Pre-Trained DistilBERT (
Dist-Bert model/): Usesdistilbert-base-uncased-finetuned-sst-2-englishoff the shelf — no training required. Fast and strong baseline for comparison. - SGDClassifier (
SDGClassifier model/): A classical ML pipeline using TF-IDF vectorization and a Stochastic Gradient Descent classifier (scikit-learn). Supports 3-class classification (Negative / Neutral / Positive). Model and vectorizer serialized withjoblibfor lightweight deployment.
- Custom Fine-Tuned DistilBERT (
-
Each approach follows the same modular pipeline structured under
src/:Stage Module Responsibility Search yt_search.pyQuery YouTube for videos by keyword Fetch data_fetch.pyDownload comments for a given video ID Preprocess preprocess.pyTokenization, stopword removal, lemmatization Classify classify.pySentiment inference via the approach-specific model Aggregate aggregate.pyRoll up comment-level scores into a per-video ranking Visualize visualize.pyGenerate sentiment distribution charts per video Evaluate evaluate.pyCompute F-score and accuracy on held-out test data -
A Flask backend with SocketIO powers real-time updates as comments are fetched and classified, exposed through a React frontend for live demo and visualization.
-
Jupyter notebooks are included throughout for experimentation, model evaluation, and reproducibility.
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ React Frontend │ │ Flask Backend │ │ Sentiment Models │
│ │ │ │ │ │
│ • Keyword Search │◄──►│ • Route Engine │◄──►│ • Fine-tuned BERT │
│ • Scored Results │ │ • SocketIO (RT) │ │ • Pre-trained BERT │
│ • Visualizations │ │ • Video ID Parser │ │ • TF-IDF + SGD │
│ • Score Display │ │ • Pipeline Trigger │ │ • Score Aggregator │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
│ │ │
└──────────────────────────┼──────────────────────────┘
│
┌────────────────▼────────────────┐
│ Data Pipeline │
│ │
│ • yt_search.py (Search) │
│ • data_fetch.py (Comments) │
│ • preprocess.py (Clean & NLP) │
│ • classify.py (Inference) │
│ • aggregate.py (Score/Rank) │
│ • visualize.py (Charts) │
└─────────────────────────────────┘
All three models were evaluated on held-out test data. Below are the results from the metric report.
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| NEGATIVE | 0.75 | 0.83 | 0.79 |
| POSITIVE | 0.81 | 0.72 | 0.76 |
Confusion Matrix:
Predicted
NEG POS
Actual NEG 14,349 2,838
POS 4,867 12,425
A solid out-of-the-box result for a model trained on movie reviews (SST-2), applied without any fine-tuning to YouTube comments.
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| NEGATIVE | 0.59 | 0.59 | 0.59 |
| NEUTRAL | 0.53 | 0.61 | 0.57 |
| POSITIVE | 0.70 | 0.60 | 0.65 |
Confusion Matrix:
Predicted
NEG NEU POS
Actual NEG 10,076 5,219 1,892
NEU 4,318 10,369 2,445
POS 3,048 3,865 10,379
The only model to support 3-class classification including a Neutral label. Performance is lower overall, which is expected given the harder task and lighter model — but inference is extremely fast with no GPU required.
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| NEGATIVE | 0.86 | 0.91 | 0.88 |
| POSITIVE | 0.90 | 0.86 | 0.88 |
Confusion Matrix:
Predicted
NEG POS
Actual NEG 7,758 816
POS 1,209 7,407
Fine-tuning DistilBERT directly on YouTube comment data yields a substantial improvement over the pre-trained SST-2 baseline — +9 F1 points across both classes. This is the recommended model for production use.
| Model | Task | F1 (Negative) | F1 (Positive) | Speed | GPU Required |
|---|---|---|---|---|---|
| Pre-trained DistilBERT (SST-2) | Binary | 0.79 | 0.76 | Medium | Recommended |
| TF-IDF + SGDClassifier | Multiclass | 0.59 | 0.65 | Fast | No |
| Fine-tuned DistilBERT | Binary | 0.88 | 0.88 | Medium | Recommended |
A live deployment of the app is hosted on Render and accessible here:
🔗 https://rankingyt-web.onrender.com
Note: The app is hosted on a free Render instance — it may take 30–60 seconds to wake up on first load.
Each modeling approach is self-contained in its own folder with its own requirements.txt. To get started:
git clone https://github.com/your-username/CommentScore.git
cd CommentScore/Choose your preferred approach and install its dependencies:
# Example: Fine-tuned DistilBERT approach (recommended)
cd "Complex model/"
pip install -r requirements.txtThen launch the Flask backend:
python app.pyAnd in a separate terminal, start the React frontend:
cd yt-sentiment-ui/
npm install
npm start-
YouTube Comments Sentiment Analysis — Ritika Singh et al.: Motivated the multi-algorithm comparison and the preprocessing pipeline (lemmatization, tokenization, stopword removal, n-grams). Guided evaluation using F-score and accuracy.
-
Ranking of tutorials on YouTube based on the analysis of feelings made to their comments — Goyzueta Torres et al., Innosoft Journal: Directly inspired the core idea of using aggregated comment sentiment as a ranking signal, and the rationale for BERT-based approaches at scale.
- ... to the HuggingFace team for the Transformers library and the pre-trained SST-2 DistilBERT model.
- ... to Ritika Singh et al. and Goyzueta Torres et al. for the research that shaped this project's design.
- ... to the open-source community behind
scikit-learn,Flask, andSocketIOfor making rapid ML prototyping accessible.
For questions or suggestions, feel free to open an issue or reach out.



