Skip to content

rogueslasher/document_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Document Classification with Active Learning

An NLP system that classifies documents into 20 categories using Active Learning, reducing labeling effort by 35% while achieving 85.3% accuracy.

This project demonstrates how uncertainty-based sample selection improves model performance while requiring fewer labeled examples.

Built using Python, scikit-learn, NLTK, and Streamlit.


Overview

Text classification models typically require large labeled datasets, which are expensive and time-consuming to create.

This project implements Active Learning using uncertainty sampling, where the model iteratively selects the most informative samples to label, improving learning efficiency and performance.

Key outcomes:

  • Final accuracy: 85.3%
  • +3.2% accuracy improvement over random sampling
  • 35% reduction in labeled data required
  • Multi-class classification across 20 categories

Dataset used: 20 Newsgroups (11,314 training documents)


Features

  • Active Learning implementation using uncertainty sampling
  • Complete NLP pipeline from preprocessing to evaluation
  • TF-IDF feature extraction
  • Logistic Regression classifier
  • Model evaluation and performance analysis
  • Visualization of learning performance
  • Streamlit interface for interactive predictions

Tech Stack

  • Python
  • scikit-learn
  • NLTK
  • Streamlit
  • matplotlib
  • seaborn
  • NumPy
  • pandas

Results

Metric Active Learning Random Sampling
Accuracy 85.3% 82.1%
F1 Score 0.847 0.814
Label Efficiency 35% fewer labels β€”

Active Learning achieves better performance with significantly fewer labeled examples.


Architecture

Pipeline workflow:

Raw Text
   ↓
Text Preprocessing
   ↓
TF-IDF Feature Extraction
   ↓
Baseline Model Training
   ↓
Active Learning Loop
   ↓
Model Evaluation and Analysis

Uncertainty sampling formula:

uncertainty = 1 βˆ’ max(predicted_probability)

Samples with highest uncertainty are selected for labeling.


Installation

Clone the repository:

git clone https://github.com/rogueslasher/document-classifier-active-learning.git
cd document-classifier-active-learning

Create virtual environment:

python -m venv venv
venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Download NLTK resources:

python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"

Running the Project

Open the notebooks in order:

notebooks/
β”‚
β”œβ”€β”€ explore_data.ipynb
β”œβ”€β”€ text_preprocess.ipynb
β”œβ”€β”€ feature_engineering.ipynb
β”œβ”€β”€ baseline_model.ipynb
β”œβ”€β”€ active_learning.ipynb
└── analysis.ipynb

Run each notebook sequentially to execute the full pipeline.

To run the Streamlit interface:

streamlit run app.py

Project Structure

document-classifier-active-learning/
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ explore_data.ipynb
β”‚   β”œβ”€β”€ text_preprocess.ipynb
β”‚   β”œβ”€β”€ feature_engineering.ipynb
β”‚   β”œβ”€β”€ baseline_model.ipynb
β”‚   β”œβ”€β”€ active_learning.ipynb
β”‚   └── analysis.ipynb
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ complete_analysis.png
β”‚   β”œβ”€β”€ per_class_performance.png
β”‚   β”œβ”€β”€ project_summary.txt
β”‚   └── sample_data.csv
β”‚
β”œβ”€β”€ app.py
β”œβ”€β”€ requirements.txt
└── README.md

Key Learnings

  • Active Learning implementation in real-world classification
  • Natural Language Processing pipeline development
  • TF-IDF feature engineering
  • Model evaluation using accuracy and F1 score
  • Efficient data utilization strategies
  • End-to-end machine learning workflow

Future Improvements

  • Deploy using Streamlit Cloud or AWS
  • Implement BERT-based classification
  • Add FastAPI backend for inference
  • Implement model explainability using SHAP or LIME
  • Convert notebooks into production pipeline scripts

Author

Aniket Pandey
GitHub: https://github.com/rogueslasher

About

document classification system using Active Learning to achieve better performance with 35% fewer labeled examples compared to random sampling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors