An NLP system that classifies documents into 20 categories using Active Learning, reducing labeling effort by 35% while achieving 85.3% accuracy.
This project demonstrates how uncertainty-based sample selection improves model performance while requiring fewer labeled examples.
Built using Python, scikit-learn, NLTK, and Streamlit.
Text classification models typically require large labeled datasets, which are expensive and time-consuming to create.
This project implements Active Learning using uncertainty sampling, where the model iteratively selects the most informative samples to label, improving learning efficiency and performance.
Key outcomes:
- Final accuracy: 85.3%
- +3.2% accuracy improvement over random sampling
- 35% reduction in labeled data required
- Multi-class classification across 20 categories
Dataset used: 20 Newsgroups (11,314 training documents)
- Active Learning implementation using uncertainty sampling
- Complete NLP pipeline from preprocessing to evaluation
- TF-IDF feature extraction
- Logistic Regression classifier
- Model evaluation and performance analysis
- Visualization of learning performance
- Streamlit interface for interactive predictions
- Python
- scikit-learn
- NLTK
- Streamlit
- matplotlib
- seaborn
- NumPy
- pandas
| Metric | Active Learning | Random Sampling |
|---|---|---|
| Accuracy | 85.3% | 82.1% |
| F1 Score | 0.847 | 0.814 |
| Label Efficiency | 35% fewer labels | β |
Active Learning achieves better performance with significantly fewer labeled examples.
Pipeline workflow:
Raw Text
β
Text Preprocessing
β
TF-IDF Feature Extraction
β
Baseline Model Training
β
Active Learning Loop
β
Model Evaluation and Analysis
Uncertainty sampling formula:
uncertainty = 1 β max(predicted_probability)
Samples with highest uncertainty are selected for labeling.
Clone the repository:
git clone https://github.com/rogueslasher/document-classifier-active-learning.git
cd document-classifier-active-learningCreate virtual environment:
python -m venv venv
venv\Scripts\activateInstall dependencies:
pip install -r requirements.txtDownload NLTK resources:
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"Open the notebooks in order:
notebooks/
β
βββ explore_data.ipynb
βββ text_preprocess.ipynb
βββ feature_engineering.ipynb
βββ baseline_model.ipynb
βββ active_learning.ipynb
βββ analysis.ipynb
Run each notebook sequentially to execute the full pipeline.
To run the Streamlit interface:
streamlit run app.pydocument-classifier-active-learning/
β
βββ notebooks/
β βββ explore_data.ipynb
β βββ text_preprocess.ipynb
β βββ feature_engineering.ipynb
β βββ baseline_model.ipynb
β βββ active_learning.ipynb
β βββ analysis.ipynb
β
βββ notebooks/
β βββ complete_analysis.png
β βββ per_class_performance.png
β βββ project_summary.txt
β βββ sample_data.csv
β
βββ app.py
βββ requirements.txt
βββ README.md
- Active Learning implementation in real-world classification
- Natural Language Processing pipeline development
- TF-IDF feature engineering
- Model evaluation using accuracy and F1 score
- Efficient data utilization strategies
- End-to-end machine learning workflow
- Deploy using Streamlit Cloud or AWS
- Implement BERT-based classification
- Add FastAPI backend for inference
- Implement model explainability using SHAP or LIME
- Convert notebooks into production pipeline scripts
Aniket Pandey
GitHub: https://github.com/rogueslasher