This project classifies bank transactions as either tax-deductible or non-deductible using a hybrid system that combines rule-based logic and a machine learning model.
Live Demo: https://taxclassifier.streamlit.app/
├── app.py # Streamlit web app interface
├── classifier.py # ML prediction logic
├── data_loader.py # Preprocessing and CSV loading
├── rules.py # Rule matching function
├── rules_config.py # Regex-based tax rules
├── train_model.py # Training pipeline for ML model
├── requirements.txt # Python dependencies
- Rule-Based Classifier: Uses regex patterns (e.g., business travel, meals, equipment) to assign labels and explanations.
- ML Classifier: Falls back to a trained logistic regression model when no rule matches, using TF-IDF on text and one-hot encoding on merchant name.
The model outputs a clear explanation for every prediction.
The input CSV must contain:
date– e.g.,2024-05-10amount– e.g.,125.75merchant– e.g.,Delta Airlinesdescription– e.g.,Flight to NYC for business conference
The output is available in both CSV and JSON formats. Each record contains:
datemerchantdescriptiondeductible–trueorfalsereason– explanation (e.g., "Business travel", or "ML (p=0.84)")
pip install -r requirements.txtpython train_model.py --input data/sample_transactions-2.csv --output models/tax_deductible_clf.joblibstreamlit run app.pyYou can upload your own transaction CSV or load the sample.
- ✅ Explainable rule-based deductions
- ✅ ML fallback with threshold tuning
- ✅ Streamlit UI with file upload and downloads
- ✅ JSON + CSV export support
- ✅ Modular and extensible codebase
- Transactions that don’t match any rule default to ML classification.
- The threshold for ML confidence is set to 0.5 by default (user-adjustable).
- Rule matching is prioritized by order — first match wins.
[
{
"date": "2025-07-25",
"merchant": "Apple",
"description": "MacBook purchase for work",
"deductible": true,
"reason": "Business equipment purchase"
}
]Arnav Gupta
AI/ML Internship Candidate
arnavgupta.info
source .venv/bin/activate