This project explores a fraud detection problem using bank transaction data. The objective is not to make hard fraud decisions, but to build a risk-oriented predictive model that estimates the probability of a transaction being fraudulent, supporting decision-making under uncertainty.
The project covers the full data science pipeline:
- Exploratory Data Analysis (EDA)
- Feature engineering
- Handling strong class imbalance
- Model training and evaluation
- Feature importance and explainability (SHAP)
- Risk-based prediction
The dataset was obtained from Kaggle: Bank Transaction Fraud Detection
Due to size and licensing constraints, the full dataset is not included in this repository. A representative sample is used for preview and reproducibility. The complete dataset can be accessed directly on Kaggle.
- Highly imbalanced dataset (~5% fraud)
- Fraud patterns are subtle and overlap strongly with non-fraud transactions
- Standard accuracy metrics are misleading
- Focus on ROC-AUC, relative risk, and probabilistic outputs
Key feature engineering steps include:
- Removal of non-informative identifiers (UUIDs, IDs, timestamps)
- Transformation of repeated identifiers into frequency-based features
- Selection of financially meaningful numerical variables
Final core features used for modeling:
- Transaction Amount
- Account Balance
- Customer Age
Several exploratory analyses were performed, including:
- Fraud distribution and class imbalance
- Fraud rates across bank branches
- Relative fraud risk by bank (normalized by global average)
A Random Forest Classifier was chosen due to:
- Non-linear decision boundaries
- Robustness to feature scaling
- Interpretability via feature importance and SHAP
Class imbalance was handled using:
- Stratified train-test split
- Class weighting (
class_weight='balanced')
Model evaluation focused on:
- ROC-AUC
- Feature ablation (AUC drop)
Two complementary approaches were used:
Random Forest feature importances highlighted the dominance of numerical variables.
Each feature was removed individually, retraining the model to measure the real impact on predictive performance.
This confirmed that:
- Transaction Amount
- Account Balance
- Age
are the most influential features in fraud prediction.
SHAP values were used to analyze how features influence predictions at an individual level.
Key observations:
- Most SHAP values are concentrated around the base value
- The model rarely receives strong evidence to push predictions toward fraud
- This behavior is consistent with strong class imbalance and subtle fraud patterns
Instead of outputting a binary decision, the final model produces a fraud risk score:
Probability that a transaction is fraudulent
This allows:
- Flexible thresholding
- Integration with business rules
- Manual review prioritization
#Example usage:
predict_fraud_risk(model, amount=50000, balance=20000, age=60)- Dataset limited to a single country and month
- Fraud patterns may evolve over time
- Model performance constrained by data imbalance and feature overlap
- Cost-sensitive learning
- Time-based validation
- Threshold optimization based on business cost
- Integration with real-time alert systems
- Python
- Pandas / NumPy
- Scikit-learn
- SHAP
- Matplotlib / Seaborn
Bastián B. BSc in Mathematics Aspiring Data Analyst / Data Scientist