Bank Transaction Fraud Detection – Machine Learning Project

Overview

This project explores a fraud detection problem using bank transaction data. The objective is not to make hard fraud decisions, but to build a risk-oriented predictive model that estimates the probability of a transaction being fraudulent, supporting decision-making under uncertainty.

The project covers the full data science pipeline:

Exploratory Data Analysis (EDA)
Feature engineering
Handling strong class imbalance
Model training and evaluation
Feature importance and explainability (SHAP)
Risk-based prediction

Dataset

The dataset was obtained from Kaggle: Bank Transaction Fraud Detection

Due to size and licensing constraints, the full dataset is not included in this repository. A representative sample is used for preview and reproducibility. The complete dataset can be accessed directly on Kaggle.

Problem Characteristics

Highly imbalanced dataset (~5% fraud)
Fraud patterns are subtle and overlap strongly with non-fraud transactions
Standard accuracy metrics are misleading
Focus on ROC-AUC, relative risk, and probabilistic outputs

Feature Engineering

Key feature engineering steps include:

Removal of non-informative identifiers (UUIDs, IDs, timestamps)
Transformation of repeated identifiers into frequency-based features
Selection of financially meaningful numerical variables

Final core features used for modeling:

Transaction Amount
Account Balance
Customer Age

Exploratory Analysis

Several exploratory analyses were performed, including:

Fraud distribution and class imbalance
Fraud rates across bank branches
Relative fraud risk by bank (normalized by global average)

Modeling Approach

A Random Forest Classifier was chosen due to:

Non-linear decision boundaries
Robustness to feature scaling
Interpretability via feature importance and SHAP

Class imbalance was handled using:

Stratified train-test split
Class weighting (class_weight='balanced')

Model evaluation focused on:

ROC-AUC
Feature ablation (AUC drop)

Feature Importance & Ablation Study

Two complementary approaches were used:

Impurity-Based Importance

Random Forest feature importances highlighted the dominance of numerical variables.

AUC-Based Feature Ablation

Each feature was removed individually, retraining the model to measure the real impact on predictive performance.

This confirmed that:

Transaction Amount
Account Balance
Age

are the most influential features in fraud prediction.

Model Explainability (SHAP)

SHAP values were used to analyze how features influence predictions at an individual level.

Key observations:

Most SHAP values are concentrated around the base value
The model rarely receives strong evidence to push predictions toward fraud
This behavior is consistent with strong class imbalance and subtle fraud patterns

Risk-Oriented Prediction

Instead of outputting a binary decision, the final model produces a fraud risk score:

Probability that a transaction is fraudulent

This allows:

Flexible thresholding
Integration with business rules
Manual review prioritization

#Example usage:

predict_fraud_risk(model, amount=50000, balance=20000, age=60)

Limitations

Dataset limited to a single country and month
Fraud patterns may evolve over time
Model performance constrained by data imbalance and feature overlap

Future Work

Cost-sensitive learning
Time-based validation
Threshold optimization based on business cost
Integration with real-time alert systems

Technologies Used

Python
Pandas / NumPy
Scikit-learn
SHAP
Matplotlib / Seaborn

Author

Bastián B. BSc in Mathematics Aspiring Data Analyst / Data Scientist

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebook		notebook
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bank Transaction Fraud Detection – Machine Learning Project

Overview

Dataset

Problem Characteristics

Feature Engineering

Exploratory Analysis

Modeling Approach

Feature Importance & Ablation Study

Impurity-Based Importance

AUC-Based Feature Ablation

Model Explainability (SHAP)

Risk-Oriented Prediction

Limitations

Future Work

Technologies Used

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bank Transaction Fraud Detection – Machine Learning Project

Overview

Dataset

Problem Characteristics

Feature Engineering

Exploratory Analysis

Modeling Approach

Feature Importance & Ablation Study

Impurity-Based Importance

AUC-Based Feature Ablation

Model Explainability (SHAP)

Risk-Oriented Prediction

Limitations

Future Work

Technologies Used

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages