An advanced Machine Learning pipeline for detecting malicious cryptocurrency wallets.
The Wallet Risk Scorer is a data-driven security tool designed to classify blockchain addresses as "Safe" or "Scam" based on behavioral analysis. By ingesting transaction history and derived metrics, the system trains a high-performance XGBoost Classifier to predict the likelihood of malicious activity.
This project moves beyond simple blocklists by analyzing behavioral features—such as transaction frequency, activity duration, and token transfer patterns—to flag suspicious wallets that may not yet be reported.
- 🤖 Advanced Gradient Boosting: Utilizes XGBoost (Extreme Gradient Boosting) for superior performance on tabular risk data.
- 🎯 Automated Optimization: Implements
RandomizedSearchCVto automatically tune hyperparameters (n_estimators,max_depth,learning_rate, etc.) for the best possible accuracy. - 📉 Robust Validation: Uses Stratified K-Fold Cross-Validation (K=5) to ensure the model generalizes well to unseen data and avoids overfitting.
- 📊 Detailed Analytics: Generates comprehensive Classification Reports and Confusion Matrices to evaluate precision, recall, and F1-scores.
- 🧠 Feature Engineering: Derives key behavioral signals like
transaction_frequency(transactions per active day) to enhance model discriminability.
- Language: Python 3.12+
- Machine Learning:
XGBoost,Scikit-Learn - Data Manipulation:
Pandas,NumPy - Visualization:
Matplotlib - Data Source: Etherscan / Moralis (via efficient CSV datasets)
wallet-risk-scorer-ML/
├── data/ # Source CSV datasets (Safe vs Scam wallets)
├── models/ # Serialized trained models (.joblib)
├── src/
│ ├── risk_scorer/
│ │ ├── data_collection/ # Scripts to fetch transactions & labels
│ │ ├── main.py # 🚀 MASTER PIPELINE: Preprocessing -> Tuning -> Training -> Evaluation
│ │ └── config.py # Configuration & Path definitions
│ └── utils/ # Helper functions for data fetching
├── pyproject.toml # Project dependencies & configuration
└── README.md # DocumentationEnsure you have Python 3.9+ installed on your machine.
Clone the repository and install the dependencies.
git clone https://github.com/yourusername/wallet-risk-scorer-ML.git
cd wallet-risk-scorer-ML
# It is recommended to use a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install pandas scikit-learn xgboost matplotlib joblib python-dotenv requestsTo run the full Training & Evaluation Pipeline:
python -m src.risk_scorer.mainWhat happens when you run this?
- Data Ingestion: Loads scam and safe wallet datasets.
- Preprocessing: Cleans data, handles missing values, and calculates
transaction_frequency. - Hyperparameter Tuning: Runs a Randomized Search to find the best XGBoost parameters.
- Training: Trains the model on the full dataset using the best parameters.
- Evaluation: Performs 5-Fold Cross-Validation and prints detailed accuracy metrics.
- Serialization: Saves the optimized model to
models/xgboost_optimized_v5.joblib.
The core logic resides in src/risk_scorer/main.py. The pipeline follows these steps:
- Labeling: Assigns
1to scam datasets and0to safe datasets. - Merging: Combines base wallet data with token transfer data.
- Hyperparameter Search:
param_dist = { 'n_estimators': [100, 300, 500, 700], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'max_depth': [3, 4, 5, 6, 8], # ... and more }
- Model Serialization: The final high-performing model is saved using
joblibfor future inference integration.
- Real-time API: Expose the model via a FastAPI/Flask endpoint.
- Live Inference: Script to fetch data for a new address and predict immediately.
- Deep Learning: Explore LSTM/RNNs for sequential transaction analysis.
- Explainability: Integrate SHAP (SHapley Additive exPlanations) to explain individual risk scores.
Disclaimer: This tool is for educational and research purposes. Cryptocurrency markets are volatile and high-risk. Always do your own research.