This project classifies URLs as malicious or benign using an ensemble of machine learning models (RF, KNN, MLP, SVM and GNB) with a combination phase using weighted voting on an imbalanced dataset.
- Dataset
- Feature Engineering
- Models
- Workflow
- Evaluation
- Prerequisites
- How to Run with Docker
- Running Locally
The dataset was acquired from kaggle.
The dataset consists of 651,191 unique URLs across four categories, making it robust enough for deep learning and ensemble methods
According to our findings, the dataset is highly imbalanced. The majority of the 651,191 samples are benign (65.7%), while malicious categories like defacement (14.8%), phishing (14.5%), and malware (5.0%) make up the minority. This imbalance necessitates the use of our ensemble approach and metrics like F1-Score to ensure minority classes are predicted accurately.
| url | type |
|---|---|
| br-icloud.com.br | phishing |
| mp3raid.com/music/krizz_kaliko.html | benign |
| http://www.garage-pirenne.be/index.php... | defacement |
Due to the presence of only a single feature: url, training would not be effective as there is not enough information for the model to effectively classify whether its malignent or benign. Therefore, to provide more information to our ensemble we divided the url feature into further sub-features. The new features are:
- URL Length
- Dot count
- Digits count
- Domain name
- Hostname check
- Secure HTTP
- Contain digit
- Letters count
- Has shortening service
- Has standard IP Address
According to this study, an effective ensemble should contain diverse classifiers of different types. The classifiers used in this study are:
- Random Forest Classifier
- K-Nearest Neighbor
- MLP
- Gaussian Naive Bayes
- SGD
The diversity of classifiers in an ensemble allows for effective search through the search space. This enables ensembles to find niche results all across the solution space that single classifiers cannot.
The project follows a standard machine learning pipeline as orchestrated in main.py:
- Data Loading: Downloads the dataset from Kaggle using
kagglehuband loads it into a pandas DataFrame (src/data_loader.py). - Feature Engineering: Extracts the 10 numerical and categorical features from the raw URL strings (
src/features.py). - Label Encoding: Converts the categorical
typelabels (benign, defacement, phishing, malware) into numerical values for the models. - Model Initialization: Initializes the individual classifiers (XGBoost, KNN, MLP, SVM, GNB) and the ensemble VotingClassifier (
src/models.py). - Evaluation: Trains and evaluates the models using a 5-fold cross-validation pipeline with standard scaling, and reports the performance metrics (
src/evaluate.py).
The metrics used in this study are:
- Accuracy: The proportion of all classifications that were correct, whether positive or negative.
- Recall: The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives.
- Precision: The proportion of all the model's positive classifications that are actually positive.
- F1 Score: The harmonic mean of the precision and recall.
- Docker installed on your machine.
-
Build the Docker image:
Open your terminal in the project directory and run:
docker build -t malicious-url-classifier . -
Run the container:
Once the build is complete, run the application:
docker run --rm -v $(pwd):/app -w /app malicious-url-classifierThe
--rmflag automatically removes the container after it finishes running to save space. -
Run the analysis script:
To view the nature of the dataset using the cache: in
docker run --rm -v $(pwd)/kaggle_cache:/root/.cache/kagglehub malicious-url-classifier python -m src.analyze -
Plot analysis graphs:
To create plots of the dataset using the container:
docker run -v $(pwd)/outputs:/app/outputs malicious-url-classifier python -m src.analyze -
Run mock test
To test whether desired print and plots are working.
docker run --rm -v $(pwd):/app -w /app malicious-url-classifier python -m src.mock_test
- The script uses
kagglehubto download the dataset. If the dataset requires authentication, you may need to pass your Kaggle credentials as environment variables:docker run --rm -e KAGGLE_USERNAME=your_username -e KAGGLE_KEY=your_key malicious-url-classifier
- The dataset is downloaded inside the container. To persist the dataset between runs (to avoid re-downloading), you can mount a volume:
docker run --rm -v $(pwd)/kaggle_cache:/root/.cache/kagglehub malicious-url-classifier
To run the analysis script locally without Docker, execute the following from the project root:
pip install -r requirements.txt
python3 -m src.analyzeDocker has build-in commands that are ment to be used for house keeping tasks:
docker image prune: delete all dangling images (as in without an assigned tag)docker image prune -a: delete all images not used by any containerdocker system prune: delete stopped containers, unused networks and dangling image + dangling build cachedocker system prune -a: delete stopped containers, unused networks, images not used by any container + all build cache
