Malicious URL Classifier

This project classifies URLs as malicious or benign using an ensemble of machine learning models (RF, KNN, MLP, SVM and GNB) with a combination phase using weighted voting on an imbalanced dataset.

Dataset

The dataset was acquired from kaggle.

The dataset consists of 651,191 unique URLs across four categories, making it robust enough for deep learning and ensemble methods

According to our findings, the dataset is highly imbalanced. The majority of the 651,191 samples are benign (65.7%), while malicious categories like defacement (14.8%), phishing (14.5%), and malware (5.0%) make up the minority. This imbalance necessitates the use of our ensemble approach and metrics like F1-Score to ensure minority classes are predicted accurately.

Raw Data Sample

url	type
br-icloud.com.br	phishing
mp3raid.com/music/krizz_kaliko.html	benign
http://www.garage-pirenne.be/index.php...	defacement

Feature Engineering

Due to the presence of only a single feature: url, training would not be effective as there is not enough information for the model to effectively classify whether its malignent or benign. Therefore, to provide more information to our ensemble we divided the url feature into further sub-features. The new features are:

URL Length
Dot count
Digits count
Domain name
Hostname check
Secure HTTP
Contain digit
Letters count
Has shortening service
Has standard IP Address

Models

According to this study, an effective ensemble should contain diverse classifiers of different types. The classifiers used in this study are:

Random Forest Classifier
K-Nearest Neighbor
MLP
Gaussian Naive Bayes
SGD

The diversity of classifiers in an ensemble allows for effective search through the search space. This enables ensembles to find niche results all across the solution space that single classifiers cannot.

Workflow

The project follows a standard machine learning pipeline as orchestrated in main.py:

Data Loading: Downloads the dataset from Kaggle using kagglehub and loads it into a pandas DataFrame (src/data_loader.py).
Feature Engineering: Extracts the 10 numerical and categorical features from the raw URL strings (src/features.py).
Label Encoding: Converts the categorical type labels (benign, defacement, phishing, malware) into numerical values for the models.
Model Initialization: Initializes the individual classifiers (XGBoost, KNN, MLP, SVM, GNB) and the ensemble VotingClassifier (src/models.py).
Evaluation: Trains and evaluates the models using a 5-fold cross-validation pipeline with standard scaling, and reports the performance metrics (src/evaluate.py).

Evaluation

The metrics used in this study are:

Accuracy: The proportion of all classifications that were correct, whether positive or negative.
Recall: The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives.
Precision: The proportion of all the model's positive classifications that are actually positive.
F1 Score: The harmonic mean of the precision and recall.

Prerequisites

Docker installed on your machine.

How to Run with Docker

Build the Docker image:

Open your terminal in the project directory and run:
```
docker build -t malicious-url-classifier .
```
Run the container:

Once the build is complete, run the application:
```
docker run --rm -v $(pwd):/app -w /app malicious-url-classifier
```
The --rm flag automatically removes the container after it finishes running to save space.

Run the analysis script:

To view the nature of the dataset using the cache: in

docker run --rm -v $(pwd)/kaggle_cache:/root/.cache/kagglehub malicious-url-classifier python -m src.analyze

Plot analysis graphs:

To create plots of the dataset using the container:

docker run -v $(pwd)/outputs:/app/outputs malicious-url-classifier python -m src.analyze

Run mock test

To test whether desired print and plots are working.

docker run --rm -v $(pwd):/app -w /app malicious-url-classifier python -m src.mock_test

Notes

The script uses kagglehub to download the dataset. If the dataset requires authentication, you may need to pass your Kaggle credentials as environment variables:
```
docker run --rm -e KAGGLE_USERNAME=your_username -e KAGGLE_KEY=your_key malicious-url-classifier
```
The dataset is downloaded inside the container. To persist the dataset between runs (to avoid re-downloading), you can mount a volume:
```
docker run --rm -v $(pwd)/kaggle_cache:/root/.cache/kagglehub malicious-url-classifier
```

Running Locally

To run the analysis script locally without Docker, execute the following from the project root:

pip install -r requirements.txt
python3 -m src.analyze

Docker cleaning commands

Docker has build-in commands that are ment to be used for house keeping tasks:

docker image prune: delete all dangling images (as in without an assigned tag)
docker image prune -a: delete all images not used by any container
docker system prune: delete stopped containers, unused networks and dangling image + dangling build cache
docker system prune -a: delete stopped containers, unused networks, images not used by any container + all build cache

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
outputs		outputs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malicious URL Classifier

Table of Contents

Dataset

Raw Data Sample

Feature Engineering

Models

Workflow

Evaluation

Prerequisites

How to Run with Docker

Notes

Running Locally

Docker cleaning commands

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Malicious URL Classifier

Table of Contents

Dataset

Raw Data Sample

Feature Engineering

Models

Workflow

Evaluation

Prerequisites

How to Run with Docker

Notes

Running Locally

Docker cleaning commands

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages