This project presents the design and evaluation of a Machine Learning-based Intrusion Detection System (IDS) built using Python. It was completed as part of the 32130 Fundamentals of Data Analytics course at UTS.
The system classifies network intrusions based on labeled traffic features using various ML models. Pre-processing, model comparison, and evaluation were conducted to determine the most effective approach.
Intrusion Classes:
Mirai-greip_floodRecon-OSScanDictionaryBruteForce
train_IoT_Intrusion_Detection.csv— labeled dataset used for trainingunknowndataset.csv— unlabeled dataset used for prediction
Each record represents a network traffic snapshot with 48 features including:
- Packet durations
- Flag indicators (HTTP, DNS, TCP)
- Binary protocol usage
- Header lengths and rates
🧪 Missing or constant columns such as
DrateandUnnamedwere excluded based on EDA.
- Language: Python
- Libraries:
pandas,numpy,matplotlib,scikit-learn - Notebook:
intrusion_detection_system-2.ipynb
Preprocessing involved the following steps:
-
Feature Standardization All numerical features were scaled to have zero mean and unit variance.
-
Label Encoding
Mirai-greip_flood→ 0Recon-OSScan→ 1DictionaryBruteForce→ 2
-
Dropped Columns
Drate(constant = 0)Unnamed(irrelevant index)Label(during scaling)
The following classifiers were evaluated:
| Model | Description |
|---|---|
| Random Forest | Ensemble of decision trees |
| K-Nearest Neighbors | Based on Euclidean distance |
| Logistic Regression | Linear classification algorithm |
| Support Vector Machine | Margin-based classifier |
| Multi-Layer Perceptron | Feed-forward neural network |
We evaluated both unprocessed and pre-processed datasets on the following metrics:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC Curves
| Model | Unprocessed | Preprocessed |
|---|---|---|
| Random Forest | 99.42% | 99.73% |
| Logistic Regression | 97.92% | 99.37% |
| KNN | 98.1% | 99.45% |
| SVM | 98.5% | 99.37% |
| MLP Classifier | 98.0% | 99.37% |
- Random Forest consistently outperformed other models, both in raw and processed datasets.
- Data preprocessing significantly improved model performance across all metrics.
- Surprisingly, raw dataset predictions scored higher on the unknown test set, indicating the real-world value of preserved outliers.
✅ Preprocessing (especially outlier removal & standardization) boosts accuracy ✅ Random Forest is highly robust for intrusion detection ✅ Real-world testing must balance clean data with natural variability ✅ Visual analytics like ROC curves help in interpreting model quality
📂 intrusion-detection-ml/
├── intrusion_detection_system-2.ipynb # Jupyter Notebook with code
├── fda_a3_25203896.pdf # Detailed report
├── images/
│ ├── outlier_duration.png
│ ├── code_flowchart.png
│ ├── roc_unprocessed.png
│ └── roc_preprocessed.png
└── README.md
- Expand attack classes and retrain with new datasets
- Integrate with a live packet sniffer (e.g., Wireshark + Scapy)
- Explore deep learning and real-time alerting mechanisms



