Skip to content

RenatoMignone/network-flow-classification-dl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Laboratory 1 for the AIC Course

Polito Logo

This repository contains all the materials, scripts, figures and the formal write-up for Laboratory 1 of the AI & Cybersecurity course. The laboratory focuses on building and evaluating Feed Forward Neural Networks (FFNNs) on a reduced subset of the CICIDS2017 dataset. Activities cover dataset exploration and cleaning, baseline and deep neural models, treatment of class imbalance, analysis of feature biases, and regularization experiments.

Overview

The goal of Laboratory 1 is to teach the full modelling pipeline for a network-flow classification task using FFNNs. Key deliverables and activities performed in the lab are:

  • Exploratory Data Analysis (EDA) and preprocessing: removing missing values and duplicates, understanding feature distributions and outliers, choosing normalization schemes, and creating reproducible data splits (60% train / 20% validation / 20% test).
  • Baseline models (shallow FFNNs): train single-hidden-layer networks with varying neuron counts and activations to establish baselines and visualize loss curves and classification reports.
  • Class imbalance and loss experiments: quantify imbalance, compute class weights (scikit-learn compute_class_weight) and use weighted CrossEntropyLoss to improve rare-class detection.
  • Feature-bias analysis: test how the Destination Port feature influences results (e.g., BruteForce attacks concentrated on port 80) and measure the model’s sensitivity by modifying test data and by removing the port feature.
  • Deep networks and hyperparameter sweeps: explore deeper architectures (3–6 layers), batch-size effects, optimizer comparisons (SGD, SGD+momentum, AdamW), and learning rate tuning.
  • Regularization: test dropout, batch normalization and weight decay (AdamW) to reduce overfitting on larger models.

Detailed experiments, plots and numeric outputs are contained in the Jupyter notebooks inside lab/notebooks/ and exported images are saved under lab/Plots/.

Repository Structure

Laboratory1/
├── lab/            # Data, notebooks and support material (see lab/README.md for a focused summary)
│   ├── notebooks/  # Jupyter notebooks: First_Part.ipynb, Second_Part.ipynb, Third_Part.ipynb
│   ├── data/       # dataset_lab_1.csv (subset of CICIDS2017 used in the lab)
│   └── Plots/      # Exported figures used in the report
├── report/         # LaTeX source and compiled report (main.tex + figures/tables)
├── resources/      # Logos, PDFs and other supporting files
└── README.md       # This file (overview, reproduction steps and links)

Note

The formal lab report and compiled PDF are available in report/ (see report/main.tex). For a runnable summary of the experiments and step-by-step code, open lab/notebooks/ or read the focused lab/README.md.

Lab Objectives

The main learning objectives for the lab were:

  1. Learn an end-to-end machine learning workflow for tabular network data: EDA, preprocessing, splitting and feature normalization.
  2. Build, train and compare shallow and deep FFNN architectures for multi-class flow classification.
  3. Understand the impact of class imbalance and how to use weighted loss functions to mitigate it.
  4. Analyze how spurious features (e.g., destination port) can introduce shortcuts and bias model predictions.
  5. Learn regularization techniques (dropout, batch normalization, weight decay) to reduce overfitting in deeper models.

Requirements (recommended environment)

We used a standard Python data-science stack. The notebooks are compatible with recent Python 3.8+ environments. Recommended packages (install with pip):

# create and activate virtual environment (zsh)
python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install jupyterlab notebook pandas numpy scikit-learn matplotlib seaborn torch torchvision tqdm

Notes:

  • If you plan to train large models and have an NVIDIA GPU, install the CUDA-enabled PyTorch build for faster training.
  • For reproducibility, set the same random seeds for numpy, torch and sklearn; the notebooks include seed-setting cells.

Experiment procedure (how the notebooks map to the lab tasks)

Run the notebooks in order to reproduce the workflow. High-level mapping between tasks and notebooks:

  • lab/notebooks/First_Part.ipynb — Task 1: Data preprocessing and EDA

    • Load lab/data/dataset_lab_1.csv and inspect column types and missing values.
    • Remove NaNs and duplicate rows, report counts before/after cleaning.
    • Visualize feature distributions and justify normalization choices.
    • Create train/validation/test splits (60/20/20) with fixed seeds and save split metadata.
  • lab/notebooks/Second_Part.ipynb — Tasks 2–4: Shallow FFNNs, activation experiments, feature-bias and weighted loss

    • Train three single-hidden-layer FFNNs (32, 64, 128 neurons) under the given hyperparameter configuration: batch size 64, AdamW, lr=5e-4, CrossEntropyLoss, early stopping.
    • Plot training/validation loss curves and produce classification reports for validation and test sets.
    • Compare Linear vs ReLU activations, choose best model by validation metrics, and evaluate test generalization.
    • Perform targeted feature-bias experiments: replace destination port 80 with 8080 in the test set and measure the impact; then drop the port feature and re-run preprocessing and models to evaluate robustness.
    • Experiment with weighted CrossEntropyLoss (class weights estimated on the training partition using sklearn's compute_class_weight).
  • lab/notebooks/Third_Part.ipynb — Tasks 5–6: Deep FFNNs, optimizer/batch-size studies and regularization

    • Design and train deeper FFNNs (3–5 layers) with several neuron configurations and identify the best architecture using validation performance.
    • Study the effect of batch sizes {4, 64, 256, 1024} on convergence and training time.
    • Compare optimizers (SGD, SGD with momentum 0.1/0.5/0.9, AdamW) and analyze loss trends and timings.
    • Apply regularization: dropout, batch normalization, and AdamW weight decay; measure their effect on overfitting and validation/test performance.

Each notebook contains code cells that produce the exact training loops, metric computation and figures used in the report.

Where to find results

  • Notebooks: lab/notebooks/ — run them to reproduce results and regenerate figures.
  • Dataset: lab/data/dataset_lab_1.csv — the CSV used for all experiments.
  • Plots: lab/Plots/ — exported PNGs for loss curves, confusion matrices and other figures included in the LaTeX report.
  • Report: report/main.tex and compiled outputs in report/.

Reproducing the experiments (quick guide)

  1. Set up the virtual environment and install dependencies (see Requirements section above).
  2. Open Jupyter Lab/Notebook:
jupyter lab
# or
jupyter notebook
  1. Run the notebooks in order:
    • First_Part.ipynb — run all cells to produce cleaned data and inspect preprocessing outputs.
    • Second_Part.ipynb — run model training and shallow-net experiments.
    • Third_Part.ipynb — run deep-net and regularization experiments.

Tips:

  • For quicker iteration, disable long training cells or reduce the number of epochs. The notebooks contain parameters near the top of each notebook for easy modification.
  • To reproduce exact results, run the notebooks start-to-finish without parallel modifications and use the same seed values present in the notebooks.

Key findings (short summary)

  • Cleaning the dataset (NaNs and duplicates) changed sample counts and can disproportionately affect certain classes (PortScan in our experiments).
  • Shallow FFNNs provide robust baselines; activation functions and neuron counts affect convergence and class-wise performance.
  • Class-weighted loss helps detect rare classes better; the trade-off between overall accuracy and rare-class recall is discussed in the notebooks.
  • Feature shortcuts (e.g., destination port) can bias models; replacing or removing the port feature revealed how much the model relied on it.
  • Deeper architectures increase expressivity but require regularization to avoid overfitting; dropout, batch normalization and AdamW weight decay were effective to varying degrees.

Next steps and suggestions

  • Add a requirements.txt or environment.yml with pinned library versions for exact reproducibility.
  • Export model checkpoints and create a small scripts/inference.py to run fast inference on the test set.
  • Add lightweight unit tests for preprocessing steps (e.g., expected counts after NaN/duplicate removal).

Authors

Name GitHub LinkedIn Email
Renato Mignone GitHub LinkedIn Email
Claudia Sanna GitHub LinkedIn Email
Chiara Iorio GitHub LinkedIn Email

About

End-to-end Deep Learning pipeline for network flow classification using the CICIDS2017 dataset. Implements Feed-Forward Neural Networks (FFNNs), handling class imbalance with weighted loss functions and analyzing feature bias. Includes extensive EDA, preprocessing, and regularization experiments to optimize model performance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages