Skip to content

a-pt/CPU-vs-GPU-Performance-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

CPU vs GPU Performance Analysis

A benchmarking study comparing CPU and GPU performance for training a Feed-Forward Neural Network (FNN) on the MNIST handwritten digit classification dataset. Six gradient-descent optimization algorithms are implemented from scratch—each with a CPU-only and a CUDA GPU-accelerated version—and evaluated across two network sizes.


Hardware

Component Specification
GPU NVIDIA GeForce GTX 1660 Ti
CPU AMD Ryzen 7

Dataset

Split Samples Input Shape
Train 60 000 784 (28×28 flattened, normalised)
Test 10 000 784

Pre-processing: zero-mean, unit-variance normalisation using training-set statistics.


Model Architecture

Feed-Forward Neural Network (FNN)

Hyperparameter Value
Hidden layers 2
Hidden-layer activation ReLU
Output activation Softmax
Output classes (K) 10
Learning rate (η) 0.001
Epochs 1
Weight initialisation Normal(0, 1/fan-in)
Bias initialisation Zero

Two hidden-layer widths are compared: N = 8 and N = 64 neurons per hidden layer.


Implemented Optimisation Algorithms

Each algorithm is implemented twice: a pure-NumPy CPU version and a PyCUDA GPU version with custom CUDA C kernels.

# Algorithm Update Rule Summary Batch Strategy
1 Stochastic GD w ← w − η·∇w Online (1 sample)
2 Momentum GD u ← γu + η·∇w ; w ← w − u Mini-batch (32/64)
3 Nesterov GD Look-ahead: w̃ = w − γu ; grad at w̃ Mini-batch (32/64)
4 RMSProp v ← βv + (1−β)∇w² ; w ← w − (η/√(v+ε))·∇w Mini-batch (64)
5 Adam Bias-corrected 1st & 2nd moment estimates Mini-batch (64)
6 Nadam Adam + Nesterov look-ahead momentum Mini-batch (64)

Hyperparameters per Algorithm

Algorithm γ (momentum) β / β₁ β₂ ε
Momentum 0.9
Nesterov 0.9
RMSProp 0.98 1e-10
Adam 0.9 0.999 1e-10
Nadam 0.9 0.999 1e-10

GPU Implementation (PyCUDA)

Custom CUDA kernels handle the performance-critical operations:

Kernel Purpose
compute_fw Weighted sum + bias (forward pass)
relu_ac ReLU activation
grad_mul Outer-product weight gradient
grad_relu ReLU derivative
ele_grad Element-wise gradient multiplication
grad_wt Back-propagate weight gradient
update SGD weight update
add / reset Gradient accumulation / reset
lookahead_c Nesterov look-ahead computation
update_1 Momentum update
update_2 RMSProp / Adam second-moment update
update_3 RMSProp weight update
update_4 Adam first-moment update
update_5 Adam weight update
update_6 Nadam weight update

Block size: 32 threads; grid size computed as ⌈dim/32⌉.


Results

N = 8 Hidden-Layer Neurons

Algorithm GPU Accuracy (%) GPU Time (s) CPU Accuracy (%) CPU Time (s) Speedup (CPU/GPU)
Stochastic GD 89.12 119.75 89.69 58.99 0.49×
Momentum GD 86.13 119.71 86.70 63.81 0.53×
Nesterov GD 90.02 135.48 86.12 95.68 0.71×
RMSProp 80.97 120.49 83.45 63.64 0.53×
Adam 87.37 120.11 84.45 64.73 0.54×
Nadam 89.52 120.55 88.46 65.39 0.54×

Key observation (N=8): CPU is faster than GPU for all algorithms. The network is too small to amortise GPU kernel-launch and data-transfer overhead.


N = 64 Hidden-Layer Neurons

Algorithm GPU Accuracy (%) GPU Time (s) CPU Accuracy (%) CPU Time (s) Speedup (CPU/GPU)
Stochastic GD 91.58 120.35 92.30 435.93 3.62×
Momentum GD 94.34 120.08 94.67 393.47 3.28×
Nesterov GD 95.45 136.75 95.57 727.68 5.32×
RMSProp 92.19 122.67 92.93 456.06 3.72×
Adam 92.81 124.10 92.94 461.69 3.72×
Nadam 93.52 121.45 93.19 451.05 3.71×

Key observation (N=64): GPU delivers 3–5× speedup over CPU. Nesterov GD shows the greatest GPU advantage (~5.3×). Accuracy is comparable between devices, with CPU slightly edging GPU in most cases due to float64 vs float32 numeric precision.


Summary: GPU Speedup vs Network Size

Algorithm Speedup N=8 Speedup N=64
Stochastic GD 0.49× 3.62×
Momentum GD 0.53× 3.28×
Nesterov GD 0.71× 5.32×
RMSProp 0.53× 3.72×
Adam 0.54× 3.72×
Nadam 0.54× 3.71×

Training Loss Curves

Loss curves (GPU vs CPU) and a runtime comparison bar chart are saved in the results/ folder.

N = 8 Neurons

Stochastic GD Momentum GD Nesterov GD
Stochastic GD Loss N=8 Momentum GD Loss N=8 Nesterov GD Loss N=8
RMSProp Adam Nadam
RMSProp Loss N=8 Adam Loss N=8 Nadam Loss N=8
Runtime Comparison (N=8)
Runtime N=8

N = 64 Neurons

Stochastic GD Momentum GD Nesterov GD
Stochastic GD Loss N=64 Momentum GD Loss N=64 Nesterov GD Loss N=64
RMSProp Adam Nadam
RMSProp Loss N=64 Adam Loss N=64 Nadam Loss N=64
Runtime Comparison (N=64)
Runtime N=64

Key Findings

  1. GPU advantage scales with model size. For N=8, GPU is slower than CPU due to kernel-launch and PCIe-transfer overhead. At N=64, GPU achieves a 3–5× speedup.
  2. Nesterov GD benefits most from GPU acceleration (~5.3× at N=64), because its look-ahead step doubles the number of forward/backward passes per batch update.
  3. Accuracy is nearly identical between GPU and CPU implementations, with CPU marginally higher in most cases — consistent with float32 (GPU) vs float64 (CPU) precision differences.
  4. Best accuracy at N=64 is achieved by Nesterov GD (GPU: 95.45%, CPU: 95.57%).
  5. GPU run times are highly consistent (~120–137 s) across all algorithms and both N values, confirming that GPU execution time is dominated by kernel overhead rather than compute. For N=64, the CPU time variance is large (394–728 s), showing strong sensitivity to algorithmic complexity.

Project Structure

CPU-vs-GPU-Performance-Analysis/
│
├── GD_optimisers.py          # Main script: all algorithms, training, evaluation & plots
│
└── results/
    ├── 8.txt                 # Benchmark output — N=8 neurons
    ├── 64.txt                # Benchmark output — N=64 neurons
    ├── N8/                   # Training loss & runtime plots for N=8
    │   ├── Figure_1.png      # Stochastic GD loss curve
    │   ├── Figure_2.png      # Momentum GD loss curve
    │   ├── Figure_3.png      # Nesterov GD loss curve
    │   ├── Figure_4.png      # RMSProp loss curve
    │   ├── Figure_5.png      # Adam loss curve
    │   ├── Figure_6.png      # Nadam loss curve
    │   └── rt.png            # Runtime comparison bar chart
    └── N64/                  # Training loss & runtime plots for N=64
        ├── Figure_1.png
        ├── Figure_2.png
        ├── Figure_3.png
        ├── Figure_4.png
        ├── Figure_5.png
        ├── Figure_6.png
        └── rt.png

Dependencies

Package Purpose
numpy Array operations (CPU path)
matplotlib Loss & runtime plots
scikit-learn Dataset shuffling
keras / tensorflow MNIST dataset loader
pycuda CUDA Python bindings
CUDA Toolkit CUDA C kernel compilation (SourceModule)

Install Python dependencies:

pip install numpy matplotlib scikit-learn tensorflow pycuda

Note: PyCUDA requires a working NVIDIA CUDA Toolkit installation. Ensure nvcc is on your PATH.


Running the Script

python GD_optimisers.py

The script will:

  1. Load and pre-process the MNIST dataset.
  2. Sequentially run all 6 optimisers (GPU first, then CPU) for the configured network size.
  3. Print accuracy and runtime for each run.
  4. Display training-loss curves (GPU vs CPU) and a final runtime comparison chart.

To switch between N=8 and N=64, change line 1116 in GD_optimisers.py:

N = 8   # or 64

About

Benchmarking tests were performed between CPU and GPU for task of Classification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages