You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A benchmarking study comparing CPU and GPU performance for training a
Feed-Forward Neural Network (FNN) on the
MNIST handwritten digit classification
dataset. Six gradient-descent optimization algorithms are implemented from
scratch—each with a CPU-only and a CUDA GPU-accelerated version—and evaluated
across two network sizes.
Hardware
Component
Specification
GPU
NVIDIA GeForce GTX 1660 Ti
CPU
AMD Ryzen 7
Dataset
Split
Samples
Input Shape
Train
60 000
784 (28×28 flattened, normalised)
Test
10 000
784
Pre-processing: zero-mean, unit-variance normalisation using training-set
statistics.
Model Architecture
Feed-Forward Neural Network (FNN)
Hyperparameter
Value
Hidden layers
2
Hidden-layer activation
ReLU
Output activation
Softmax
Output classes (K)
10
Learning rate (η)
0.001
Epochs
1
Weight initialisation
Normal(0, 1/fan-in)
Bias initialisation
Zero
Two hidden-layer widths are compared: N = 8 and N = 64 neurons per
hidden layer.
Implemented Optimisation Algorithms
Each algorithm is implemented twice: a pure-NumPy CPU version and a PyCUDA
GPU version with custom CUDA C kernels.
#
Algorithm
Update Rule Summary
Batch Strategy
1
Stochastic GD
w ← w − η·∇w
Online (1 sample)
2
Momentum GD
u ← γu + η·∇w ; w ← w − u
Mini-batch (32/64)
3
Nesterov GD
Look-ahead: w̃ = w − γu ; grad at w̃
Mini-batch (32/64)
4
RMSProp
v ← βv + (1−β)∇w² ; w ← w − (η/√(v+ε))·∇w
Mini-batch (64)
5
Adam
Bias-corrected 1st & 2nd moment estimates
Mini-batch (64)
6
Nadam
Adam + Nesterov look-ahead momentum
Mini-batch (64)
Hyperparameters per Algorithm
Algorithm
γ (momentum)
β / β₁
β₂
ε
Momentum
0.9
—
—
—
Nesterov
0.9
—
—
—
RMSProp
—
0.98
—
1e-10
Adam
—
0.9
0.999
1e-10
Nadam
—
0.9
0.999
1e-10
GPU Implementation (PyCUDA)
Custom CUDA kernels handle the performance-critical operations:
Kernel
Purpose
compute_fw
Weighted sum + bias (forward pass)
relu_ac
ReLU activation
grad_mul
Outer-product weight gradient
grad_relu
ReLU derivative
ele_grad
Element-wise gradient multiplication
grad_wt
Back-propagate weight gradient
update
SGD weight update
add / reset
Gradient accumulation / reset
lookahead_c
Nesterov look-ahead computation
update_1
Momentum update
update_2
RMSProp / Adam second-moment update
update_3
RMSProp weight update
update_4
Adam first-moment update
update_5
Adam weight update
update_6
Nadam weight update
Block size: 32 threads; grid size computed as ⌈dim/32⌉.
Results
N = 8 Hidden-Layer Neurons
Algorithm
GPU Accuracy (%)
GPU Time (s)
CPU Accuracy (%)
CPU Time (s)
Speedup (CPU/GPU)
Stochastic GD
89.12
119.75
89.69
58.99
0.49×
Momentum GD
86.13
119.71
86.70
63.81
0.53×
Nesterov GD
90.02
135.48
86.12
95.68
0.71×
RMSProp
80.97
120.49
83.45
63.64
0.53×
Adam
87.37
120.11
84.45
64.73
0.54×
Nadam
89.52
120.55
88.46
65.39
0.54×
Key observation (N=8): CPU is faster than GPU for all algorithms. The
network is too small to amortise GPU kernel-launch and data-transfer overhead.
N = 64 Hidden-Layer Neurons
Algorithm
GPU Accuracy (%)
GPU Time (s)
CPU Accuracy (%)
CPU Time (s)
Speedup (CPU/GPU)
Stochastic GD
91.58
120.35
92.30
435.93
3.62×
Momentum GD
94.34
120.08
94.67
393.47
3.28×
Nesterov GD
95.45
136.75
95.57
727.68
5.32×
RMSProp
92.19
122.67
92.93
456.06
3.72×
Adam
92.81
124.10
92.94
461.69
3.72×
Nadam
93.52
121.45
93.19
451.05
3.71×
Key observation (N=64): GPU delivers 3–5× speedup over CPU. Nesterov
GD shows the greatest GPU advantage (~5.3×). Accuracy is comparable between
devices, with CPU slightly edging GPU in most cases due to float64 vs float32
numeric precision.
Summary: GPU Speedup vs Network Size
Algorithm
Speedup N=8
Speedup N=64
Stochastic GD
0.49×
3.62×
Momentum GD
0.53×
3.28×
Nesterov GD
0.71×
5.32×
RMSProp
0.53×
3.72×
Adam
0.54×
3.72×
Nadam
0.54×
3.71×
Training Loss Curves
Loss curves (GPU vs CPU) and a runtime comparison bar chart are saved in the
results/ folder.
N = 8 Neurons
Stochastic GD
Momentum GD
Nesterov GD
RMSProp
Adam
Nadam
Runtime Comparison (N=8)
N = 64 Neurons
Stochastic GD
Momentum GD
Nesterov GD
RMSProp
Adam
Nadam
Runtime Comparison (N=64)
Key Findings
GPU advantage scales with model size. For N=8, GPU is slower than CPU
due to kernel-launch and PCIe-transfer overhead. At N=64, GPU achieves a
3–5× speedup.
Nesterov GD benefits most from GPU acceleration (~5.3× at N=64), because
its look-ahead step doubles the number of forward/backward passes per batch
update.
Accuracy is nearly identical between GPU and CPU implementations, with
CPU marginally higher in most cases — consistent with float32 (GPU) vs
float64 (CPU) precision differences.
Best accuracy at N=64 is achieved by Nesterov GD (GPU: 95.45%, CPU:
95.57%).
GPU run times are highly consistent (~120–137 s) across all algorithms
and both N values, confirming that GPU execution time is dominated by kernel
overhead rather than compute. For N=64, the CPU time variance is large
(394–728 s), showing strong sensitivity to algorithmic complexity.
Project Structure
CPU-vs-GPU-Performance-Analysis/
│
├── GD_optimisers.py # Main script: all algorithms, training, evaluation & plots
│
└── results/
├── 8.txt # Benchmark output — N=8 neurons
├── 64.txt # Benchmark output — N=64 neurons
├── N8/ # Training loss & runtime plots for N=8
│ ├── Figure_1.png # Stochastic GD loss curve
│ ├── Figure_2.png # Momentum GD loss curve
│ ├── Figure_3.png # Nesterov GD loss curve
│ ├── Figure_4.png # RMSProp loss curve
│ ├── Figure_5.png # Adam loss curve
│ ├── Figure_6.png # Nadam loss curve
│ └── rt.png # Runtime comparison bar chart
└── N64/ # Training loss & runtime plots for N=64
├── Figure_1.png
├── Figure_2.png
├── Figure_3.png
├── Figure_4.png
├── Figure_5.png
├── Figure_6.png
└── rt.png