CPU vs GPU Performance Analysis

A benchmarking study comparing CPU and GPU performance for training a Feed-Forward Neural Network (FNN) on the MNIST handwritten digit classification dataset. Six gradient-descent optimization algorithms are implemented from scratch—each with a CPU-only and a CUDA GPU-accelerated version—and evaluated across two network sizes.

Hardware

Component	Specification
GPU	NVIDIA GeForce GTX 1660 Ti
CPU	AMD Ryzen 7

Dataset

Split	Samples	Input Shape
Train	60 000	784 (28×28 flattened, normalised)
Test	10 000	784

Pre-processing: zero-mean, unit-variance normalisation using training-set statistics.

Model Architecture

Feed-Forward Neural Network (FNN)

Hyperparameter	Value
Hidden layers	2
Hidden-layer activation	ReLU
Output activation	Softmax
Output classes (K)	10
Learning rate (η)	0.001
Epochs	1
Weight initialisation	Normal(0, 1/fan-in)
Bias initialisation	Zero

Two hidden-layer widths are compared: N = 8 and N = 64 neurons per hidden layer.

Implemented Optimisation Algorithms

Each algorithm is implemented twice: a pure-NumPy CPU version and a PyCUDA GPU version with custom CUDA C kernels.

#	Algorithm	Update Rule Summary	Batch Strategy
1	Stochastic GD	w ← w − η·∇w	Online (1 sample)
2	Momentum GD	u ← γu + η·∇w ; w ← w − u	Mini-batch (32/64)
3	Nesterov GD	Look-ahead: w̃ = w − γu ; grad at w̃	Mini-batch (32/64)
4	RMSProp	v ← βv + (1−β)∇w² ; w ← w − (η/√(v+ε))·∇w	Mini-batch (64)
5	Adam	Bias-corrected 1st & 2nd moment estimates	Mini-batch (64)
6	Nadam	Adam + Nesterov look-ahead momentum	Mini-batch (64)

Hyperparameters per Algorithm

Algorithm	γ (momentum)	β / β₁	β₂	ε
Momentum	0.9	—	—	—
Nesterov	0.9	—	—	—
RMSProp	—	0.98	—	1e-10
Adam	—	0.9	0.999	1e-10
Nadam	—	0.9	0.999	1e-10

GPU Implementation (PyCUDA)

Custom CUDA kernels handle the performance-critical operations:

Kernel	Purpose
`compute_fw`	Weighted sum + bias (forward pass)
`relu_ac`	ReLU activation
`grad_mul`	Outer-product weight gradient
`grad_relu`	ReLU derivative
`ele_grad`	Element-wise gradient multiplication
`grad_wt`	Back-propagate weight gradient
`update`	SGD weight update
`add` / `reset`	Gradient accumulation / reset
`lookahead_c`	Nesterov look-ahead computation
`update_1`	Momentum update
`update_2`	RMSProp / Adam second-moment update
`update_3`	RMSProp weight update
`update_4`	Adam first-moment update
`update_5`	Adam weight update
`update_6`	Nadam weight update

Block size: 32 threads; grid size computed as ⌈dim/32⌉.

Results

N = 8 Hidden-Layer Neurons

Algorithm	GPU Accuracy (%)	GPU Time (s)	CPU Accuracy (%)	CPU Time (s)	Speedup (CPU/GPU)
Stochastic GD	89.12	119.75	89.69	58.99	0.49×
Momentum GD	86.13	119.71	86.70	63.81	0.53×
Nesterov GD	90.02	135.48	86.12	95.68	0.71×
RMSProp	80.97	120.49	83.45	63.64	0.53×
Adam	87.37	120.11	84.45	64.73	0.54×
Nadam	89.52	120.55	88.46	65.39	0.54×

Key observation (N=8): CPU is faster than GPU for all algorithms. The network is too small to amortise GPU kernel-launch and data-transfer overhead.

N = 64 Hidden-Layer Neurons

Algorithm	GPU Accuracy (%)	GPU Time (s)	CPU Accuracy (%)	CPU Time (s)	Speedup (CPU/GPU)
Stochastic GD	91.58	120.35	92.30	435.93	3.62×
Momentum GD	94.34	120.08	94.67	393.47	3.28×
Nesterov GD	95.45	136.75	95.57	727.68	5.32×
RMSProp	92.19	122.67	92.93	456.06	3.72×
Adam	92.81	124.10	92.94	461.69	3.72×
Nadam	93.52	121.45	93.19	451.05	3.71×

Key observation (N=64): GPU delivers 3–5× speedup over CPU. Nesterov GD shows the greatest GPU advantage (~5.3×). Accuracy is comparable between devices, with CPU slightly edging GPU in most cases due to float64 vs float32 numeric precision.

Summary: GPU Speedup vs Network Size

Algorithm	Speedup N=8	Speedup N=64
Stochastic GD	0.49×	3.62×
Momentum GD	0.53×	3.28×
Nesterov GD	0.71×	5.32×
RMSProp	0.53×	3.72×
Adam	0.54×	3.72×
Nadam	0.54×	3.71×

Training Loss Curves

Loss curves (GPU vs CPU) and a runtime comparison bar chart are saved in the results/ folder.

N = 8 Neurons

Stochastic GD	Momentum GD	Nesterov GD

RMSProp	Adam	Nadam

Runtime Comparison (N=8)

N = 64 Neurons

Stochastic GD	Momentum GD	Nesterov GD

RMSProp	Adam	Nadam

Runtime Comparison (N=64)

Key Findings

GPU advantage scales with model size. For N=8, GPU is slower than CPU due to kernel-launch and PCIe-transfer overhead. At N=64, GPU achieves a 3–5× speedup.
Nesterov GD benefits most from GPU acceleration (~5.3× at N=64), because its look-ahead step doubles the number of forward/backward passes per batch update.
Accuracy is nearly identical between GPU and CPU implementations, with CPU marginally higher in most cases — consistent with float32 (GPU) vs float64 (CPU) precision differences.
Best accuracy at N=64 is achieved by Nesterov GD (GPU: 95.45%, CPU: 95.57%).
GPU run times are highly consistent (~120–137 s) across all algorithms and both N values, confirming that GPU execution time is dominated by kernel overhead rather than compute. For N=64, the CPU time variance is large (394–728 s), showing strong sensitivity to algorithmic complexity.

Project Structure

CPU-vs-GPU-Performance-Analysis/
│
├── GD_optimisers.py          # Main script: all algorithms, training, evaluation & plots
│
└── results/
    ├── 8.txt                 # Benchmark output — N=8 neurons
    ├── 64.txt                # Benchmark output — N=64 neurons
    ├── N8/                   # Training loss & runtime plots for N=8
    │   ├── Figure_1.png      # Stochastic GD loss curve
    │   ├── Figure_2.png      # Momentum GD loss curve
    │   ├── Figure_3.png      # Nesterov GD loss curve
    │   ├── Figure_4.png      # RMSProp loss curve
    │   ├── Figure_5.png      # Adam loss curve
    │   ├── Figure_6.png      # Nadam loss curve
    │   └── rt.png            # Runtime comparison bar chart
    └── N64/                  # Training loss & runtime plots for N=64
        ├── Figure_1.png
        ├── Figure_2.png
        ├── Figure_3.png
        ├── Figure_4.png
        ├── Figure_5.png
        ├── Figure_6.png
        └── rt.png

Dependencies

Package	Purpose
`numpy`	Array operations (CPU path)
`matplotlib`	Loss & runtime plots
`scikit-learn`	Dataset shuffling
`keras` / `tensorflow`	MNIST dataset loader
`pycuda`	CUDA Python bindings
`CUDA Toolkit`	CUDA C kernel compilation (SourceModule)

Install Python dependencies:

pip install numpy matplotlib scikit-learn tensorflow pycuda

Note: PyCUDA requires a working NVIDIA CUDA Toolkit installation. Ensure nvcc is on your PATH.

Running the Script

python GD_optimisers.py

The script will:

Load and pre-process the MNIST dataset.
Sequentially run all 6 optimisers (GPU first, then CPU) for the configured network size.
Print accuracy and runtime for each run.
Display training-loss curves (GPU vs CPU) and a final runtime comparison chart.

To switch between N=8 and N=64, change line 1116 in GD_optimisers.py:

N = 8   # or 64

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
results		results
GD_optimisers.py		GD_optimisers.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPU vs GPU Performance Analysis

Hardware

Dataset

Model Architecture

Implemented Optimisation Algorithms

Hyperparameters per Algorithm

GPU Implementation (PyCUDA)

Results

N = 8 Hidden-Layer Neurons

N = 64 Hidden-Layer Neurons

Summary: GPU Speedup vs Network Size

Training Loss Curves

N = 8 Neurons

N = 64 Neurons

Key Findings

Project Structure

Dependencies

Running the Script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CPU vs GPU Performance Analysis

Hardware

Dataset

Model Architecture

Implemented Optimisation Algorithms

Hyperparameters per Algorithm

GPU Implementation (PyCUDA)

Results

N = 8 Hidden-Layer Neurons

N = 64 Hidden-Layer Neurons

Summary: GPU Speedup vs Network Size

Training Loss Curves

N = 8 Neurons

N = 64 Neurons

Key Findings

Project Structure

Dependencies

Running the Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages