A ground-up implementation of backpropagation and three gradient-based optimizers on a custom computational graph, built as an educational deep learning project. Every gradient is derived analytically using the chain rule, verified against numerical approximation via gradient checking, and benchmarked across Vanilla SGD, Momentum SGD, and Adam.
BACKPROPOGATION-FROM-SCRATCH---Deep-Learning/
├── Backpropogation_deep_learning.ipynb # Full implementation & results
├── requirements.txt
└── README.md
Task: Regression on a dataset of 506 samples with 5 features (f1–f5). Goal: Learn 9 weights (w1–w9) by minimizing MSE loss using manual backpropagation.
The network is split into 3 sub-graphs for clarity:
Part 1 (Exponential branch):
z1 = w0*x0 + w1*x1
part1 = exp(z1² + w5)
Part 2 (Tanh branch):
part2 = tanh(part1 + w6)
Part 3 (Sigmoid branch):
z2 = sin(w2*x2)
z3 = w3*x3 + w4*x4
part3 = sigmoid(z2 * z3 + w7)
Output:
y_pred = part3 * w8 + part2
Loss (MSE per sample):
L = (y - y_pred)²
Weight initialization: Normal distribution, mean=0, std=0.01
- Computes all intermediate activations through the graph
- Stores values needed for the backward pass
- Calculates per-sample MSE loss and the seed gradient
∂L/∂y_pred = -2(y - y_pred)
Derives all 9 gradients analytically:
| Gradient | Expression summary |
|---|---|
| dw0 | ∂L/∂y_pred · 1 · ∂tanh · ∂exp · 2z1 · x0 |
| dw1 | ∂L/∂y_pred · 1 · ∂tanh · ∂exp · 2z1 · x1 |
| dw2 | ∂L/∂y_pred · w8 · ∂sigmoid · z3 · cos(w2·x2) · x2 |
| dw3 | ∂L/∂y_pred · w8 · ∂sigmoid · z2 · x3 |
| dw4 | ∂L/∂y_pred · w8 · ∂sigmoid · z2 · x4 |
| dw5 | ∂L/∂y_pred · 1 · ∂tanh · ∂exp |
| dw6 | ∂L/∂y_pred · 1 · ∂tanh |
| dw7 | ∂L/∂y_pred · w8 · ∂sigmoid |
| dw8 | ∂L/∂y_pred · part3 |
Validates analytical gradients against numerical approximation:
f'(x) ≈ (f(x+ε) − f(x−ε)) / (2ε), ε = 1e-7
Relative error = |dW − dW_approx| / (|dW| + |dW_approx|)
All gradients passed with relative error < 1e-3 (max observed: −2.55e-04).
w_t+1 = w_t − η · ∇w
- Learning rate η = 0.05
v_t = γ · v_{t-1} + η · ∇w
w_t+1 = w_t − v_t
- Momentum γ = 0.9, Learning rate η = 0.05
m_t = β1 · m_{t-1} + (1−β1) · ∇w
v_t = β2 · v_{t-1} + (1−β2) · ∇w²
m̂_t = m_t / (1−β1^t)
v̂_t = v_t / (1−β2^t)
w_t+1 = w_t − α · m̂_t / (√v̂_t + ε)
- β1 = 0.9, β2 = 0.99, α = 0.001, ε = 1e-4
| Optimizer | Loss @ Epoch 1 | Loss @ Epoch 20 |
|---|---|---|
| Vanilla SGD | 0.0423 | 3.21e-04 |
| Momentum SGD | 0.0430 | 1.67e-05 |
| Adam | 0.4663 | 1.62e-06 |
Key observations:
- Adam achieves the lowest final loss (1.62e-06) despite a higher initial loss, due to adaptive learning rates.
- Momentum converges ~19x faster than Vanilla SGD.
- All three show monotonic loss decrease; Adam's decay appears exponential on a log scale.
All tasks passed built-in grader checks:
- Sigmoid: output = 0.8807970779778823 for z = 2
- Forward pass: loss = 0.9298, dy_pred = −1.9285
- Backward pass: all 9 gradients verified to 6 decimal places
- Gradient checking: max relative error < 1e-3
# Clone the repo
git clone https://github.com/<your-username>/BACKPROPOGATION-FROM-SCRATCH---Deep-Learning.git
cd BACKPROPOGATION-FROM-SCRATCH---Deep-Learning
# Install dependencies
pip install -r requirements.txt
# Launch notebook
jupyter notebook Backpropogation_deep_learning.ipynbThe notebook was originally developed on Google Colab. The
google.colabimport is only needed for file upload; remove it if running locally.
- Computational graphs & directed acyclic graph (DAG) traversal
- Chain rule for composite functions (exp, tanh, sigmoid, sin)
- Numerical gradient checking as a debugging tool
- SGD, Momentum, and Adam optimizers from first principles
- Effect of adaptive learning rates on convergence speed