Gradient Descent
What the parent explanation assumed you knew:
You understand that neural networks have parameters (weights) that get adjusted during training to minimize a 'loss' that measures how wrong the predictions are.
What this page explains:
How gradient descent actually adjusts those parameters, what gradients are, and why this simple idea powers all of deep learning.
The Explanation
Gradient descent is an optimization algorithm that finds the minimum of a function by repeatedly taking steps in the direction of steepest descent.
The Core Idea: Imagine you're blindfolded on a hilly landscape, trying to find the lowest point. You can feel the slope under your feet. The strategy: always step downhill.
Mathematically:
θ_new = θ_old - α × ∇L(θ)Where:
- •θ = the parameters (weights) of the neural network
- •α = learning rate (step size)
- •∇L(θ) = gradient of the loss function
What is a Gradient? The gradient is a vector of partial derivatives - it points in the direction of steepest increase. We go the opposite way (hence the minus sign) to decrease the loss.
**For a neural network:**
- •. Forward pass: Compute predictions
- •. Loss: Measure how wrong we are
- •. Backward pass: Compute gradients via chain rule (backpropagation)
- •. Update: Adjust each parameter proportionally to its gradient
**Variants:**
- •<strong>SGD</strong>: Use random mini-batches for faster, noisier updates
- •<strong>Adam</strong>: Adapt learning rate per-parameter based on history
- •<strong>AdaGrad</strong>: Accumulate squared gradients for adaptive rates
**Learning Rate Matters:**
- •Too high: Overshoot and diverge
- •Too low: Painfully slow convergence
- •Just right: Smooth descent to minimum
Local vs Global Minima: In high dimensions (millions of parameters), there are many 'valleys'. Fortunately, research shows that for overparameterized networks, most local minima are nearly as good as the global minimum.
Visual Aid
Adjust the learning rate and watch the optimization path on a 2D loss landscape. See how different rates lead to convergence, oscillation, or divergence.
Open interactive demo →The "Aha" Moment
Gradient descent finds patterns in data by treating learning as a slow downhill walk on a mathematical landscape where altitude represents error.
Go Even Deeper
This explanation assumes you understand these fundamentals. Click to learn more:
derivatives basics
Level 1 fundamental
vectors and matrices
Level 1 fundamental