Gradient Descent and Optimization
Lesson, slides, and applied problem sets.
View SlidesLesson
Gradient Descent and Optimization
Why this module exists
Machine learning is optimization. We have a loss function, and we need to find parameters that minimize it. Gradient descent is the workhorse algorithm that makes this possible. Understanding optimization is understanding how models learn.
1) The optimization problem
Given:
- A model with parameters θ (weights)
- A loss function L(θ) measuring error
- Training data
Find: θ* that minimizes L(θ)
θ* = argmin_θ L(θ)
For simple models, we might solve analytically. For neural networks, we need iterative optimization.
2) Gradient descent intuition
Imagine standing on a hilly landscape, blindfolded:
- Feel which way is downhill (compute gradient)
- Take a step in that direction
- Repeat until you reach the bottom
The gradient ∇L(θ) points uphill. We go the opposite way:
θ = θ - learning_rate * ∇L(θ)
3) The gradient descent algorithm
def gradient_descent(loss_fn, grad_fn, theta_init, lr, n_steps):
theta = theta_init
for step in range(n_steps):
loss = loss_fn(theta)
gradient = grad_fn(theta)
theta = theta - lr * gradient
if step % 100 == 0:
print(f"Step {step}, Loss: {loss}")
return theta
Parameters:
- learning_rate (lr): Step size
- n_steps: Number of iterations
- theta_init: Starting point
4) Learning rate: The critical hyperparameter
Too small:
- Slow convergence
- May get stuck in local minima
Too large:
- Overshoot the minimum
- Oscillate or diverge
Just right:
- Steady progress toward minimum
lr = 0.001 is a common starting point
5) Batch, Stochastic, and Mini-batch
Batch Gradient Descent: Use all training data per step
gradient = average_gradient(all_data)
- Stable but slow for large datasets
Stochastic Gradient Descent (SGD): Use one sample per step
gradient = gradient(one_random_sample)
- Fast but noisy
Mini-batch SGD: Use a small batch (e.g., 32 samples)
gradient = average_gradient(batch_of_32)
- Best of both worlds: stable and efficient
6) Why mini-batch works
- Parallelism: GPUs process batches efficiently
- Regularization: Noise helps escape local minima
- Memory: Fits in GPU memory
- Convergence: Good balance of speed and stability
Typical batch sizes: 32, 64, 128, 256
7) Momentum
SGD can oscillate in "ravines" (steep in one direction, shallow in another).
Momentum accumulates gradients over time:
velocity = 0
for step in range(n_steps):
gradient = compute_gradient()
velocity = momentum * velocity + gradient
theta = theta - lr * velocity
Like a ball rolling downhill—it builds speed and smooths oscillations.
Typical momentum: 0.9
8) Adaptive learning rates
Different parameters may need different learning rates.
AdaGrad: Scale learning rate by past gradients
accumulated_grad_sq += gradient ** 2
theta = theta - lr * gradient / sqrt(accumulated_grad_sq)
Problem: Learning rate shrinks to zero over time.
RMSprop: Use exponential moving average
avg_grad_sq = decay * avg_grad_sq + (1 - decay) * gradient ** 2
theta = theta - lr * gradient / sqrt(avg_grad_sq)
9) Adam: The default optimizer
Combines momentum and adaptive learning rates:
m = beta1 * m + (1 - beta1) * gradient # momentum
v = beta2 * v + (1 - beta2) * gradient ** 2 # RMSprop
m_hat = m / (1 - beta1 ** t) # bias correction
v_hat = v / (1 - beta2 ** t)
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)
Default hyperparameters work well:
- lr = 0.001
- beta1 = 0.9
- beta2 = 0.999
- eps = 1e-8
Adam is the default choice for most deep learning.
10) Convergence criteria
When to stop training?
- Fixed iterations: Run for N epochs
- Loss threshold: Stop when loss < target
- Patience: Stop when loss hasn't improved for K epochs
- Gradient norm: Stop when gradients are tiny
Early stopping (patience) helps prevent overfitting.
11) Learning rate schedules
Reduce learning rate during training:
Step decay: lr = lr * 0.1 every N epochs Exponential decay: lr = lr * decay^epoch Cosine annealing: lr follows cosine curve Warmup: Start small, increase, then decrease
# Warmup + decay
if epoch < warmup_epochs:
lr = initial_lr * epoch / warmup_epochs
else:
lr = initial_lr * decay ** (epoch - warmup_epochs)
12) Local minima and saddle points
For neural networks:
- Local minima: Rarely a problem; most local minima are good
- Saddle points: More problematic; gradient is zero but not a minimum
- Plateaus: Flat regions slow down learning
Momentum and Adam help escape these.
13) Convex vs non-convex
Convex functions: One global minimum, gradient descent finds it
- Linear regression with MSE is convex
Non-convex functions: Multiple local minima
- Neural networks are non-convex
- We find a good minimum, not necessarily the best
Good news: Random initialization + SGD noise often find good solutions.
14) Practical tips
- Start with Adam: lr=0.001 is a good default
- Monitor training: Plot loss curves
- Watch for divergence: If loss explodes, reduce lr
- Try learning rate finder: Sweep lr, find where loss drops fastest
- Use early stopping: Prevent overfitting
- Normalize inputs: Helps optimization converge faster
Key takeaways
- Gradient descent: θ = θ - lr × ∇L(θ)
- Learning rate is critical: not too big, not too small
- Mini-batch SGD balances efficiency and stability
- Momentum smooths oscillations, builds speed
- Adam combines momentum + adaptive lr; use it by default
- Learning rate schedules improve final performance
- Non-convex landscapes are OK; SGD finds good solutions