Gradient Descent and Optimization

Lesson, slides, and applied problem sets.

View Slides

Lesson

Gradient Descent and Optimization

Why this module exists

Machine learning is optimization. We have a loss function, and we need to find parameters that minimize it. Gradient descent is the workhorse algorithm that makes this possible. Understanding optimization is understanding how models learn.


1) The optimization problem

Given:

  • A model with parameters θ (weights)
  • A loss function L(θ) measuring error
  • Training data

Find: θ* that minimizes L(θ)

θ* = argmin_θ L(θ)

For simple models, we might solve analytically. For neural networks, we need iterative optimization.


2) Gradient descent intuition

Imagine standing on a hilly landscape, blindfolded:

  1. Feel which way is downhill (compute gradient)
  2. Take a step in that direction
  3. Repeat until you reach the bottom

The gradient ∇L(θ) points uphill. We go the opposite way:

θ = θ - learning_rate * ∇L(θ)

3) The gradient descent algorithm

def gradient_descent(loss_fn, grad_fn, theta_init, lr, n_steps):
    theta = theta_init

    for step in range(n_steps):
        loss = loss_fn(theta)
        gradient = grad_fn(theta)
        theta = theta - lr * gradient

        if step % 100 == 0:
            print(f"Step {step}, Loss: {loss}")

    return theta

Parameters:

  • learning_rate (lr): Step size
  • n_steps: Number of iterations
  • theta_init: Starting point

4) Learning rate: The critical hyperparameter

Too small:

  • Slow convergence
  • May get stuck in local minima

Too large:

  • Overshoot the minimum
  • Oscillate or diverge

Just right:

  • Steady progress toward minimum
lr = 0.001 is a common starting point

5) Batch, Stochastic, and Mini-batch

Batch Gradient Descent: Use all training data per step

gradient = average_gradient(all_data)
  • Stable but slow for large datasets

Stochastic Gradient Descent (SGD): Use one sample per step

gradient = gradient(one_random_sample)
  • Fast but noisy

Mini-batch SGD: Use a small batch (e.g., 32 samples)

gradient = average_gradient(batch_of_32)
  • Best of both worlds: stable and efficient

6) Why mini-batch works

  • Parallelism: GPUs process batches efficiently
  • Regularization: Noise helps escape local minima
  • Memory: Fits in GPU memory
  • Convergence: Good balance of speed and stability

Typical batch sizes: 32, 64, 128, 256


7) Momentum

SGD can oscillate in "ravines" (steep in one direction, shallow in another).

Momentum accumulates gradients over time:

velocity = 0
for step in range(n_steps):
    gradient = compute_gradient()
    velocity = momentum * velocity + gradient
    theta = theta - lr * velocity

Like a ball rolling downhill—it builds speed and smooths oscillations.

Typical momentum: 0.9


8) Adaptive learning rates

Different parameters may need different learning rates.

AdaGrad: Scale learning rate by past gradients

accumulated_grad_sq += gradient ** 2
theta = theta - lr * gradient / sqrt(accumulated_grad_sq)

Problem: Learning rate shrinks to zero over time.

RMSprop: Use exponential moving average

avg_grad_sq = decay * avg_grad_sq + (1 - decay) * gradient ** 2
theta = theta - lr * gradient / sqrt(avg_grad_sq)

9) Adam: The default optimizer

Combines momentum and adaptive learning rates:

m = beta1 * m + (1 - beta1) * gradient      # momentum
v = beta2 * v + (1 - beta2) * gradient ** 2  # RMSprop

m_hat = m / (1 - beta1 ** t)  # bias correction
v_hat = v / (1 - beta2 ** t)

theta = theta - lr * m_hat / (sqrt(v_hat) + eps)

Default hyperparameters work well:

  • lr = 0.001
  • beta1 = 0.9
  • beta2 = 0.999
  • eps = 1e-8

Adam is the default choice for most deep learning.


10) Convergence criteria

When to stop training?

  • Fixed iterations: Run for N epochs
  • Loss threshold: Stop when loss < target
  • Patience: Stop when loss hasn't improved for K epochs
  • Gradient norm: Stop when gradients are tiny

Early stopping (patience) helps prevent overfitting.


11) Learning rate schedules

Reduce learning rate during training:

Step decay: lr = lr * 0.1 every N epochs Exponential decay: lr = lr * decay^epoch Cosine annealing: lr follows cosine curve Warmup: Start small, increase, then decrease

# Warmup + decay
if epoch < warmup_epochs:
    lr = initial_lr * epoch / warmup_epochs
else:
    lr = initial_lr * decay ** (epoch - warmup_epochs)

12) Local minima and saddle points

For neural networks:

  • Local minima: Rarely a problem; most local minima are good
  • Saddle points: More problematic; gradient is zero but not a minimum
  • Plateaus: Flat regions slow down learning

Momentum and Adam help escape these.


13) Convex vs non-convex

Convex functions: One global minimum, gradient descent finds it

  • Linear regression with MSE is convex

Non-convex functions: Multiple local minima

  • Neural networks are non-convex
  • We find a good minimum, not necessarily the best

Good news: Random initialization + SGD noise often find good solutions.


14) Practical tips

  1. Start with Adam: lr=0.001 is a good default
  2. Monitor training: Plot loss curves
  3. Watch for divergence: If loss explodes, reduce lr
  4. Try learning rate finder: Sweep lr, find where loss drops fastest
  5. Use early stopping: Prevent overfitting
  6. Normalize inputs: Helps optimization converge faster

Key takeaways

  • Gradient descent: θ = θ - lr × ∇L(θ)
  • Learning rate is critical: not too big, not too small
  • Mini-batch SGD balances efficiency and stability
  • Momentum smooths oscillations, builds speed
  • Adam combines momentum + adaptive lr; use it by default
  • Learning rate schedules improve final performance
  • Non-convex landscapes are OK; SGD finds good solutions

Module Items