Gradient Descent and Optimization

Lesson, slides, and applied problem sets.

Lesson

Gradient Descent and Optimization

Why this module exists

Machine learning is optimization. We have a loss function, and we need to find parameters that minimize it. Gradient descent is the workhorse algorithm that makes this possible. Understanding optimization is understanding how models learn.

1) The optimization problem

Given:

A model with parameters θ (weights)
A loss function L(θ) measuring error
Training data

Find: θ* that minimizes L(θ)

θ* = argmin_θ L(θ)

For simple models, we might solve analytically. For neural networks, we need iterative optimization.

2) Gradient descent intuition

Imagine standing on a hilly landscape, blindfolded:

Feel which way is downhill (compute gradient)
Take a step in that direction
Repeat until you reach the bottom

The gradient ∇L(θ) points uphill. We go the opposite way:

θ = θ - learning_rate * ∇L(θ)

3) The gradient descent algorithm

def gradient_descent(loss_fn, grad_fn, theta_init, lr, n_steps):
    theta = theta_init

    for step in range(n_steps):
        loss = loss_fn(theta)
        gradient = grad_fn(theta)
        theta = theta - lr * gradient

        if step % 100 == 0:
            print(f"Step {step}, Loss: {loss}")

    return theta

Parameters:

learning_rate (lr): Step size
n_steps: Number of iterations
theta_init: Starting point

4) Learning rate: The critical hyperparameter

Too small:

Slow convergence
May get stuck in local minima

Too large:

Overshoot the minimum
Oscillate or diverge

Just right:

Steady progress toward minimum

lr = 0.001 is a common starting point

5) Batch, Stochastic, and Mini-batch

Batch Gradient Descent: Use all training data per step

gradient = average_gradient(all_data)

Stable but slow for large datasets

Stochastic Gradient Descent (SGD): Use one sample per step

gradient = gradient(one_random_sample)

Fast but noisy

Mini-batch SGD: Use a small batch (e.g., 32 samples)

gradient = average_gradient(batch_of_32)

Best of both worlds: stable and efficient

6) Why mini-batch works

Parallelism: GPUs process batches efficiently
Regularization: Noise helps escape local minima
Memory: Fits in GPU memory
Convergence: Good balance of speed and stability

Typical batch sizes: 32, 64, 128, 256

7) Momentum

SGD can oscillate in "ravines" (steep in one direction, shallow in another).

Momentum accumulates gradients over time:

velocity = 0
for step in range(n_steps):
    gradient = compute_gradient()
    velocity = momentum * velocity + gradient
    theta = theta - lr * velocity

Like a ball rolling downhill—it builds speed and smooths oscillations.

Typical momentum: 0.9

8) Adaptive learning rates

Different parameters may need different learning rates.

AdaGrad: Scale learning rate by past gradients

accumulated_grad_sq += gradient ** 2
theta = theta - lr * gradient / sqrt(accumulated_grad_sq)

Problem: Learning rate shrinks to zero over time.

RMSprop: Use exponential moving average

avg_grad_sq = decay * avg_grad_sq + (1 - decay) * gradient ** 2
theta = theta - lr * gradient / sqrt(avg_grad_sq)

9) Adam: The default optimizer

Combines momentum and adaptive learning rates:

m = beta1 * m + (1 - beta1) * gradient      # momentum
v = beta2 * v + (1 - beta2) * gradient ** 2  # RMSprop

m_hat = m / (1 - beta1 ** t)  # bias correction
v_hat = v / (1 - beta2 ** t)

theta = theta - lr * m_hat / (sqrt(v_hat) + eps)

Default hyperparameters work well:

lr = 0.001
beta1 = 0.9
beta2 = 0.999
eps = 1e-8

Adam is the default choice for most deep learning.

10) Convergence criteria

When to stop training?

Fixed iterations: Run for N epochs
Loss threshold: Stop when loss < target
Patience: Stop when loss hasn't improved for K epochs
Gradient norm: Stop when gradients are tiny

Early stopping (patience) helps prevent overfitting.

11) Learning rate schedules

Reduce learning rate during training:

Step decay: lr = lr * 0.1 every N epochs Exponential decay: lr = lr * decay^epoch Cosine annealing: lr follows cosine curve Warmup: Start small, increase, then decrease

# Warmup + decay
if epoch < warmup_epochs:
    lr = initial_lr * epoch / warmup_epochs
else:
    lr = initial_lr * decay ** (epoch - warmup_epochs)

12) Local minima and saddle points

For neural networks:

Local minima: Rarely a problem; most local minima are good
Saddle points: More problematic; gradient is zero but not a minimum
Plateaus: Flat regions slow down learning

Momentum and Adam help escape these.

13) Convex vs non-convex

Convex functions: One global minimum, gradient descent finds it

Linear regression with MSE is convex

Non-convex functions: Multiple local minima

Neural networks are non-convex
We find a good minimum, not necessarily the best

Good news: Random initialization + SGD noise often find good solutions.

14) Practical tips

Start with Adam: lr=0.001 is a good default
Monitor training: Plot loss curves
Watch for divergence: If loss explodes, reduce lr
Try learning rate finder: Sweep lr, find where loss drops fastest
Use early stopping: Prevent overfitting
Normalize inputs: Helps optimization converge faster

Key takeaways

Gradient descent: θ = θ - lr × ∇L(θ)
Learning rate is critical: not too big, not too small
Mini-batch SGD balances efficiency and stability
Momentum smooths oscillations, builds speed
Adam combines momentum + adaptive lr; use it by default
Learning rate schedules improve final performance
Non-convex landscapes are OK; SGD finds good solutions

Gradient Descent and Optimization

Lesson

Gradient Descent and Optimization

Why this module exists

1) The optimization problem

2) Gradient descent intuition

3) The gradient descent algorithm

4) Learning rate: The critical hyperparameter

5) Batch, Stochastic, and Mini-batch

6) Why mini-batch works

7) Momentum

8) Adaptive learning rates

9) Adam: The default optimizer

10) Convergence criteria

11) Learning rate schedules

12) Local minima and saddle points

13) Convex vs non-convex

14) Practical tips

Key takeaways

Module Items

Gradient Descent Optimizer

Linear Regression