Calculus for Machine Learning

Lesson, slides, and applied problem sets.

Lesson

Calculus for Machine Learning

Why this module exists

Machine learning is optimization. We have a function (loss) that measures how wrong our model is, and we want to minimize it. Calculus gives us the tools to find the direction of improvement: gradients.

You don't need to be a calculus expert. You need to understand what derivatives and gradients mean, and how they guide learning.

1) The derivative: Rate of change

The derivative of f(x) at point x tells you how fast f changes when you nudge x:

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Intuition: the slope of the tangent line at x.

Positive derivative: function is increasing Negative derivative: function is decreasing Zero derivative: local minimum, maximum, or saddle point

2) Common derivatives

These come up constantly in ML:

Function	Derivative
x^n	n * x^(n-1)
e^x	e^x
ln(x)	1/x
sin(x)	cos(x)
sigmoid(x)	sigmoid(x) * (1 - sigmoid(x))

The sigmoid derivative is special: it can be computed from the output itself.

3) Derivative rules

Sum Rule

(f + g)' = f' + g'

Product Rule

(f * g)' = f' * g + f * g'

Chain Rule (the most important!)

(f(g(x)))' = f'(g(x)) * g'(x)

The chain rule is how backpropagation works. Complex functions are compositions; gradients flow backward through the chain.

4) Partial derivatives

When a function has multiple inputs, the partial derivative with respect to one variable treats others as constants:

f(x, y) = x² + xy + y²

∂f/∂x = 2x + y    (treat y as constant)
∂f/∂y = x + 2y    (treat x as constant)

5) The gradient: Direction of steepest ascent

The gradient is a vector of all partial derivatives:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

The gradient points in the direction of steepest increase.

To minimize a function, go in the opposite direction of the gradient. This is gradient descent.

6) Gradient descent intuition

Imagine you're blindfolded on a hilly landscape. To find the lowest point:

Feel the slope under your feet (compute gradient)
Take a step downhill (negative gradient direction)
Repeat until flat (gradient ≈ 0)

# Gradient descent update
x = x - learning_rate * gradient

The learning rate controls step size:

Too large: overshoot, oscillate, diverge
Too small: slow convergence
Just right: steady progress to minimum

7) Computing gradients numerically

You can approximate gradients without calculus using finite differences:

def numerical_gradient(f, x, h=1e-5):
    grad = []
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += h
        x_minus = x.copy()
        x_minus[i] -= h
        grad.append((f(x_plus) - f(x_minus)) / (2 * h))
    return grad

This is slow but useful for:

Debugging analytical gradients
Functions without closed-form derivatives

8) The chain rule in neural networks

Consider a simple network: input → hidden → output → loss

loss = L(output)
output = f(hidden)
hidden = g(input)

To update weights in g, we need ∂loss/∂weights_g:

∂loss/∂weights_g = ∂loss/∂output * ∂output/∂hidden * ∂hidden/∂weights_g

This is backpropagation: gradients flow backward through the chain.

9) Convexity matters

A function is convex if any line between two points lies above the function. Convex functions have a single global minimum.

Good news: Linear regression loss (MSE) is convex. Bad news: Neural network losses are non-convex (many local minima).

For non-convex functions, gradient descent finds a minimum, not necessarily the minimum. That's okay in practice.

10) Jacobians and Hessians (advanced)

The Jacobian is the matrix of all first partial derivatives for vector-valued functions. It describes how a transformation stretches/rotates locally.

The Hessian is the matrix of second partial derivatives. It describes curvature:

Positive definite Hessian: local minimum
Negative definite: local maximum
Mixed: saddle point

Practical tips

Check gradients numerically: Always verify analytical gradients with finite differences during development.
Gradient clipping: If gradients explode (become huge), clip them to a max value.
Watch for vanishing gradients: Deep networks can have gradients that shrink to zero. Activations and architectures matter.
Automatic differentiation: Modern frameworks (PyTorch, TensorFlow) compute gradients automatically. Understand what they do, but let them do it.

Key takeaways

Derivatives measure rate of change; gradients extend this to multiple dimensions
The gradient points toward steepest increase; negate it to descend
Chain rule is the foundation of backpropagation
Numerical gradients are slow but useful for debugging
Learning rate is critical: not too big, not too small

Calculus for Machine Learning

Lesson

Calculus for Machine Learning

Why this module exists

1) The derivative: Rate of change

2) Common derivatives

3) Derivative rules

Sum Rule

Product Rule

Chain Rule (the most important!)

4) Partial derivatives

5) The gradient: Direction of steepest ascent

6) Gradient descent intuition

7) Computing gradients numerically

8) The chain rule in neural networks

9) Convexity matters

10) Jacobians and Hessians (advanced)

Practical tips

Key takeaways

Module Items

Numerical Gradient

Calculus for ML Checkpoint