Calculus for Machine Learning

Lesson, slides, and applied problem sets.

View Slides

Lesson

Calculus for Machine Learning

Why this module exists

Machine learning is optimization. We have a function (loss) that measures how wrong our model is, and we want to minimize it. Calculus gives us the tools to find the direction of improvement: gradients.

You don't need to be a calculus expert. You need to understand what derivatives and gradients mean, and how they guide learning.


1) The derivative: Rate of change

The derivative of f(x) at point x tells you how fast f changes when you nudge x:

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

Intuition: the slope of the tangent line at x.

Positive derivative: function is increasing Negative derivative: function is decreasing Zero derivative: local minimum, maximum, or saddle point


2) Common derivatives

These come up constantly in ML:

FunctionDerivative
x^nn * x^(n-1)
e^xe^x
ln(x)1/x
sin(x)cos(x)
sigmoid(x)sigmoid(x) * (1 - sigmoid(x))

The sigmoid derivative is special: it can be computed from the output itself.


3) Derivative rules

Sum Rule

(f + g)' = f' + g'

Product Rule

(f * g)' = f' * g + f * g'

Chain Rule (the most important!)

(f(g(x)))' = f'(g(x)) * g'(x)

The chain rule is how backpropagation works. Complex functions are compositions; gradients flow backward through the chain.


4) Partial derivatives

When a function has multiple inputs, the partial derivative with respect to one variable treats others as constants:

f(x, y) = x² + xy + y²

∂f/∂x = 2x + y    (treat y as constant)
∂f/∂y = x + 2y    (treat x as constant)

5) The gradient: Direction of steepest ascent

The gradient is a vector of all partial derivatives:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

The gradient points in the direction of steepest increase.

To minimize a function, go in the opposite direction of the gradient. This is gradient descent.


6) Gradient descent intuition

Imagine you're blindfolded on a hilly landscape. To find the lowest point:

  1. Feel the slope under your feet (compute gradient)
  2. Take a step downhill (negative gradient direction)
  3. Repeat until flat (gradient ≈ 0)
# Gradient descent update
x = x - learning_rate * gradient

The learning rate controls step size:

  • Too large: overshoot, oscillate, diverge
  • Too small: slow convergence
  • Just right: steady progress to minimum

7) Computing gradients numerically

You can approximate gradients without calculus using finite differences:

def numerical_gradient(f, x, h=1e-5):
    grad = []
    for i in range(len(x)):
        x_plus = x.copy()
        x_plus[i] += h
        x_minus = x.copy()
        x_minus[i] -= h
        grad.append((f(x_plus) - f(x_minus)) / (2 * h))
    return grad

This is slow but useful for:

  • Debugging analytical gradients
  • Functions without closed-form derivatives

8) The chain rule in neural networks

Consider a simple network: input → hidden → output → loss

loss = L(output)
output = f(hidden)
hidden = g(input)

To update weights in g, we need ∂loss/∂weights_g:

∂loss/∂weights_g = ∂loss/∂output * ∂output/∂hidden * ∂hidden/∂weights_g

This is backpropagation: gradients flow backward through the chain.


9) Convexity matters

A function is convex if any line between two points lies above the function. Convex functions have a single global minimum.

Good news: Linear regression loss (MSE) is convex. Bad news: Neural network losses are non-convex (many local minima).

For non-convex functions, gradient descent finds a minimum, not necessarily the minimum. That's okay in practice.


10) Jacobians and Hessians (advanced)

The Jacobian is the matrix of all first partial derivatives for vector-valued functions. It describes how a transformation stretches/rotates locally.

The Hessian is the matrix of second partial derivatives. It describes curvature:

  • Positive definite Hessian: local minimum
  • Negative definite: local maximum
  • Mixed: saddle point

Practical tips

  1. Check gradients numerically: Always verify analytical gradients with finite differences during development.
  2. Gradient clipping: If gradients explode (become huge), clip them to a max value.
  3. Watch for vanishing gradients: Deep networks can have gradients that shrink to zero. Activations and architectures matter.
  4. Automatic differentiation: Modern frameworks (PyTorch, TensorFlow) compute gradients automatically. Understand what they do, but let them do it.

Key takeaways

  • Derivatives measure rate of change; gradients extend this to multiple dimensions
  • The gradient points toward steepest increase; negate it to descend
  • Chain rule is the foundation of backpropagation
  • Numerical gradients are slow but useful for debugging
  • Learning rate is critical: not too big, not too small

Module Items