Calculus for Machine Learning
Lesson, slides, and applied problem sets.
View SlidesLesson
Calculus for Machine Learning
Why this module exists
Machine learning is optimization. We have a function (loss) that measures how wrong our model is, and we want to minimize it. Calculus gives us the tools to find the direction of improvement: gradients.
You don't need to be a calculus expert. You need to understand what derivatives and gradients mean, and how they guide learning.
1) The derivative: Rate of change
The derivative of f(x) at point x tells you how fast f changes when you nudge x:
f'(x) = lim[h→0] (f(x+h) - f(x)) / h
Intuition: the slope of the tangent line at x.
Positive derivative: function is increasing Negative derivative: function is decreasing Zero derivative: local minimum, maximum, or saddle point
2) Common derivatives
These come up constantly in ML:
| Function | Derivative |
|---|---|
| x^n | n * x^(n-1) |
| e^x | e^x |
| ln(x) | 1/x |
| sin(x) | cos(x) |
| sigmoid(x) | sigmoid(x) * (1 - sigmoid(x)) |
The sigmoid derivative is special: it can be computed from the output itself.
3) Derivative rules
Sum Rule
(f + g)' = f' + g'
Product Rule
(f * g)' = f' * g + f * g'
Chain Rule (the most important!)
(f(g(x)))' = f'(g(x)) * g'(x)
The chain rule is how backpropagation works. Complex functions are compositions; gradients flow backward through the chain.
4) Partial derivatives
When a function has multiple inputs, the partial derivative with respect to one variable treats others as constants:
f(x, y) = x² + xy + y²
∂f/∂x = 2x + y (treat y as constant)
∂f/∂y = x + 2y (treat x as constant)
5) The gradient: Direction of steepest ascent
The gradient is a vector of all partial derivatives:
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]
The gradient points in the direction of steepest increase.
To minimize a function, go in the opposite direction of the gradient. This is gradient descent.
6) Gradient descent intuition
Imagine you're blindfolded on a hilly landscape. To find the lowest point:
- Feel the slope under your feet (compute gradient)
- Take a step downhill (negative gradient direction)
- Repeat until flat (gradient ≈ 0)
# Gradient descent update
x = x - learning_rate * gradient
The learning rate controls step size:
- Too large: overshoot, oscillate, diverge
- Too small: slow convergence
- Just right: steady progress to minimum
7) Computing gradients numerically
You can approximate gradients without calculus using finite differences:
def numerical_gradient(f, x, h=1e-5):
grad = []
for i in range(len(x)):
x_plus = x.copy()
x_plus[i] += h
x_minus = x.copy()
x_minus[i] -= h
grad.append((f(x_plus) - f(x_minus)) / (2 * h))
return grad
This is slow but useful for:
- Debugging analytical gradients
- Functions without closed-form derivatives
8) The chain rule in neural networks
Consider a simple network: input → hidden → output → loss
loss = L(output)
output = f(hidden)
hidden = g(input)
To update weights in g, we need ∂loss/∂weights_g:
∂loss/∂weights_g = ∂loss/∂output * ∂output/∂hidden * ∂hidden/∂weights_g
This is backpropagation: gradients flow backward through the chain.
9) Convexity matters
A function is convex if any line between two points lies above the function. Convex functions have a single global minimum.
Good news: Linear regression loss (MSE) is convex. Bad news: Neural network losses are non-convex (many local minima).
For non-convex functions, gradient descent finds a minimum, not necessarily the minimum. That's okay in practice.
10) Jacobians and Hessians (advanced)
The Jacobian is the matrix of all first partial derivatives for vector-valued functions. It describes how a transformation stretches/rotates locally.
The Hessian is the matrix of second partial derivatives. It describes curvature:
- Positive definite Hessian: local minimum
- Negative definite: local maximum
- Mixed: saddle point
Practical tips
- Check gradients numerically: Always verify analytical gradients with finite differences during development.
- Gradient clipping: If gradients explode (become huge), clip them to a max value.
- Watch for vanishing gradients: Deep networks can have gradients that shrink to zero. Activations and architectures matter.
- Automatic differentiation: Modern frameworks (PyTorch, TensorFlow) compute gradients automatically. Understand what they do, but let them do it.
Key takeaways
- Derivatives measure rate of change; gradients extend this to multiple dimensions
- The gradient points toward steepest increase; negate it to descend
- Chain rule is the foundation of backpropagation
- Numerical gradients are slow but useful for debugging
- Learning rate is critical: not too big, not too small