Autograd: Building an Automatic Differentiation Engine

Lesson, slides, and applied problem sets.

Lesson

Autograd: Building an Automatic Differentiation Engine

Goal

Build a minimal scalar autograd engine (micrograd-style). The forward pass builds a DAG of Value nodes; the backward pass applies the chain rule to compute dL/dx for every node.

Prerequisites: ML Foundations pack.

1) The Value node (scalar)

A Value stores:

data: float
grad: float, accumulator for dL/d(this)
_prev: set of parent Values
_op: debug label (e.g., "+", "*")
_backward: closure that pushes out.grad into parents

All later "tensors" are just nested lists of Value.

2) Local derivatives on the forward pass

Each op constructs a new Value and records how gradients flow to its parents.

# a + b
out = Value(a.data + b.data, (a, b), "+")

def _backward():
    a.grad += 1.0 * out.grad
    b.grad += 1.0 * out.grad
out._backward = _backward

# a * b
out = Value(a.data * b.data, (a, b), "*")

def _backward():
    a.grad += b.data * out.grad
    b.grad += a.data * out.grad
out._backward = _backward

Note the +=: a node can feed multiple downstream paths.

3) Supported ops (core)

Arithmetic:

__add__, __mul__, __pow__ (int/float), __neg__, __sub__, __truediv__
reverse ops: __radd__, __rmul__, __rsub__, __rtruediv__

Nonlinear:

tanh, relu, exp, log (defined only for data > 0)

All ops should accept a Value or a Python number; numbers are wrapped as Value.

4) Backward pass (reverse topological order)

We need parents to receive gradients only after their children are ready. Use a DFS topological sort, then traverse in reverse.

def backward(self):
    topo = []
    visited = set()

    def build(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build(child)
            topo.append(v)

    build(self)
    self.grad = 1.0
    for node in reversed(topo):
        node._backward()

self.grad = 1.0 implements dL/dL = 1 for the loss node.

5) Gradient accumulation

grad accumulates by design. For iterative optimization, zero grads before each new backward:

for p in params:
    p.grad = 0.0
loss.backward()

6) Sanity check

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)

f = (a * b + c) ** 2
f.backward()

# df/da = 2*(a*b + c) * b = -24
# df/db = 2*(a*b + c) * a = 16
# df/dc = 2*(a*b + c) * 1 = 8

Key takeaways

Autograd is just a DAG of scalar nodes plus local derivatives.
The forward pass builds the graph; the backward pass applies the chain rule.
Reverse topological order is required for correctness.
Gradients accumulate; zero them explicitly between steps.

Next: build Module, Linear, and Sequential to compose networks cleanly.

Autograd: Building an Automatic Differentiation Engine

Lesson

Autograd: Building an Automatic Differentiation Engine

Goal

1) The Value node (scalar)

2) Local derivatives on the forward pass

3) Supported ops (core)

4) Backward pass (reverse topological order)

5) Gradient accumulation

6) Sanity check

Key takeaways

Module Items

Autograd Value: Building the Computational Graph

Autograd Backward: Automatic Gradient Computation

Autograd Checkpoint

Autograd MLP: Neural Network with Automatic Differentiation