Autograd: Building an Automatic Differentiation Engine

Lesson, slides, and applied problem sets.

View Slides

Lesson

Autograd: Building an Automatic Differentiation Engine

Goal

Build a minimal scalar autograd engine (micrograd-style). The forward pass builds a DAG of Value nodes; the backward pass applies the chain rule to compute dL/dx for every node.

Prerequisites: ML Foundations pack.


1) The Value node (scalar)

A Value stores:

  • data: float
  • grad: float, accumulator for dL/d(this)
  • _prev: set of parent Values
  • _op: debug label (e.g., "+", "*")
  • _backward: closure that pushes out.grad into parents

All later "tensors" are just nested lists of Value.


2) Local derivatives on the forward pass

Each op constructs a new Value and records how gradients flow to its parents.

# a + b
out = Value(a.data + b.data, (a, b), "+")

def _backward():
    a.grad += 1.0 * out.grad
    b.grad += 1.0 * out.grad
out._backward = _backward
# a * b
out = Value(a.data * b.data, (a, b), "*")

def _backward():
    a.grad += b.data * out.grad
    b.grad += a.data * out.grad
out._backward = _backward

Note the +=: a node can feed multiple downstream paths.


3) Supported ops (core)

Arithmetic:

  • __add__, __mul__, __pow__ (int/float), __neg__, __sub__, __truediv__
  • reverse ops: __radd__, __rmul__, __rsub__, __rtruediv__

Nonlinear:

  • tanh, relu, exp, log (defined only for data > 0)

All ops should accept a Value or a Python number; numbers are wrapped as Value.


4) Backward pass (reverse topological order)

We need parents to receive gradients only after their children are ready. Use a DFS topological sort, then traverse in reverse.

def backward(self):
    topo = []
    visited = set()

    def build(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build(child)
            topo.append(v)

    build(self)
    self.grad = 1.0
    for node in reversed(topo):
        node._backward()

self.grad = 1.0 implements dL/dL = 1 for the loss node.


5) Gradient accumulation

grad accumulates by design. For iterative optimization, zero grads before each new backward:

for p in params:
    p.grad = 0.0
loss.backward()

6) Sanity check

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)

f = (a * b + c) ** 2
f.backward()

# df/da = 2*(a*b + c) * b = -24
# df/db = 2*(a*b + c) * a = 16
# df/dc = 2*(a*b + c) * 1 = 8

Key takeaways

  1. Autograd is just a DAG of scalar nodes plus local derivatives.
  2. The forward pass builds the graph; the backward pass applies the chain rule.
  3. Reverse topological order is required for correctness.
  4. Gradients accumulate; zero them explicitly between steps.

Next: build Module, Linear, and Sequential to compose networks cleanly.


Module Items