Transformer Block

1 / 6

Pre-norm block

y = x + Attn(LN(x))
z = y + FFN(LN(y))

Input/output shape: (T, D)

2 / 6

LayerNorm

Normalize per position across features:

(x - mean) / sqrt(var + eps)

Then scale/shift with gamma, beta.

3 / 6

FFN

Two linear layers:

D -> 4D -> D

Activation: ReLU (GELU is common in GPTs).

4 / 6

Residuals

x + sublayer(x) keeps gradients flowing.

5 / 6

Note

Pack implementation uses single-head attention for simplicity.

6 / 6
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.