The Transformer Block: Assembling the Pieces
Lesson, slides, and applied problem sets.
View SlidesLesson
The Transformer Block
Goal
Assemble self-attention, layer normalization, feed-forward networks, and residuals into the core transformer block used by GPT-style models.
Prerequisites: Attention module.
1) Block structure (pre-norm)
For input x with shape (T, D):
y = x + Attn(LN(x))
z = y + FFN(LN(y))
This is the pre-norm layout used in GPT-2 and later for stability.
2) LayerNorm (per position)
LayerNorm normalizes across the feature dimension for each position independently:
mu = mean(x)
var = mean((x - mu)^2)
ln(x) = (x - mu) / sqrt(var + eps)
output = gamma * ln(x) + beta
gamma and beta are learnable vectors of length D.
3) Feed-forward network (FFN)
Two linear layers with a nonlinearity:
FFN(x) = W2 * act(W1 * x + b1) + b2
Standard size is 4D for the hidden dimension. In this pack, ReLU is fine (GELU is a common modern variant).
4) Residual connections
Residuals preserve information and improve gradient flow:
x -> x + sublayer(x)
Without residuals, deep transformers are much harder to train.
5) Implementation notes (list-of-Value)
- Apply
LayerNormto each position:[ln(pos) for pos in x] SelfAttentionreturns(T, D)- Residual sums are elementwise list additions
6) Single-head vs multi-head
The conceptual block often uses multi-head attention, but the pack code uses a single-head SelfAttention for clarity. The block API still passes num_heads for future extension.
Key takeaways
- TransformerBlock = pre-norm attention + pre-norm FFN with residuals.
- LayerNorm is per-position, across features.
- FFN adds per-position nonlinearity and capacity.
- Residuals are essential for deep training stability.
Next: combine blocks into a full GPT-style language model.