The Transformer Block: Assembling the Pieces

Lesson, slides, and applied problem sets.

View Slides

Lesson

The Transformer Block

Goal

Assemble self-attention, layer normalization, feed-forward networks, and residuals into the core transformer block used by GPT-style models.

Prerequisites: Attention module.


1) Block structure (pre-norm)

For input x with shape (T, D):

y = x + Attn(LN(x))
z = y + FFN(LN(y))

This is the pre-norm layout used in GPT-2 and later for stability.


2) LayerNorm (per position)

LayerNorm normalizes across the feature dimension for each position independently:

mu  = mean(x)
var = mean((x - mu)^2)
ln(x) = (x - mu) / sqrt(var + eps)
output = gamma * ln(x) + beta

gamma and beta are learnable vectors of length D.


3) Feed-forward network (FFN)

Two linear layers with a nonlinearity:

FFN(x) = W2 * act(W1 * x + b1) + b2

Standard size is 4D for the hidden dimension. In this pack, ReLU is fine (GELU is a common modern variant).


4) Residual connections

Residuals preserve information and improve gradient flow:

x -> x + sublayer(x)

Without residuals, deep transformers are much harder to train.


5) Implementation notes (list-of-Value)

  • Apply LayerNorm to each position: [ln(pos) for pos in x]
  • SelfAttention returns (T, D)
  • Residual sums are elementwise list additions

6) Single-head vs multi-head

The conceptual block often uses multi-head attention, but the pack code uses a single-head SelfAttention for clarity. The block API still passes num_heads for future extension.


Key takeaways

  1. TransformerBlock = pre-norm attention + pre-norm FFN with residuals.
  2. LayerNorm is per-position, across features.
  3. FFN adds per-position nonlinearity and capacity.
  4. Residuals are essential for deep training stability.

Next: combine blocks into a full GPT-style language model.


Module Items