Multi-Head Attention and Transformer Blocks

Lesson, slides, and applied problem sets.

Lesson

Implement multi-head attention and the pre-norm transformer block used by GPT-style models.

Split the embedding dimension D into H heads of size Dh = D / H. For each head:

Concatenate all heads and apply a final output projection.

For input x (T, D):

y = x + MHA(LN(x))
z = y + FFN(LN(y))

This is more stable than post-norm for deep networks.

Normalize across features for each position:

mu = mean(x)
var = mean((x - mu)^2)
ln(x) = (x - mu) / sqrt(var + eps)

Then scale/shift with learnable gamma, beta.

Two linear layers with activation:

Residual connections preserve information and stabilize gradients:

x = x + sublayer(x)

Next: combine blocks into a complete GPT model.