Multi-Head Attention and Transformer Blocks

Lesson, slides, and applied problem sets.

View Slides

Lesson

Multi-Head Attention and Transformer Blocks

Goal

Implement multi-head attention and the pre-norm transformer block used by GPT-style models.


1) Multi-head attention

Split the embedding dimension D into H heads of size Dh = D / H. For each head:

  • project to Q, K, V
  • apply scaled dot-product attention
  • produce (T, Dh) output

Concatenate all heads and apply a final output projection.


2) Pre-norm transformer block

For input x (T, D):

y = x + MHA(LN(x))
z = y + FFN(LN(y))

This is more stable than post-norm for deep networks.


3) LayerNorm

Normalize across features for each position:

mu = mean(x)
var = mean((x - mu)^2)
ln(x) = (x - mu) / sqrt(var + eps)

Then scale/shift with learnable gamma, beta.


4) FFN (position-wise)

Two linear layers with activation:

  • D -> 4D -> D
  • ReLU is acceptable; GELU is common in GPTs.

5) Residuals

Residual connections preserve information and stabilize gradients:

x = x + sublayer(x)

Key takeaways

  1. Multi-head attention lets the model attend to different subspaces.
  2. Transformer blocks are pre-norm attention + FFN with residuals.
  3. LayerNorm and residuals are essential for deep training.

Next: combine blocks into a complete GPT model.


Module Items