Multi-Head Attention and Transformer Blocks
Lesson, slides, and applied problem sets.
View SlidesLesson
Multi-Head Attention and Transformer Blocks
Goal
Implement multi-head attention and the pre-norm transformer block used by GPT-style models.
1) Multi-head attention
Split the embedding dimension D into H heads of size Dh = D / H. For each head:
- project to
Q, K, V - apply scaled dot-product attention
- produce
(T, Dh)output
Concatenate all heads and apply a final output projection.
2) Pre-norm transformer block
For input x (T, D):
y = x + MHA(LN(x))
z = y + FFN(LN(y))
This is more stable than post-norm for deep networks.
3) LayerNorm
Normalize across features for each position:
mu = mean(x)
var = mean((x - mu)^2)
ln(x) = (x - mu) / sqrt(var + eps)
Then scale/shift with learnable gamma, beta.
4) FFN (position-wise)
Two linear layers with activation:
D -> 4D -> D- ReLU is acceptable; GELU is common in GPTs.
5) Residuals
Residual connections preserve information and stabilize gradients:
x = x + sublayer(x)
Key takeaways
- Multi-head attention lets the model attend to different subspaces.
- Transformer blocks are pre-norm attention + FFN with residuals.
- LayerNorm and residuals are essential for deep training.
Next: combine blocks into a complete GPT model.