Self-Attention
Lesson, slides, and applied problem sets.
View SlidesLesson
Self-Attention
Goal
Implement scaled dot-product self-attention over a sequence.
1) Shapes
- sequence length:
T - embedding dim:
D - input
x:(T, D)
Projections:
Q, K, Veach(T, D)
2) Scaled dot-product attention
score[i, j] = (Q[i] dot K[j]) / sqrt(D)
weights[i] = softmax(score[i, :])
output[i] = sum_j weights[i, j] * V[j]
Matrix form:
Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V
3) Causal masking
For language modeling, block future positions:
mask[i][j] = 0 if j <= i else -1e9
score[i][j] += mask[i][j]
4) Softmax stability
Always subtract max before exponentiating to avoid overflow.
5) Complexity
- Time: O(T^2 * D)
- Memory: O(T^2)
Key takeaways
- Attention mixes information across positions.
- Causal masking enforces autoregressive behavior.
- Stability matters: use max-subtracted softmax.
Next: multi-head attention and transformer blocks.