Self-Attention

Lesson, slides, and applied problem sets.

View Slides

Lesson

Self-Attention

Goal

Implement scaled dot-product self-attention over a sequence.


1) Shapes

  • sequence length: T
  • embedding dim: D
  • input x: (T, D)

Projections:

  • Q, K, V each (T, D)

2) Scaled dot-product attention

score[i, j] = (Q[i] dot K[j]) / sqrt(D)
weights[i] = softmax(score[i, :])
output[i] = sum_j weights[i, j] * V[j]

Matrix form:

Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V

3) Causal masking

For language modeling, block future positions:

mask[i][j] = 0 if j <= i else -1e9
score[i][j] += mask[i][j]

4) Softmax stability

Always subtract max before exponentiating to avoid overflow.


5) Complexity

  • Time: O(T^2 * D)
  • Memory: O(T^2)

Key takeaways

  1. Attention mixes information across positions.
  2. Causal masking enforces autoregressive behavior.
  3. Stability matters: use max-subtracted softmax.

Next: multi-head attention and transformer blocks.


Module Items