Self-Attention: The Heart of Transformers
Lesson, slides, and applied problem sets.
View SlidesLesson
Self-Attention: The Core Transformer Primitive
Goal
Implement scaled dot-product self-attention over a sequence. This is the operation that lets each position mix information from all other positions.
Prerequisites: Embeddings module.
1) Notation and shapes
Let:
T= sequence lengthD= embedding dimension
Input x has shape (T, D) (a list of T vectors of length D).
We compute:
Q = x W_q,K = x W_k,V = x W_v- Each has shape
(T, D)
2) Scaled dot-product attention
For each position i, compute similarity to all positions j:
score[i, j] = (Q[i] dot K[j]) / sqrt(D)
weights[i] = softmax(score[i, :])
output[i] = sum_j weights[i, j] * V[j]
Matrix form:
Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V
3) Softmax (numerical stability)
Always subtract the max before exponentiating:
def softmax(scores):
m = max(s.data for s in scores)
exp_scores = [(s - m).exp() for s in scores]
s = sum(exp_scores)
return [e / s for e in exp_scores]
This keeps exp from blowing up.
4) Causal masking (autoregressive)
For language models, position i may only attend to j <= i. Add a large negative value to disallowed scores:
mask[i][j] = 0.0 if j <= i else -1e9
score[i][j] += mask[i][j]
After softmax, masked positions have ~0 probability.
5) Implementation sketch (list-of-Value)
# Q, K, V: list length T of vectors length D
scores = attention_scores(Q, K, D) # (T, T)
if causal:
scores[i][j] += mask[i][j]
weights = [softmax(row) for row in scores]
output = apply_attention(weights, V) # (T, D)
6) Multi-head attention (concept)
In full transformers, we split D into H heads and run attention per head. This pack's coding problems implement a single-head SelfAttention for clarity. Multi-head is a straightforward extension once the single-head core is correct.
7) Complexity
- Time: O(T^2 * D)
- Memory: O(T^2)
The quadratic term is the main bottleneck for long sequences.
Key takeaways
- Attention mixes each position with a weighted sum of all positions.
- Softmax is row-wise and must be numerically stable.
- Causal masking enforces autoregressive generation.
- Single-head attention is the core; multi-head is a structured extension.