Self-Attention: The Heart of Transformers

Lesson, slides, and applied problem sets.

Lesson

Self-Attention: The Core Transformer Primitive

Goal

Implement scaled dot-product self-attention over a sequence. This is the operation that lets each position mix information from all other positions.

Prerequisites: Embeddings module.

1) Notation and shapes

Let:

T = sequence length
D = embedding dimension

Input x has shape (T, D) (a list of T vectors of length D).

We compute:

Q = x W_q, K = x W_k, V = x W_v
Each has shape (T, D)

2) Scaled dot-product attention

For each position i, compute similarity to all positions j:

score[i, j] = (Q[i] dot K[j]) / sqrt(D)
weights[i] = softmax(score[i, :])
output[i] = sum_j weights[i, j] * V[j]

Matrix form:

Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V

3) Softmax (numerical stability)

Always subtract the max before exponentiating:

def softmax(scores):
    m = max(s.data for s in scores)
    exp_scores = [(s - m).exp() for s in scores]
    s = sum(exp_scores)
    return [e / s for e in exp_scores]

This keeps exp from blowing up.

4) Causal masking (autoregressive)

For language models, position i may only attend to j <= i. Add a large negative value to disallowed scores:

mask[i][j] = 0.0 if j <= i else -1e9
score[i][j] += mask[i][j]

After softmax, masked positions have ~0 probability.

5) Implementation sketch (list-of-Value)

# Q, K, V: list length T of vectors length D
scores = attention_scores(Q, K, D)  # (T, T)
if causal:
    scores[i][j] += mask[i][j]
weights = [softmax(row) for row in scores]
output = apply_attention(weights, V)  # (T, D)

6) Multi-head attention (concept)

In full transformers, we split D into H heads and run attention per head. This pack's coding problems implement a single-head SelfAttention for clarity. Multi-head is a straightforward extension once the single-head core is correct.

7) Complexity

Time: O(T^2 * D)
Memory: O(T^2)

The quadratic term is the main bottleneck for long sequences.

Key takeaways

Attention mixes each position with a weighted sum of all positions.
Softmax is row-wise and must be numerically stable.
Causal masking enforces autoregressive generation.
Single-head attention is the core; multi-head is a structured extension.