Self-Attention

Lesson, slides, and applied problem sets.

Lesson

Implement scaled dot-product self-attention over a sequence.

Projections:

score[i, j] = (Q[i] dot K[j]) / sqrt(D)
weights[i] = softmax(score[i, :])
output[i] = sum_j weights[i, j] * V[j]

Matrix form:

Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V

For language modeling, block future positions:

mask[i][j] = 0 if j <= i else -1e9
score[i][j] += mask[i][j]

Always subtract max before exponentiating to avoid overflow.

Next: multi-head attention and transformer blocks.