Self-Attention

1 / 6

Shapes

  • Sequence length: T
  • Embedding dim: D
  • Input x: (T, D)
2 / 6

Formula

Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V

Row-wise softmax over scores.

3 / 6

Causal mask

Autoregressive mask:

  • allow j <= i
  • disallow future with -1e9
4 / 6

Softmax stability

Subtract max before exponentiating:

exp(s - max(s))
5 / 6

Output

  • Each position becomes a weighted sum of all values
  • Complexity: O(T^2 * D)
6 / 6
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.