Self-Attention

1 / 6

Shapes

2 / 6

Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V

Row-wise softmax over scores.

3 / 6

Autoregressive mask:

4 / 6

Subtract max before exponentiating:

exp(s - max(s))

5 / 6

6 / 6

Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.