Self-Attention

1 / 5

Shapes

Input (T, D) Q, K, V each (T, D)

2 / 5

Formula

softmax(Q K^T / sqrt(D)) @ V

3 / 5

Causal mask

Block future tokens with -1e9 mask.

4 / 5

Softmax stability

Subtract max before exp.

5 / 5
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.