T
D
x
(T, D)
Attention(Q, K, V) = softmax(Q K^T / sqrt(D)) @ V
Row-wise softmax over scores.
Autoregressive mask:
j <= i
-1e9
Subtract max before exponentiating:
exp(s - max(s))