Split D into H heads of size Dh. Run attention per head, concatenate, project.
D
H
Dh
y = x + MHA(LN(x)) z = y + FFN(LN(y))
y = x + MHA(LN(x))
z = y + FFN(LN(y))
Normalize per position across features.
D -> 4D -> D with activation.
D -> 4D -> D