Multi-Head Attention + Transformer Block

1 / 5

Multi-head

Split D into H heads of size Dh. Run attention per head, concatenate, project.

2 / 5

Pre-norm block

y = x + MHA(LN(x)) z = y + FFN(LN(y))

3 / 5

LayerNorm

Normalize per position across features.

4 / 5

FFN

D -> 4D -> D with activation.

5 / 5
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.