Multi-Head Attention and Transformer Block
Multi-Head Attention and Transformer Block
Implement multi-head attention and a pre-norm transformer block.
Components to implement
1) MultiHeadAttention(Module)
class MultiHeadAttention(Module):
def __init__(self, embed_dim: int, num_heads: int):
# embed_dim must be divisible by num_heads
pass
def forward(self, x: List[List[Value]], causal: bool = True) -> List[List[Value]]:
# x: (T, D)
# returns: (T, D)
pass
Rules:
- Split the embedding dimension into
num_headsheads of sizehead_dim = embed_dim // num_heads. - Run self-attention independently per head.
- Concatenate head outputs and apply a final output projection.
2) LayerNorm(Module)
Normalize a single vector of length D and apply learnable scale/shift.
3) FeedForward(Module)
Two-layer MLP applied per position:
D -> 4D -> D- Use ReLU activation
4) TransformerBlock(Module)
Pre-norm block with residuals:
y = x + MHA(LN(x))
z = y + FFN(LN(y))
Notes
- Use list-of-
Valuemath; no NumPy. num_headsis validated withassert embed_dim % num_heads == 0.- Input/output shapes must match
(T, D).
Run tests to see results
No issues detected