Multi-Head Attention and Transformer Block

hard · transformers, multi-head, layer-norm

Multi-Head Attention and Transformer Block

Implement multi-head attention and a pre-norm transformer block.

Components to implement

1) MultiHeadAttention(Module)

class MultiHeadAttention(Module):
    def __init__(self, embed_dim: int, num_heads: int):
        # embed_dim must be divisible by num_heads
        pass

    def forward(self, x: List[List[Value]], causal: bool = True) -> List[List[Value]]:
        # x: (T, D)
        # returns: (T, D)
        pass

Rules:

  • Split the embedding dimension into num_heads heads of size head_dim = embed_dim // num_heads.
  • Run self-attention independently per head.
  • Concatenate head outputs and apply a final output projection.

2) LayerNorm(Module)

Normalize a single vector of length D and apply learnable scale/shift.

3) FeedForward(Module)

Two-layer MLP applied per position:

  • D -> 4D -> D
  • Use ReLU activation

4) TransformerBlock(Module)

Pre-norm block with residuals:

y = x + MHA(LN(x))
z = y + FFN(LN(y))

Notes

  • Use list-of-Value math; no NumPy.
  • num_heads is validated with assert embed_dim % num_heads == 0.
  • Input/output shapes must match (T, D).
Run tests to see results
No issues detected