Self-Attention: The Core of Transformers

hard · attention, transformers, self-attention

Self-Attention: The Core of Transformers

Implement scaled dot-product self-attention using scalar Value operations and list-based vectors.

What you are building

1) softmax(scores: List[Value]) -> List[Value]

  • Subtract max for numerical stability
  • Return probabilities that sum to 1

2) attention_scores(Q, K, d_k) -> List[List[Value]]

  • Q, K: lists of length T, each vector length d_k
  • scores[i][j] = dot(Q[i], K[j]) / sqrt(d_k)

3) causal_mask(seq_len: int) -> List[List[float]]

  • mask[i][j] = 0.0 if j <= i, else -1e9

4) apply_attention(weights, V) -> List[List[Value]]

  • weights: (T, T) probabilities
  • V: (T, d_v) values
  • Output: (T, d_v) weighted sums

5) SelfAttention(Module)

class SelfAttention(Module):
    def __init__(self, embed_dim: int):
        # W_q, W_k, W_v, W_o = Linear(embed_dim, embed_dim)
        pass

    def forward(self, x: List[List[Value]], causal: bool = True) -> List[List[Value]]:
        # x: (T, embed_dim)
        # returns: (T, embed_dim)
        pass

Notes

  • Use Value ops for all computations so gradients flow.
  • The causal mask can be added as a float; Value.__add__ handles it.
  • Softmax should be computed row-wise on the (T, T) score matrix.

Example

attn = SelfAttention(embed_dim=4)
x = [[Value(0.1) for _ in range(4)] for _ in range(3)]
out = attn(x, causal=True)
assert len(out) == 3 and len(out[0]) == 4

Hints

  • Use sum(w * v for w, v in zip(vec1, vec2)) for dot products.
  • Causal masking should zero out probability mass on future positions after softmax.
Run tests to see results
No issues detected