Self-Attention
Self-Attention: The Core of Transformers
Implement scaled dot-product self-attention using scalar Value operations and list-based vectors.
What you are building
1) softmax(scores: List[Value]) -> List[Value]
- Subtract max for numerical stability
- Return probabilities that sum to 1
2) attention_scores(Q, K, d_k) -> List[List[Value]]
Q,K: lists of lengthT, each vector lengthd_kscores[i][j] = dot(Q[i], K[j]) / sqrt(d_k)
3) causal_mask(seq_len: int) -> List[List[float]]
mask[i][j] = 0.0ifj <= i, else-1e9
4) apply_attention(weights, V) -> List[List[Value]]
weights:(T, T)probabilitiesV:(T, d_v)values- Output:
(T, d_v)weighted sums
5) SelfAttention(Module)
class SelfAttention(Module):
def __init__(self, embed_dim: int):
# W_q, W_k, W_v, W_o = Linear(embed_dim, embed_dim)
pass
def forward(self, x: List[List[Value]], causal: bool = True) -> List[List[Value]]:
# x: (T, embed_dim)
# returns: (T, embed_dim)
pass
Notes
- Use
Valueops for all computations so gradients flow. - The causal mask can be added as a float;
Value.__add__handles it. - Softmax should be computed row-wise on the
(T, T)score matrix.
Example
attn = SelfAttention(embed_dim=4)
x = [[Value(0.1) for _ in range(4)] for _ in range(3)]
out = attn(x, causal=True)
assert len(out) == 3 and len(out[0]) == 4
Hints
- Use
sum(w * v for w, v in zip(vec1, vec2))for dot products. - Causal masking should zero out probability mass on future positions after softmax.
Run tests to see results
No issues detected