Embeddings and Positional Encoding
Lesson, slides, and applied problem sets.
View SlidesLesson
Embeddings and Positional Encoding
Goal
Transform discrete token IDs into continuous vectors and inject position information so a transformer can reason about order.
Prerequisites: Autograd + NN abstractions.
1) Token embeddings
An embedding is a lookup table:
- Vocabulary size:
V - Embedding dimension:
D - Weight matrix:
Wwith shape(V, D)
Lookup for token IDs [t0, t1, ...] returns rows W[t0], W[t1], ....
class Embedding(Module):
def __init__(self, num_embeddings, embedding_dim):
k = 1.0 / (embedding_dim ** 0.5)
self.weight = [[Value(rand(-k, k)) for _ in range(embedding_dim)]
for _ in range(num_embeddings)]
def forward(self, indices):
return [self.weight[i] for i in indices]
2) Sparse gradient updates
Only the rows that were actually looked up receive gradients. This is expected and efficient:
embedding.weight[42]gets updated if token 42 appears- unused rows get no gradient
3) Positional embeddings
Self-attention is permutation-invariant; we must add position info.
Learned positional embedding:
- Max sequence length
Tmax - Table
Pwith shape(Tmax, D) - Positions are indices
0..T-1
pos = position_embedding(list(range(T)))
4) Combine token + position
For each position:
input[i] = token_emb[i] + pos_emb[i]
Shapes:
- token_emb:
(T, D) - pos_emb:
(T, D) - input:
(T, D)
5) Tokenization (this pack)
We use character-level tokenization for clarity:
- Build
char_to_idxfrom the dataset - Encode text -> list of ints
- Decode ints -> string
The same embedding logic applies to subword tokenization (BPE) used in real GPTs.
Key takeaways
- Embeddings are just a learned lookup table
W. - Positional embeddings inject order information.
- The model input is the elementwise sum of token and position vectors.
Next: self-attention, which consumes these (T, D) vectors.