Embeddings and Positional Encoding
Lesson, slides, and applied problem sets.
View SlidesLesson
Embeddings and Positional Encoding
Goal
Turn token IDs into continuous vectors and add position information. These are the inputs to attention and MLP language models.
1) Token embeddings
An embedding is a lookup table:
- Vocabulary size
V - Embedding dimension
D - Weight matrix
Wof shape(V, D)
Lookup for token IDs returns rows of W.
2) Positional embeddings
Self-attention is permutation-invariant; position must be injected.
Learned positional embeddings:
- Max sequence length
Tmax - Table
Pof shape(Tmax, D)
3) Combine
For each position:
input[i] = token_emb[i] + pos_emb[i]
Shapes:
- token_emb:
(T, D) - pos_emb:
(T, D) - input:
(T, D)
4) Initialization
Use scaled uniform init for stability:
k = 1 / sqrt(D)
weight ~ U(-k, k)
5) Sparse updates
Only rows that were looked up receive gradients. This is expected.
Key takeaways
- Embeddings are just learned lookup tables.
- Positional embeddings inject order.
- Combined embeddings are the standard transformer input.
Next: build a bigram language model using embeddings.