Embeddings and Positional Encoding

Lesson, slides, and applied problem sets.

Lesson

Turn token IDs into continuous vectors and add position information. These are the inputs to attention and MLP language models.

An embedding is a lookup table:

Lookup for token IDs returns rows of W.

Self-attention is permutation-invariant; position must be injected.

Learned positional embeddings:

For each position:

input[i] = token_emb[i] + pos_emb[i]

Shapes:

Use scaled uniform init for stability:

k = 1 / sqrt(D)
weight ~ U(-k, k)

Only rows that were looked up receive gradients. This is expected.

Next: build a bigram language model using embeddings.