Embeddings and Positional Encoding

Lesson, slides, and applied problem sets.

View Slides

Lesson

Embeddings and Positional Encoding

Goal

Turn token IDs into continuous vectors and add position information. These are the inputs to attention and MLP language models.


1) Token embeddings

An embedding is a lookup table:

  • Vocabulary size V
  • Embedding dimension D
  • Weight matrix W of shape (V, D)

Lookup for token IDs returns rows of W.


2) Positional embeddings

Self-attention is permutation-invariant; position must be injected.

Learned positional embeddings:

  • Max sequence length Tmax
  • Table P of shape (Tmax, D)

3) Combine

For each position:

input[i] = token_emb[i] + pos_emb[i]

Shapes:

  • token_emb: (T, D)
  • pos_emb: (T, D)
  • input: (T, D)

4) Initialization

Use scaled uniform init for stability:

k = 1 / sqrt(D)
weight ~ U(-k, k)

5) Sparse updates

Only rows that were looked up receive gradients. This is expected.


Key takeaways

  1. Embeddings are just learned lookup tables.
  2. Positional embeddings inject order.
  3. Combined embeddings are the standard transformer input.

Next: build a bigram language model using embeddings.


Module Items