Embeddings and Positional Encoding

Lesson, slides, and applied problem sets.

View Slides

Lesson

Embeddings and Positional Encoding

Goal

Transform discrete token IDs into continuous vectors and inject position information so a transformer can reason about order.

Prerequisites: Autograd + NN abstractions.


1) Token embeddings

An embedding is a lookup table:

  • Vocabulary size: V
  • Embedding dimension: D
  • Weight matrix: W with shape (V, D)

Lookup for token IDs [t0, t1, ...] returns rows W[t0], W[t1], ....

class Embedding(Module):
    def __init__(self, num_embeddings, embedding_dim):
        k = 1.0 / (embedding_dim ** 0.5)
        self.weight = [[Value(rand(-k, k)) for _ in range(embedding_dim)]
                       for _ in range(num_embeddings)]

    def forward(self, indices):
        return [self.weight[i] for i in indices]

2) Sparse gradient updates

Only the rows that were actually looked up receive gradients. This is expected and efficient:

  • embedding.weight[42] gets updated if token 42 appears
  • unused rows get no gradient

3) Positional embeddings

Self-attention is permutation-invariant; we must add position info.

Learned positional embedding:

  • Max sequence length Tmax
  • Table P with shape (Tmax, D)
  • Positions are indices 0..T-1
pos = position_embedding(list(range(T)))

4) Combine token + position

For each position:

input[i] = token_emb[i] + pos_emb[i]

Shapes:

  • token_emb: (T, D)
  • pos_emb: (T, D)
  • input: (T, D)

5) Tokenization (this pack)

We use character-level tokenization for clarity:

  • Build char_to_idx from the dataset
  • Encode text -> list of ints
  • Decode ints -> string

The same embedding logic applies to subword tokenization (BPE) used in real GPTs.


Key takeaways

  1. Embeddings are just a learned lookup table W.
  2. Positional embeddings inject order information.
  3. The model input is the elementwise sum of token and position vectors.

Next: self-attention, which consumes these (T, D) vectors.


Module Items