Embeddings and Positional Encoding

Lesson, slides, and applied problem sets.

Lesson

Embeddings and Positional Encoding

Goal

Transform discrete token IDs into continuous vectors and inject position information so a transformer can reason about order.

Prerequisites: Autograd + NN abstractions.

1) Token embeddings

An embedding is a lookup table:

Vocabulary size: V
Embedding dimension: D
Weight matrix: W with shape (V, D)

Lookup for token IDs [t0, t1, ...] returns rows W[t0], W[t1], ....

class Embedding(Module):
    def __init__(self, num_embeddings, embedding_dim):
        k = 1.0 / (embedding_dim ** 0.5)
        self.weight = [[Value(rand(-k, k)) for _ in range(embedding_dim)]
                       for _ in range(num_embeddings)]

    def forward(self, indices):
        return [self.weight[i] for i in indices]

2) Sparse gradient updates

Only the rows that were actually looked up receive gradients. This is expected and efficient:

embedding.weight[42] gets updated if token 42 appears
unused rows get no gradient

3) Positional embeddings

Self-attention is permutation-invariant; we must add position info.

Learned positional embedding:

Max sequence length Tmax
Table P with shape (Tmax, D)
Positions are indices 0..T-1

pos = position_embedding(list(range(T)))

4) Combine token + position

For each position:

input[i] = token_emb[i] + pos_emb[i]

Shapes:

token_emb: (T, D)
pos_emb: (T, D)
input: (T, D)

5) Tokenization (this pack)

We use character-level tokenization for clarity:

Build char_to_idx from the dataset
Encode text -> list of ints
Decode ints -> string

The same embedding logic applies to subword tokenization (BPE) used in real GPTs.

Key takeaways

Embeddings are just a learned lookup table W.
Positional embeddings inject order information.
The model input is the elementwise sum of token and position vectors.

Next: self-attention, which consumes these (T, D) vectors.

Module Items

Embeddings Checkpoint
Dense representations and similarity.
Quiz

Lesson

Embeddings and Positional Encoding

Goal

1) Token embeddings

2) Sparse gradient updates

3) Positional embeddings

4) Combine token + position

5) Tokenization (this pack)

Key takeaways

Module Items

Embeddings Checkpoint