Embeddings and Positional Encoding

1 / 6

Token embeddings

  • Vocabulary size V, embedding dim D
  • Weight table W has shape (V, D)
  • Lookup by token IDs returns vectors
2 / 6

Sparse updates

Only rows that were used get gradients. Unused tokens keep their weights unchanged.

3 / 6

Positional embeddings

  • Self-attention has no position sense
  • Add learned position vectors P[pos]
  • P shape: (Tmax, D)
4 / 6

Combine

input[i] = token_emb[i] + pos_emb[i]

Shape: (T, D)

5 / 6

Tokenization note

We use character-level tokens in this pack for simplicity.

6 / 6
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.