Embeddings and Positional Encoding
1 / 6
V, embedding dim DW has shape (V, D)Only rows that were used get gradients. Unused tokens keep their weights unchanged.
P[pos]P shape: (Tmax, D)input[i] = token_emb[i] + pos_emb[i]
Shape: (T, D)
We use character-level tokens in this pack for simplicity.