Tokenization and Batching
Lesson, slides, and applied problem sets.
View SlidesLesson
Tokenization and Batching
Goal
Build a minimal character-level tokenizer and a deterministic batch sampler for language modeling.
1) Vocabulary
From a training corpus text:
chars = sorted(set(text))stoi = {c: i for i, c in enumerate(chars)}itos = {i: c for c, i in stoi.items()}
These are your reversible token maps.
2) Encode / Decode
def encode(s):
return [stoi[c] for c in s]
def decode(ids):
return "".join(itos[i] for i in ids)
decode(encode(s)) should return s.
3) Train / Val split
Use a fixed split ratio for deterministic evaluation:
split = int(len(data) * 0.9)
train = data[:split]
val = data[split:]
4) Random batch sampling
For block size T and batch size B:
- sample
Brandom start indices x = data[i : i+T]y = data[i+1 : i+T+1]
def get_batch(data, block_size, batch_size, rng):
xs, ys = [], []
for _ in range(batch_size):
i = rng.randint(0, len(data) - block_size - 1)
xs.append(data[i:i+block_size])
ys.append(data[i+1:i+block_size+1])
return xs, ys
Pass an RNG so tests can be deterministic.
Key takeaways
- Tokenization is just a reversible mapping.
- Batch sampling aligns inputs with next-token targets.
- Deterministic batching makes debugging and tests reliable.
Next: embeddings + positional encodings.