Tokenization and Batching

Lesson, slides, and applied problem sets.

View Slides

Lesson

Tokenization and Batching

Goal

Build a minimal character-level tokenizer and a deterministic batch sampler for language modeling.


1) Vocabulary

From a training corpus text:

  • chars = sorted(set(text))
  • stoi = {c: i for i, c in enumerate(chars)}
  • itos = {i: c for c, i in stoi.items()}

These are your reversible token maps.


2) Encode / Decode

def encode(s):
    return [stoi[c] for c in s]

def decode(ids):
    return "".join(itos[i] for i in ids)

decode(encode(s)) should return s.


3) Train / Val split

Use a fixed split ratio for deterministic evaluation:

split = int(len(data) * 0.9)
train = data[:split]
val = data[split:]

4) Random batch sampling

For block size T and batch size B:

  • sample B random start indices
  • x = data[i : i+T]
  • y = data[i+1 : i+T+1]
def get_batch(data, block_size, batch_size, rng):
    xs, ys = [], []
    for _ in range(batch_size):
        i = rng.randint(0, len(data) - block_size - 1)
        xs.append(data[i:i+block_size])
        ys.append(data[i+1:i+block_size+1])
    return xs, ys

Pass an RNG so tests can be deterministic.


Key takeaways

  1. Tokenization is just a reversible mapping.
  2. Batch sampling aligns inputs with next-token targets.
  3. Deterministic batching makes debugging and tests reliable.

Next: embeddings + positional encodings.


Module Items