Tokenization and Batching

Lesson, slides, and applied problem sets.

View Slides

Lesson

Tokenization and Batching

Goal

Build a minimal character-level tokenizer and a deterministic batch sampler for language modeling.

1) Vocabulary

From a training corpus text:

chars = sorted(set(text))
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for c, i in stoi.items()}

These are your reversible token maps.

2) Encode / Decode

def encode(s):
    return [stoi[c] for c in s]

def decode(ids):
    return "".join(itos[i] for i in ids)

decode(encode(s)) should return s.

3) Train / Val split

Use a fixed split ratio for deterministic evaluation:

split = int(len(data) * 0.9)
train = data[:split]
val = data[split:]

4) Random batch sampling

For block size T and batch size B:

sample B random start indices
x = data[i : i+T]
y = data[i+1 : i+T+1]

def get_batch(data, block_size, batch_size, rng):
    xs, ys = [], []
    for _ in range(batch_size):
        i = rng.randint(0, len(data) - block_size - 1)
        xs.append(data[i:i+block_size])
        ys.append(data[i+1:i+block_size+1])
    return xs, ys

Pass an RNG so tests can be deterministic.

Key takeaways

Tokenization is just a reversible mapping.
Batch sampling aligns inputs with next-token targets.
Deterministic batching makes debugging and tests reliable.

Next: embeddings + positional encodings.

Module Items

Tokenization and Batching
Implement a char-level tokenizer and batch sampler.
easy tokenization · nlp · batching