Tokenization and Batching

easy · tokenization, nlp, batching

Implement a minimal character-level tokenizer and a deterministic batch sampler.

Functions to implement

Return (stoi, itos) where:

Rules:

Convert a string into a list of token IDs.

Convert a list of token IDs back into a string.

Split encoded data into train/val lists by index:

split = int(len(data) * split_ratio)
train = data[:split]
val = data[split:]

Return (xs, ys) where:

Use rng.randint to choose start indices. Required index range:

0 <= i <= len(data) - block_size - 1

The output must satisfy:

ys[k] == xs[k][1:] + [next_token]

text = "hello"
stoi, itos = build_vocab(text)
ids = encode(text, stoi)
assert decode(ids, itos) == text

Auto-advance

Run tests to see results

No issues detected