Tokenization and Batching
Tokenization and Batching
Implement a minimal character-level tokenizer and a deterministic batch sampler.
Functions to implement
1) build_vocab(text: str)
Return (stoi, itos) where:
stoi: dict mapping char -> intitos: dict mapping int -> char
Rules:
- Use
sorted(set(text))to define the vocabulary. itosmust be the exact inverse ofstoi.
2) encode(text: str, stoi: dict) -> list[int]
Convert a string into a list of token IDs.
3) decode(ids: list[int], itos: dict) -> str
Convert a list of token IDs back into a string.
4) train_val_split(data: list[int], split_ratio: float = 0.9)
Split encoded data into train/val lists by index:
split = int(len(data) * split_ratio)
train = data[:split]
val = data[split:]
5) get_batch(data, block_size, batch_size, rng)
Return (xs, ys) where:
xsis a list ofbatch_sizesequences, each lengthblock_sizeysis the same but shifted by one token
Use rng.randint to choose start indices. Required index range:
0 <= i <= len(data) - block_size - 1
The output must satisfy:
ys[k] == xs[k][1:] + [next_token]
Notes
- Pass
rngexplicitly to make tests deterministic. decode(encode(s))should returnsexactly.
Example
text = "hello"
stoi, itos = build_vocab(text)
ids = encode(text, stoi)
assert decode(ids, itos) == text
Run tests to see results
No issues detected