Tokenization and Batching

easy · tokenization, nlp, batching

Tokenization and Batching

Implement a minimal character-level tokenizer and a deterministic batch sampler.

Functions to implement

1) build_vocab(text: str)

Return (stoi, itos) where:

  • stoi: dict mapping char -> int
  • itos: dict mapping int -> char

Rules:

  • Use sorted(set(text)) to define the vocabulary.
  • itos must be the exact inverse of stoi.

2) encode(text: str, stoi: dict) -> list[int]

Convert a string into a list of token IDs.

3) decode(ids: list[int], itos: dict) -> str

Convert a list of token IDs back into a string.

4) train_val_split(data: list[int], split_ratio: float = 0.9)

Split encoded data into train/val lists by index:

split = int(len(data) * split_ratio)
train = data[:split]
val = data[split:]

5) get_batch(data, block_size, batch_size, rng)

Return (xs, ys) where:

  • xs is a list of batch_size sequences, each length block_size
  • ys is the same but shifted by one token

Use rng.randint to choose start indices. Required index range:

0 <= i <= len(data) - block_size - 1

The output must satisfy:

ys[k] == xs[k][1:] + [next_token]

Notes

  • Pass rng explicitly to make tests deterministic.
  • decode(encode(s)) should return s exactly.

Example

text = "hello"
stoi, itos = build_vocab(text)
ids = encode(text, stoi)
assert decode(ids, itos) == text
Run tests to see results
No issues detected