Mini-GPT: Building a Character-Level Language Model
Lesson, slides, and applied problem sets.
View SlidesLesson
Mini-GPT: Character-Level Language Model
Goal
Build a complete GPT-style language model from scratch: embeddings -> transformer blocks -> logits -> autoregressive generation.
Prerequisites: all prior DL modules.
1) Problem setup
We model sequences of token IDs.
- Vocabulary size:
V - Context length:
T(must be<= max_seq_len) - Embedding dim:
D
At each position t, the model predicts the next token x[t+1].
2) Model architecture (single-head attention)
Token IDs -> Token Embedding + Position Embedding
-> TransformerBlock x N
-> Final LayerNorm
-> Linear to vocab logits
Forward pass (shapes):
- input IDs:
(T,) - embeddings:
(T, D) - logits:
(T, V)
Note: the pack uses single-head SelfAttention for clarity. The num_heads argument is kept for API compatibility and extension.
3) Next-token loss (cross-entropy)
Targets are the input shifted by one:
input = [x0, x1, x2, ... x_{T-1}]
target = [x1, x2, x3, ... x_T]
Loss is the mean negative log-probability of the correct class at each position.
Stable softmax:
m = max(logit.data for logit in logits_t)
exp = [(z - m).exp() for z in logits_t]
probs = [e / sum(exp) for e in exp]
loss = -probs[target_t].log()
4) Data pipeline (char-level)
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
def encode(s): return [char_to_idx[c] for c in s]
def get_batch(data, block_size, batch_size):
# pick random slices of length block_size
# x: data[i : i+block_size]
# y: data[i+1 : i+block_size+1]
5) Training loop
for step in range(steps):
x, y = sample_batch(...)
logits = model(x)
loss = cross_entropy_loss(logits, y)
for p in model.parameters():
p.grad = 0.0
loss.backward()
for p in model.parameters():
p.data -= lr * p.grad
6) Autoregressive generation
At each step:
- Crop context to
max_seq_len - Get logits for the last position
- Apply temperature and softmax
- Sample a token and append
def generate(model, start_ids, max_new_tokens, temperature=1.0):
tokens = list(start_ids)
for _ in range(max_new_tokens):
context = tokens[-model.max_seq_len:]
logits = model(context)
last = logits[-1]
scaled = [z / temperature for z in last]
probs = softmax(scaled)
next_id = random.choices(range(len(probs)), weights=[p.data for p in probs])[0]
tokens.append(next_id)
return tokens
7) Debugging checklist
- Loss decreases on a tiny dataset (overfit a few characters)
len(logits) == Tandlen(logits[0]) == V- Causal mask blocks future positions
- Gradients are non-zero and finite
Key takeaways
- GPT is just stacked transformer blocks trained on next-token prediction.
- Cross-entropy over
(T, V)logits is the training objective. - Generation is greedy or stochastic sampling from the final position.
- A tiny model is enough to understand the full pipeline.
Congratulations, you built a complete GPT pipeline from scratch.