Mini-GPT: End-to-End Language Model
Lesson, slides, and applied problem sets.
View SlidesLesson
Mini-GPT: End-to-End Language Model
Goal
Assemble embeddings, multi-head attention, and transformer blocks into a GPT-style language model with training and generation.
1) Architecture
Token IDs
-> Token Embedding + Position Embedding
-> TransformerBlock x N
-> Final LayerNorm
-> Linear -> vocab logits
Shapes:
- input IDs:
(T,) - embeddings:
(T, D) - logits:
(T, V)
2) Next-token objective
Targets are the input shifted by one:
input: [x0, x1, x2, ..., x_{T-1}]
output: [x1, x2, x3, ..., x_T]
Use cross-entropy averaged over positions.
3) Training loop
logits = model(x)
loss = cross_entropy(logits, y)
for p in model.parameters():
p.grad = 0.0
loss.backward()
for p in model.parameters():
p.data -= lr * p.grad
4) Generation
Autoregressive sampling:
- Crop context to
max_seq_len - Get logits for last position
- Apply temperature and softmax
- Sample next token
5) Debugging checklist
- Loss decreases on a tiny dataset
- Shapes:
(T, V)logits - Causal masking blocks future tokens
- Gradients are finite
Key takeaways
- GPT is stacked transformer blocks trained on next-token prediction.
- The full pipeline is just embeddings + attention + MLP + loss.
- A tiny model is enough to understand the whole system.