Mini-GPT: End-to-End Language Model

Lesson, slides, and applied problem sets.

Lesson

Mini-GPT: End-to-End Language Model

Goal

Assemble embeddings, multi-head attention, and transformer blocks into a GPT-style language model with training and generation.

1) Architecture

Token IDs
  -> Token Embedding + Position Embedding
  -> TransformerBlock x N
  -> Final LayerNorm
  -> Linear -> vocab logits

Shapes:

input IDs: (T,)
embeddings: (T, D)
logits: (T, V)

2) Next-token objective

Targets are the input shifted by one:

input:  [x0, x1, x2, ..., x_{T-1}]
output: [x1, x2, x3, ..., x_T]

Use cross-entropy averaged over positions.

3) Training loop

logits = model(x)
loss = cross_entropy(logits, y)

for p in model.parameters():
    p.grad = 0.0
loss.backward()

for p in model.parameters():
    p.data -= lr * p.grad

4) Generation

Autoregressive sampling:

Crop context to max_seq_len
Get logits for last position
Apply temperature and softmax
Sample next token

5) Debugging checklist

Loss decreases on a tiny dataset
Shapes: (T, V) logits
Causal masking blocks future tokens
Gradients are finite

Key takeaways

GPT is stacked transformer blocks trained on next-token prediction.
The full pipeline is just embeddings + attention + MLP + loss.
A tiny model is enough to understand the whole system.

Module Items

Mini-GPT: End-to-End Language Model
Assemble a GPT-style model and implement training + generation.
hard gpt · transformers · language-models