Mini-GPT: End-to-End Language Model

Lesson, slides, and applied problem sets.

View Slides

Lesson

Mini-GPT: End-to-End Language Model

Goal

Assemble embeddings, multi-head attention, and transformer blocks into a GPT-style language model with training and generation.


1) Architecture

Token IDs
  -> Token Embedding + Position Embedding
  -> TransformerBlock x N
  -> Final LayerNorm
  -> Linear -> vocab logits

Shapes:

  • input IDs: (T,)
  • embeddings: (T, D)
  • logits: (T, V)

2) Next-token objective

Targets are the input shifted by one:

input:  [x0, x1, x2, ..., x_{T-1}]
output: [x1, x2, x3, ..., x_T]

Use cross-entropy averaged over positions.


3) Training loop

logits = model(x)
loss = cross_entropy(logits, y)

for p in model.parameters():
    p.grad = 0.0
loss.backward()

for p in model.parameters():
    p.data -= lr * p.grad

4) Generation

Autoregressive sampling:

  1. Crop context to max_seq_len
  2. Get logits for last position
  3. Apply temperature and softmax
  4. Sample next token

5) Debugging checklist

  • Loss decreases on a tiny dataset
  • Shapes: (T, V) logits
  • Causal masking blocks future tokens
  • Gradients are finite

Key takeaways

  1. GPT is stacked transformer blocks trained on next-token prediction.
  2. The full pipeline is just embeddings + attention + MLP + loss.
  3. A tiny model is enough to understand the whole system.

Module Items