Mini-GPT: Building a Character-Level Language Model

Lesson, slides, and applied problem sets.

View Slides

Lesson

Mini-GPT: Character-Level Language Model

Goal

Build a complete GPT-style language model from scratch: embeddings -> transformer blocks -> logits -> autoregressive generation.

Prerequisites: all prior DL modules.


1) Problem setup

We model sequences of token IDs.

  • Vocabulary size: V
  • Context length: T (must be <= max_seq_len)
  • Embedding dim: D

At each position t, the model predicts the next token x[t+1].


2) Model architecture (single-head attention)

Token IDs -> Token Embedding + Position Embedding
          -> TransformerBlock x N
          -> Final LayerNorm
          -> Linear to vocab logits

Forward pass (shapes):

  • input IDs: (T,)
  • embeddings: (T, D)
  • logits: (T, V)

Note: the pack uses single-head SelfAttention for clarity. The num_heads argument is kept for API compatibility and extension.


3) Next-token loss (cross-entropy)

Targets are the input shifted by one:

input  = [x0, x1, x2, ... x_{T-1}]
target = [x1, x2, x3, ... x_T]

Loss is the mean negative log-probability of the correct class at each position.

Stable softmax:

m = max(logit.data for logit in logits_t)
exp = [(z - m).exp() for z in logits_t]
probs = [e / sum(exp) for e in exp]
loss = -probs[target_t].log()

4) Data pipeline (char-level)

chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}

def encode(s): return [char_to_idx[c] for c in s]

def get_batch(data, block_size, batch_size):
    # pick random slices of length block_size
    # x: data[i : i+block_size]
    # y: data[i+1 : i+block_size+1]

5) Training loop

for step in range(steps):
    x, y = sample_batch(...)
    logits = model(x)
    loss = cross_entropy_loss(logits, y)

    for p in model.parameters():
        p.grad = 0.0
    loss.backward()

    for p in model.parameters():
        p.data -= lr * p.grad

6) Autoregressive generation

At each step:

  1. Crop context to max_seq_len
  2. Get logits for the last position
  3. Apply temperature and softmax
  4. Sample a token and append
def generate(model, start_ids, max_new_tokens, temperature=1.0):
    tokens = list(start_ids)
    for _ in range(max_new_tokens):
        context = tokens[-model.max_seq_len:]
        logits = model(context)
        last = logits[-1]
        scaled = [z / temperature for z in last]
        probs = softmax(scaled)
        next_id = random.choices(range(len(probs)), weights=[p.data for p in probs])[0]
        tokens.append(next_id)
    return tokens

7) Debugging checklist

  • Loss decreases on a tiny dataset (overfit a few characters)
  • len(logits) == T and len(logits[0]) == V
  • Causal mask blocks future positions
  • Gradients are non-zero and finite

Key takeaways

  1. GPT is just stacked transformer blocks trained on next-token prediction.
  2. Cross-entropy over (T, V) logits is the training objective.
  3. Generation is greedy or stochastic sampling from the final position.
  4. A tiny model is enough to understand the full pipeline.

Congratulations, you built a complete GPT pipeline from scratch.


Module Items