Bigram Language Model

medium · language-models, softmax, training

Bigram Language Model

Build a trainable bigram language model. This is the smallest end-to-end LM: it predicts the next token using only the current token.

Model

A bigram LM stores logits for each token pair:

  • Parameters: W of shape (V, V)
  • Given token t, logits are W[t]

You can implement this with a single embedding table where embedding_dim = vocab_size.

Tasks

1) softmax(scores)

  • Accepts a list of Value
  • Must be numerically stable (subtract max)

2) cross_entropy_loss(logits, targets)

  • logits: list of length T, each length V
  • targets: list of length T
  • Return mean negative log-probability of correct targets

3) BigramLM(Module)

class BigramLM(Module):
    def __init__(self, vocab_size: int):
        pass

    def forward(self, token_ids: List[int]) -> List[List[Value]]:
        pass

4) train_step(model, x, y, lr)

  • Forward -> loss -> backward -> SGD update
  • Return loss.data

5) generate(model, start_ids, max_new_tokens, temperature=1.0)

  • Autoregressively sample next tokens
  • Use the logits from the last position

Notes

  • Use Value operations so gradients flow to parameters.
  • generate should return a list of token IDs including the prompt.
Run tests to see results
No issues detected