Mini-GPT: Character-Level Language Model
Mini-GPT: Character-Level Language Model
Build a complete GPT-style model from scratch using your autograd and module system. This is the capstone for the pack.
What you are building
1) Embedding(Module)
Lookup table for token or position embeddings.
class Embedding(Module):
def __init__(self, num_embeddings: int, embedding_dim: int):
pass
def forward(self, indices: List[int]) -> List[List[Value]]:
pass
2) LayerNorm(Module)
Normalize a single vector of length D.
class LayerNorm(Module):
def __init__(self, dim: int, eps: float = 1e-5):
pass
def forward(self, x: List[Value]) -> List[Value]:
pass
3) FeedForward(Module)
Two linear layers with activation in between.
class FeedForward(Module):
def __init__(self, embed_dim: int, hidden_dim: int | None = None):
pass
def forward(self, x: List[Value]) -> List[Value]:
pass
4) SelfAttention(Module)
Single-head self-attention.
class SelfAttention(Module):
def __init__(self, embed_dim: int):
pass
def forward(self, x: List[List[Value]], causal: bool = True) -> List[List[Value]]:
pass
5) TransformerBlock(Module)
Pre-norm block with residuals:
class TransformerBlock(Module):
def __init__(self, embed_dim: int, num_heads: int):
# num_heads is included for API compatibility
pass
def forward(self, x: List[List[Value]]) -> List[List[Value]]:
pass
6) MiniGPT(Module)
Complete model: embeddings -> blocks -> final norm -> output projection.
class MiniGPT(Module):
def __init__(self, vocab_size, embed_dim, num_heads, num_layers, max_seq_len):
pass
def forward(self, token_ids: List[int]) -> List[List[Value]]:
# returns logits of shape (T, vocab_size)
pass
7) Training + generation helpers
def cross_entropy_loss(logits: List[List[Value]], targets: List[int]) -> Value:
pass
def generate(model, start_ids, max_new_tokens, temperature=1.0) -> List[int]:
pass
Requirements and notes
- Shapes:
token_ids:(T,)- embeddings:
(T, D) - logits:
(T, V)
Tmust not exceedmax_seq_len.- Cross-entropy should compute softmax internally and average over positions.
generateshould crop context tomax_seq_len, apply temperature, and sample.- This pack uses single-head attention;
num_headscan be ignored or used for validation.
Example
text = "hello"
chars = sorted(set(text))
char_to_idx = {c: i for i, c in enumerate(chars)}
encode = lambda s: [char_to_idx[c] for c in s]
model = MiniGPT(vocab_size=len(chars), embed_dim=16, num_heads=2, num_layers=2, max_seq_len=8)
x = encode(text)
logits = model(x)
loss = cross_entropy_loss(logits, x[1:] + [x[-1]])
Run tests to see results
No issues detected