Compiler Foundations: Language, Tokens, Grammar
Lesson, slides, and applied problem sets.
View SlidesLesson
Compiler Foundations: Language, Tokens, Grammar
Why this module exists
A compiler is a pipeline that turns text into meaning and then into execution. If the foundation is shaky, everything downstream is harder. This module sets the shared vocabulary we will use in later lessons and quizzes.
We will use a tiny expression language throughout the pack. The details are small on purpose so you can focus on the ideas.
1) The compiler pipeline (big picture)
A typical pipeline looks like this:
- Lexing: text -> tokens
- Parsing: tokens -> AST (Abstract Syntax Tree)
- Semantic analysis: name resolution, type checks, rules
- Lowering / IR: AST -> simpler intermediate form
- Optimization: improve the IR
- Codegen / execution: IR -> bytecode or machine code
Not every compiler uses every stage, but the mental model stays useful.
2) Language design is a set of choices
There is no single "correct" compiler design. You choose:
- What the syntax looks like
- Which errors are recoverable
- How strict or permissive the rules are
- Which IR fits your goals (readable, fast, easy to optimize)
Our pack is open-ended on design, but specific on behavior so tests can be reliable.
3) Tokens vs lexemes
- Lexeme: the exact text (e.g.,
"foo","123","==") - Token: the category + metadata (kind, text, position)
Example source:
let x = 1 + 2;
Example tokens:
- LET (lexeme
let) - IDENT (lexeme
x) - ASSIGN (lexeme
=) - NUMBER (lexeme
1) - PLUS (lexeme
+) - NUMBER (lexeme
2) - SEMICOLON (lexeme
;)
Tokens carry positions so errors can point to a line and column.
4) Grammar and precedence
A grammar describes valid structure. A simple expression grammar might be:
expr -> term ( ("+" | "-") term )*
term -> factor ( ("*" | "/") factor )*
factor -> NUMBER | IDENT | "(" expr ")"
The grammar encodes precedence (multiplication before addition) and associativity (left to right for + and -).
5) ASTs are about meaning, not punctuation
An AST keeps the essential structure and drops syntax noise. For 1 + 2 * 3:
(+)
/ \
1 (*)
/ \
2 3
The AST will drive later stages like lowering or bytecode generation.
6) Errors should be readable
The goal is not "no errors" but good errors:
- Report the location (line, column)
- Say what was expected vs what was found
- Keep messages short and clear
Readability beats cleverness here.