Lexing: From Text to Tokens
Lesson, slides, and applied problem sets.
View SlidesLesson
Lexing: From Text to Tokens
Why this module exists
Before a parser can understand structure, we need to turn raw characters into a clean stream of tokens. A good lexer is small, predictable, and easy to debug.
1) The core loop
A lexer usually has one loop:
- Skip whitespace and comments
- Read the next token (identifier, number, operator, punctuation)
- Repeat until EOF
This is simple and reliable.
2) Longest match wins
For operators like = and ==, always check the two-character form first.
Examples:
==is one token, not two=tokens<=is one token, not<and=
3) Identifiers and keywords
Identifiers share a rule:
- Start with a letter or
_ - Continue with letters, digits, or
_
After scanning an identifier, compare its lexeme to the keyword list (let, fn, if, ...). If it matches, emit the keyword token.
4) Positions matter
Track the start position (line + column) of each token. That data makes errors useful later.
Tip: update line/column as you advance. For newline, increment line and reset column to 1.
5) Keep it readable
A readable lexer is easier to extend. Straightforward loops beat clever tricks.