Lexing: From Text to Tokens

Lesson, slides, and applied problem sets.

View Slides

Lesson

Lexing: From Text to Tokens

Why this module exists

Before a parser can understand structure, we need to turn raw characters into a clean stream of tokens. A good lexer is small, predictable, and easy to debug.


1) The core loop

A lexer usually has one loop:

  1. Skip whitespace and comments
  2. Read the next token (identifier, number, operator, punctuation)
  3. Repeat until EOF

This is simple and reliable.


2) Longest match wins

For operators like = and ==, always check the two-character form first.

Examples:

  • == is one token, not two = tokens
  • <= is one token, not < and =

3) Identifiers and keywords

Identifiers share a rule:

  • Start with a letter or _
  • Continue with letters, digits, or _

After scanning an identifier, compare its lexeme to the keyword list (let, fn, if, ...). If it matches, emit the keyword token.


4) Positions matter

Track the start position (line + column) of each token. That data makes errors useful later.

Tip: update line/column as you advance. For newline, increment line and reset column to 1.


5) Keep it readable

A readable lexer is easier to extend. Straightforward loops beat clever tricks.


Module Items