Strings & Automata

Lesson, slides, and applied problem sets.

Lesson

Strings & Automata

Why this module exists

String matching and parsing are everywhere: search engines, compilers, log analyzers, input validators. This module covers three powerful techniques:

KMP: Find a pattern in text in O(n+m) instead of O(n*m)
Regex matching with DP: Handle . and * wildcards
Finite State Machines: Parse complex formats (numbers, dates, protocols)

These problems teach you to think about string processing systematically.

1) Naive String Search and Its Problem

The obvious approach to find pattern P in text T:

func naiveSearch(text, pattern string) int {
    n, m := len(text), len(pattern)
    for i := 0; i <= n-m; i++ {
        match := true
        for j := 0; j < m; j++ {
            if text[i+j] != pattern[j] {
                match = false
                break
            }
        }
        if match {
            return i
        }
    }
    return -1
}

Time complexity: O(n * m) in the worst case.

Worst case example:

Text: "AAAAAAAAAB"
Pattern: "AAAAB"
At each position, we compare almost the entire pattern before failing.

2) KMP: The Key Insight

When a mismatch occurs, we've already matched some characters. Can we use that information instead of starting over?

Consider:

Text:    A B A B A B C
Pattern: A B A B C
              ^ mismatch at position 4

We matched "ABAB". The pattern "ABAB" has a prefix "AB" that's also a suffix. So instead of restarting, we can continue matching from where "AB" ends!

This is the LPS (Longest Prefix Suffix) table.

3) Building the LPS Table

For each position i in the pattern, lps[i] = length of the longest proper prefix that's also a suffix of pattern[0:i+1].

Pattern: A B A B C
Index:   0 1 2 3 4
LPS:     0 0 1 2 0

Explanation:
- lps[0] = 0 (single char, no proper prefix)
- lps[1] = 0 ("AB" - no prefix equals suffix)
- lps[2] = 1 ("ABA" - "A" is both prefix and suffix)
- lps[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
- lps[4] = 0 ("ABABC" - no match)

Building LPS in O(m):

func buildLPS(pattern string) []int {
    m := len(pattern)
    lps := make([]int, m)

    length := 0  // length of previous longest prefix suffix
    i := 1

    for i < m {
        if pattern[i] == pattern[length] {
            length++
            lps[i] = length
            i++
        } else if length > 0 {
            // Don't increment i, try shorter prefix
            length = lps[length-1]
        } else {
            lps[i] = 0
            i++
        }
    }

    return lps
}

4) KMP Search Algorithm

func kmpSearch(text, pattern string) int {
    if len(pattern) == 0 {
        return 0
    }

    n, m := len(text), len(pattern)
    lps := buildLPS(pattern)

    i := 0  // index in text
    j := 0  // index in pattern

    for i < n {
        if text[i] == pattern[j] {
            i++
            j++
            if j == m {
                return i - j  // Found match
            }
        } else if j > 0 {
            // Use LPS to skip
            j = lps[j-1]
        } else {
            i++
        }
    }

    return -1
}

Time complexity: O(n + m) - linear!

5) Regex engines: DP, backtracking, and NFAs

Three classic ways to run regexes:

DP (tabular): correct, predictable, and easy for limited syntax.
Backtracking: simple to implement but can be exponential.
NFA simulation: flexible and avoids catastrophic backtracking.

We start with DP for the regex-lite problem, then build a real engine capable of POSIX-style leftmost-longest matching.

6) Regex DP (regex-lite)

Implement a matcher with:

. matches any single character
* matches zero or more of the preceding element

The match must cover the entire string.

isMatch("aab", "c*a*b") -> true
isMatch("ab", ".*") -> true
isMatch("mississippi", "mis*is*p*.") -> false

Define dp[i][j] = true if text[0:i] matches pattern[0:j].

Transitions:

If pattern[j-1] is a regular char or .:

   dp[i][j] = dp[i-1][j-1] AND (text[i-1] == pattern[j-1] OR pattern[j-1] == '.')

If pattern[j-1] is *:
- Zero occurrences: dp[i][j] = dp[i][j-2]
- One or more: dp[i][j] = dp[i-1][j] AND (text[i-1] matches pattern[j-2])

func isMatch(s, p string) bool {
    m, n := len(s), len(p)
    dp := make([][]bool, m+1)
    for i := range dp {
        dp[i] = make([]bool, n+1)
    }

    dp[0][0] = true

    // Patterns like a*, a*b* can match empty string.
    for j := 2; j <= n; j++ {
        if p[j-1] == '*' {
            dp[0][j] = dp[0][j-2]
        }
    }

    for i := 1; i <= m; i++ {
        for j := 1; j <= n; j++ {
            if p[j-1] == '*' {
                dp[i][j] = dp[i][j-2]
                if matches(s[i-1], p[j-2]) {
                    dp[i][j] = dp[i][j] || dp[i-1][j]
                }
            } else if matches(s[i-1], p[j-1]) {
                dp[i][j] = dp[i-1][j-1]
            }
        }
    }
    return dp[m][n]
}

Time complexity: O(m * n)

7) Backtracking engines and why they blow up

Backtracking evaluates the regex like a depth-first tree search:

Try one path, and if it fails, backtrack and try another.
Easy to implement for rich syntax (groups, alternation, etc.).

But it can be exponential. Example:

pattern: (a+)+b
text:    aaaaaaaaaaaaaaaaaa

The engine keeps re-partitioning the a's before it finally fails on b. This is called catastrophic backtracking.

8) Thompson NFA simulation (the scalable way)

Thompson's construction turns a regex into an NFA with epsilon transitions. You then simulate all active states at once:

Build an NFA from the pattern
Maintain the epsilon-closure of active states
Consume characters and advance the state set

This avoids exponential blow-ups and gives predictable performance.

9) POSIX leftmost-longest matching

POSIX regex engines must return the leftmost match, and among those, the longest match.

That means the engine should ignore alternation order. Example:

text = "aa"
pattern = "a|aa"
POSIX match -> "aa" (longest at leftmost position)

A practical approach:

For each possible start position (leftmost first),
Run the NFA and track the longest accepting end,
Return the first start position with any match.

This is the capstone: a real regex engine with POSIX-style behavior. Include bounded repetition like {m}, {m,}, and {m,n} in the capstone syntax.

10) Finite State Machines for Parsing

An FSM (or DFA) is perfect for validating structured input like numbers, emails, or dates.

States: Where you are in the parsing process Transitions: What input moves you to which state Accept states: Valid ending points

11) Example: Validating Numbers

Valid: "3.14", "-2", "2e10", "+.5", "4." Invalid: "e3", "1e", "--6", "1.2.3"

States:

START
SIGN_SEEN
INTEGER
DOT
DECIMAL
EXP
EXP_SIGN
EXP_INTEGER
INVALID

Key transitions:

START -> SIGN_SEEN (on +/-)
START -> INTEGER (on digit)
START -> DOT (on .)
INTEGER -> DOT (on .)
INTEGER -> EXP (on e/E)
DOT -> DECIMAL (on digit)
etc.

func isValidNumber(s string) bool {
    const (
        START = iota
        SIGN
        INTEGER
        DOT
        DECIMAL
        EXP
        EXP_SIGN
        EXP_INT
    )

    state := START

    for _, c := range s {
        switch state {
        case START:
            if c == '+' || c == '-' {
                state = SIGN
            } else if isDigit(c) {
                state = INTEGER
            } else if c == '.' {
                state = DOT
            } else {
                return false
            }
        case SIGN:
            if isDigit(c) {
                state = INTEGER
            } else if c == '.' {
                state = DOT
            } else {
                return false
            }
        // ... more cases
        }
    }

    // Check if we're in an accepting state
    return state == INTEGER || state == DECIMAL || state == EXP_INT
}

12) FSM Design Tips

Draw the state diagram first - paper or whiteboard
Enumerate all input types - digits, signs, dots, exponent chars, other
Identify accepting states - which states are valid endings?
Handle each transition explicitly - don't rely on "default"
Test edge cases - empty string, single character, leading/trailing special chars

13) Complexity Summary

Algorithm	Time	Space
Naive search	O(n*m)	O(1)
KMP	O(n+m)	O(m) for LPS
Regex DP	O(m*n)	O(m*n)
Thompson NFA (regex)	O(n*states)	O(states)
FSM parsing	O(n)	O(1)

Outcomes

After this module, you should be able to:

Explain why KMP is faster than naive search
Build an LPS table and use it for pattern matching
Implement regex matching with DP for . and *
Explain backtracking pitfalls and why NFAs scale better
Implement POSIX leftmost-longest regex matching on a small ERE subset
Design a finite state machine for parsing structured input
Choose the right approach for different string problems

Practice Set

KMP String Search (Medium) - Implement the full KMP algorithm
Regex Lite (Medium) - Match with . and * using DP
Regex Backtracking Pitfalls (Hard) - Avoid catastrophic backtracking
POSIX Regex Search (Hard) - Build a leftmost-longest regex engine
Valid Number (FSM) (Medium) - Parse decimal numbers with optional exponent

Start with KMP to understand prefix tables, then Regex Lite for DP, then Backtracking Pitfalls to see why naive recursion fails, then POSIX Regex for a real engine, and finally Valid Number to practice FSM design.

Lesson

Strings & Automata

Why this module exists

1) Naive String Search and Its Problem

2) KMP: The Key Insight

3) Building the LPS Table

4) KMP Search Algorithm

5) Regex engines: DP, backtracking, and NFAs

6) Regex DP (regex-lite)

7) Backtracking engines and why they blow up

8) Thompson NFA simulation (the scalable way)

9) POSIX leftmost-longest matching

10) Finite State Machines for Parsing

11) Example: Validating Numbers

12) FSM Design Tips

13) Complexity Summary

Outcomes

Practice Set

Module Items

KMP String Search

Regex Lite

Regex Backtracking Pitfalls

POSIX Regex Search

Valid Number (FSM)

Strings & Automata