Strings & Automata

Lesson, slides, and applied problem sets.

View Slides

Lesson

Strings & Automata

Why this module exists

String matching and parsing are everywhere: search engines, compilers, log analyzers, input validators. This module covers three powerful techniques:

  1. KMP: Find a pattern in text in O(n+m) instead of O(n*m)
  2. Regex matching with DP: Handle . and * wildcards
  3. Finite State Machines: Parse complex formats (numbers, dates, protocols)

These problems teach you to think about string processing systematically.


1) Naive String Search and Its Problem

The obvious approach to find pattern P in text T:

func naiveSearch(text, pattern string) int {
    n, m := len(text), len(pattern)
    for i := 0; i <= n-m; i++ {
        match := true
        for j := 0; j < m; j++ {
            if text[i+j] != pattern[j] {
                match = false
                break
            }
        }
        if match {
            return i
        }
    }
    return -1
}

Time complexity: O(n * m) in the worst case.

Worst case example:

  • Text: "AAAAAAAAAB"
  • Pattern: "AAAAB"
  • At each position, we compare almost the entire pattern before failing.

2) KMP: The Key Insight

When a mismatch occurs, we've already matched some characters. Can we use that information instead of starting over?

Consider:

Text:    A B A B A B C
Pattern: A B A B C
              ^ mismatch at position 4

We matched "ABAB". The pattern "ABAB" has a prefix "AB" that's also a suffix. So instead of restarting, we can continue matching from where "AB" ends!

This is the LPS (Longest Prefix Suffix) table.


3) Building the LPS Table

For each position i in the pattern, lps[i] = length of the longest proper prefix that's also a suffix of pattern[0:i+1].

Pattern: A B A B C
Index:   0 1 2 3 4
LPS:     0 0 1 2 0

Explanation:
- lps[0] = 0 (single char, no proper prefix)
- lps[1] = 0 ("AB" - no prefix equals suffix)
- lps[2] = 1 ("ABA" - "A" is both prefix and suffix)
- lps[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
- lps[4] = 0 ("ABABC" - no match)

Building LPS in O(m):

func buildLPS(pattern string) []int {
    m := len(pattern)
    lps := make([]int, m)

    length := 0  // length of previous longest prefix suffix
    i := 1

    for i < m {
        if pattern[i] == pattern[length] {
            length++
            lps[i] = length
            i++
        } else if length > 0 {
            // Don't increment i, try shorter prefix
            length = lps[length-1]
        } else {
            lps[i] = 0
            i++
        }
    }

    return lps
}

4) KMP Search Algorithm

func kmpSearch(text, pattern string) int {
    if len(pattern) == 0 {
        return 0
    }

    n, m := len(text), len(pattern)
    lps := buildLPS(pattern)

    i := 0  // index in text
    j := 0  // index in pattern

    for i < n {
        if text[i] == pattern[j] {
            i++
            j++
            if j == m {
                return i - j  // Found match
            }
        } else if j > 0 {
            // Use LPS to skip
            j = lps[j-1]
        } else {
            i++
        }
    }

    return -1
}

Time complexity: O(n + m) - linear!


5) Regex engines: DP, backtracking, and NFAs

Three classic ways to run regexes:

  1. DP (tabular): correct, predictable, and easy for limited syntax.
  2. Backtracking: simple to implement but can be exponential.
  3. NFA simulation: flexible and avoids catastrophic backtracking.

We start with DP for the regex-lite problem, then build a real engine capable of POSIX-style leftmost-longest matching.


6) Regex DP (regex-lite)

Implement a matcher with:

  • . matches any single character
  • * matches zero or more of the preceding element

The match must cover the entire string.

isMatch("aab", "c*a*b") -> true
isMatch("ab", ".*") -> true
isMatch("mississippi", "mis*is*p*.") -> false

Define dp[i][j] = true if text[0:i] matches pattern[0:j].

Transitions:

  1. If pattern[j-1] is a regular char or .:
   dp[i][j] = dp[i-1][j-1] AND (text[i-1] == pattern[j-1] OR pattern[j-1] == '.')
  1. If pattern[j-1] is *:
    • Zero occurrences: dp[i][j] = dp[i][j-2]
    • One or more: dp[i][j] = dp[i-1][j] AND (text[i-1] matches pattern[j-2])
func isMatch(s, p string) bool {
    m, n := len(s), len(p)
    dp := make([][]bool, m+1)
    for i := range dp {
        dp[i] = make([]bool, n+1)
    }

    dp[0][0] = true

    // Patterns like a*, a*b* can match empty string.
    for j := 2; j <= n; j++ {
        if p[j-1] == '*' {
            dp[0][j] = dp[0][j-2]
        }
    }

    for i := 1; i <= m; i++ {
        for j := 1; j <= n; j++ {
            if p[j-1] == '*' {
                dp[i][j] = dp[i][j-2]
                if matches(s[i-1], p[j-2]) {
                    dp[i][j] = dp[i][j] || dp[i-1][j]
                }
            } else if matches(s[i-1], p[j-1]) {
                dp[i][j] = dp[i-1][j-1]
            }
        }
    }
    return dp[m][n]
}

Time complexity: O(m * n)


7) Backtracking engines and why they blow up

Backtracking evaluates the regex like a depth-first tree search:

  • Try one path, and if it fails, backtrack and try another.
  • Easy to implement for rich syntax (groups, alternation, etc.).

But it can be exponential. Example:

pattern: (a+)+b
text:    aaaaaaaaaaaaaaaaaa

The engine keeps re-partitioning the a's before it finally fails on b. This is called catastrophic backtracking.


8) Thompson NFA simulation (the scalable way)

Thompson's construction turns a regex into an NFA with epsilon transitions. You then simulate all active states at once:

  • Build an NFA from the pattern
  • Maintain the epsilon-closure of active states
  • Consume characters and advance the state set

This avoids exponential blow-ups and gives predictable performance.


9) POSIX leftmost-longest matching

POSIX regex engines must return the leftmost match, and among those, the longest match.

That means the engine should ignore alternation order. Example:

text = "aa"
pattern = "a|aa"
POSIX match -> "aa" (longest at leftmost position)

A practical approach:

  1. For each possible start position (leftmost first),
  2. Run the NFA and track the longest accepting end,
  3. Return the first start position with any match.

This is the capstone: a real regex engine with POSIX-style behavior. Include bounded repetition like {m}, {m,}, and {m,n} in the capstone syntax.


10) Finite State Machines for Parsing

An FSM (or DFA) is perfect for validating structured input like numbers, emails, or dates.

States: Where you are in the parsing process Transitions: What input moves you to which state Accept states: Valid ending points


11) Example: Validating Numbers

Valid: "3.14", "-2", "2e10", "+.5", "4." Invalid: "e3", "1e", "--6", "1.2.3"

States:

  1. START
  2. SIGN_SEEN
  3. INTEGER
  4. DOT
  5. DECIMAL
  6. EXP
  7. EXP_SIGN
  8. EXP_INTEGER
  9. INVALID

Key transitions:

  • START -> SIGN_SEEN (on +/-)
  • START -> INTEGER (on digit)
  • START -> DOT (on .)
  • INTEGER -> DOT (on .)
  • INTEGER -> EXP (on e/E)
  • DOT -> DECIMAL (on digit)
  • etc.
func isValidNumber(s string) bool {
    const (
        START = iota
        SIGN
        INTEGER
        DOT
        DECIMAL
        EXP
        EXP_SIGN
        EXP_INT
    )

    state := START

    for _, c := range s {
        switch state {
        case START:
            if c == '+' || c == '-' {
                state = SIGN
            } else if isDigit(c) {
                state = INTEGER
            } else if c == '.' {
                state = DOT
            } else {
                return false
            }
        case SIGN:
            if isDigit(c) {
                state = INTEGER
            } else if c == '.' {
                state = DOT
            } else {
                return false
            }
        // ... more cases
        }
    }

    // Check if we're in an accepting state
    return state == INTEGER || state == DECIMAL || state == EXP_INT
}

12) FSM Design Tips

  1. Draw the state diagram first - paper or whiteboard
  2. Enumerate all input types - digits, signs, dots, exponent chars, other
  3. Identify accepting states - which states are valid endings?
  4. Handle each transition explicitly - don't rely on "default"
  5. Test edge cases - empty string, single character, leading/trailing special chars

13) Complexity Summary

AlgorithmTimeSpace
Naive searchO(n*m)O(1)
KMPO(n+m)O(m) for LPS
Regex DPO(m*n)O(m*n)
Thompson NFA (regex)O(n*states)O(states)
FSM parsingO(n)O(1)

Outcomes

After this module, you should be able to:

  • Explain why KMP is faster than naive search
  • Build an LPS table and use it for pattern matching
  • Implement regex matching with DP for . and *
  • Explain backtracking pitfalls and why NFAs scale better
  • Implement POSIX leftmost-longest regex matching on a small ERE subset
  • Design a finite state machine for parsing structured input
  • Choose the right approach for different string problems

Practice Set

  1. KMP String Search (Medium) - Implement the full KMP algorithm
  2. Regex Lite (Medium) - Match with . and * using DP
  3. Regex Backtracking Pitfalls (Hard) - Avoid catastrophic backtracking
  4. POSIX Regex Search (Hard) - Build a leftmost-longest regex engine
  5. Valid Number (FSM) (Medium) - Parse decimal numbers with optional exponent

Start with KMP to understand prefix tables, then Regex Lite for DP, then Backtracking Pitfalls to see why naive recursion fails, then POSIX Regex for a real engine, and finally Valid Number to practice FSM design.


Module Items