Strings & Automata
Lesson, slides, and applied problem sets.
View SlidesLesson
Strings & Automata
Why this module exists
String matching and parsing are everywhere: search engines, compilers, log analyzers, input validators. This module covers three powerful techniques:
- KMP: Find a pattern in text in O(n+m) instead of O(n*m)
- Regex matching with DP: Handle
.and*wildcards - Finite State Machines: Parse complex formats (numbers, dates, protocols)
These problems teach you to think about string processing systematically.
1) Naive String Search and Its Problem
The obvious approach to find pattern P in text T:
func naiveSearch(text, pattern string) int {
n, m := len(text), len(pattern)
for i := 0; i <= n-m; i++ {
match := true
for j := 0; j < m; j++ {
if text[i+j] != pattern[j] {
match = false
break
}
}
if match {
return i
}
}
return -1
}
Time complexity: O(n * m) in the worst case.
Worst case example:
- Text: "AAAAAAAAAB"
- Pattern: "AAAAB"
- At each position, we compare almost the entire pattern before failing.
2) KMP: The Key Insight
When a mismatch occurs, we've already matched some characters. Can we use that information instead of starting over?
Consider:
Text: A B A B A B C
Pattern: A B A B C
^ mismatch at position 4
We matched "ABAB". The pattern "ABAB" has a prefix "AB" that's also a suffix. So instead of restarting, we can continue matching from where "AB" ends!
This is the LPS (Longest Prefix Suffix) table.
3) Building the LPS Table
For each position i in the pattern, lps[i] = length of the longest proper prefix that's also a suffix of pattern[0:i+1].
Pattern: A B A B C
Index: 0 1 2 3 4
LPS: 0 0 1 2 0
Explanation:
- lps[0] = 0 (single char, no proper prefix)
- lps[1] = 0 ("AB" - no prefix equals suffix)
- lps[2] = 1 ("ABA" - "A" is both prefix and suffix)
- lps[3] = 2 ("ABAB" - "AB" is both prefix and suffix)
- lps[4] = 0 ("ABABC" - no match)
Building LPS in O(m):
func buildLPS(pattern string) []int {
m := len(pattern)
lps := make([]int, m)
length := 0 // length of previous longest prefix suffix
i := 1
for i < m {
if pattern[i] == pattern[length] {
length++
lps[i] = length
i++
} else if length > 0 {
// Don't increment i, try shorter prefix
length = lps[length-1]
} else {
lps[i] = 0
i++
}
}
return lps
}
4) KMP Search Algorithm
func kmpSearch(text, pattern string) int {
if len(pattern) == 0 {
return 0
}
n, m := len(text), len(pattern)
lps := buildLPS(pattern)
i := 0 // index in text
j := 0 // index in pattern
for i < n {
if text[i] == pattern[j] {
i++
j++
if j == m {
return i - j // Found match
}
} else if j > 0 {
// Use LPS to skip
j = lps[j-1]
} else {
i++
}
}
return -1
}
Time complexity: O(n + m) - linear!
5) Regex engines: DP, backtracking, and NFAs
Three classic ways to run regexes:
- DP (tabular): correct, predictable, and easy for limited syntax.
- Backtracking: simple to implement but can be exponential.
- NFA simulation: flexible and avoids catastrophic backtracking.
We start with DP for the regex-lite problem, then build a real engine capable of POSIX-style leftmost-longest matching.
6) Regex DP (regex-lite)
Implement a matcher with:
.matches any single character*matches zero or more of the preceding element
The match must cover the entire string.
isMatch("aab", "c*a*b") -> true
isMatch("ab", ".*") -> true
isMatch("mississippi", "mis*is*p*.") -> false
Define dp[i][j] = true if text[0:i] matches pattern[0:j].
Transitions:
- If
pattern[j-1]is a regular char or.:
dp[i][j] = dp[i-1][j-1] AND (text[i-1] == pattern[j-1] OR pattern[j-1] == '.')
- If
pattern[j-1]is*:- Zero occurrences:
dp[i][j] = dp[i][j-2] - One or more:
dp[i][j] = dp[i-1][j] AND (text[i-1] matches pattern[j-2])
- Zero occurrences:
func isMatch(s, p string) bool {
m, n := len(s), len(p)
dp := make([][]bool, m+1)
for i := range dp {
dp[i] = make([]bool, n+1)
}
dp[0][0] = true
// Patterns like a*, a*b* can match empty string.
for j := 2; j <= n; j++ {
if p[j-1] == '*' {
dp[0][j] = dp[0][j-2]
}
}
for i := 1; i <= m; i++ {
for j := 1; j <= n; j++ {
if p[j-1] == '*' {
dp[i][j] = dp[i][j-2]
if matches(s[i-1], p[j-2]) {
dp[i][j] = dp[i][j] || dp[i-1][j]
}
} else if matches(s[i-1], p[j-1]) {
dp[i][j] = dp[i-1][j-1]
}
}
}
return dp[m][n]
}
Time complexity: O(m * n)
7) Backtracking engines and why they blow up
Backtracking evaluates the regex like a depth-first tree search:
- Try one path, and if it fails, backtrack and try another.
- Easy to implement for rich syntax (groups, alternation, etc.).
But it can be exponential. Example:
pattern: (a+)+b
text: aaaaaaaaaaaaaaaaaa
The engine keeps re-partitioning the a's before it finally fails on b. This is called catastrophic backtracking.
8) Thompson NFA simulation (the scalable way)
Thompson's construction turns a regex into an NFA with epsilon transitions. You then simulate all active states at once:
- Build an NFA from the pattern
- Maintain the epsilon-closure of active states
- Consume characters and advance the state set
This avoids exponential blow-ups and gives predictable performance.
9) POSIX leftmost-longest matching
POSIX regex engines must return the leftmost match, and among those, the longest match.
That means the engine should ignore alternation order. Example:
text = "aa"
pattern = "a|aa"
POSIX match -> "aa" (longest at leftmost position)
A practical approach:
- For each possible start position (leftmost first),
- Run the NFA and track the longest accepting end,
- Return the first start position with any match.
This is the capstone: a real regex engine with POSIX-style behavior. Include bounded repetition like {m}, {m,}, and {m,n} in the capstone syntax.
10) Finite State Machines for Parsing
An FSM (or DFA) is perfect for validating structured input like numbers, emails, or dates.
States: Where you are in the parsing process Transitions: What input moves you to which state Accept states: Valid ending points
11) Example: Validating Numbers
Valid: "3.14", "-2", "2e10", "+.5", "4." Invalid: "e3", "1e", "--6", "1.2.3"
States:
- START
- SIGN_SEEN
- INTEGER
- DOT
- DECIMAL
- EXP
- EXP_SIGN
- EXP_INTEGER
- INVALID
Key transitions:
- START -> SIGN_SEEN (on +/-)
- START -> INTEGER (on digit)
- START -> DOT (on .)
- INTEGER -> DOT (on .)
- INTEGER -> EXP (on e/E)
- DOT -> DECIMAL (on digit)
- etc.
func isValidNumber(s string) bool {
const (
START = iota
SIGN
INTEGER
DOT
DECIMAL
EXP
EXP_SIGN
EXP_INT
)
state := START
for _, c := range s {
switch state {
case START:
if c == '+' || c == '-' {
state = SIGN
} else if isDigit(c) {
state = INTEGER
} else if c == '.' {
state = DOT
} else {
return false
}
case SIGN:
if isDigit(c) {
state = INTEGER
} else if c == '.' {
state = DOT
} else {
return false
}
// ... more cases
}
}
// Check if we're in an accepting state
return state == INTEGER || state == DECIMAL || state == EXP_INT
}
12) FSM Design Tips
- Draw the state diagram first - paper or whiteboard
- Enumerate all input types - digits, signs, dots, exponent chars, other
- Identify accepting states - which states are valid endings?
- Handle each transition explicitly - don't rely on "default"
- Test edge cases - empty string, single character, leading/trailing special chars
13) Complexity Summary
| Algorithm | Time | Space |
|---|---|---|
| Naive search | O(n*m) | O(1) |
| KMP | O(n+m) | O(m) for LPS |
| Regex DP | O(m*n) | O(m*n) |
| Thompson NFA (regex) | O(n*states) | O(states) |
| FSM parsing | O(n) | O(1) |
Outcomes
After this module, you should be able to:
- Explain why KMP is faster than naive search
- Build an LPS table and use it for pattern matching
- Implement regex matching with DP for
.and* - Explain backtracking pitfalls and why NFAs scale better
- Implement POSIX leftmost-longest regex matching on a small ERE subset
- Design a finite state machine for parsing structured input
- Choose the right approach for different string problems
Practice Set
- KMP String Search (Medium) - Implement the full KMP algorithm
- Regex Lite (Medium) - Match with
.and*using DP - Regex Backtracking Pitfalls (Hard) - Avoid catastrophic backtracking
- POSIX Regex Search (Hard) - Build a leftmost-longest regex engine
- Valid Number (FSM) (Medium) - Parse decimal numbers with optional exponent
Start with KMP to understand prefix tables, then Regex Lite for DP, then Backtracking Pitfalls to see why naive recursion fails, then POSIX Regex for a real engine, and finally Valid Number to practice FSM design.
Module Items
KMP String Search
Find a pattern using the KMP prefix table.
Regex Lite
Match a pattern with '.' and '*' tokens.
Regex Backtracking Pitfalls
Avoid catastrophic backtracking by memoizing regex matching.
POSIX Regex Search
Build a POSIX leftmost-longest regex engine over a small ERE subset.
Valid Number (FSM)
Validate decimal numbers with optional exponent.
Strings & Automata
KMP, regex engines, and numeric FSMs.