POSIX Regex Search

hard · string, regex, nfa, parsing

POSIX Regex Search (Leftmost-Longest)

Build a small regex engine that returns the leftmost-longest match (POSIX semantics).

Supported syntax (subset of POSIX ERE)

  • Literals: any non-metacharacter
  • . matches any single character
  • * zero or more of the previous atom
  • + one or more of the previous atom
  • ? zero or one of the previous atom
  • {m} exactly m repetitions
  • {m,} at least m repetitions
  • {m,n} between m and n repetitions (inclusive)
  • | alternation
  • ( ... ) grouping
  • Character classes:
    • [abc] any of a, b, c
    • [a-z] range
    • [^abc] negated class
  • Anchors:
    • ^ only if it is the first character in the pattern
    • $ only if it is the last character in the pattern
  • Escapes:
    • \ escapes the next character (treat it literally)

Assume the pattern is valid. Input is ASCII.

Function signature

func RegexPosix(text, pattern string) (int, int)

Return the start index (inclusive) and end index (exclusive) of the leftmost-longest match. If no match exists, return (-1, -1).

Examples

text = "aa"
pattern = "a|aa"
output = (0, 2)  // leftmost-longest chooses "aa"

text = "xxabc"
pattern = "abc$"
output = (2, 5)

text = "bbb"
pattern = "a*"
output = (0, 0)  // empty match at the leftmost position

Constraints

  • 0 <= len(text), len(pattern) <= 2000

Notes

  • Do not use a regex library.
  • A Thompson NFA + epsilon-closure simulation is a good approach.
Run tests to see results
No issues detected