Regex Engine and Lambda Calculus Parser

A Python implementation of a regular expression engine built upon formal automata theory, paired with a lexer and parser for lambda calculus expressions. The project covers the full pipeline from regular expression parsing to deterministic finite automaton minimization, and culminates in a CYK-based parser for context-free grammars.

Overview

This project implements a complete formal language processing pipeline from scratch in Python. It is organized into two primary subsystems:

Regex Engine. A regular expression engine that parses regex syntax into an abstract syntax tree (AST), converts that tree into a Nondeterministic Finite Automaton (NFA) using Thompson's Construction, and then converts the NFA into a Deterministic Finite Automaton (DFA) via Subset Construction. The resulting DFA can optionally be minimized using the table-filling (matrix minimization) algorithm.

Lambda Calculus Parser. A lexer and parser built on top of the regex engine. The lexer tokenizes an input string according to a user-provided specification of (token name, regex) pairs. The parser then applies the Cocke-Younger-Kasami (CYK) algorithm on the token stream using a grammar in Chomsky Normal Form (CNF), producing a parse tree for lambda calculus expressions.

Architecture

The pipeline proceeds in the following stages:

Regex String
     |
     v
  Tokenizer  (RegexParser.py)
     |
     v
Abstract Syntax Tree (Regex subclasses)
     |
     v
Thompson's Construction  ->  NFA
     |
     v
Subset Construction      ->  DFA
     |
     v
Matrix Minimization      ->  Minimal DFA
     |
     v
Lexer  (Lexer.py)
     |
     v
Token Stream
     |
     v
CYK Algorithm  (Grammar.py)
     |
     v
Parse Tree  (ParseTree.py)

Module Reference

Regex Engine

The regex engine is structured around an abstract base class Regex and a set of concrete subclasses, each representing a distinct syntactic construct of regular expressions.

`Regex` (abstract base class)

Defined in Regex.py. Declares the abstract method thompson(), which every subclass must implement. Also exposes the parse_regex(regex: str) -> Regex convenience function, which delegates to RegexParser.

`Character`

Represents a single character literal. Its thompson() method constructs a two-state NFA with a single transition consuming the character from state 0 to state 1.

`Epsilon`

Represents the empty string. Its thompson() method produces a two-state NFA with a single epsilon transition (represented as "") from state 0 to state 1. The alphabet of this NFA is the empty set.

`Union`

Represents the alternation operator (|). Given two child Regex objects, left_regex and right_regex, its thompson() method:

Constructs NFAs for both children, remapping their states to avoid collisions.
Introduces a new initial state with epsilon transitions to both child NFAs' initial states.
Introduces a new final state with epsilon transitions from all final states of both child NFAs.

`Concatenation`

Represents sequential composition of two regular expressions. Its thompson() method:

Constructs NFAs for both children, remapping states to avoid collisions.
Introduces a new initial state with an epsilon transition to the left NFA's initial state.
Adds epsilon transitions from every final state of the left NFA to the initial state of the right NFA.
Adds epsilon transitions from every final state of the right NFA to a new global final state.

`Star`

Represents the Kleene star operator (*). Its thompson() method wraps the child NFA by:

Adding a new initial state with epsilon transitions to both the child NFA's initial state and a new final state (allowing zero repetitions).
Adding epsilon transitions from every final state of the child NFA back to its initial state (allowing repeated matching) and to the new final state.

`Plus`

Represents the one-or-more operator (+). Implemented as syntactic sugar: Plus(r) is equivalent to Concatenation(r, Star(r)) and delegates its thompson() call accordingly.

`Optional`

Represents the zero-or-one operator (?). Implemented as syntactic sugar: Optional(r) is equivalent to Union(r, Epsilon()) and delegates its thompson() call accordingly.

`Range`

Represents a character class range such as [a-z] or [0-9]. Its thompson() method constructs a chain of Union nodes, one per character in the specified range, and calls thompson() on the resulting union tree.

Automata

`NFA`

A generic dataclass defined in NFA.py parameterized by the state type STATE. Fields:

Field	Type	Description
`S`	`set[str]`	The input alphabet.
`K`	`set[STATE]`	The set of states.
`q0`	`STATE`	The initial state.
`d`	`dict[tuple[STATE, str], set[STATE]]`	The transition function. Maps (state, character) to a set of successor states. Epsilon transitions use the empty string as the character.
`F`	`set[STATE]`	The set of accepting states.

Key methods:

epsilon_closure(state) — Returns the frozenset of all states reachable from state via zero or more epsilon transitions, computed via BFS.
character_closure(state, character) — Returns all states reachable by first consuming character from state, then applying epsilon closure to each resulting state.
subset_construction() — Converts the NFA to an equivalent DFA[frozenset[STATE]]. See Subset Construction.
remap_states(f) — Returns a new NFA where every state is replaced by f(state). Used to avoid state collisions when composing NFAs during Thompson's construction.

`DFA`

A generic dataclass defined in DFA.py parameterized by the state type STATE. Fields are analogous to NFA, except the transition function d maps (STATE, str) to a single STATE (deterministic).

Key methods:

accept(word) — Simulates the DFA on word and returns True if the DFA ends in an accepting state.
minimize() — Applies the table-filling minimization algorithm. See DFA Minimization.
remap_states(f) — Returns a new DFA where every state is replaced by f(state).

Lexer

`LexerObject`

An internal helper class. Each instance corresponds to one entry in the lexer specification (a (token_name, regex) pair). On construction it parses the regex, runs Thompson's construction, and remaps NFA states to a globally unique range to prevent conflicts when NFAs are combined.

`Lexer`

Defined in Lexer.py. Accepts a specification of the form list[tuple[str, str]] where each tuple is (token_name, regex_pattern). On construction:

A LexerObject is created for each spec entry.
All individual NFAs are merged into a single NFA (createLexerNFA) with a shared new initial state that has epsilon transitions to each individual NFA's initial state.
The combined NFA is converted to a DFA (createLexerDFA) via Subset Construction.

The lex(word) method performs maximal-munch lexical analysis: it repeatedly calls lexPartialWord to find the longest matching token at the current position, advances past it, and accumulates results. Errors (no viable match, or EOF in an unexpected state) produce a descriptive message including the line number and character offset.

Token priority is determined by the order of entries in the specification: earlier entries take precedence when multiple tokens match the same prefix.

Parser

`Grammar`

Defined in Grammar.py. Represents a context-free grammar in Chomsky Normal Form (CNF). Can be loaded from a file via Grammar.fromFile(file_name), which reads production rules in the format:

NonTerminal: Body1|Body2|...

where each body is either a single terminal/non-terminal or a pair of non-terminals separated by a space.

The cykParse(w) method accepts the token stream produced by the lexer and applies the CYK algorithm. See CYK Parsing.

`Parser`

A thin facade defined in Parser.py that combines a Lexer and a Grammar. Its parse(input) method:

Lexes the input string.
Filters out tokens whose names begin with "SPACE".
Runs cykParse on the filtered token stream.
Returns the parse tree as a formatted string, or an empty string if parsing fails.

`ParseTree`

Defined in ParseTree.py. A recursive tree structure where each node has a name, an optional token (for terminal nodes), and a list of children. The to_string(indent) method performs a pre-order traversal to produce a human-readable, indented representation. Intermediate nodes whose names begin with "int_" are treated as transparent and do not add an indentation level, allowing the grammar author to introduce auxiliary non-terminals without affecting the output format.

Regex Tokenizer and Parser

`RegexToken`

Defined in RegexToken.py. A simple value object with a type (one of "character", "operator", "concatenation", "range", "epsilon", "EOR") and a value. The distinction between "character" and "operator" is essential for correctly handling escaped characters such as \+, which must be treated as a literal rather than an operator.

`RegexParser`

Defined in Regex.py (via RegexParser.py). Implements a recursive descent parser for regular expressions. The grammar it parses is:

expression       ::= union
union            ::= concatenation ( '|' concatenation )*
concatenation    ::= operator_expr ( operator_expr )*
operator_expr    ::= factor ( '*' | '+' | '?' )*
factor           ::= character
                   | range
                   | 'eps'
                   | '(' expression ')'

The tokenizer (tokenize) handles escape sequences (\), range literals ([a-z]), parentheses, and the eps keyword for explicit epsilon before passing tokens to the parser.

Algorithms

Thompson's Construction

Thompson's Construction converts a regular expression into an equivalent NFA. Each Regex subclass implements thompson() following the structural induction on the regular expression. The key invariant maintained throughout is that all NFAs produced have exactly one initial state and exactly one final state, which simplifies composition. States are remapped using remap_states before composition to ensure globally unique state identifiers.

Subset Construction

Defined in NFA.subset_construction(). Converts an NFA to a DFA where each DFA state is a frozenset of NFA states. The algorithm:

Computes the epsilon closure of the NFA's initial state to form the DFA's initial state.
Maintains a worklist of unprocessed DFA states.
For each DFA state and each alphabet symbol, computes the set of NFA states reachable by consuming that symbol (using character_closure on each constituent NFA state), forming the next DFA state.
Marks a DFA state as accepting if it contains any NFA accepting state.
Continues until the worklist is empty.

The resulting DFA may include a sink state (the empty frozenset), representing an implicit dead state for inputs not covered by any transition.

DFA Minimization

Defined in DFA.minimize(). Applies the table-filling (Myhill-Nerode) algorithm:

States are first remapped to strings for uniform comparison.
A triangular matrix is populated: pairs of states are initially marked distinguishable if exactly one is accepting, and indistinguishable otherwise.
Iteratively, any indistinguishable pair (P, Q) is marked distinguishable if there exists an alphabet symbol x such that (delta(P, x), delta(Q, x)) is already marked distinguishable. This continues until no new pairs can be marked.
Indistinguishable pairs are merged into equivalence classes (partitions).
A new minimal DFA is constructed using frozensets of original states as new states, with transitions and accepting sets derived from any representative state within each partition.

CYK Parsing

Defined in Grammar.cykParse(). The Cocke-Younger-Kasami algorithm parses a token sequence of length n in O(n^3 |G|) time where |G| is the grammar size.

A triangular dynamic programming matrix of size n x n is initialized. Each cell matrix[row][col] holds a dictionary mapping non-terminal names to ParseTree nodes.
Base case: terminal rules A -> a are applied for each token in the input, populating the bottom row of the matrix.
Inductive step: for increasing substring lengths, for each substring position, and for each binary rule A -> B C, the algorithm checks whether B derives the left sub-sequence and C derives the right sub-sequence. If so, a new ParseTree node for A is created with those derivations as children.
If the start symbol appears in matrix[0][0], parsing succeeds and the corresponding ParseTree is returned. Otherwise, None is returned.

Project Structure

.
|-- src/
|   |-- Regex.py            # Abstract base class and parse_regex entry point
|   |-- RegexParser.py      # Tokenizer and recursive descent parser for regex syntax
|   |-- RegexToken.py       # Token value object used by the regex tokenizer
|   |-- Character.py        # Regex leaf: single character
|   |-- Epsilon.py          # Regex leaf: empty string
|   |-- Union.py            # Regex node: alternation (|)
|   |-- Concatenation.py    # Regex node: sequential composition
|   |-- Star.py             # Regex node: Kleene star (*)
|   |-- Plus.py             # Regex node: one-or-more (+), sugar over Concatenation+Star
|   |-- Optional.py         # Regex node: zero-or-one (?), sugar over Union+Epsilon
|   |-- Range.py            # Regex node: character class range [a-z], [0-9]
|   |-- NFA.py              # NFA dataclass with epsilon closure, subset construction
|   |-- DFA.py              # DFA dataclass with acceptance check and minimization
|   |-- Lexer.py            # Maximal-munch lexer built on top of the regex engine
|   |-- Grammar.py          # CNF grammar loader and CYK parsing algorithm
|   |-- Parser.py           # Facade combining Lexer and Grammar
|   |-- ParseTree.py        # Recursive parse tree with formatted string output

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
test		test
tests_1		tests_1
tests_2		tests_2
tests_4		tests_4
.gitignore		.gitignore
ID.txt		ID.txt
README.md		README.md
grammar_lambda.txt		grammar_lambda.txt
lexer_spec.json		lexer_spec.json

Folders and files

Latest commit

History

Repository files navigation

Regex Engine and Lambda Calculus Parser

Table of Contents

Overview

Architecture

Module Reference

Regex Engine

Regex (abstract base class)

Character

Epsilon

Union

Concatenation

Star

Plus

Optional

Range

Automata

NFA

DFA

Lexer

LexerObject

Lexer

Parser

Grammar

Parser

ParseTree

Regex Tokenizer and Parser

RegexToken

RegexParser

Algorithms

Thompson's Construction

Subset Construction

DFA Minimization

CYK Parsing

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Regex` (abstract base class)

`Character`

`Epsilon`

`Union`

`Concatenation`

`Star`

`Plus`

`Optional`

`Range`

`NFA`

`DFA`

`LexerObject`

`Lexer`

`Grammar`

`Parser`

`ParseTree`

`RegexToken`

`RegexParser`

Packages