Skip to content

A lexer for 3.0#26

Merged
jahav merged 3 commits into
developfrom
pratt-parser
Oct 19, 2025
Merged

A lexer for 3.0#26
jahav merged 3 commits into
developfrom
pratt-parser

Conversation

@jahav

@jahav jahav commented Oct 18, 2025

Copy link
Copy Markdown
Member

This is a lexer for version 3.0 that should use Pratt parsing.

New token structure and pratt parser should solve some long-term issues that were rather annoying in original lexer + recursive parser. 2.0 had too much logic in lexer and tokens were modeled after ABNF grammar. Not a great choice. I am not too happy with recursive parser either. Some issues that had problems:

  • LOG10 can be reference to column LOG and row 10, function LOG10(14), or a sheet name
  • bang names and references
  • [1] can be part of external book references (e.g, [4]Sheet!A1), structure reference to a column named 1 in a table or even be part of R[1]C[1].

The lexer is handmade, though representable in DFA if necessary. Not a very big fan of DFA generation, tooling is just not there, has to be patched manually and so on.

After experience with 2.0, it seems that too much was done in lexer and the lexer produce far less tokens and should be far more "relaxed". That means parser will have to deal with more relaxed tokens that might not be valid in used context.

Performance is acceptable (150ms for tokenization of 22MB of enron formulas), but that is because it's rather simple. A lot of logic will be in the pratt parser.

@jahav jahav added this to the 3.0 milestone Oct 18, 2025
Comment thread src/ClosedXML.Parser/Pratt/Token.cs
jahav added 3 commits October 19, 2025 22:30
This is a lexer for versin 3.0 that should use Pratt parsing. The lexer
is handmade, though representable in DFA if necessary.

After experience with 2.0, it seems that too much was done in lexer
and the lexer produce far less tokens and should be far more
"relaxed". That means parser will have to deal with more relaxed tokens
that might not be valid in used context.

Performance is acceptable (150ms for tokenization of 22MB of enron
formulas), but that is because it's rather simple. A lot of logic will
be in parser.
Token doesn't need to allocate substring, it can just store indeces and
combine it with an input.

Due to queue, it can't be turned into a ref struct.
@jahav jahav merged commit adfe58c into develop Oct 19, 2025
1 check passed
@jahav jahav deleted the pratt-parser branch October 19, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants