A lexer for 3.0 by jahav · Pull Request #26 · ClosedXML/ClosedXML.Parser

jahav · 2025-10-18T21:57:40Z

This is a lexer for version 3.0 that should use Pratt parsing.

New token structure and pratt parser should solve some long-term issues that were rather annoying in original lexer + recursive parser. 2.0 had too much logic in lexer and tokens were modeled after ABNF grammar. Not a great choice. I am not too happy with recursive parser either. Some issues that had problems:

LOG10 can be reference to column LOG and row 10, function LOG10(14), or a sheet name
bang names and references
[1] can be part of external book references (e.g, [4]Sheet!A1), structure reference to a column named 1 in a table or even be part of R[1]C[1].

The lexer is handmade, though representable in DFA if necessary. Not a very big fan of DFA generation, tooling is just not there, has to be patched manually and so on.

After experience with 2.0, it seems that too much was done in lexer and the lexer produce far less tokens and should be far more "relaxed". That means parser will have to deal with more relaxed tokens that might not be valid in used context.

Performance is acceptable (150ms for tokenization of 22MB of enron formulas), but that is because it's rather simple. A lot of logic will be in the pratt parser.

This is a lexer for versin 3.0 that should use Pratt parsing. The lexer is handmade, though representable in DFA if necessary. After experience with 2.0, it seems that too much was done in lexer and the lexer produce far less tokens and should be far more "relaxed". That means parser will have to deal with more relaxed tokens that might not be valid in used context. Performance is acceptable (150ms for tokenization of 22MB of enron formulas), but that is because it's rather simple. A lot of logic will be in parser.

Token doesn't need to allocate substring, it can just store indeces and combine it with an input. Due to queue, it can't be turned into a ref struct.

jahav added this to the 3.0 milestone Oct 18, 2025

Pankraty reviewed Oct 19, 2025

View reviewed changes

Comment thread src/ClosedXML.Parser/Pratt/Token.cs

jahav force-pushed the pratt-parser branch from 36c2f89 to 53fe3d1 Compare October 19, 2025 19:48

jahav added 3 commits October 19, 2025 22:30

Avoid useless allocation

5cb3225

Token doesn't need to allocate substring, it can just store indeces and combine it with an input. Due to queue, it can't be turned into a ref struct.

Add info to token types

5c22f21

jahav force-pushed the pratt-parser branch from 53fe3d1 to 5c22f21 Compare October 19, 2025 20:32

jahav merged commit adfe58c into develop Oct 19, 2025
1 check passed

jahav deleted the pratt-parser branch October 19, 2025 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lexer for 3.0#26

A lexer for 3.0#26
jahav merged 3 commits into
developfrom
pratt-parser

jahav commented Oct 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jahav commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jahav commented Oct 18, 2025 •

edited

Loading