A lexer for 3.0#26
Merged
Merged
Conversation
Pankraty
reviewed
Oct 19, 2025
This is a lexer for versin 3.0 that should use Pratt parsing. The lexer is handmade, though representable in DFA if necessary. After experience with 2.0, it seems that too much was done in lexer and the lexer produce far less tokens and should be far more "relaxed". That means parser will have to deal with more relaxed tokens that might not be valid in used context. Performance is acceptable (150ms for tokenization of 22MB of enron formulas), but that is because it's rather simple. A lot of logic will be in parser.
Token doesn't need to allocate substring, it can just store indeces and combine it with an input. Due to queue, it can't be turned into a ref struct.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a lexer for version 3.0 that should use Pratt parsing.
New token structure and pratt parser should solve some long-term issues that were rather annoying in original lexer + recursive parser. 2.0 had too much logic in lexer and tokens were modeled after ABNF grammar. Not a great choice. I am not too happy with recursive parser either. Some issues that had problems:
LOG10can be reference to column LOG and row 10, function LOG10(14), or a sheet name[1]can be part of external book references (e.g,[4]Sheet!A1), structure reference to a column named1in a table or even be part of R[1]C[1].The lexer is handmade, though representable in DFA if necessary. Not a very big fan of DFA generation, tooling is just not there, has to be patched manually and so on.
After experience with 2.0, it seems that too much was done in lexer and the lexer produce far less tokens and should be far more "relaxed". That means parser will have to deal with more relaxed tokens that might not be valid in used context.
Performance is acceptable (150ms for tokenization of 22MB of enron formulas), but that is because it's rather simple. A lot of logic will be in the pratt parser.