We have a real-world grammar for the Wolfram Language, which is fairly complex by nature. When running it through lezer-generator, it results in an error Goto table too large, exit code 1. The grammar file is attached. (It relies on a custom tokenizer, but I don't think that's needed to understand or reproduce this issue.)
What I know so far:
- This used to work with lezer-generator v0.8.5 (after processing the grammar for ~90 seconds) but fails now with v0.13.1.
- If I remove handling of "Unicode operators" (commenting out
| unicodeOperation in the definition of expr), the grammar is generated OK.
Those Unciode operations are rules of the form
Unicode_XXX { expr !precXXX UnicodeOp<unicodeXXX> expr }
with extra "wrappers"
UnicodeOp<token> { Op<token> }
Op<term> { term }
defined for the corresponding unicode tokens (this helps with the "post-processing" of the generated parse tree). Perhaps that just adds too much complexity?
I just wanted to open this issue as dicussed previosly on lezer-parser/generator#2 (comment). I don't really expect anyone to solve the whole problem for us, but wanted to ask what would be useful next steps...
Would it help to narrow down the specific version (> 0.8.5, <= 0.13.1) where this "broke"?
Should we try to simplify the grammar (e.g. trying to remove the nesting of UnicodeOp / Op), assuming this limit is expected?
Or would it make sense to work towards supporting larger "goto tables"? I could try to open a pull request increasing the "pointer size" from 16 to 32 bit, if that makes sense.
Thanks!
wl.txt
We have a real-world grammar for the Wolfram Language, which is fairly complex by nature. When running it through
lezer-generator, it results in an errorGoto table too large, exit code 1. The grammar file is attached. (It relies on a custom tokenizer, but I don't think that's needed to understand or reproduce this issue.)What I know so far:
| unicodeOperationin the definition ofexpr), the grammar is generated OK.Those Unciode operations are rules of the form
with extra "wrappers"
defined for the corresponding unicode tokens (this helps with the "post-processing" of the generated parse tree). Perhaps that just adds too much complexity?
I just wanted to open this issue as dicussed previosly on lezer-parser/generator#2 (comment). I don't really expect anyone to solve the whole problem for us, but wanted to ask what would be useful next steps...
Would it help to narrow down the specific version (> 0.8.5, <= 0.13.1) where this "broke"?
Should we try to simplify the grammar (e.g. trying to remove the nesting of
UnicodeOp/Op), assuming this limit is expected?Or would it make sense to work towards supporting larger "goto tables"? I could try to open a pull request increasing the "pointer size" from 16 to 32 bit, if that makes sense.
Thanks!
wl.txt