refactor: use parser restarting API instead of whitespace padding#750
refactor: use parser restarting API instead of whitespace padding#750ejgallego wants to merge 4 commits intoleanprover:mainfrom
Conversation
698b0d5 to
fa6ff26
Compare
I think for this one we don't have to care as much, because the output of
I think this would be a good PR to upstream. |
| /-- Compute a parsing starting position and `InputContext` from an | ||
| embedded string literal. This is often used to call Lean's parser | ||
| re-entranly. **Precondition**: the string literal must appear in the | ||
| source, otherwise the function may panic. -/ |
There was a problem hiding this comment.
| /-- Compute a parsing starting position and `InputContext` from an | |
| embedded string literal. This is often used to call Lean's parser | |
| re-entranly. **Precondition**: the string literal must appear in the | |
| source, otherwise the function may panic. -/ | |
| /-- | |
| Compute a parsing starting position and `InputContext` from an | |
| embedded string literal. This is often used to call Lean's parser | |
| re-entranly. **Precondition**: the string literal must appear in the | |
| source, otherwise the function may panic. | |
| -/ |
See https://github.com/leanprover/lean4/blob/master/doc/style.md#layout-and-indentation
| public def inputContextFromStrLit [Monad m] [MonadLog m] [MonadFileMap m] (str : StrLit) (versoStyle : Bool := true) (fileName : Option String := none) : m (String.Pos.Raw × InputContext) := do | ||
| -- dbg_trace "{repr str}" | ||
| let filename ← fileName.getDM getFileName | ||
| let source := (← getFileMap).source | ||
| let some (pos, endPos) := str.innerPos? versoStyle | ||
| -- XXX: replace by elaborator exception (throwErrorAt) | ||
| -- XXX: Gonna fail when users write a bad macro | ||
| | panic "invalid string literal on parser resumption (inputContextFromStrLit)" | ||
| if endPos_valid : endPos ≤ source.rawEndPos then | ||
| let iCtx := mkInputContext source filename (endPos := endPos) (endPos_valid := endPos_valid) | ||
| return (pos, iCtx) | ||
| else | ||
| panic "invalid source code slice on parser resumption, slice goes out of bounds" |
There was a problem hiding this comment.
I really think that there should be two separate functions here, because the current one is a footgun when versoStyle := false.
There's two use cases here:
- The Verso parser for something like a code block constructed a string literal node internally for the contents of the code block. The atom contained within it has been passed through
String.quoteto add quotes around it and escape things inside it, because this allows Verso elaborators to put it into aTermeasily, e.g. in a context like`(Doc.Inline.code $s). In this case, when we want to do further parsing of the contents, we should ignore the atom contained inside theStrLitand instead parse its span in the original source code, because this further parsing shouldn't be undoing escapes and the like. - The user wrote some syntax in a Lean string literal that we'd like to parse with a metaprogram. The user's code already has quotes around it and may contain escapes. Parsing this is a bit dodgy - the user could have written
\u0020instead of, for example. We do it in a few places where it would be better to have dedicated syntax, but $REASONS (we can get into it at some point). In this case, we want to decode the atom contained within the string literal node into the string that it denotes, parse that, and then do our best to report errors that may have occurred in the context of the original string literal. This could be done right by having a pass to compute a mapping from string literal source positions to decoded string positions, then go over the message log and syntax objects to back-patch the positions, but we don't do that right now.
So in these two cases, the source positions are used fundamentally differently, and the string contents are used fundamentally differently. I think it really needs to be two separate functions in the API.
This is a step towards removing `parserInputString`. Upstream `runParserCategory` always starts at `pos := 0` Another option would be to update the upstream function to take this into account, but the idea is that the function will evolve to take a string literal so it should be able to compute the position by itself. I made the types different (the Verso one is monadic) as to help avoid confusion, and increase programming ergonomics. Note that we don't change the parts `in VersoManual.Docstring` as these are code strings that come from the `MD4Lean` and they don't have an attached location.
This is possible after leanprover/lean4#10043 improved upstream parsing API. This replaces the current use of `parserInputString`, which pads the input, and seems clearer overall. There are two main cases for string literals for us to process: - string literals coming from Verso: in this case, `.getPos` and `.getTailPos` do provide the correct start / end of the string, without the quotes. - string literals coming from Lean: in this case, we must account for the quotes and fixup the positions. We have handled this by convention, but `VersoUtils.parseString` takes an optional parameter as it is used in both modes. Eventually we'd like to move Verso-style strings to its own type instead of `StrLit`. Notes: - I couldn't port `VersoBlog.leanInit` as `Parser.parseHeader` always parses from `pos := 0`. This is the last blocker to completely remove `parserInputString`. - In general, it seems like most functions using `mkParserState` upstream could benefit from an update to take positions. - Tests in `UsersGuide.Markup` had to be adapted due to use of `contents.getString.trimAsciiEnd.copy`. I did this to pass CI, must implement a better fix before merging. - Code that depends on MD4Lean hasn't been ported as MD4 doesn't seem to provide the right API for us. - TODO: we should test `canonical := true` and quoting properly. Note special cases such as `` `code ``
Only `Command.State` is needed to thread several example blocks. Note that before we used the parsing state resulting from the last code block, so we did resume from an offset that could be potentially much before the current code block. This worked as the string in that case was empty (produced by `parserInputString`), but would have failed once we have reuse of the input source string.
20de2a0 to
c32416e
Compare
|
Preview for this PR is ready! 🎉 |
This is possible after leanprover/lean4#10043
improved upstream parsing API.
This replaces the current use of
parserInputString, which pads the input, and seems clearer overall.There are two main cases for string literals for us to process:
string literals coming from Verso: in this case,
.getPosand.getTailPosdo provide the correct start / end of the string,without the quotes.
string literals coming from Lean: in this case, we must account for
the quotes and fixup the positions.
We have handled this by convention, but
VersoUtils.parseStringtakesan optional parameter as it is used in both modes. Eventually we'd
like to move Verso-style strings to its own type instead of
StrLit.Notes:
I couldn't port
VersoBlog.leanInitasParser.parseHeaderalwaysparses from
pos := 0. This is the last blocker to completelyremove
parserInputString.In general, it seems like most functions using
mkParserStateupstream could benefit from an update to take positions.
Tests in
UsersGuide.Markuphad to be adapted due to use ofcontents.getString.trimAsciiEnd.copy. I did this to pass CI, mustimplement a better fix before merging.
Code that depends on MD4Lean hasn't been ported as MD4 doesn't seem
to provide the right API for us.
TODO: we should test
canonical := trueand quoting properly. Notespecial cases such as
`code