refactor: use parser restarting API instead of whitespace padding by ejgallego · Pull Request #750 · leanprover/verso

ejgallego · 2026-02-09T16:39:21Z

This is possible after leanprover/lean4#10043
improved upstream parsing API.

This replaces the current use of parserInputString, which pads the input, and seems clearer overall.

There are two main cases for string literals for us to process:

string literals coming from Verso: in this case, .getPos and
.getTailPos do provide the correct start / end of the string,
without the quotes.
string literals coming from Lean: in this case, we must account for
the quotes and fixup the positions.

We have handled this by convention, but VersoUtils.parseString takes
an optional parameter as it is used in both modes. Eventually we'd
like to move Verso-style strings to its own type instead of StrLit.

Notes:

I couldn't port VersoBlog.leanInit as Parser.parseHeader always
parses from pos := 0. This is the last blocker to completely
remove parserInputString.
In general, it seems like most functions using mkParserState
upstream could benefit from an update to take positions.
Tests in UsersGuide.Markup had to be adapted due to use of
contents.getString.trimAsciiEnd.copy. I did this to pass CI, must
implement a better fix before merging.
Code that depends on MD4Lean hasn't been ported as MD4 doesn't seem
to provide the right API for us.
TODO: we should test canonical := true and quoting properly. Note
special cases such as `code

david-christiansen · 2026-02-10T10:41:05Z

I couldn't port VersoBlog.leanInit as Parser.parseHeader always
parses from pos := 0. This is the last blocker to completely
remove parserInputString.

I think for this one we don't have to care as much, because the output of parseHeader is never rendered for users and won't have interesting things in the infoview. As long as we can translate error message line numbers, I think it's OK to just parse it without extra fancy things. We can also look into deprecating the functionality, it's not as useful as we initially thought it might be.

In general, it seems like most functions using mkParserState
upstream could benefit from an update to take positions.

I think this would be a good PR to upstream.

david-christiansen · 2026-02-10T10:43:48Z

src/verso/Verso/SyntaxUtils.lean

+/-- Compute a parsing starting position and `InputContext` from an
+   embedded string literal. This is often used to call Lean's parser
+   re-entranly. **Precondition**: the string literal must appear in the
+   source, otherwise the function may panic. -/


Suggested change

/-- Compute a parsing starting position and `InputContext` from an

embedded string literal. This is often used to call Lean's parser

re-entranly. **Precondition**: the string literal must appear in the

source, otherwise the function may panic. -/

/--

Compute a parsing starting position and `InputContext` from an

embedded string literal. This is often used to call Lean's parser

re-entranly. **Precondition**: the string literal must appear in the

source, otherwise the function may panic.

-/

See https://github.com/leanprover/lean4/blob/master/doc/style.md#layout-and-indentation

david-christiansen · 2026-02-10T10:51:08Z

src/verso/Verso/SyntaxUtils.lean

+public def inputContextFromStrLit [Monad m] [MonadLog m] [MonadFileMap m] (str : StrLit) (versoStyle : Bool := true) (fileName : Option String := none) : m (String.Pos.Raw × InputContext) := do
+  -- dbg_trace "{repr str}"
+  let filename ← fileName.getDM getFileName
+  let source := (← getFileMap).source
+  let some (pos, endPos) := str.innerPos? versoStyle
+    -- XXX: replace by elaborator exception (throwErrorAt)
+    -- XXX: Gonna fail when users write a bad macro
+    | panic "invalid string literal on parser resumption (inputContextFromStrLit)"
+  if endPos_valid : endPos ≤ source.rawEndPos then
+    let iCtx := mkInputContext source filename (endPos := endPos) (endPos_valid := endPos_valid)
+    return (pos, iCtx)
+  else
+    panic "invalid source code slice on parser resumption, slice goes out of bounds"


I really think that there should be two separate functions here, because the current one is a footgun when versoStyle := false.

There's two use cases here:

The Verso parser for something like a code block constructed a string literal node internally for the contents of the code block. The atom contained within it has been passed through String.quote to add quotes around it and escape things inside it, because this allows Verso elaborators to put it into a Term easily, e.g. in a context like `(Doc.Inline.code $s). In this case, when we want to do further parsing of the contents, we should ignore the atom contained inside the StrLit and instead parse its span in the original source code, because this further parsing shouldn't be undoing escapes and the like.

The user wrote some syntax in a Lean string literal that we'd like to parse with a metaprogram. The user's code already has quotes around it and may contain escapes. Parsing this is a bit dodgy - the user could have written \u0020 instead of , for example. We do it in a few places where it would be better to have dedicated syntax, but $REASONS (we can get into it at some point). In this case, we want to decode the atom contained within the string literal node into the string that it denotes, parse that, and then do our best to report errors that may have occurred in the context of the original string literal. This could be done right by having a pass to compute a mapping from string literal source positions to decoded string positions, then go over the message log and syntax objects to back-patch the positions, but we don't do that right now.

So in these two cases, the source positions are used fundamentally differently, and the string contents are used fundamentally differently. I think it really needs to be two separate functions in the API.

This is a step towards removing `parserInputString`. Upstream `runParserCategory` always starts at `pos := 0` Another option would be to update the upstream function to take this into account, but the idea is that the function will evolve to take a string literal so it should be able to compute the position by itself. I made the types different (the Verso one is monadic) as to help avoid confusion, and increase programming ergonomics. Note that we don't change the parts `in VersoManual.Docstring` as these are code strings that come from the `MD4Lean` and they don't have an attached location.

This is possible after leanprover/lean4#10043 improved upstream parsing API. This replaces the current use of `parserInputString`, which pads the input, and seems clearer overall. There are two main cases for string literals for us to process: - string literals coming from Verso: in this case, `.getPos` and `.getTailPos` do provide the correct start / end of the string, without the quotes. - string literals coming from Lean: in this case, we must account for the quotes and fixup the positions. We have handled this by convention, but `VersoUtils.parseString` takes an optional parameter as it is used in both modes. Eventually we'd like to move Verso-style strings to its own type instead of `StrLit`. Notes: - I couldn't port `VersoBlog.leanInit` as `Parser.parseHeader` always parses from `pos := 0`. This is the last blocker to completely remove `parserInputString`. - In general, it seems like most functions using `mkParserState` upstream could benefit from an update to take positions. - Tests in `UsersGuide.Markup` had to be adapted due to use of `contents.getString.trimAsciiEnd.copy`. I did this to pass CI, must implement a better fix before merging. - Code that depends on MD4Lean hasn't been ported as MD4 doesn't seem to provide the right API for us. - TODO: we should test `canonical := true` and quoting properly. Note special cases such as `` `code ``

Only `Command.State` is needed to thread several example blocks. Note that before we used the parsing state resulting from the last code block, so we did resume from an offset that could be potentially much before the current code block. This worked as the string in that case was empty (produced by `parserInputString`), but would have failed once we have reuse of the input source string.

github-actions · 2026-02-20T12:33:56Z

Preview for this PR is ready! 🎉

ejgallego changed the title ~~refactor: use parser restarting API instead of whitespace hacks~~ refactor: use parser restarting API instead of whitespace padding Feb 9, 2026

ejgallego force-pushed the parse_start branch from 698b0d5 to fa6ff26 Compare February 9, 2026 16:41

david-christiansen reviewed Feb 10, 2026

View reviewed changes

ejgallego added 4 commits February 20, 2026 13:23

fix for VersoManual.imports

c32416e

ejgallego force-pushed the parse_start branch from 20de2a0 to c32416e Compare February 20, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use parser restarting API instead of whitespace padding#750

refactor: use parser restarting API instead of whitespace padding#750
ejgallego wants to merge 4 commits intoleanprover:mainfrom
ejgallego:parse_start

ejgallego commented Feb 9, 2026 •

edited

Loading

Uh oh!

david-christiansen commented Feb 10, 2026

Uh oh!

david-christiansen Feb 10, 2026

Uh oh!

david-christiansen Feb 10, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ejgallego commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-christiansen commented Feb 10, 2026

Uh oh!

david-christiansen Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

david-christiansen Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ejgallego commented Feb 9, 2026 •

edited

Loading