Skip to content

refactor: use parser restarting API instead of whitespace padding#750

Draft
ejgallego wants to merge 4 commits intoleanprover:mainfrom
ejgallego:parse_start
Draft

refactor: use parser restarting API instead of whitespace padding#750
ejgallego wants to merge 4 commits intoleanprover:mainfrom
ejgallego:parse_start

Conversation

@ejgallego
Copy link
Contributor

@ejgallego ejgallego commented Feb 9, 2026

This is possible after leanprover/lean4#10043
improved upstream parsing API.

This replaces the current use of parserInputString, which pads the input, and seems clearer overall.

There are two main cases for string literals for us to process:

  • string literals coming from Verso: in this case, .getPos and
    .getTailPos do provide the correct start / end of the string,
    without the quotes.

  • string literals coming from Lean: in this case, we must account for
    the quotes and fixup the positions.

We have handled this by convention, but VersoUtils.parseString takes
an optional parameter as it is used in both modes. Eventually we'd
like to move Verso-style strings to its own type instead of StrLit.

Notes:

  • I couldn't port VersoBlog.leanInit as Parser.parseHeader always
    parses from pos := 0. This is the last blocker to completely
    remove parserInputString.

  • In general, it seems like most functions using mkParserState
    upstream could benefit from an update to take positions.

  • Tests in UsersGuide.Markup had to be adapted due to use of
    contents.getString.trimAsciiEnd.copy. I did this to pass CI, must
    implement a better fix before merging.

  • Code that depends on MD4Lean hasn't been ported as MD4 doesn't seem
    to provide the right API for us.

  • TODO: we should test canonical := true and quoting properly. Note
    special cases such as `code

@ejgallego ejgallego changed the title refactor: use parser restarting API instead of whitespace hacks refactor: use parser restarting API instead of whitespace padding Feb 9, 2026
@david-christiansen
Copy link
Collaborator

  • I couldn't port VersoBlog.leanInit as Parser.parseHeader always
    parses from pos := 0. This is the last blocker to completely
    remove parserInputString.

I think for this one we don't have to care as much, because the output of parseHeader is never rendered for users and won't have interesting things in the infoview. As long as we can translate error message line numbers, I think it's OK to just parse it without extra fancy things. We can also look into deprecating the functionality, it's not as useful as we initially thought it might be.

  • In general, it seems like most functions using mkParserState
    upstream could benefit from an update to take positions.

I think this would be a good PR to upstream.

Comment on lines +154 to +157
/-- Compute a parsing starting position and `InputContext` from an
embedded string literal. This is often used to call Lean's parser
re-entranly. **Precondition**: the string literal must appear in the
source, otherwise the function may panic. -/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/-- Compute a parsing starting position and `InputContext` from an
embedded string literal. This is often used to call Lean's parser
re-entranly. **Precondition**: the string literal must appear in the
source, otherwise the function may panic. -/
/--
Compute a parsing starting position and `InputContext` from an
embedded string literal. This is often used to call Lean's parser
re-entranly. **Precondition**: the string literal must appear in the
source, otherwise the function may panic.
-/

See https://github.com/leanprover/lean4/blob/master/doc/style.md#layout-and-indentation

Comment on lines +158 to +170
public def inputContextFromStrLit [Monad m] [MonadLog m] [MonadFileMap m] (str : StrLit) (versoStyle : Bool := true) (fileName : Option String := none) : m (String.Pos.Raw × InputContext) := do
-- dbg_trace "{repr str}"
let filename ← fileName.getDM getFileName
let source := (← getFileMap).source
let some (pos, endPos) := str.innerPos? versoStyle
-- XXX: replace by elaborator exception (throwErrorAt)
-- XXX: Gonna fail when users write a bad macro
| panic "invalid string literal on parser resumption (inputContextFromStrLit)"
if endPos_valid : endPos ≤ source.rawEndPos then
let iCtx := mkInputContext source filename (endPos := endPos) (endPos_valid := endPos_valid)
return (pos, iCtx)
else
panic "invalid source code slice on parser resumption, slice goes out of bounds"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think that there should be two separate functions here, because the current one is a footgun when versoStyle := false.

There's two use cases here:

  1. The Verso parser for something like a code block constructed a string literal node internally for the contents of the code block. The atom contained within it has been passed through String.quote to add quotes around it and escape things inside it, because this allows Verso elaborators to put it into a Term easily, e.g. in a context like `(Doc.Inline.code $s). In this case, when we want to do further parsing of the contents, we should ignore the atom contained inside the StrLit and instead parse its span in the original source code, because this further parsing shouldn't be undoing escapes and the like.
  2. The user wrote some syntax in a Lean string literal that we'd like to parse with a metaprogram. The user's code already has quotes around it and may contain escapes. Parsing this is a bit dodgy - the user could have written \u0020 instead of , for example. We do it in a few places where it would be better to have dedicated syntax, but $REASONS (we can get into it at some point). In this case, we want to decode the atom contained within the string literal node into the string that it denotes, parse that, and then do our best to report errors that may have occurred in the context of the original string literal. This could be done right by having a pass to compute a mapping from string literal source positions to decoded string positions, then go over the message log and syntax objects to back-patch the positions, but we don't do that right now.

So in these two cases, the source positions are used fundamentally differently, and the string contents are used fundamentally differently. I think it really needs to be two separate functions in the API.

This is a step towards removing `parserInputString`. Upstream
`runParserCategory` always starts at `pos := 0`

Another option would be to update the upstream function to take this
into account, but the idea is that the function will evolve to take a
string literal so it should be able to compute the position by itself.

I made the types different (the Verso one is monadic) as to help avoid
confusion, and increase programming ergonomics.

Note that we don't change the parts `in VersoManual.Docstring` as
these are code strings that come from the `MD4Lean` and they don't
have an attached location.
This is possible after leanprover/lean4#10043
improved upstream parsing API.

This replaces the current use of `parserInputString`, which pads the
input, and seems clearer overall.

There are two main cases for string literals for us to process:

- string literals coming from Verso: in this case, `.getPos` and
  `.getTailPos` do provide the correct start / end of the string,
  without the quotes.

- string literals coming from Lean: in this case, we must account for
  the quotes and fixup the positions.

We have handled this by convention, but `VersoUtils.parseString` takes
an optional parameter as it is used in both modes. Eventually we'd
like to move Verso-style strings to its own type instead of `StrLit`.

Notes:

- I couldn't port `VersoBlog.leanInit` as `Parser.parseHeader` always
  parses from `pos := 0`. This is the last blocker to completely
  remove `parserInputString`.

- In general, it seems like most functions using `mkParserState`
  upstream could benefit from an update to take positions.

- Tests in `UsersGuide.Markup` had to be adapted due to use of
  `contents.getString.trimAsciiEnd.copy`. I did this to pass CI, must
  implement a better fix before merging.

- Code that depends on MD4Lean hasn't been ported as MD4 doesn't seem
  to provide the right API for us.

- TODO: we should test `canonical := true` and quoting properly. Note
  special cases such as `` `code ``
Only `Command.State` is needed to thread several example blocks.

Note that before we used the parsing state resulting from the last
code block, so we did resume from an offset that could be potentially
much before the current code block.

This worked as the string in that case was empty (produced by
`parserInputString`), but would have failed once we have reuse of the
input source string.
@github-actions
Copy link
Contributor

Preview for this PR is ready! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants