Skip to content

perf: fuse fvar substitution into instantiateMVars#12233

Draft
nomeata wants to merge 47 commits intomasterfrom
joachim/instantiateMVarsNoUpdate
Draft

perf: fuse fvar substitution into instantiateMVars#12233
nomeata wants to merge 47 commits intomasterfrom
joachim/instantiateMVarsNoUpdate

Conversation

@nomeata
Copy link
Collaborator

@nomeata nomeata commented Jan 29, 2026

This PR replaces the default instantiateMVars implementation with a two-pass variant that fuses fvar substitution into the traversal, avoiding separate replace_fvars calls for delayed-assigned MVars and preserving sharing.

Terminology (used in this PR and in the source)

  • Direct MVar: an MVar that is not delayed-assigned.
  • Pending MVar: the direct MVar stored in a DelayedMetavarAssignment.
  • Assigned MVar: a direct MVar with an assignment, or a delayed-assigned MVar with an assigned pending MVar.
  • MVar DAG: the directed acyclic graph of MVars reachable from the expression.
  • Resolvable MVar: an MVar where all MVars reachable from it (including itself) are assigned.
  • Updateable MVar: an assigned direct MVar, or a delayed-assigned MVar that is resolvable but not reachable from any other resolvable delayed-assigned MVar.

In the MVar DAG, the updateable delayed-assigned MVars form a cut (the updateable-MVar cut) with only assigned MVars behind it and no resolvable delayed-assigned MVars before it.

Two-pass architecture

Pass 1 (instantiate_direct_fn): Traverses all MVars and expressions reachable from the initial expression and instantiates all updateable direct MVars (updating their assignment with the result), instantiates all level MVars, and determines if there are any updateable delayed-assigned MVars.

Pass 2 (instantiate_delayed_fn): Only run if pass 1 found updateable delayed-assigned MVars. Has an outer and an inner mode, depending on whether it has crossed the updateable-MVar cut.

In outer mode (empty fvar substitution), all MVars are either unassigned direct MVars (left alone), non-updateable delayed-assigned MVars (pending MVar traversed in outer mode and updated with the result), or updateable delayed-assigned MVars. When a delayed-assigned MVar is encountered, its MVar DAG is explored (via is_resolvable_pending) to determine if it is resolvable (and thus updateable). Results are cached across invocations.

If it is updateable, the substitution is initialized from its arguments and traversal continues with the value of its pending MVar in inner mode. In inner mode (non-empty substitution), all encountered delayed-assigned MVars are, by construction, resolvable but not updateable. The substitution is carried along and extended as we cross such MVars. Pending MVars of these delayed-assigned MVars are NOT updated with the result (as the result is valid only for this substitution, not in general).

Applying the substitution in one go, rather than instantiating each delayed-assigned MVar on its own from inside out, avoids the quadratic overhead of that approach when there are long chains of delayed-assigned MVars.

Write-back behavior: Pass 2 writes back the normalized pending MVar values of delayed-assigned MVars above the updateable-MVar cut (the non-resolvable ones whose children may have been resolved). This is exactly the right set: these MVars are visited in outer mode, so their normalized values are suitable for storing in the mctx. MVars below the cut are visited in inner mode, so their intermediate values cannot be written back.

Pass 2 scope-tracked caching

A scope_cache data structure ensures that sharing is preserved even across different delayed-assigned MVars (and hence with different substitutions), when possible. Each visit_delayed call pushes a new scope with fresh fvar bindings. The cache correctly handles cross-scope reuse, fvar shadowing, and late-binding via generation counters and scope-level tracking.

Behavioral differences from original instantiateMVars

The implementation matches the original single-pass instantiateMVars behavior with one cosmetic difference: the new implementation substitutes fvars inline during traversal rather than constructing intermediate beta-redexes, producing more beta-reduced terms in some edge cases. This changes the pretty-printed output for two elab tests (1179b, depElim1) but all terms remain definitionally equal.

Files

  • src/kernel/instantiate_mvars_all.cpp — C++ implementation of both passes
  • src/kernel/scope_cache.h — scope-aware cache data structure
  • src/kernel/instantiate_mvars.cpp — redirects lean_instantiate_expr_mvars to the new variant
  • src/Lean/MetavarContext.lean — exported accessors for DelayedMetavarAssignment fields
  • tests/lean/run/instantiateAllMVarsSharing.lean — sharing preservation test
  • tests/lean/run/instantiateAllMVarsShadow.lean — fvar shadowing and late-binding correctness tests
  • tests/elab/1179b.lean.out.expected, tests/elab/depElim1.lean.out.expected — updated expected output

@nomeata nomeata added the changelog-language Language features and metaprograms label Jan 29, 2026
@nomeata
Copy link
Collaborator Author

nomeata commented Jan 29, 2026

!radar

@leanprover-radar
Copy link

leanprover-radar commented Jan 29, 2026

Benchmark results for c83f527 against 9e18eea are in! @nomeata

  • 🟥 build//instructions: +4.7G (+0.0%)

Large changes (1✅, 2🟥)

  • 🟥 big_struct//instructions: +3.0G (+15.0%)
  • big_struct_dep//instructions: -3.8G (-18.4%)
  • 🟥 big_struct_dep1//instructions: +9.3G (+22.7%)

Small changes (7🟥)

  • 🟥 build/module/Init.Grind.Config//instructions: +53.5M (+2.8%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Init.MetaTypes//instructions: +35.6M (+1.6%)
  • 🟥 build/module/Lake.Config.Package//instructions: +12.8M (+0.5%)
  • 🟥 build/module/Lake.Config.PackageConfig//instructions: +36.0M (+0.6%)
  • 🟥 build/module/Lean.Meta.Tactic.Grind.Arith.CommRing.Types//instructions: +25.2M (+0.9%)
  • 🟥 build/module/Lean.Meta.Tactic.Grind.Arith.Cutsat.ToIntInfo//instructions: +16.3M (+0.9%)
  • 🟥 build/module/Lean.Meta.Tactic.Grind.Arith.Linear.Types//instructions: +80.9M (+1.5%)

@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Jan 29, 2026
@leanprover-bot
Copy link
Collaborator

leanprover-bot commented Jan 29, 2026

Reference manual CI status:

  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-01-28 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-01-29 17:27:26)
  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-02-25 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-02-28 00:29:59)

ghost pushed a commit to leanprover-community/batteries that referenced this pull request Jan 29, 2026
@github-actions github-actions bot added the mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN label Jan 29, 2026
ghost pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Jan 29, 2026
@ghost ghost added the builds-mathlib CI has verified that Mathlib builds against this PR label Jan 29, 2026
@ghost
Copy link

ghost commented Jan 29, 2026

Mathlib CI status (docs):

nomeata and others added 2 commits February 26, 2026 23:23
…date

This adds an experimental `instantiateMVarsNoUpdate2` variant (inline in the
benchmark) that keeps its expression cache across binder depths using
bvar-blind keying, rather than clearing the cache at each depth.

Bench2 shows it avoids the O(depth * bodySize) scaling of the old approach
when the same closed sub-expression appears at many depths (crossover ~depth=50).
However, the ~6ms constant per-node overhead makes it 3-5x slower on the
bench1 workload (the realistic delayed-assignment case), where both variants
are already sub-millisecond. Not worth adopting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This PR adds a C++ implementation of `instantiateMVarsNoUpdate`, following the
pattern of the existing `instantiate_mvars.cpp`. Key features:

- FVar substitution via hash map instead of `replace_fvars`, avoiding a
  separate traversal pass for delayed assignments
- Depth tracking with lazy lifting of substitution values
- Two-level cache: a depth-dependent cache (cleared under binders) and a
  global cache for fvar-free expressions that persists across binder depths
- When the fvar substitution is empty, updates mvar assignments with their
  normalized values so subsequent `instantiateMVars` calls are free

Benchmark (delayed-assignment workload):
- C++ NoUpdate is ~2x faster than the Lean NoUpdate implementation
- C++ NoUpdate is 10-26x faster than standard instantiateMVars
- The global cache eliminates O(depth) scaling for fvar-free expressions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nomeata nomeata force-pushed the joachim/instantiateMVarsNoUpdate branch from 267f6b4 to 860b8fd Compare February 27, 2026 10:50
This PR makes instantiateMVarsNoUpdate the default implementation behind
lean_instantiate_expr_mvars. The NoUpdate variant carries a free variable
substitution during traversal instead of doing repeated replace_fvars,
avoiding quadratic behavior in terms with many delayed-assigned metavariables.

Key fixes for drop-in compatibility:
- Use name_hash_map for fvar substitution (structural name equality, not
  pointer comparison)
- Always call apply_beta with preserve_data=false for regular mvar
  assignments to strip mdata wrappers like _recApp
- Clear depth-dependent cache when extending fvar substitution in
  visit_delayed
- Handle unassigned mvars with active fvar substitution by creating
  delayed assignments via elimMVarDeps
- Match standard instantiateMVars guards for delayed assignment resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nomeata
Copy link
Collaborator Author

nomeata commented Feb 27, 2026

!radar

@leanprover-radar
Copy link

leanprover-radar commented Feb 27, 2026

Benchmark results for c5d6267 against 9e18eea are in! @nomeata

Warning

These warnings may indicate that the benchmark results are not directly comparable, for example due to changes in the runner configuration or hardware.

  • Bench repo commit hashes for run build differ between commits.
  • Bench repo commit hashes for run other differ between commits.
  • 🟥 build//instructions: +87.5G (+0.67%)

Large changes (7🟥)

  • 🟥 build/module/Std.Data.Iterators.Lemmas.Combinators.Monadic.FilterMap//instructions: +5.1G (+88.93%)
  • 🟥 build/module/Std.Tactic.BVDecide.Bitblast.BVExpr.Circuit.Lemmas.Expr//instructions: +8.6G (+6.71%)
  • 🟥 elab/big_struct//instructions: +10.7G (+53.17%)
  • 🟥 elab/big_struct_dep//instructions: +4.1G (+19.94%)
  • 🟥 elab/big_struct_dep1//instructions: +20.1G (+48.89%)
  • 🟥 elab/grind_bitvec2//instructions: +3.4G (+1.54%)
  • 🟥 elab/omega_stress//instructions: +259.9M (+5.04%)

Medium changes (19🟥)

  • 🟥 build/module/Init.Data.Iterators.Lemmas.Combinators.Monadic.FilterMap//instructions: +4.6G (+9.62%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Init.Data.Range.Polymorphic.RangeIterator//instructions: +3.2G (+13.85%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Init.Data.String.Decode//instructions: +1.1G (+3.20%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Data.DTreeMap.Internal.Balancing//instructions: +2.1G (+2.61%)
  • 🟥 build/module/Std.Data.DTreeMap.Internal.Model//instructions: +1.6G (+2.24%)
  • 🟥 build/module/Std.Data.DTreeMap.Internal.Operations//instructions: +1.5G (+2.96%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Data.DTreeMap.Internal.WF.Lemmas//instructions: +1.4G (+2.93%)
  • 🟥 build/module/Std.Data.Internal.List.Associative//instructions: +1.1G (+1.21%)
  • 🟥 build/module/Std.Data.Iterators.Lemmas.Combinators.Zip//instructions: +4.2G (+38.48%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Tactic.BVDecide.Bitblast.BVExpr.Circuit.Impl.Expr//instructions: +1.8G (+2.43%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Tactic.BVDecide.LRAT.Internal.Formula.RupAddResult//instructions: +2.1G (+3.77%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Tactic.BVDecide.LRAT.Internal.Formula.RupAddSound//instructions: +1.3G (+6.14%) (reduced significance based on absolute threshold)
  • 🟥 elab/big_match_partial//instructions: +137.2M (+0.76%)
  • 🟥 elab/bv_decide_rewriter//instructions: +91.0M (+0.77%)
  • 🟥 elab/grind_list2//instructions: +563.6M (+0.92%)
  • 🟥 elab/mut_rec_wf//instructions: +418.6M (+1.36%)
  • 🟥 misc/import Init.Data.BitVec.Lemmas//instructions: +819.4M (+0.56%)
  • 🟥 misc/import Init.Data.List.Sublist//instructions: +191.2M (+1.66%)
  • 🟥 misc/import Std.Data.Internal.List.Associative//instructions: +1.1G (+1.38%)

Small changes (1✅, 212🟥)

Too many entries to display here. View the full report on radar instead.

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 27, 2026

!radar

@leanprover-radar
Copy link

leanprover-radar commented Feb 27, 2026

Benchmark results for ee068d5 against 87ec768 are in! @nomeata

  • 🟥 build exited with code -1
  • 🟥 other exited with code -1

No significant changes detected.

@nomeata nomeata force-pushed the joachim/instantiateMVarsNoUpdate branch from ee068d5 to ae9230f Compare February 27, 2026 23:17
@nomeata
Copy link
Collaborator Author

nomeata commented Feb 27, 2026

!radar

@leanprover-radar
Copy link

leanprover-radar commented Feb 27, 2026

Benchmark results for ae9230f against 87ec768 are in! @nomeata

  • 🟥 build//instructions: +95.9G (+0.76%)

Large changes (8🟥)

  • 🟥 build/module/Init.Data.Iterators.Lemmas.Combinators.Monadic.FilterMap//instructions: +5.0G (+10.94%)
  • 🟥 build/module/Init.Data.String.Lemmas.Pattern.String.ForwardSearcher//instructions: +7.5G (+38.64%)
  • 🟥 build/module/Std.Data.Iterators.Lemmas.Combinators.Monadic.FilterMap//instructions: +5.2G (+95.78%)
  • 🟥 build/module/Std.Tactic.BVDecide.Bitblast.BVExpr.Circuit.Lemmas.Expr//instructions: +8.5G (+7.46%)
  • 🟥 elab/big_struct_dep//instructions: +4.1G (+19.96%)
  • 🟥 elab/big_struct_dep1//instructions: +872.4M (+13.71%)
  • 🟥 elab/grind_bitvec2//instructions: +3.5G (+1.72%)
  • 🟥 elab/omega_stress//instructions: +258.0M (+5.69%)

Medium changes (21🟥)

Too many entries to display here. View the full report on radar instead.

Small changes (1✅, 207🟥)

Too many entries to display here. View the full report on radar instead.

@nomeata nomeata changed the title perf: a non-updating single-subst instantiateMVar perf: fuse fvar substitution into instantiateMVars Feb 27, 2026
mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Feb 28, 2026
@mathlib-lean-pr-testing
Copy link

Mathlib CI status (docs):

Pass 2 of instantiateAllMVars does not handle level metavariables (pass 1
already resolves them), so has_mvar checks are unnecessarily broad. Replace
with has_expr_mvar throughout pass 2.

This makes the global cache redundant: expressions with no fvars and no
expr mvars are returned unchanged by the early-return check, so there is
nothing to cache. Remove m_global_cache and simplify the visit logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leanprover leanprover deleted a comment from leanprover-radar Mar 5, 2026
@nomeata
Copy link
Collaborator Author

nomeata commented Mar 5, 2026

!bench

@leanprover-radar
Copy link

Benchmark results for e2bbd74 against cda8470 are in! @nomeata

Warning

These warnings may indicate that the benchmark results are not directly comparable, for example due to changes in the runner configuration or hardware.

  • Bench repo commit hashes for run build differ between commits.
  • Bench repo commit hashes for run other differ between commits.
  • 🟥 build//instructions: +3.8G (+0.03%)

Large changes (3✅)

  • elab/big_struct//instructions: -192.7M (-6.55%)
  • elab/big_struct_dep//instructions: -6.0G (-29.32%)
  • elab/omega_stress//instructions: -210.9M (-4.65%)

Medium changes (3✅, 3🟥)

  • build/module/Std.Data.Iterators.Lemmas.Combinators.Zip//instructions: -1.2G (-10.78%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Tactic.BVDecide.Bitblast.BVExpr.Circuit.Lemmas.Expr//instructions: +1.5G (+1.28%)
  • elab/big_struct_dep1//instructions: -178.8M (-2.82%)
  • 🟥 elab/grind_bitvec2//instructions: +1.5G (+0.73%)
  • 🟥 elab/grind_list2//instructions: +272.1M (+0.47%)
  • misc/import Init.Data.List.Sublist//instructions: -209.7M (-1.88%)

Small changes (64✅, 39🟥)

Too many entries to display here. View the full report on radar instead.

nomeata and others added 3 commits March 5, 2026 16:59
…andling

In pass 2, direct mvar assignments have already been resolved by pass 1.
Assert this invariant in visit_app and for bare mvars instead of handling
the impossible case. Inline the pending value lookup in visit_delayed
instead of going through visit(mk_mvar(mid_pending)). Remove the now
unnecessary visit_mvar and visit_args_and_beta methods from pass 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This moves the resolvability checking from a separate pre-computation
phase into pass 2 itself, where it is performed on-demand when each
delayed mvar is first encountered. Caches are kept across invocations
for efficiency. This eliminates the `resolvability_checker` class and
the `head_to_pending` map from pass 1.

Pass 1 now only tracks a `m_has_delayed` bool, and only sets it when
a delayed mvar's pending value is actually assigned (unassigned pending
values will clearly fail the resolvability check, so pass 2 can be
skipped entirely in that case).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nomeata
Copy link
Collaborator Author

nomeata commented Mar 5, 2026

!radar

@leanprover-radar
Copy link

leanprover-radar commented Mar 5, 2026

Benchmark results for a61d5b6 against cda8470 are in! @nomeata

Warning

These warnings may indicate that the benchmark results are not directly comparable, for example due to changes in the runner configuration or hardware.

  • Bench repo commit hashes for run build differ between commits.
  • Bench repo commit hashes for run other differ between commits.
  • 🟥 build//instructions: +642.8M (+0.01%)

Large changes (4✅)

  • elab/big_struct//instructions: -191.9M (-6.53%)
  • elab/big_struct_dep//instructions: -6.0G (-29.30%)
  • elab/omega_stress//instructions: -215.7M (-4.76%)
  • misc/import Init.Data.List.Sublist//instructions: -212.9M (-1.91%)

Medium changes (2✅, 3🟥)

  • build/module/Std.Data.Iterators.Lemmas.Combinators.Zip//instructions: -1.2G (-10.88%) (reduced significance based on absolute threshold)
  • 🟥 build/module/Std.Tactic.BVDecide.Bitblast.BVExpr.Circuit.Lemmas.Expr//instructions: +1.4G (+1.26%)
  • elab/big_struct_dep1//instructions: -182.1M (-2.87%)
  • 🟥 elab/grind_bitvec2//instructions: +1.4G (+0.70%)
  • 🟥 elab/grind_list2//instructions: +249.5M (+0.43%)

Small changes (72✅, 28🟥)

Too many entries to display here. View the full report on radar instead.

nomeata and others added 6 commits March 5, 2026 17:43
This exports `DelayedMetavarAssignment.fvars` and
`DelayedMetavarAssignment.mvarIdPending` as C functions and uses them
in both `instantiate_mvars.cpp` and `instantiate_mvars_all.cpp`,
replacing raw `cnstr_get` access. Also simplifies `get_assignment`
in pass 2 by removing redundant `has_expr_mvar`/`has_fvar` checks
that `visit` already performs. Pass 1 now only sets `m_has_delayed`
when the pending mvar is actually assigned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inside a resolvable subtree, all nested delayed mvars are also
resolvable, so the non-resolvable branch is only reachable in outer
mode (empty fvar substitution).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Mar 5, 2026
mathlib-nightly-testing bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Mar 5, 2026
leanprover-bot added a commit to leanprover/reference-manual that referenced this pull request Mar 5, 2026
@mathlib-lean-pr-testing
Copy link

Mathlib CI status (docs):

This commit introduces consistent terminology throughout
instantiate_mvars_all.cpp: direct MVar, pending MVar, assigned MVar,
MVar DAG, resolvable MVar, updateable MVar, outer/inner mode, and
the updateable-MVar cut. Renames fvar_subst_empty() to in_outer_mode()
and m_has_delayed to m_has_updateable_delayed for alignment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Mar 5, 2026
mathlib-nightly-testing bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Mar 5, 2026
leanprover-bot added a commit to leanprover/reference-manual that referenced this pull request Mar 5, 2026
@mathlib-lean-pr-testing
Copy link

Mathlib CI status (docs):

This commit replaces the scope_cache comment with a clearer
description of the conceptual model (stack of hashmaps) and how the
implementation inverts this for efficiency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Mar 6, 2026
mathlib-nightly-testing bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Mar 6, 2026
leanprover-bot added a commit to leanprover/reference-manual that referenced this pull request Mar 6, 2026
@mathlib-lean-pr-testing
Copy link

Mathlib CI status (docs):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

builds-mathlib CI has verified that Mathlib builds against this PR changelog-language Language features and metaprograms mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants