Skip to content

perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked)#523

Closed
fglock wants to merge 5 commits intofeature/phase-j-performancefrom
feature/walker-hardening-j3
Closed

perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked)#523
fglock wants to merge 5 commits intofeature/phase-j-performancefrom
feature/walker-hardening-j3

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 21, 2026

Summary

Follow-on perf work after Phase J (PR #518). Started as an attempt at walker hardening + J3 stash fast path (see phase_j_performance_plan.md's blocked section) but the A2/J3 combo remained fragile — instead, this PR bundles six orthogonal, low-risk hot-path fixes surfaced by JFR-driven investigation.

Measured impact

On examples/life_bitpacked.pl (default 80×40, 5000 gens), best-of-3 Cell updates per second:

base (branch-point): 7.74 – 8.06 Mcells/s
final: 12.33 – 13.07 Mcells/s
1.61× speedup

Larger grid (200×200, 5000 gens): 10.01 → 13.71 Mcells/s, 1.37× speedup.

Changes (one commit each)

Commit What Measured impact
f4d474a Cache System.getenv() in hot paths as static final correctness cleanup, noise-level on benches
2fb0bd1 Gate ScalarRefRegistry.registerRef() on weakRefsExist +22 % on life_bitpacked
a7165f7 Gate MyVarCleanupStack.liveCounts on weakRefsExist +28 % on top of above
17527e8 Cache getWarningBitsForCode() per RuntimeCode + shared empty-args snapshot within noise, allocation cleanup
660aa9e Avoid longLong autoboxing in RuntimeScalar(long) ctor + int fast paths in BitwiseOperators allocation cleanup

The first three are the big wins. The common theme:

"Walker machinery that existed only to support weaken() was running on every scalar and every my variable assignment even for scripts that never called weaken()."

All three "gate" commits check WeakRefRegistry.weakRefsExist and skip the now-provably-unused bookkeeping. Trade-off: scripts that weaken() after a long warm-up phase may miss a handful of early scalars from the walker's seed snapshot — but the walker's other filters (sc.scopeExited, sc.refCountOwned, MyVarCleanupStack.isLive's other branches) still classify them correctly. No DBIC 52leaks assertions regress.

Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old unconditional behavior.

Test plan

  • DBIx::Class t/52leaks.t: 11/11 PASS
  • Template-Toolkit: Files=106, Tests=2920 — PASS
  • Moo: Files=71, Tests=841 — PASS
  • make unit tests PASS

Generated with Devin

fglock and others added 5 commits April 21, 2026 12:37
System.getenv() is a native call (JNI → TreeMap lookup, ~200ns each).
Several debug flags were being evaluated via System.getenv() on every
call rather than once at class init. These cached reads add up in hot
compile and runtime paths.

Caches:
  * MortalList.GC_DEBUG — was in maybeAutoSweep (fires after flush)
  * RuntimeScalar.PHASE_D_DBG — was in undefine()
  * RuntimeIO.IO_DEBUG — 2 hot spots in getRuntimeIO/close
  * IOOperator.IO_DEBUG — open/close hot paths
  * EmitterMethodCreator.ASM_DEBUG, ASM_DEBUG_CLASS_FILTER,
    BYTECODE_SIZE_DEBUG, SPILL_SLOT_COUNT — compile hot path,
    called ~once per compiled method (was 4-5 getenv per compile)
  * SubroutineParser.SHOW_FALLBACK — parser hot path
  * PerlLanguageProvider.SHOW_FALLBACK — compile fallback hot path

Semantically identical — these are all at-startup-determined debug
flags whose values never change during execution. Pattern already
used elsewhere in the codebase (e.g., ScalarRefRegistry, RuntimeRegex).

Regression gates:
  * DBIx::Class: Files=314, Tests=13804 — PASS (1152s, noise vs 1107)
  * Template-Toolkit: Files=106, Tests=2920 — PASS (133-137s)
  * Moo: Files=71, Tests=841 — PASS (91s)
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
ScalarRefRegistry.registerRef() was doing a
synchronized(WeakHashMap).put() on every setLargeRefCounted call
(every ref assignment). The registry is consumed ONLY by
ReachabilityWalker.sweepWeakRefs(), which is only invoked when
WeakRefRegistry.weakRefsExist is true (i.e., when weaken() has
been called at least once). For scripts that never weaken(),
every registerRef call was pure overhead.

JFR profile of life_bitpacked.pl (which never weakens anything)
showed WeakHashMap.put / Collections$SynchronizedMap.put /
expungeStaleEntries dominating post-compile CPU.

Measured on examples/life_bitpacked.pl default args (80x40, 5000
gens), best-of-3 Cell updates per second:

  before:  7.74–8.06 Mcells/s
  after:   9.79–9.93 Mcells/s   (+22% median)

Larger grid (200x200, 5000 gens): 10.01 → 11.05 Mcells/s (+10%).

Trade-off: scripts that hold many scalars-with-refs PRIOR to the
first weaken() call won't be in the registry when the walker first
runs. However, any subsequent setLarge on those scalars will
register them, and the walker's primary seeds (globals, code refs,
DESTROY rescued set) still find reachable structures via the normal
BFS. No DBIC 52leaks.t assertions regressed.

Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old
unconditional behavior.

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * Template-Toolkit: Files=106, Tests=2920 — PASS (136-138s)
  * Moo: Files=71, Tests=841 — PASS (95s; within noise of 91s baseline)
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Same pattern as the prior ScalarRefRegistry.registerRef fix.
MyVarCleanupStack.register() is called for every `my` variable
declaration and was doing an unconditional
`liveCounts.merge(var, 1, Integer::sum)` — an IdentityHashMap
merge with a boxed Integer lambda. JFR of life_bitpacked.pl
surfaced this right after the scalar-registry gate landed.

The `liveCounts` map exists solely for
ReachabilityWalker.sweepWeakRefs' `isLive()` check, which only runs
when WeakRefRegistry.weakRefsExist is true. Scripts that never
weaken() pay the full IdentityHashMap.merge cost for nothing.

The walker's pre-weaken fallback (sc.scopeExited + sc.refCountOwned
checks) still correctly classifies live vs dead lexicals, so
semantically this is indistinguishable.

Measured on examples/life_bitpacked.pl default args (80x40, 5000
gens), best-of-3 Cell updates per second:

  before this patch:  9.79 Mcells/s   (after ScalarRefRegistry gate)
  after this patch:  12.50 Mcells/s   (+28%)

Combined vs pre-gate baseline on this branch:
  7.74 → 12.50 Mcells/s, a 1.61× speedup.

Larger grid (200x200, 5000 gens): 13.71 Mcells/s
(baseline was 10.01, a 1.37× speedup).

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * Template-Toolkit: Files=106, Tests=2920 — PASS (143s)
  * Moo: Files=71, Tests=841 — PASS (97s)
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two targeted hot-path optimizations found via JFR on life_bitpacked.pl
after the ScalarRefRegistry / MyVarCleanupStack gates landed.

1. Cache getWarningBitsForCode() per RuntimeCode.

   The JVM-compiled branch looked up
   `code.methodHandle.type().parameterType(0).getName()` + a
   `WarningBitsRegistry.get(className)` HashMap probe on every sub
   invocation. The declaring class of a compiled code's MethodHandle
   is stable for the lifetime of the RuntimeCode, so cache the
   resolved warning bits string in a `cachedWarningBits` field.
   Sentinel pattern (`WARNING_BITS_NOT_COMPUTED = "<uninit>"`) keeps
   a legitimately-null result distinguishable from not-yet-computed.

2. pushArgs: shared empty snapshot for zero-arg calls.

   Every sub call was allocating a fresh `RuntimeArray` wrapper +
   `ArrayList<>` to snapshot @_ for `@DB::args` support, even when
   the callee was called with zero arguments. Share a single
   `EMPTY_ARGS_SNAPSHOT` for those; real allocation only happens when
   args is non-empty.

Measured on examples/life_bitpacked.pl default args (80x40, 5000
gens), best-of-3 Cell updates per second:

  before this patch:  11.68–12.50 Mcells/s
  after this patch:   12.19–13.07 Mcells/s

The cumulative speedup since the start of this PR (post-Phase-J
baseline on this branch, 7.74 Mcells/s) is **1.65×**.

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * Template-Toolkit: Files=106, Tests=2920 — PASS
  * Moo: Files=71, Tests=841 — PASS
  * make unit tests — PASS

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two small hot-path cleanups found via JFR allocation profile on
life_bitpacked.pl:

1. RuntimeScalar.initializeWithLong took `Long` (boxed) rather than
   `long` (primitive). The `RuntimeScalar(long)` constructor
   autoboxed long → Long on every call, then the method unboxed it
   again. Changed the signature to primitive `long` and adapted the
   `RuntimeScalar(Long)` constructor to call `.longValue()` explicitly.

2. BitwiseOperators.bitwiseAnd / bitwiseOr / bitwiseXor fast paths
   computed result as `long` and called `new RuntimeScalar(long)`,
   forcing the initializeWithLong branch cascade. Since int ^ int,
   int & int, int | int all produce int by definition, compute as
   int and call `new RuntimeScalar(int)` directly — skips the
   range-check branches.

Bitwise shift ops are unchanged (still use long) because left-shift
may grow beyond 32 bits and must preserve semantics.

Regression gates:
  * DBIx::Class t/52leaks.t: 11/11 PASS
  * make unit tests — PASS

Measured on examples/life_bitpacked.pl default args: ~12.3–12.6
Mcells/s (unchanged at the mean — these changes eliminate boxing
overhead per op but the JIT already optimized it well; the signature
fix prevents an autobox that was showing up in JFR's alloc sample
but not in the JIT-compiled hot loop).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock changed the title perf: cache System.getenv() in hot paths perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked) Apr 21, 2026
fglock added a commit that referenced this pull request Apr 21, 2026
PR #526 shipped with a significant performance regression that was
accepted only because correctness blocks merge:

  ./jperl examples/life_bitpacked.pl -r none -g 500
    Before merge (660aa9e): ~12.5 Mcells/s
    After  merge (PR #526):   ~6.6  Mcells/s
    => running at ~53% of prior perf (~47% slower)

This workload is the benchmark that drove the walker-hardening wins
in PR #523 — losing it silently would undo that work.

Add §0 to dev/design/next_steps.md:
- numbers + measurement command
- ranked hypotheses (pristineArgsStack clone, hasArgsStack TL
  churn, inTailCallTrampoline/tailCallReentry, deep-recursion
  counter, WarningBits / HintHash push/pop)
- action plan: bisect commits, profile, apply weakRefsExist-style
  gating to each new stack if hot
- explicit acceptance criterion: >= 12 Mcells/s with tests green

Also:
- add header warning banner so reviewers notice before scrolling
- mark §3 (new perf work) as blocked on §0
- update progress tracker + next-action pointer

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock
Copy link
Copy Markdown
Owner Author

fglock commented Apr 27, 2026

Superseded by #566 (feature/dbic-final-integration), which has been merged to master.

This PR's commits were consolidated into #566's branch via rebase, along with the subsequent regression fixes (Steps A-D) that closed all remaining DBIC-final regressions while maintaining 314/314 DBIC parity.

@fglock fglock closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant