perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked)#523
Closed
fglock wants to merge 5 commits intofeature/phase-j-performancefrom
Closed
perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked)#523fglock wants to merge 5 commits intofeature/phase-j-performancefrom
fglock wants to merge 5 commits intofeature/phase-j-performancefrom
Conversation
System.getenv() is a native call (JNI → TreeMap lookup, ~200ns each).
Several debug flags were being evaluated via System.getenv() on every
call rather than once at class init. These cached reads add up in hot
compile and runtime paths.
Caches:
* MortalList.GC_DEBUG — was in maybeAutoSweep (fires after flush)
* RuntimeScalar.PHASE_D_DBG — was in undefine()
* RuntimeIO.IO_DEBUG — 2 hot spots in getRuntimeIO/close
* IOOperator.IO_DEBUG — open/close hot paths
* EmitterMethodCreator.ASM_DEBUG, ASM_DEBUG_CLASS_FILTER,
BYTECODE_SIZE_DEBUG, SPILL_SLOT_COUNT — compile hot path,
called ~once per compiled method (was 4-5 getenv per compile)
* SubroutineParser.SHOW_FALLBACK — parser hot path
* PerlLanguageProvider.SHOW_FALLBACK — compile fallback hot path
Semantically identical — these are all at-startup-determined debug
flags whose values never change during execution. Pattern already
used elsewhere in the codebase (e.g., ScalarRefRegistry, RuntimeRegex).
Regression gates:
* DBIx::Class: Files=314, Tests=13804 — PASS (1152s, noise vs 1107)
* Template-Toolkit: Files=106, Tests=2920 — PASS (133-137s)
* Moo: Files=71, Tests=841 — PASS (91s)
* make unit tests — PASS
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
ScalarRefRegistry.registerRef() was doing a synchronized(WeakHashMap).put() on every setLargeRefCounted call (every ref assignment). The registry is consumed ONLY by ReachabilityWalker.sweepWeakRefs(), which is only invoked when WeakRefRegistry.weakRefsExist is true (i.e., when weaken() has been called at least once). For scripts that never weaken(), every registerRef call was pure overhead. JFR profile of life_bitpacked.pl (which never weakens anything) showed WeakHashMap.put / Collections$SynchronizedMap.put / expungeStaleEntries dominating post-compile CPU. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before: 7.74–8.06 Mcells/s after: 9.79–9.93 Mcells/s (+22% median) Larger grid (200x200, 5000 gens): 10.01 → 11.05 Mcells/s (+10%). Trade-off: scripts that hold many scalars-with-refs PRIOR to the first weaken() call won't be in the registry when the walker first runs. However, any subsequent setLarge on those scalars will register them, and the walker's primary seeds (globals, code refs, DESTROY rescued set) still find reachable structures via the normal BFS. No DBIC 52leaks.t assertions regressed. Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old unconditional behavior. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS (136-138s) * Moo: Files=71, Tests=841 — PASS (95s; within noise of 91s baseline) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Same pattern as the prior ScalarRefRegistry.registerRef fix. MyVarCleanupStack.register() is called for every `my` variable declaration and was doing an unconditional `liveCounts.merge(var, 1, Integer::sum)` — an IdentityHashMap merge with a boxed Integer lambda. JFR of life_bitpacked.pl surfaced this right after the scalar-registry gate landed. The `liveCounts` map exists solely for ReachabilityWalker.sweepWeakRefs' `isLive()` check, which only runs when WeakRefRegistry.weakRefsExist is true. Scripts that never weaken() pay the full IdentityHashMap.merge cost for nothing. The walker's pre-weaken fallback (sc.scopeExited + sc.refCountOwned checks) still correctly classifies live vs dead lexicals, so semantically this is indistinguishable. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before this patch: 9.79 Mcells/s (after ScalarRefRegistry gate) after this patch: 12.50 Mcells/s (+28%) Combined vs pre-gate baseline on this branch: 7.74 → 12.50 Mcells/s, a 1.61× speedup. Larger grid (200x200, 5000 gens): 13.71 Mcells/s (baseline was 10.01, a 1.37× speedup). Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS (143s) * Moo: Files=71, Tests=841 — PASS (97s) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two targeted hot-path optimizations found via JFR on life_bitpacked.pl after the ScalarRefRegistry / MyVarCleanupStack gates landed. 1. Cache getWarningBitsForCode() per RuntimeCode. The JVM-compiled branch looked up `code.methodHandle.type().parameterType(0).getName()` + a `WarningBitsRegistry.get(className)` HashMap probe on every sub invocation. The declaring class of a compiled code's MethodHandle is stable for the lifetime of the RuntimeCode, so cache the resolved warning bits string in a `cachedWarningBits` field. Sentinel pattern (`WARNING_BITS_NOT_COMPUTED = "<uninit>"`) keeps a legitimately-null result distinguishable from not-yet-computed. 2. pushArgs: shared empty snapshot for zero-arg calls. Every sub call was allocating a fresh `RuntimeArray` wrapper + `ArrayList<>` to snapshot @_ for `@DB::args` support, even when the callee was called with zero arguments. Share a single `EMPTY_ARGS_SNAPSHOT` for those; real allocation only happens when args is non-empty. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before this patch: 11.68–12.50 Mcells/s after this patch: 12.19–13.07 Mcells/s The cumulative speedup since the start of this PR (post-Phase-J baseline on this branch, 7.74 Mcells/s) is **1.65×**. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS * Moo: Files=71, Tests=841 — PASS * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two small hot-path cleanups found via JFR allocation profile on life_bitpacked.pl: 1. RuntimeScalar.initializeWithLong took `Long` (boxed) rather than `long` (primitive). The `RuntimeScalar(long)` constructor autoboxed long → Long on every call, then the method unboxed it again. Changed the signature to primitive `long` and adapted the `RuntimeScalar(Long)` constructor to call `.longValue()` explicitly. 2. BitwiseOperators.bitwiseAnd / bitwiseOr / bitwiseXor fast paths computed result as `long` and called `new RuntimeScalar(long)`, forcing the initializeWithLong branch cascade. Since int ^ int, int & int, int | int all produce int by definition, compute as int and call `new RuntimeScalar(int)` directly — skips the range-check branches. Bitwise shift ops are unchanged (still use long) because left-shift may grow beyond 32 bits and must preserve semantics. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * make unit tests — PASS Measured on examples/life_bitpacked.pl default args: ~12.3–12.6 Mcells/s (unchanged at the mean — these changes eliminate boxing overhead per op but the JIT already optimized it well; the signature fix prevents an autobox that was showing up in JFR's alloc sample but not in the JIT-compiled hot loop). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
5 tasks
fglock
added a commit
that referenced
this pull request
Apr 21, 2026
PR #526 shipped with a significant performance regression that was accepted only because correctness blocks merge: ./jperl examples/life_bitpacked.pl -r none -g 500 Before merge (660aa9e): ~12.5 Mcells/s After merge (PR #526): ~6.6 Mcells/s => running at ~53% of prior perf (~47% slower) This workload is the benchmark that drove the walker-hardening wins in PR #523 — losing it silently would undo that work. Add §0 to dev/design/next_steps.md: - numbers + measurement command - ranked hypotheses (pristineArgsStack clone, hasArgsStack TL churn, inTailCallTrampoline/tailCallReentry, deep-recursion counter, WarningBits / HintHash push/pop) - action plan: bisect commits, profile, apply weakRefsExist-style gating to each new stack if hot - explicit acceptance criterion: >= 12 Mcells/s with tests green Also: - add header warning banner so reviewers notice before scrolling - mark §3 (new perf work) as blocked on §0 - update progress tracker + next-action pointer Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-on perf work after Phase J (PR #518). Started as an attempt at walker hardening + J3 stash fast path (see phase_j_performance_plan.md's blocked section) but the A2/J3 combo remained fragile — instead, this PR bundles six orthogonal, low-risk hot-path fixes surfaced by JFR-driven investigation.
Measured impact
On
examples/life_bitpacked.pl(default 80×40, 5000 gens), best-of-3 Cell updates per second:base (branch-point): 7.74 – 8.06 Mcells/s
final: 12.33 – 13.07 Mcells/s
1.61× speedup
Larger grid (200×200, 5000 gens): 10.01 → 13.71 Mcells/s, 1.37× speedup.
Changes (one commit each)
f4d474aSystem.getenv()in hot paths asstatic final2fb0bd1ScalarRefRegistry.registerRef()onweakRefsExista7165f7MyVarCleanupStack.liveCountsonweakRefsExist17527e8getWarningBitsForCode()perRuntimeCode+ shared empty-args snapshot660aa9elong↔Longautoboxing inRuntimeScalar(long)ctor +intfast paths inBitwiseOperatorsThe first three are the big wins. The common theme:
All three "gate" commits check
WeakRefRegistry.weakRefsExistand skip the now-provably-unused bookkeeping. Trade-off: scripts thatweaken()after a long warm-up phase may miss a handful of early scalars from the walker's seed snapshot — but the walker's other filters (sc.scopeExited,sc.refCountOwned,MyVarCleanupStack.isLive's other branches) still classify them correctly. No DBIC 52leaks assertions regress.Escape hatch:
JPERL_UNGATED_SCALAR_REGISTRY=1restores the old unconditional behavior.Test plan
makeunit tests PASSGenerated with Devin