perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked) by fglock · Pull Request #523 · fglock/PerlOnJava

fglock · 2026-04-21T10:38:12Z

Summary

Follow-on perf work after Phase J (PR #518). Started as an attempt at walker hardening + J3 stash fast path (see phase_j_performance_plan.md's blocked section) but the A2/J3 combo remained fragile — instead, this PR bundles six orthogonal, low-risk hot-path fixes surfaced by JFR-driven investigation.

Measured impact

On examples/life_bitpacked.pl (default 80×40, 5000 gens), best-of-3 Cell updates per second:

base (branch-point): 7.74 – 8.06 Mcells/s
final: 12.33 – 13.07 Mcells/s
1.61× speedup

Larger grid (200×200, 5000 gens): 10.01 → 13.71 Mcells/s, 1.37× speedup.

Changes (one commit each)

Commit	What	Measured impact
`f4d474a`	Cache `System.getenv()` in hot paths as `static final`	correctness cleanup, noise-level on benches
`2fb0bd1`	Gate `ScalarRefRegistry.registerRef()` on `weakRefsExist`	+22 % on life_bitpacked
`a7165f7`	Gate `MyVarCleanupStack.liveCounts` on `weakRefsExist`	+28 % on top of above
`17527e8`	Cache `getWarningBitsForCode()` per `RuntimeCode` + shared empty-args snapshot	within noise, allocation cleanup
`660aa9e`	Avoid `long`↔`Long` autoboxing in `RuntimeScalar(long)` ctor + `int` fast paths in `BitwiseOperators`	allocation cleanup

The first three are the big wins. The common theme:

"Walker machinery that existed only to support weaken() was running on every scalar and every my variable assignment even for scripts that never called weaken()."

All three "gate" commits check WeakRefRegistry.weakRefsExist and skip the now-provably-unused bookkeeping. Trade-off: scripts that weaken() after a long warm-up phase may miss a handful of early scalars from the walker's seed snapshot — but the walker's other filters (sc.scopeExited, sc.refCountOwned, MyVarCleanupStack.isLive's other branches) still classify them correctly. No DBIC 52leaks assertions regress.

Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old unconditional behavior.

Test plan

DBIx::Class t/52leaks.t: 11/11 PASS
Template-Toolkit: Files=106, Tests=2920 — PASS
Moo: Files=71, Tests=841 — PASS
make unit tests PASS

Generated with Devin

System.getenv() is a native call (JNI → TreeMap lookup, ~200ns each). Several debug flags were being evaluated via System.getenv() on every call rather than once at class init. These cached reads add up in hot compile and runtime paths. Caches: * MortalList.GC_DEBUG — was in maybeAutoSweep (fires after flush) * RuntimeScalar.PHASE_D_DBG — was in undefine() * RuntimeIO.IO_DEBUG — 2 hot spots in getRuntimeIO/close * IOOperator.IO_DEBUG — open/close hot paths * EmitterMethodCreator.ASM_DEBUG, ASM_DEBUG_CLASS_FILTER, BYTECODE_SIZE_DEBUG, SPILL_SLOT_COUNT — compile hot path, called ~once per compiled method (was 4-5 getenv per compile) * SubroutineParser.SHOW_FALLBACK — parser hot path * PerlLanguageProvider.SHOW_FALLBACK — compile fallback hot path Semantically identical — these are all at-startup-determined debug flags whose values never change during execution. Pattern already used elsewhere in the codebase (e.g., ScalarRefRegistry, RuntimeRegex). Regression gates: * DBIx::Class: Files=314, Tests=13804 — PASS (1152s, noise vs 1107) * Template-Toolkit: Files=106, Tests=2920 — PASS (133-137s) * Moo: Files=71, Tests=841 — PASS (91s) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

ScalarRefRegistry.registerRef() was doing a synchronized(WeakHashMap).put() on every setLargeRefCounted call (every ref assignment). The registry is consumed ONLY by ReachabilityWalker.sweepWeakRefs(), which is only invoked when WeakRefRegistry.weakRefsExist is true (i.e., when weaken() has been called at least once). For scripts that never weaken(), every registerRef call was pure overhead. JFR profile of life_bitpacked.pl (which never weakens anything) showed WeakHashMap.put / Collections$SynchronizedMap.put / expungeStaleEntries dominating post-compile CPU. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before: 7.74–8.06 Mcells/s after: 9.79–9.93 Mcells/s (+22% median) Larger grid (200x200, 5000 gens): 10.01 → 11.05 Mcells/s (+10%). Trade-off: scripts that hold many scalars-with-refs PRIOR to the first weaken() call won't be in the registry when the walker first runs. However, any subsequent setLarge on those scalars will register them, and the walker's primary seeds (globals, code refs, DESTROY rescued set) still find reachable structures via the normal BFS. No DBIC 52leaks.t assertions regressed. Escape hatch: JPERL_UNGATED_SCALAR_REGISTRY=1 restores the old unconditional behavior. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS (136-138s) * Moo: Files=71, Tests=841 — PASS (95s; within noise of 91s baseline) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Same pattern as the prior ScalarRefRegistry.registerRef fix. MyVarCleanupStack.register() is called for every `my` variable declaration and was doing an unconditional `liveCounts.merge(var, 1, Integer::sum)` — an IdentityHashMap merge with a boxed Integer lambda. JFR of life_bitpacked.pl surfaced this right after the scalar-registry gate landed. The `liveCounts` map exists solely for ReachabilityWalker.sweepWeakRefs' `isLive()` check, which only runs when WeakRefRegistry.weakRefsExist is true. Scripts that never weaken() pay the full IdentityHashMap.merge cost for nothing. The walker's pre-weaken fallback (sc.scopeExited + sc.refCountOwned checks) still correctly classifies live vs dead lexicals, so semantically this is indistinguishable. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before this patch: 9.79 Mcells/s (after ScalarRefRegistry gate) after this patch: 12.50 Mcells/s (+28%) Combined vs pre-gate baseline on this branch: 7.74 → 12.50 Mcells/s, a 1.61× speedup. Larger grid (200x200, 5000 gens): 13.71 Mcells/s (baseline was 10.01, a 1.37× speedup). Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS (143s) * Moo: Files=71, Tests=841 — PASS (97s) * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Two targeted hot-path optimizations found via JFR on life_bitpacked.pl after the ScalarRefRegistry / MyVarCleanupStack gates landed. 1. Cache getWarningBitsForCode() per RuntimeCode. The JVM-compiled branch looked up `code.methodHandle.type().parameterType(0).getName()` + a `WarningBitsRegistry.get(className)` HashMap probe on every sub invocation. The declaring class of a compiled code's MethodHandle is stable for the lifetime of the RuntimeCode, so cache the resolved warning bits string in a `cachedWarningBits` field. Sentinel pattern (`WARNING_BITS_NOT_COMPUTED = "<uninit>"`) keeps a legitimately-null result distinguishable from not-yet-computed. 2. pushArgs: shared empty snapshot for zero-arg calls. Every sub call was allocating a fresh `RuntimeArray` wrapper + `ArrayList<>` to snapshot @_ for `@DB::args` support, even when the callee was called with zero arguments. Share a single `EMPTY_ARGS_SNAPSHOT` for those; real allocation only happens when args is non-empty. Measured on examples/life_bitpacked.pl default args (80x40, 5000 gens), best-of-3 Cell updates per second: before this patch: 11.68–12.50 Mcells/s after this patch: 12.19–13.07 Mcells/s The cumulative speedup since the start of this PR (post-Phase-J baseline on this branch, 7.74 Mcells/s) is **1.65×**. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * Template-Toolkit: Files=106, Tests=2920 — PASS * Moo: Files=71, Tests=841 — PASS * make unit tests — PASS Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Two small hot-path cleanups found via JFR allocation profile on life_bitpacked.pl: 1. RuntimeScalar.initializeWithLong took `Long` (boxed) rather than `long` (primitive). The `RuntimeScalar(long)` constructor autoboxed long → Long on every call, then the method unboxed it again. Changed the signature to primitive `long` and adapted the `RuntimeScalar(Long)` constructor to call `.longValue()` explicitly. 2. BitwiseOperators.bitwiseAnd / bitwiseOr / bitwiseXor fast paths computed result as `long` and called `new RuntimeScalar(long)`, forcing the initializeWithLong branch cascade. Since int ^ int, int & int, int | int all produce int by definition, compute as int and call `new RuntimeScalar(int)` directly — skips the range-check branches. Bitwise shift ops are unchanged (still use long) because left-shift may grow beyond 32 bits and must preserve semantics. Regression gates: * DBIx::Class t/52leaks.t: 11/11 PASS * make unit tests — PASS Measured on examples/life_bitpacked.pl default args: ~12.3–12.6 Mcells/s (unchanged at the mean — these changes eliminate boxing overhead per op but the JIT already optimized it well; the signature fix prevents an autobox that was showing up in JFR's alloc sample but not in the JIT-compiled hot loop). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

PR #526 shipped with a significant performance regression that was accepted only because correctness blocks merge: ./jperl examples/life_bitpacked.pl -r none -g 500 Before merge (660aa9e): ~12.5 Mcells/s After merge (PR #526): ~6.6 Mcells/s => running at ~53% of prior perf (~47% slower) This workload is the benchmark that drove the walker-hardening wins in PR #523 — losing it silently would undo that work. Add §0 to dev/design/next_steps.md: - numbers + measurement command - ranked hypotheses (pristineArgsStack clone, hasArgsStack TL churn, inTailCallTrampoline/tailCallReentry, deep-recursion counter, WarningBits / HintHash push/pop) - action plan: bisect commits, profile, apply weakRefsExist-style gating to each new stack if hot - explicit acceptance criterion: >= 12 Mcells/s with tests green Also: - add header warning banner so reviewers notice before scrolling - mark §3 (new perf work) as blocked on §0 - update progress tracker + next-action pointer Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock · 2026-04-27T07:56:44Z

Superseded by #566 (feature/dbic-final-integration), which has been merged to master.

This PR's commits were consolidated into #566's branch via rebase, along with the subsequent regression fixes (Steps A-D) that closed all remaining DBIC-final regressions while maintaining 314/314 DBIC parity.

fglock and others added 5 commits April 21, 2026 12:37

fglock changed the title ~~perf: cache System.getenv() in hot paths~~ perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked) Apr 21, 2026

fglock mentioned this pull request Apr 21, 2026

perf: refcount + Phase-J + walker hardening combined (rebased on master) #526

Closed

5 tasks

fglock closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked)#523

perf: walker-machinery gates + hot-path cleanups (+1.6× on life_bitpacked)#523
fglock wants to merge 5 commits intofeature/phase-j-performancefrom
feature/walker-hardening-j3

fglock commented Apr 21, 2026 •

edited

Loading

Uh oh!

fglock commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact

Changes (one commit each)

Test plan

Uh oh!

fglock commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fglock commented Apr 21, 2026 •

edited

Loading