docs(D-W6.1): sub-install reproducer + revised drift diagnosis by fglock · Pull Request #603 · fglock/PerlOnJava

fglock · 2026-04-29T07:26:50Z

Summary

D-W6.1 lands the sub-install drift reproducer
(src/test/resources/unit/refcount/drift/sub_install.t) and the
revised diagnosis.

Outcome: D-W6.1 was misnamed. All five simple sub-install
patterns pass with the walker gate disabled. The bytecode for
*Foo::bar = $cv is not the drift source.

What we actually found

PJ_DESTROY_TRACE=1 ./jperl -e 'use Class::MOP::Class' (gate
disabled) printed the first premature destroy's stack:

[DESTROY] RuntimeCode@... name=Sub::Install::(anon)
  at MortalList.flush(MortalList.java:588)
  at RuntimeScalar.setLargeRefCounted(RuntimeScalar.java:1241)
  at RuntimeScalar.setLarge(RuntimeScalar.java:1019)
  at RuntimeScalar.set(RuntimeScalar.java:949)
  at RuntimeGlob.set(RuntimeGlob.java:229)
  at anon652.apply(.../Sub/Install.pm:2159)

Three key facts:

The destroyed CV is not the one being assigned — it's an
unrelated Sub::Install::(anon) sitting in the
deferred-decrement queue.
The setLargeRefCounted flush (line 1241) decrements that
queued CV from refCount=0 to -1, treated as 0→0, firing DESTROY
on a CV still held by an outer closure capture.
Sub::Install's install path is a stack of nested closures:
*install_sub = _build_public_installer(_ignore_warnings(_installer)).
Each layer captures $code, $arg, $installer etc.

So the drift is in closure-capture refCount (D-W6.2), not in
the glob-assignment path (D-W6.1).

What's in this PR

src/test/resources/unit/refcount/drift/sub_install.t — five
patterns covering simple glob-assign, named-sub, loop-install,
temp-drop, and nested-install. All pass on master and on a
gate-disabled probe build. Kept as a regression guard for when
D-W6.2 lands and the gate is loosened.
DestroyDispatch.callDestroy — the PJ_DESTROY_TRACE=1 env-flag
output now includes name=Pkg::subname for RuntimeCode
destructions (zero-cost when the flag is off).
dev/modules/moose_support.md — D-W6.1 investigation log and the
revised diagnosis, marking D-W6.2 as the next concrete step.

Test plan

make (build + unit tests) green.
drift/sub_install.t 12/12 pass on master.
drift/sub_install.t 9/9 pass with the gate disabled
(probe-build verified).
No production behavior change (DESTROY_TRACE is off by
default; reproducer is a new file).

… check PerlOnJava destroy semantics must not depend on which Perl module is being used. PR #572 shipped a stopgap class-name heuristic (`DestroyDispatch.classNeedsWalkerGate`) that gated walker-deferred destruction to only Class::MOP / Moose / Moo classes. That violates the rule. Replace it with a single principled criterion: defer destroy only when the object is reachable from a *package global* (not from any local lexical) via the existing `ReachabilityWalker.isReachableFromRoots(target, globalOnly=true)` overload. Why this works: - Moose / Class::MOP store metaclasses in `our %METAS` (a package global), so they remain reachable globalOnly=true → gate fires → destruction is deferred until the real refCount drop. Moose bootstrap depends on this. - DBIx::Class / Class::DBI track rows in `live_object_index` via WEAK refs, so they are NOT reachable globalOnly=true → gate doesn't fire → row dies at refCount=0 so a later `find()` reloads from the DB. Changes: - Remove `DestroyDispatch.classNeedsWalkerGate` and the `walkerGateClasses` / `walkerGateChecked` BitSets. - Both gate sites (`MortalList.flush` and `RuntimeScalar.set`) now pass `globalOnly=true` to `ReachabilityWalker.isReachableFromRoots`. - Updated comments at the gate sites to explain the universal rule. Empirical impact (DBIC + Moose, jcpan -t, --jobs 1): | Discriminator | DBIC fails | Moose fails | |--------------------------------------|-----------:|------------:| | Class-name heuristic (PR #572) | 0/0 PASS | 82 / 137 | | No gate at all | 7/2 | 77 / 134 | | `globalOnly=true` (this commit) | 3/1 | 63 / 691 | The Moose 691-asserts number is dominated by one test (`util_std_type_constraints.t`, 556 of those); excluding it puts Moose at ~135 failed asserts, comparable to the heuristic baseline. Three known follow-ups are tracked in dev/modules/moose_support.md under "Phase D-W5" — the one-test type_constraints regression, the txn_scope_guard.t double-DESTROY warning, and t/52leaks.t SIGKILL. None require a class-name list to fix. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Empirical follow-up to D-W5: globalOnly=true loses too much for Moose, because anonymous metaclasses are held weakly in `our %METAS` and strongly only via my-vars (the user's `my $class = create_anon_class(...)`). Without my-var seeding, the walker decides the metaclass is unreachable and DESTROY fires, taking the anonymous package's methods with it. Switch both gate sites back to the default `isReachableFromRoots(base)` overload, which seeds from package globals AND from `MyVarCleanupStack`-tracked live my-vars and ScalarRefRegistry. This is the principled rule: if any *live* strong root holds the object, the cooperative refCount drop to 0 is a transient drift and we must not destroy. DBIC has 3-7 known regressions vs the class-name heuristic; tracked as follow-ups in dev/modules/moose_support.md (D-W5). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

- Universal walker beats heuristic on Moose (61/133 vs 82/137). - Four DBIC regressions tracked as follow-ups, none are timeouts. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@db

After running each failing test in isolation: - 52leaks.t bails at line 518 because PerlOnJava correctly detects and destroys the self-referential cycle the test EXPECTS to leak. Inherent semantic difference; fixing it = regression elsewhere. - txn_scope_guard.t test 18 wants the second DESTROY to fire after a Devel::StackTrace-style \@db::args capture — relies on Perl 5's exact refcount semantics; our walker now correctly sees the capture and defers, which is arguably more correct. - cdbi/04-lazy.t test 11 — SQL fetches opop but it's lost during DBIC's row construction. Real bug, but requires deep instrumentation. - generic_subq.t passes in isolation (false positive in suite output). Performance: universal walker is ~6-14% FASTER than the heuristic because the walker terminates early on hits and the WeakRefRegistry.hasWeakRefsTo precheck keeps it off the common path. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Goal: PerlOnJava should not 'improve on' Perl 5 — cycles should leak, DESTROY should fire at every refCount=0 transition, and the walker gate should disappear in favor of accurate cooperative refCount. Plan documents: - Why the walker gate was a stopgap (refCount drift in MOP code). - Five candidate refCount-drift sources to audit (hash slot stores, closure captures, @_ promotion, glob assignment, stash writes). - Acceptance criteria: 52leaks.t passes, txn_scope_guard.t passes, cdbi/04-lazy.t passes, no walker gate calls, Moose pass rate not regressed. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@db

….set Per the user directive: PerlOnJava should match Perl 5 destroy semantics exactly, not 'improve on' them. The cooperative refCount is now the sole source of truth for DESTROY firing. Verified passing in isolation: - cdbi/04-lazy.t test 11 (was failing — phantom destroy was clobbering row construction) - txn_scope_guard.t test 18 (was failing — gate was deferring the first DESTROY past the @db::args capture) - 52leaks.t 11/11 (was failing — gate was destroying cycles the test expected to leak) - walker_gate_dbic_pattern.t and walker_gate_dbic_minimal.t still pass. - Cycle leaks are preserved (cooperative refCount keeps cycle members at refCount >= 1 naturally). Phase D-W6 step 3 of the plan in dev/modules/moose_support.md. Full Moose / DBIC suite results to follow. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Reduce from 2373 to 584 lines: - Drop per-phase historical metrics and chronological narrative (in git). - Consolidate every reverted approach into a single "do NOT repeat these mistakes" section with commit hashes. - Keep Phase D-W6 (replicate Perl 5 destroy semantics) verbatim. - Keep the empirical D-W5 gate comparison table. - Keep Phase D plan (D1-D6) and the XS inventory. Generated with [Devin](https://devin.ai) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Empirical D-W6 progress is documented in moose_support.md. Findings from the no-gate experiment (commit 230a672): PASS — what removing the gate fixes (Perl 5 semantics replicated): - t/cdbi/04-lazy.t test 11 (no phantom destroy clobbers row build). - t/storage/txn_scope_guard.t test 18 (double-DESTROY warning fires). - t/52leaks.t 11/11 (cycles leak naturally, $r stays defined). - Refcount unit tests still pass. FAIL — what removing the gate breaks (refCount drift in MOP code): - use Class::MOP::Class fails: "Can't locate method initialize" — the CV for Class::MOP::Class::initialize is destroyed during the circular bootstrap. PJ_DESTROY_TRACE shows non-blessed RuntimeCode objects hitting refCount=0 inside MortalList.flush during MiniTrait::apply, which is a sub-installation path. - use Moose itself works (different load order) but cmop bootstrap destroys subs prematurely. Hybrid attempt (only destroy blessed objects through the gate path) fixed the bootstrap but regressed t/52leaks.t cascade-cleanup of non-blessed containers holding blessed children. Conclusion: the proper D-W6 fix is multi-week — it requires auditing each cooperative refCount source (sub installation, glob assignment, @_ promotion, hash-slot stores, closure captures) and back-filling the missing increments. Until then, restore the universal walker gate as a safety net (no class-name dispatch; matches PR #599). Adds a PJ_DESTROY_TRACE=1 env-flag in DestroyDispatch.callDestroy to help future drift hunting. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…ll path Identified the specific refCount drift that the walker gate has been papering over: anonymous CVs transiently hit refCount=0 between creation and being installed in the package stash. Documented the three concrete next-step audits (sub installation, closure capture, @_ promotion) and noted the temporary restoration of the universal walker gate from PR #599 as a safety net. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Replaces the three-bullet 'Next concrete steps' list with a structured plan for D-W6.1 (sub install), D-W6.2 (closure capture), and D-W6.3 (\@_ build). Each sub-phase ships: - A standalone reproducer in src/test/resources/unit/refcount/drift/ with a complete skeleton (no Moose/DBIC dependency). - A bisectable fix commit at the source (audit targets named). - A guarded gate-condition removal so the fix is verifiable and rollbackable. - Hard-regression-gate verification before merging. Sequencing: D-W6.1 first (highest ROI; likely fixes cdbi/04-lazy.t on its own). D-W6.2 next (clears the numeric_defaults regression pattern). D-W6.3 last (smallest blast radius). Effort estimate is included for each. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Outcome: D-W6.1 was misnamed. The simple sub-installation patterns (glob assignment, named sub, loop install, temp drop, nested install) all work correctly with the walker gate disabled — the bytecode for *Foo::bar = $cv is not the drift source. The actual drift, surfaced by PJ_DESTROY_TRACE during `use Class::MOP::Class`, is in Sub::Install's nested-closure pattern: *install_sub = _build_public_installer(_ignore_warnings(_installer)); Each layer is a closure returning a closure. The destroyed CV is NOT the one being assigned — it's an unrelated Sub::Install anon CV sitting in the deferred-decrement queue. The flush at the end of setLargeRefCounted processes that queue, decrementing refCount from 0 to -1 on a CV that is still held by an outer closure capture. So the next step is D-W6.2 (closure-capture drift). The reproducer landing here is kept as a regression guard for the simple-install path — when D-W6.2 lands and the gate is loosened, any regression in this area is caught immediately. Also enhances the PJ_DESTROY_TRACE output to include the CV's package::subname when available (key for finding which sub is being prematurely destroyed). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 11 commits April 28, 2026 23:35

This was referenced Apr 29, 2026

docs(D-W6.4): weak-metaclass reproducer + concrete hypotheses for the actual drift #606

Open

fix(D-W6): three diagnostic env-flags + sharper Class::MOP drift diagnosis #607

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(D-W6.1): sub-install reproducer + revised drift diagnosis#603

docs(D-W6.1): sub-install reproducer + revised drift diagnosis#603
fglock wants to merge 11 commits intomasterfrom
fix/d-w6-1-sub-install-drift

fglock commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 29, 2026

Summary

What we actually found

What's in this PR

Test plan

Next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant