Skip to content

docs(D-W6.1): sub-install reproducer + revised drift diagnosis#603

Open
fglock wants to merge 11 commits intomasterfrom
fix/d-w6-1-sub-install-drift
Open

docs(D-W6.1): sub-install reproducer + revised drift diagnosis#603
fglock wants to merge 11 commits intomasterfrom
fix/d-w6-1-sub-install-drift

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 29, 2026

Summary

D-W6.1 lands the sub-install drift reproducer
(src/test/resources/unit/refcount/drift/sub_install.t) and the
revised diagnosis.

Outcome: D-W6.1 was misnamed. All five simple sub-install
patterns pass with the walker gate disabled. The bytecode for
*Foo::bar = $cv is not the drift source.

What we actually found

PJ_DESTROY_TRACE=1 ./jperl -e 'use Class::MOP::Class' (gate
disabled) printed the first premature destroy's stack:

[DESTROY] RuntimeCode@... name=Sub::Install::(anon)
  at MortalList.flush(MortalList.java:588)
  at RuntimeScalar.setLargeRefCounted(RuntimeScalar.java:1241)
  at RuntimeScalar.setLarge(RuntimeScalar.java:1019)
  at RuntimeScalar.set(RuntimeScalar.java:949)
  at RuntimeGlob.set(RuntimeGlob.java:229)
  at anon652.apply(.../Sub/Install.pm:2159)

Three key facts:

  1. The destroyed CV is not the one being assigned — it's an
    unrelated Sub::Install::(anon) sitting in the
    deferred-decrement queue.
  2. The setLargeRefCounted flush (line 1241) decrements that
    queued CV from refCount=0 to -1, treated as 0→0, firing DESTROY
    on a CV still held by an outer closure capture.
  3. Sub::Install's install path is a stack of nested closures:
    *install_sub = _build_public_installer(_ignore_warnings(_installer)).
    Each layer captures $code, $arg, $installer etc.

So the drift is in closure-capture refCount (D-W6.2), not in
the glob-assignment path (D-W6.1).

What's in this PR

  • src/test/resources/unit/refcount/drift/sub_install.t — five
    patterns covering simple glob-assign, named-sub, loop-install,
    temp-drop, and nested-install. All pass on master and on a
    gate-disabled probe build. Kept as a regression guard for when
    D-W6.2 lands and the gate is loosened.
  • DestroyDispatch.callDestroy — the PJ_DESTROY_TRACE=1 env-flag
    output now includes name=Pkg::subname for RuntimeCode
    destructions (zero-cost when the flag is off).
  • dev/modules/moose_support.md — D-W6.1 investigation log and the
    revised diagnosis, marking D-W6.2 as the next concrete step.

Test plan

  • make (build + unit tests) green.
  • drift/sub_install.t 12/12 pass on master.
  • drift/sub_install.t 9/9 pass with the gate disabled
    (probe-build verified).
  • No production behavior change (DESTROY_TRACE is off by
    default; reproducer is a new file).

Next

D-W6.2 (closure-capture drift) — the reproducer for that is
already drafted in dev/modules/moose_support.md and will land
in a follow-up PR.

Generated with Devin

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 11 commits April 28, 2026 23:35
… check

PerlOnJava destroy semantics must not depend on which Perl module is
being used. PR #572 shipped a stopgap class-name heuristic
(`DestroyDispatch.classNeedsWalkerGate`) that gated walker-deferred
destruction to only Class::MOP / Moose / Moo classes. That violates
the rule.

Replace it with a single principled criterion: defer destroy only
when the object is reachable from a *package global* (not from any
local lexical) via the existing
`ReachabilityWalker.isReachableFromRoots(target, globalOnly=true)`
overload.

Why this works:

- Moose / Class::MOP store metaclasses in `our %METAS` (a package
  global), so they remain reachable globalOnly=true → gate fires →
  destruction is deferred until the real refCount drop. Moose
  bootstrap depends on this.

- DBIx::Class / Class::DBI track rows in `live_object_index` via
  WEAK refs, so they are NOT reachable globalOnly=true → gate
  doesn't fire → row dies at refCount=0 so a later `find()`
  reloads from the DB.

Changes:

- Remove `DestroyDispatch.classNeedsWalkerGate` and the
  `walkerGateClasses` / `walkerGateChecked` BitSets.
- Both gate sites (`MortalList.flush` and `RuntimeScalar.set`) now
  pass `globalOnly=true` to `ReachabilityWalker.isReachableFromRoots`.
- Updated comments at the gate sites to explain the universal rule.

Empirical impact (DBIC + Moose, jcpan -t, --jobs 1):

| Discriminator                        | DBIC fails | Moose fails |
|--------------------------------------|-----------:|------------:|
| Class-name heuristic (PR #572)       |  0/0 PASS  |  82 / 137   |
| No gate at all                       |  7/2       |  77 / 134   |
| `globalOnly=true` (this commit)      |  3/1       |  63 / 691   |

The Moose 691-asserts number is dominated by one test
(`util_std_type_constraints.t`, 556 of those); excluding it puts
Moose at ~135 failed asserts, comparable to the heuristic baseline.

Three known follow-ups are tracked in dev/modules/moose_support.md
under "Phase D-W5" — the one-test type_constraints regression, the
txn_scope_guard.t double-DESTROY warning, and t/52leaks.t SIGKILL.
None require a class-name list to fix.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Empirical follow-up to D-W5: globalOnly=true loses too much for Moose,
because anonymous metaclasses are held weakly in `our %METAS` and
strongly only via my-vars (the user's `my $class = create_anon_class(...)`).
Without my-var seeding, the walker decides the metaclass is unreachable
and DESTROY fires, taking the anonymous package's methods with it.

Switch both gate sites back to the default
`isReachableFromRoots(base)` overload, which seeds from package
globals AND from `MyVarCleanupStack`-tracked live my-vars and
ScalarRefRegistry. This is the principled rule: if any *live* strong
root holds the object, the cooperative refCount drop to 0 is a
transient drift and we must not destroy.

DBIC has 3-7 known regressions vs the class-name heuristic; tracked
as follow-ups in dev/modules/moose_support.md (D-W5).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Universal walker beats heuristic on Moose (61/133 vs 82/137).
- Four DBIC regressions tracked as follow-ups, none are timeouts.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
After running each failing test in isolation:

- 52leaks.t bails at line 518 because PerlOnJava correctly detects and
  destroys the self-referential cycle the test EXPECTS to leak.
  Inherent semantic difference; fixing it = regression elsewhere.
- txn_scope_guard.t test 18 wants the second DESTROY to fire after a
  Devel::StackTrace-style \@db::args capture — relies on Perl 5's exact
  refcount semantics; our walker now correctly sees the capture and
  defers, which is arguably more correct.
- cdbi/04-lazy.t test 11 — SQL fetches opop but it's lost during DBIC's
  row construction. Real bug, but requires deep instrumentation.
- generic_subq.t passes in isolation (false positive in suite output).

Performance: universal walker is ~6-14% FASTER than the heuristic
because the walker terminates early on hits and the
WeakRefRegistry.hasWeakRefsTo precheck keeps it off the common path.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Goal: PerlOnJava should not 'improve on' Perl 5 — cycles should leak,
DESTROY should fire at every refCount=0 transition, and the walker
gate should disappear in favor of accurate cooperative refCount.

Plan documents:
- Why the walker gate was a stopgap (refCount drift in MOP code).
- Five candidate refCount-drift sources to audit (hash slot stores,
  closure captures, @_ promotion, glob assignment, stash writes).
- Acceptance criteria: 52leaks.t passes, txn_scope_guard.t passes,
  cdbi/04-lazy.t passes, no walker gate calls, Moose pass rate not
  regressed.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
….set

Per the user directive: PerlOnJava should match Perl 5 destroy semantics
exactly, not 'improve on' them. The cooperative refCount is now the
sole source of truth for DESTROY firing.

Verified passing in isolation:
- cdbi/04-lazy.t test 11 (was failing — phantom destroy was clobbering
  row construction)
- txn_scope_guard.t test 18 (was failing — gate was deferring the first
  DESTROY past the @db::args capture)
- 52leaks.t 11/11 (was failing — gate was destroying cycles the test
  expected to leak)
- walker_gate_dbic_pattern.t and walker_gate_dbic_minimal.t still pass.
- Cycle leaks are preserved (cooperative refCount keeps cycle members
  at refCount >= 1 naturally).

Phase D-W6 step 3 of the plan in dev/modules/moose_support.md.
Full Moose / DBIC suite results to follow.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Reduce from 2373 to 584 lines:
- Drop per-phase historical metrics and chronological narrative (in git).
- Consolidate every reverted approach into a single "do NOT repeat
  these mistakes" section with commit hashes.
- Keep Phase D-W6 (replicate Perl 5 destroy semantics) verbatim.
- Keep the empirical D-W5 gate comparison table.
- Keep Phase D plan (D1-D6) and the XS inventory.

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Empirical D-W6 progress is documented in moose_support.md.

Findings from the no-gate experiment (commit 230a672):

PASS — what removing the gate fixes (Perl 5 semantics replicated):
- t/cdbi/04-lazy.t test 11 (no phantom destroy clobbers row build).
- t/storage/txn_scope_guard.t test 18 (double-DESTROY warning fires).
- t/52leaks.t 11/11 (cycles leak naturally, $r stays defined).
- Refcount unit tests still pass.

FAIL — what removing the gate breaks (refCount drift in MOP code):
- use Class::MOP::Class fails: "Can't locate method initialize" — the
  CV for Class::MOP::Class::initialize is destroyed during the circular
  bootstrap. PJ_DESTROY_TRACE shows non-blessed RuntimeCode objects
  hitting refCount=0 inside MortalList.flush during MiniTrait::apply,
  which is a sub-installation path.
- use Moose itself works (different load order) but cmop bootstrap
  destroys subs prematurely.

Hybrid attempt (only destroy blessed objects through the gate path)
fixed the bootstrap but regressed t/52leaks.t cascade-cleanup of
non-blessed containers holding blessed children.

Conclusion: the proper D-W6 fix is multi-week — it requires auditing
each cooperative refCount source (sub installation, glob assignment,
@_ promotion, hash-slot stores, closure captures) and back-filling
the missing increments. Until then, restore the universal walker gate
as a safety net (no class-name dispatch; matches PR #599).

Adds a PJ_DESTROY_TRACE=1 env-flag in DestroyDispatch.callDestroy to
help future drift hunting.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ll path

Identified the specific refCount drift that the walker gate has been
papering over: anonymous CVs transiently hit refCount=0 between
creation and being installed in the package stash. Documented the
three concrete next-step audits (sub installation, closure capture,
@_ promotion) and noted the temporary restoration of the universal
walker gate from PR #599 as a safety net.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Replaces the three-bullet 'Next concrete steps' list with a structured
plan for D-W6.1 (sub install), D-W6.2 (closure capture), and D-W6.3
(\@_ build). Each sub-phase ships:

- A standalone reproducer in src/test/resources/unit/refcount/drift/
  with a complete skeleton (no Moose/DBIC dependency).
- A bisectable fix commit at the source (audit targets named).
- A guarded gate-condition removal so the fix is verifiable and
  rollbackable.
- Hard-regression-gate verification before merging.

Sequencing: D-W6.1 first (highest ROI; likely fixes cdbi/04-lazy.t
on its own). D-W6.2 next (clears the numeric_defaults regression
pattern). D-W6.3 last (smallest blast radius). Effort estimate is
included for each.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Outcome: D-W6.1 was misnamed. The simple sub-installation patterns
(glob assignment, named sub, loop install, temp drop, nested install)
all work correctly with the walker gate disabled — the bytecode for
*Foo::bar = $cv is not the drift source.

The actual drift, surfaced by PJ_DESTROY_TRACE during `use
Class::MOP::Class`, is in Sub::Install's nested-closure pattern:

    *install_sub = _build_public_installer(_ignore_warnings(_installer));

Each layer is a closure returning a closure. The destroyed CV is
NOT the one being assigned — it's an unrelated Sub::Install anon CV
sitting in the deferred-decrement queue. The flush at the end of
setLargeRefCounted processes that queue, decrementing refCount from
0 to -1 on a CV that is still held by an outer closure capture.

So the next step is D-W6.2 (closure-capture drift). The reproducer
landing here is kept as a regression guard for the simple-install
path — when D-W6.2 lands and the gate is loosened, any regression
in this area is caught immediately.

Also enhances the PJ_DESTROY_TRACE output to include the CV's
package::subname when available (key for finding which sub is
being prematurely destroyed).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant