diag(D-W6.7): pinpoint root cause of metaclass refCount underflow by fglock · Pull Request #611 · fglock/PerlOnJava

fglock · 2026-04-29T08:31:17Z

Summary

Diagnostic-only PR that pinpoints the root cause of the
Class::MOP bootstrap failure when the D-W6 walker gate is disabled,
plus initial unit-test scaffolding.

Method

Added PJ_WEAKCLEAR_TRACE env-flag probes to WeakRefRegistry
covering weaken(), removeWeakRef(), clearWeakRefsTo().
Wrapped Class::MOP::Attribute::{attach_to_class, install_accessors}
from Perl with refaddr-printing logs.
Disabled the gate, ran use Class::MOP, and correlated the Java
weak-tracking events with the Perl-level events.

Smoking gun

The metaclass RuntimeHash (e.g. id=1746570062) is held strongly
by our %METAS ($METAS{$package_name} = $self). Yet at end-of-
statement, the temporary returned by HasMethods->meta falling out
of scope causes MortalList.flush() to drop refCount to 0 and
fire DESTROY → clearWeakRefsTo, which nulls all 4 weakened
associated_class slots — including the just-attached
wrapped_method_metaclass's slot.

The first failing install_accessors then reads UNDEF
associated_class, dies on $class->add_method(...). That die is
hidden by local $SIG{__DIE__} inside try { install_accessors }
at Class/MOP/Class.pm:897. The catch fires remove_attribute → remove_accessors → _remove_accessor at Class/MOP/Attribute.pm:475,
which dies again with the visible Can't call method "get_method" on an undefined value message.

Root cause statement

Cooperative refCount fails to count the strong reference held by
the package-variable hash slot $METAS{$package_name}. When the
temporary metaclass return-value from ->meta expires, refCount
goes 1 → 0 even though %METAS still references it.

Unit tests added

Two regression tests under src/test/resources/unit/refcount/drift/:

our_metas_underflow.t — 20 assertions across 3 scenarios
(storage in package hash, method-chain temps with weakened back-
refs, multi-meta cross-linking).
cmop_pattern.t — 11 assertions exercising the exact CMOP shape
with non-CMOP class names (so the walker gate is not relevant).

Both pass on master via correct refCount discipline. They serve as
regression guards for the eventual fix.

The actual Class::MOP failure mode (when the gate is disabled)
requires the real CMOP recursive bootstrap and many more nodes than
synthetic tests can model. Documented in
dev/modules/moose_support.md D-W6.7/D-W6.8.

Two paths to a real fix (documented in moose_support.md D-W6.8)

Fix refCount discipline. Ensure RuntimeHash.put() /
slot-assignment increments the cooperative refCount of the
referent when storing a tracked RuntimeBase value.
Walker as ground truth before destroy. When refCount
underflows, walk reachability from package-variable roots; if
reachable, leave refCount at 1 instead of dropping to 0.
Replaces the heuristic name-based gate.

Walker gate restored to current behaviour; this PR ships only
diagnostics + tests + docs.

Test plan

make passes locally (full unit test suite).
No production behaviour change (probes are gated on env-flag).
New unit tests pass on master.

Generated with Devin

Add granular diagnostic probes (gated on PJ_WEAKCLEAR_TRACE) to WeakRefRegistry covering weaken(), removeWeakRef(), and clearWeakRefsTo(). Combined with a Perl-side wrapper around Class::MOP::Attribute::{attach_to_class, install_accessors}, this nails down exactly when and why "Can't call method 'get_method' on an undefined value at Class/MOP/Attribute.pm line 475" happens during 'use Class::MOP' with the walker gate disabled. Finding: the metaclass RuntimeHash held strongly by `our %METAS` nevertheless has its cooperative refCount drop to 0 at end-of-statement when the temporary returned by `HasMethods->meta` falls out of scope. MortalList.flush() then fires DESTROY -> clearWeakRefsTo, which nulls all weakened `associated_class` slots on the just-attached attributes. Two fix paths documented in dev/modules/moose_support.md (D-W6.8): 1. Fix RuntimeHash slot-assignment refCount discipline. 2. Walk package-variable roots before clearWeakRefsTo on refCount=0. Walker gate restored to its previous state; this PR is purely diagnostic. `make` passes. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Two unit tests under src/test/resources/unit/refcount/drift/ that demonstrate the WANTED behavior (storage in package hash strongly holds the value): - our_metas_underflow.t: 20 assertions across 3 scenarios (object stored in package hash; method-chain temps with weakened back-refs; multi-meta cross-linking). - cmop_pattern.t: 11 assertions exercising the exact CMOP shape (`Class->meta->add_attribute(Attr->new(...))` repeated as separate top-level statements, with weakened associated_class back-refs that must stay alive across statements). Both files pass on master (the walker gate masks the underflow for the heuristic class names; these tests use non-CMOP class names so the gate isn't relevant — they stay green via correct refCount discipline). When the underlying refCount underflow is fixed, these will continue to pass and serve as regression guards. The actual `Class::MOP` failure mode (when the gate is disabled) requires the real CMOP recursive bootstrap and is documented in moose_support.md D-W6.7. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Add a per-RuntimeBase `refCountTrace` flag that is armed at `bless` time when the bless target matches `classNeedsWalkerGate` (Class::MOP / Moose / Moo) and the env-flag `PJ_REFCOUNT_TRACE` is set. The flag-gated `traceRefCount(delta, reason)` method writes a stderr line for every refCount transition with a short stack snippet. Wired into the four critical refCount-changing sites: - WeakRefRegistry.weaken (decrement on weakening) - RuntimeScalar.setLargeRefCounted (increment on store) - RuntimeScalar.setLargeRefCounted (decrement on overwrite) - MortalList.flush (deferred decrement) Cost when off: one boolean && test per site (trivially branch- predicted). Cost when on: limited to objects matching the heuristic (typically <200 metaclasses, never DBIC schema or rows). Findings on `use Class::MOP` with the gate disabled: the failing metaclass accumulated 50 setLargeRefCounted increments + 1 silent bless increment vs 51 decrements (45 from MortalList.flush, 6 from weaken). Net: +51 - 51 = 0, i.e. one extra decrement somewhere. The full transition log for the failing metaclass is preserved at `dev/diagnostic-traces/cmop-metaclass-refcount-underflow.txt`. Walker fix experiments documented in moose_support.md D-W6.9: - Universal walker (no class-name heuristic): regresses DBIC's t/52leaks.t (expected leak is incorrectly rescued, downstream weaken dies). - Hybrid (heuristic full + non-heuristic globalOnly): same DBIC regression. - Adding walker gate to weaken's dec-to-0 path: timeout (>120s) on DBIC t/52leaks.t — too expensive on Moose-blessed objects. Conclusion: the class-name heuristic is the best known compromise. The proper fix needs the unpaired increment site identified, then fixed. The tracer is left in place to make that pinpointable in a follow-up. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Implements Step 1 of the D-W6.11 fix plan: per-RuntimeBase ownership tracking with recordOwner(scalar, site) / releaseOwner(scalar, site) hooks. JVM shutdown hook prints surviving owners with full call-site attribution. Wired increments into: - RuntimeScalar.setLargeRefCounted (normal store) - RuntimeScalar.incrementRefCountForContainerStore (container retroactive) - ReferenceOperators.bless rebless-from-untracked path - RuntimeScalar.setLargeRefCounted rescue path (currentDestroyTarget) Wired releases into: - RuntimeScalar.setLargeRefCounted overwrite - WeakRefRegistry.weaken - MortalList.deferDecrementIfTracked - MortalList.deferDecrementRecursive - RuntimeArray.shift (deferred dec on shift) - DestroyDispatch.doCallDestroy args balance Result on `use Class::MOP` with the gate disabled: zero "*** UNPAIRED RELEASE ***" events, but one CMOP metaclass survives the run with refCount=Integer.MIN_VALUE (destroyed) and 2 owners still recorded — proving the imbalance is in cooperative-refCount itself, not in the instrumentation. The 2 owners came from `RuntimeScalar.set` paths, one via `RuntimeBaseProxy.set:65` (the proxy delegate that copies `lvalue.value` to `this.value` without recording ownership) and one via `addToScalar:902`. Smoking-gun candidates for the unpaired decrement(s): - RuntimeBaseProxy.set (line 65-67): proxy holds a stale strong-ref field invisible to refcounting. - RuntimeList.java (lines 623, 642, 666, 699): list-assignment "undo materialized copy" — direct --refCount without ownership release; needs audit. Walker gate restored. Full plan in dev/modules/moose_support.md D-W6.11/12. Trace artifact at dev/diagnostic-traces/cmop-metaclass-with-owner-tracking.txt (206 lines, every refCount transition + owner record/release with stack snippets). `make` passes (unit tests). The diagnostic instrumentation stays in place (gated on `PJ_REFCOUNT_TRACE`) for the next session to finish the inc/dec audit and remove the walker-gate heuristic. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Step 2 of the D-W6.11 plan: audit and instrument all direct `--base.refCount` sites with paired release calls. New on RuntimeBase: - activeOwners set (lazy-init, IdentityHashMap-backed) - activateOwnerTracking() — initializes + backfills from ScalarRefRegistry - recordActiveOwner(scalar) / releaseActiveOwner(scalar) - activeOwnerCount() — returns count filtered to scalars that still satisfy refCountOwned==true && value==this Wired record into: - RuntimeScalar.setLargeRefCounted (store) - RuntimeScalar.incrementRefCountForContainerStore Wired release into: - RuntimeScalar.setLargeRefCounted (overwrite) - WeakRefRegistry.weaken - MortalList.deferDecrementIfTracked - MortalList.deferDecrementRecursive - RuntimeArray.shift (deferred dec) - DestroyDispatch.doCallDestroy (args balance) - RuntimeList.java (4 "undo materialized copy" paths) - Storable.releaseApplyArgs Production rescue experiment (replacing the class-name heuristic with `activeOwnerCount() > 0 → restore refCount`): - use Class::MOP / use Moose: work without heuristic - Unit tests: green - DBIC t/52leaks.t: 9-10/18 fail — too aggressive on cycles Cooperative refcount cannot tolerate cycles without weaken; rescuing on real strong owners keeps cycle members alive that DBIC's leak test expects to see destroyed. Reverted production rescue. Master parity restored (11/11). Walker- gate heuristic still primary; activeOwners infrastructure dormant unless future audit work activates it. Surprise finding: calling `activateOwnerTracking()` (snapshot of ScalarRefRegistry) on every weaken caused 9/18 leak fails — likely WeakHashMap.expungeStaleEntries() side-effect shifting JVM GC timing visible to DBIC's leak detection. Removed the auto-activation. `make` passes; Class::MOP/Moose load (with heuristic still on). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Implements D-W6.14 Option 2 (owner snapshot + reachability filter) as design-doc analysis of how system Perl handles cycles vs PerlOnJava's cooperative refcount. Key observation: system Perl doesn't solve cycles — they leak by design. The programmer breaks them with weaken(). DBIC's leak tests pass because DBIC weakens its internal back-references; cycles without weaken legitimately leak. For PerlOnJava to match, we need: 1. Precise cooperative refCount (no transient zeros), OR 2. Reachable-owner verification at refCount→0 This commit ships infrastructure for (2): - New ReachabilityWalker.isScalarReachable(target) — walks roots looking for the target SCALAR identity (existing isReachableFromRoots only finds RuntimeBases as referents, not scalars themselves). Includes closure capture walking via RuntimeCode.capturedScalars. - New RuntimeBase.reachableOwnerCount() — counts owners that are both refCountOwned-true and reachable from roots. Phantoms are filtered. - RuntimeBase.activeOwners populated incrementally via record/release calls (activated lazily on first activation — default not active to avoid behavior changes). - Audit and pair release calls at all known direct --refCount sites: RuntimeArray.pop, RuntimeStash undefine, RuntimeScalar undefine, MortalList.deferDestroyForContainerClear, MortalList.deferDecrementRecursive (unblessed branch). Walker-gate heuristic restored as primary mechanism. Class::MOP and Moose continue to work (heuristic), DBIC t/52leaks.t passes 11/11. The reachableOwnerCount() rescue path is wired but disabled by default (activeOwners is never auto-initialized in production) until the DBIC reachability gap (D-W6.15) is bridged. When activated experimentally on `use Class::MOP` (with heuristic disabled): Class::MOP and Moose load successfully — confirming the algorithm works for the original blocker. DBIC has a reachability gap (intermediate JVM-stack-live scalars not in any walker seed root) that needs bridging before universal deployment. Three concrete paths forward documented in moose_support.md D-W6.15: 1. Register every refCount-affecting RuntimeScalar in MyVarCleanupStack 2. Use Java GC as ground truth (deferred destroy + System.gc) 3. Eliminate transient zeros at the source `make` passes; Class::MOP/Moose load; DBIC t/52leaks.t 11/11. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Key insight from this session: cooperative refCount and JVM reference graph interact in ways that make a "universal walker rescue" inherently risky. activeOwners using IdentityHashMap (strong refs) prevents JVM GC of the scalars, breaking DBIC's leak detection which relies on objects becoming JVM-collectible. Switching to WeakHashMap allows GC but introduces premature collection bugs (Schema's owner gets GC'd too early). This commit: 1. Improves ReachabilityWalker.isScalarReachable to: - Skip weak refs (matching real Perl semantics) - Walk closure captures - Seed from package globals + live my-vars 2. Keeps walker-gate heuristic as primary mechanism (master parity) 3. Documents the JVM-GC interaction issue and path forward Findings: - Class::MOP/Moose load correctly via reachableOwnerCount() when manually invoked (proven by experiment — heuristic disabled, no regressions on those modules). - DBIC's t/52leaks.t per-object GC tests fail when reachableOwnerCount is the primary rescue, due to phantom JVM-stack-live scalars holding rows. - Replacing IdentityHashMap with WeakHashMap for activeOwners has correct GC semantics but breaks Schema reachability — Schema's owner scalar gets GC'd before refCount→0 transitions complete. Path forward (D-W6.17): The right fix needs the cooperative refCount itself to be precise enough that transient zeros only happen at scope-exit boundaries (matching real Perl's FREETMPS semantics). The walker can be a debug/audit tool but shouldn't be a production rescue path. For now: heuristic gate stays as primary; walker infrastructure ready for future use. `make` passes; Class::MOP/Moose load; DBIC t/52leaks.t 11/11 (master parity). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…not sufficient Tested the hypothesis that replacing per-assignment MortalList.flush() with per-statement FREETMPS-style flush would eliminate transient refCount zeros and remove the need for the class-name walker-gate heuristic. **Implementation experiment** (rolled back to master parity in this commit): - Removed MortalList.flush() from setLargeRefCounted - Added per-statement MortalList.flush() in JVM EmitBlock - Added per-statement MORTAL_FLUSH opcode in BytecodeCompiler **Result**: - Unit tests: PASS - DBIC t/52leaks.t 11/11: PASS (with heuristic enabled) - Class::MOP load WITH heuristic: works - Class::MOP load WITHOUT heuristic: STILL FAILS **Conclusion**: per-statement flush is correct (matches Perl 5 semantics) but it does NOT fix the underlying refCount imbalance. The 50 inc / 51 dec event count from D-W6.10 is a REAL imbalance, not a timing artifact. Flush schedule changes WHEN decrements fire, not WHETHER they're paired. The transient-zero hypothesis is misdiagnosis: 1→0 mid-statement is the CONSEQUENCE of the imbalance, not the cause. **Revised plan**: 1. Per-pair tracing: tag each setLargeRefCounted increment with a unique ID; tag each decrement with the ID it pairs with; find unmatched IDs at end of run. 2. Identify the specific code path producing the unmatched events. 3. Fix the path (add missing inc OR remove spurious dec). 4. Verify all four invariants without heuristic: - Unit tests pass - Class::MOP/Moose load - DBIC t/52leaks.t 11/11 - src/test/resources/unit/refcount/drift/* pass Per-statement flush remains worth doing eventually (matches Perl 5 FREETMPS, more performant), but as a separate refactor distinct from the imbalance fix. This commit ships only the design-doc revision — no code changes. The walker-gate heuristic remains primary (master parity preserved). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

… gate The walker-gate heuristic (`classNeedsWalkerGate`) was a hard-coded list of class-name prefixes (Class::MOP, Moose, Moo) that triggered refCount→0 rescue. This commit replaces it with a property-based check that captures the underlying lifetime semantics directly. The property: an object is module-global metadata if it is stored as a value in a package-global hash (e.g., `$METAS{Foo} = $meta`). Such objects have program-lifetime persistence held by the package hash, but cooperative refCount may transient-zero during method calls. Implementation: - New `RuntimeHash.isGlobalPackageHash` — set when the hash is registered in `GlobalVariable.globalHashes`. - New `RuntimeBase.storedInPackageGlobal` — set when a value is stored as an element of a package-global hash via `RuntimeBaseProxy.set` for `RuntimeHashProxyEntry`. - Walker gate at `MortalList.flush` and `RuntimeScalar.setLargeRefCounted` overwrite path now checks `base.storedInPackageGlobal` instead of `classNeedsWalkerGate(base.blessId)`. Results — all green WITHOUT the class-name heuristic: - Unit tests: PASS - `use Class::MOP; use Moose;`: PASS - `examples/moose.pl`: PASS (full Moose demo, 7 test groups) - DBIC `t/52leaks.t`: 11/11 PASS Performance: - Per-store cost: one boolean check + one flag set if the destination scalar's parent hash is package-global. O(1) per store. - Walker cost: only fires when object has weak refs AND storedInPackageGlobal AND refCount→0. Same gating as the old heuristic — same performance envelope. Correctness: - Class-agnostic: works for any module that stores metadata in package-global hashes (extensibility). - Behaviorally equivalent to the old heuristic for tested modules. - Captures the underlying property explicitly rather than via ad-hoc class-name list. The `classNeedsWalkerGate` method is still in the code (used by the D-W6.10 refCount tracer for diagnostic activation). It can be deleted in a follow-up commit; this commit focuses on the production fix. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock · 2026-04-29T11:50:43Z

Superseded by #618 — unified PR with squashed history and clean narrative.

fglock and others added 9 commits April 29, 2026 10:30

fglock closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diag(D-W6.7): pinpoint root cause of metaclass refCount underflow#611

diag(D-W6.7): pinpoint root cause of metaclass refCount underflow#611
fglock wants to merge 9 commits intomasterfrom
fix/d-w6-precise-die-probe

fglock commented Apr 29, 2026 •

edited

Loading

Uh oh!

fglock commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Method

Smoking gun

Root cause statement

Unit tests added

Two paths to a real fix (documented in moose_support.md D-W6.8)

Test plan

Uh oh!

fglock commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fglock commented Apr 29, 2026 •

edited

Loading