diag(D-W6.7): pinpoint root cause of metaclass refCount underflow#611
Closed
diag(D-W6.7): pinpoint root cause of metaclass refCount underflow#611
Conversation
Add granular diagnostic probes (gated on PJ_WEAKCLEAR_TRACE) to
WeakRefRegistry covering weaken(), removeWeakRef(), and
clearWeakRefsTo(). Combined with a Perl-side wrapper around
Class::MOP::Attribute::{attach_to_class, install_accessors}, this
nails down exactly when and why "Can't call method 'get_method' on an
undefined value at Class/MOP/Attribute.pm line 475" happens during
'use Class::MOP' with the walker gate disabled.
Finding: the metaclass RuntimeHash held strongly by `our %METAS`
nevertheless has its cooperative refCount drop to 0 at end-of-statement
when the temporary returned by `HasMethods->meta` falls out of scope.
MortalList.flush() then fires DESTROY -> clearWeakRefsTo, which nulls
all weakened `associated_class` slots on the just-attached attributes.
Two fix paths documented in dev/modules/moose_support.md (D-W6.8):
1. Fix RuntimeHash slot-assignment refCount discipline.
2. Walk package-variable roots before clearWeakRefsTo on refCount=0.
Walker gate restored to its previous state; this PR is purely
diagnostic. `make` passes.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two unit tests under src/test/resources/unit/refcount/drift/ that demonstrate the WANTED behavior (storage in package hash strongly holds the value): - our_metas_underflow.t: 20 assertions across 3 scenarios (object stored in package hash; method-chain temps with weakened back-refs; multi-meta cross-linking). - cmop_pattern.t: 11 assertions exercising the exact CMOP shape (`Class->meta->add_attribute(Attr->new(...))` repeated as separate top-level statements, with weakened associated_class back-refs that must stay alive across statements). Both files pass on master (the walker gate masks the underflow for the heuristic class names; these tests use non-CMOP class names so the gate isn't relevant — they stay green via correct refCount discipline). When the underlying refCount underflow is fixed, these will continue to pass and serve as regression guards. The actual `Class::MOP` failure mode (when the gate is disabled) requires the real CMOP recursive bootstrap and is documented in moose_support.md D-W6.7. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add a per-RuntimeBase `refCountTrace` flag that is armed at `bless` time when the bless target matches `classNeedsWalkerGate` (Class::MOP / Moose / Moo) and the env-flag `PJ_REFCOUNT_TRACE` is set. The flag-gated `traceRefCount(delta, reason)` method writes a stderr line for every refCount transition with a short stack snippet. Wired into the four critical refCount-changing sites: - WeakRefRegistry.weaken (decrement on weakening) - RuntimeScalar.setLargeRefCounted (increment on store) - RuntimeScalar.setLargeRefCounted (decrement on overwrite) - MortalList.flush (deferred decrement) Cost when off: one boolean && test per site (trivially branch- predicted). Cost when on: limited to objects matching the heuristic (typically <200 metaclasses, never DBIC schema or rows). Findings on `use Class::MOP` with the gate disabled: the failing metaclass accumulated 50 setLargeRefCounted increments + 1 silent bless increment vs 51 decrements (45 from MortalList.flush, 6 from weaken). Net: +51 - 51 = 0, i.e. one extra decrement somewhere. The full transition log for the failing metaclass is preserved at `dev/diagnostic-traces/cmop-metaclass-refcount-underflow.txt`. Walker fix experiments documented in moose_support.md D-W6.9: - Universal walker (no class-name heuristic): regresses DBIC's t/52leaks.t (expected leak is incorrectly rescued, downstream weaken dies). - Hybrid (heuristic full + non-heuristic globalOnly): same DBIC regression. - Adding walker gate to weaken's dec-to-0 path: timeout (>120s) on DBIC t/52leaks.t — too expensive on Moose-blessed objects. Conclusion: the class-name heuristic is the best known compromise. The proper fix needs the unpaired increment site identified, then fixed. The tracer is left in place to make that pinpointable in a follow-up. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implements Step 1 of the D-W6.11 fix plan: per-RuntimeBase ownership tracking with recordOwner(scalar, site) / releaseOwner(scalar, site) hooks. JVM shutdown hook prints surviving owners with full call-site attribution. Wired increments into: - RuntimeScalar.setLargeRefCounted (normal store) - RuntimeScalar.incrementRefCountForContainerStore (container retroactive) - ReferenceOperators.bless rebless-from-untracked path - RuntimeScalar.setLargeRefCounted rescue path (currentDestroyTarget) Wired releases into: - RuntimeScalar.setLargeRefCounted overwrite - WeakRefRegistry.weaken - MortalList.deferDecrementIfTracked - MortalList.deferDecrementRecursive - RuntimeArray.shift (deferred dec on shift) - DestroyDispatch.doCallDestroy args balance Result on `use Class::MOP` with the gate disabled: zero "*** UNPAIRED RELEASE ***" events, but one CMOP metaclass survives the run with refCount=Integer.MIN_VALUE (destroyed) and 2 owners still recorded — proving the imbalance is in cooperative-refCount itself, not in the instrumentation. The 2 owners came from `RuntimeScalar.set` paths, one via `RuntimeBaseProxy.set:65` (the proxy delegate that copies `lvalue.value` to `this.value` without recording ownership) and one via `addToScalar:902`. Smoking-gun candidates for the unpaired decrement(s): - RuntimeBaseProxy.set (line 65-67): proxy holds a stale strong-ref field invisible to refcounting. - RuntimeList.java (lines 623, 642, 666, 699): list-assignment "undo materialized copy" — direct --refCount without ownership release; needs audit. Walker gate restored. Full plan in dev/modules/moose_support.md D-W6.11/12. Trace artifact at dev/diagnostic-traces/cmop-metaclass-with-owner-tracking.txt (206 lines, every refCount transition + owner record/release with stack snippets). `make` passes (unit tests). The diagnostic instrumentation stays in place (gated on `PJ_REFCOUNT_TRACE`) for the next session to finish the inc/dec audit and remove the walker-gate heuristic. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Step 2 of the D-W6.11 plan: audit and instrument all direct `--base.refCount` sites with paired release calls. New on RuntimeBase: - activeOwners set (lazy-init, IdentityHashMap-backed) - activateOwnerTracking() — initializes + backfills from ScalarRefRegistry - recordActiveOwner(scalar) / releaseActiveOwner(scalar) - activeOwnerCount() — returns count filtered to scalars that still satisfy refCountOwned==true && value==this Wired record into: - RuntimeScalar.setLargeRefCounted (store) - RuntimeScalar.incrementRefCountForContainerStore Wired release into: - RuntimeScalar.setLargeRefCounted (overwrite) - WeakRefRegistry.weaken - MortalList.deferDecrementIfTracked - MortalList.deferDecrementRecursive - RuntimeArray.shift (deferred dec) - DestroyDispatch.doCallDestroy (args balance) - RuntimeList.java (4 "undo materialized copy" paths) - Storable.releaseApplyArgs Production rescue experiment (replacing the class-name heuristic with `activeOwnerCount() > 0 → restore refCount`): - use Class::MOP / use Moose: work without heuristic - Unit tests: green - DBIC t/52leaks.t: 9-10/18 fail — too aggressive on cycles Cooperative refcount cannot tolerate cycles without weaken; rescuing on real strong owners keeps cycle members alive that DBIC's leak test expects to see destroyed. Reverted production rescue. Master parity restored (11/11). Walker- gate heuristic still primary; activeOwners infrastructure dormant unless future audit work activates it. Surprise finding: calling `activateOwnerTracking()` (snapshot of ScalarRefRegistry) on every weaken caused 9/18 leak fails — likely WeakHashMap.expungeStaleEntries() side-effect shifting JVM GC timing visible to DBIC's leak detection. Removed the auto-activation. `make` passes; Class::MOP/Moose load (with heuristic still on). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implements D-W6.14 Option 2 (owner snapshot + reachability filter) as design-doc analysis of how system Perl handles cycles vs PerlOnJava's cooperative refcount. Key observation: system Perl doesn't solve cycles — they leak by design. The programmer breaks them with weaken(). DBIC's leak tests pass because DBIC weakens its internal back-references; cycles without weaken legitimately leak. For PerlOnJava to match, we need: 1. Precise cooperative refCount (no transient zeros), OR 2. Reachable-owner verification at refCount→0 This commit ships infrastructure for (2): - New ReachabilityWalker.isScalarReachable(target) — walks roots looking for the target SCALAR identity (existing isReachableFromRoots only finds RuntimeBases as referents, not scalars themselves). Includes closure capture walking via RuntimeCode.capturedScalars. - New RuntimeBase.reachableOwnerCount() — counts owners that are both refCountOwned-true and reachable from roots. Phantoms are filtered. - RuntimeBase.activeOwners populated incrementally via record/release calls (activated lazily on first activation — default not active to avoid behavior changes). - Audit and pair release calls at all known direct --refCount sites: RuntimeArray.pop, RuntimeStash undefine, RuntimeScalar undefine, MortalList.deferDestroyForContainerClear, MortalList.deferDecrementRecursive (unblessed branch). Walker-gate heuristic restored as primary mechanism. Class::MOP and Moose continue to work (heuristic), DBIC t/52leaks.t passes 11/11. The reachableOwnerCount() rescue path is wired but disabled by default (activeOwners is never auto-initialized in production) until the DBIC reachability gap (D-W6.15) is bridged. When activated experimentally on `use Class::MOP` (with heuristic disabled): Class::MOP and Moose load successfully — confirming the algorithm works for the original blocker. DBIC has a reachability gap (intermediate JVM-stack-live scalars not in any walker seed root) that needs bridging before universal deployment. Three concrete paths forward documented in moose_support.md D-W6.15: 1. Register every refCount-affecting RuntimeScalar in MyVarCleanupStack 2. Use Java GC as ground truth (deferred destroy + System.gc) 3. Eliminate transient zeros at the source `make` passes; Class::MOP/Moose load; DBIC t/52leaks.t 11/11. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Key insight from this session: cooperative refCount and JVM reference graph interact in ways that make a "universal walker rescue" inherently risky. activeOwners using IdentityHashMap (strong refs) prevents JVM GC of the scalars, breaking DBIC's leak detection which relies on objects becoming JVM-collectible. Switching to WeakHashMap allows GC but introduces premature collection bugs (Schema's owner gets GC'd too early). This commit: 1. Improves ReachabilityWalker.isScalarReachable to: - Skip weak refs (matching real Perl semantics) - Walk closure captures - Seed from package globals + live my-vars 2. Keeps walker-gate heuristic as primary mechanism (master parity) 3. Documents the JVM-GC interaction issue and path forward Findings: - Class::MOP/Moose load correctly via reachableOwnerCount() when manually invoked (proven by experiment — heuristic disabled, no regressions on those modules). - DBIC's t/52leaks.t per-object GC tests fail when reachableOwnerCount is the primary rescue, due to phantom JVM-stack-live scalars holding rows. - Replacing IdentityHashMap with WeakHashMap for activeOwners has correct GC semantics but breaks Schema reachability — Schema's owner scalar gets GC'd before refCount→0 transitions complete. Path forward (D-W6.17): The right fix needs the cooperative refCount itself to be precise enough that transient zeros only happen at scope-exit boundaries (matching real Perl's FREETMPS semantics). The walker can be a debug/audit tool but shouldn't be a production rescue path. For now: heuristic gate stays as primary; walker infrastructure ready for future use. `make` passes; Class::MOP/Moose load; DBIC t/52leaks.t 11/11 (master parity). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…not sufficient Tested the hypothesis that replacing per-assignment MortalList.flush() with per-statement FREETMPS-style flush would eliminate transient refCount zeros and remove the need for the class-name walker-gate heuristic. **Implementation experiment** (rolled back to master parity in this commit): - Removed MortalList.flush() from setLargeRefCounted - Added per-statement MortalList.flush() in JVM EmitBlock - Added per-statement MORTAL_FLUSH opcode in BytecodeCompiler **Result**: - Unit tests: PASS - DBIC t/52leaks.t 11/11: PASS (with heuristic enabled) - Class::MOP load WITH heuristic: works - Class::MOP load WITHOUT heuristic: STILL FAILS **Conclusion**: per-statement flush is correct (matches Perl 5 semantics) but it does NOT fix the underlying refCount imbalance. The 50 inc / 51 dec event count from D-W6.10 is a REAL imbalance, not a timing artifact. Flush schedule changes WHEN decrements fire, not WHETHER they're paired. The transient-zero hypothesis is misdiagnosis: 1→0 mid-statement is the CONSEQUENCE of the imbalance, not the cause. **Revised plan**: 1. Per-pair tracing: tag each setLargeRefCounted increment with a unique ID; tag each decrement with the ID it pairs with; find unmatched IDs at end of run. 2. Identify the specific code path producing the unmatched events. 3. Fix the path (add missing inc OR remove spurious dec). 4. Verify all four invariants without heuristic: - Unit tests pass - Class::MOP/Moose load - DBIC t/52leaks.t 11/11 - src/test/resources/unit/refcount/drift/* pass Per-statement flush remains worth doing eventually (matches Perl 5 FREETMPS, more performant), but as a separate refactor distinct from the imbalance fix. This commit ships only the design-doc revision — no code changes. The walker-gate heuristic remains primary (master parity preserved). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… gate
The walker-gate heuristic (`classNeedsWalkerGate`) was a hard-coded
list of class-name prefixes (Class::MOP, Moose, Moo) that triggered
refCount→0 rescue. This commit replaces it with a property-based
check that captures the underlying lifetime semantics directly.
The property: an object is module-global metadata if it is stored
as a value in a package-global hash (e.g., `$METAS{Foo} = $meta`).
Such objects have program-lifetime persistence held by the package
hash, but cooperative refCount may transient-zero during method
calls.
Implementation:
- New `RuntimeHash.isGlobalPackageHash` — set when the hash is
registered in `GlobalVariable.globalHashes`.
- New `RuntimeBase.storedInPackageGlobal` — set when a value is
stored as an element of a package-global hash via
`RuntimeBaseProxy.set` for `RuntimeHashProxyEntry`.
- Walker gate at `MortalList.flush` and `RuntimeScalar.setLargeRefCounted`
overwrite path now checks `base.storedInPackageGlobal` instead of
`classNeedsWalkerGate(base.blessId)`.
Results — all green WITHOUT the class-name heuristic:
- Unit tests: PASS
- `use Class::MOP; use Moose;`: PASS
- `examples/moose.pl`: PASS (full Moose demo, 7 test groups)
- DBIC `t/52leaks.t`: 11/11 PASS
Performance:
- Per-store cost: one boolean check + one flag set if the destination
scalar's parent hash is package-global. O(1) per store.
- Walker cost: only fires when object has weak refs AND
storedInPackageGlobal AND refCount→0. Same gating as the old
heuristic — same performance envelope.
Correctness:
- Class-agnostic: works for any module that stores metadata in
package-global hashes (extensibility).
- Behaviorally equivalent to the old heuristic for tested modules.
- Captures the underlying property explicitly rather than via
ad-hoc class-name list.
The `classNeedsWalkerGate` method is still in the code (used by
the D-W6.10 refCount tracer for diagnostic activation). It can be
deleted in a follow-up commit; this commit focuses on the
production fix.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Owner
Author
|
Superseded by #618 — unified PR with squashed history and clean narrative. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Diagnostic-only PR that pinpoints the root cause of the
Class::MOPbootstrap failure when the D-W6 walker gate is disabled,plus initial unit-test scaffolding.
Method
PJ_WEAKCLEAR_TRACEenv-flag probes toWeakRefRegistrycovering
weaken(),removeWeakRef(),clearWeakRefsTo().Class::MOP::Attribute::{attach_to_class, install_accessors}from Perl with refaddr-printing logs.
use Class::MOP, and correlated the Javaweak-tracking events with the Perl-level events.
Smoking gun
The metaclass
RuntimeHash(e.g. id=1746570062) is held stronglyby
our %METAS($METAS{$package_name} = $self). Yet at end-of-statement, the temporary returned by
HasMethods->metafalling outof scope causes
MortalList.flush()to drop refCount to 0 andfire DESTROY →
clearWeakRefsTo, which nulls all 4 weakenedassociated_classslots — including the just-attachedwrapped_method_metaclass's slot.The first failing
install_accessorsthen reads UNDEFassociated_class, dies on$class->add_method(...). That die ishidden by
local $SIG{__DIE__}insidetry { install_accessors }at
Class/MOP/Class.pm:897. The catch firesremove_attribute → remove_accessors → _remove_accessoratClass/MOP/Attribute.pm:475,which dies again with the visible
Can't call method "get_method" on an undefined valuemessage.Root cause statement
Cooperative refCount fails to count the strong reference held by
the package-variable hash slot
$METAS{$package_name}. When thetemporary metaclass return-value from
->metaexpires, refCountgoes 1 → 0 even though
%METASstill references it.Unit tests added
Two regression tests under
src/test/resources/unit/refcount/drift/:our_metas_underflow.t— 20 assertions across 3 scenarios(storage in package hash, method-chain temps with weakened back-
refs, multi-meta cross-linking).
cmop_pattern.t— 11 assertions exercising the exact CMOP shapewith non-CMOP class names (so the walker gate is not relevant).
Both pass on master via correct refCount discipline. They serve as
regression guards for the eventual fix.
The actual
Class::MOPfailure mode (when the gate is disabled)requires the real CMOP recursive bootstrap and many more nodes than
synthetic tests can model. Documented in
dev/modules/moose_support.mdD-W6.7/D-W6.8.Two paths to a real fix (documented in moose_support.md D-W6.8)
Fix refCount discipline. Ensure
RuntimeHash.put()/slot-assignment increments the cooperative refCount of the
referent when storing a tracked
RuntimeBasevalue.Walker as ground truth before destroy. When refCount
underflows, walk reachability from package-variable roots; if
reachable, leave refCount at 1 instead of dropping to 0.
Replaces the heuristic name-based gate.
Walker gate restored to current behaviour; this PR ships only
diagnostics + tests + docs.
Test plan
makepasses locally (full unit test suite).Generated with Devin