Skip to content

diag(D-W6.7): pinpoint root cause of metaclass refCount underflow#611

Closed
fglock wants to merge 9 commits intomasterfrom
fix/d-w6-precise-die-probe
Closed

diag(D-W6.7): pinpoint root cause of metaclass refCount underflow#611
fglock wants to merge 9 commits intomasterfrom
fix/d-w6-precise-die-probe

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 29, 2026

Summary

Diagnostic-only PR that pinpoints the root cause of the
Class::MOP bootstrap failure when the D-W6 walker gate is disabled,
plus initial unit-test scaffolding.

Method

  • Added PJ_WEAKCLEAR_TRACE env-flag probes to WeakRefRegistry
    covering weaken(), removeWeakRef(), clearWeakRefsTo().
  • Wrapped Class::MOP::Attribute::{attach_to_class, install_accessors}
    from Perl with refaddr-printing logs.
  • Disabled the gate, ran use Class::MOP, and correlated the Java
    weak-tracking events with the Perl-level events.

Smoking gun

The metaclass RuntimeHash (e.g. id=1746570062) is held strongly
by our %METAS ($METAS{$package_name} = $self). Yet at end-of-
statement, the temporary returned by HasMethods->meta falling out
of scope causes MortalList.flush() to drop refCount to 0 and
fire DESTROY → clearWeakRefsTo, which nulls all 4 weakened
associated_class slots — including the just-attached
wrapped_method_metaclass's slot.

The first failing install_accessors then reads UNDEF
associated_class, dies on $class->add_method(...). That die is
hidden by local $SIG{__DIE__} inside try { install_accessors }
at Class/MOP/Class.pm:897. The catch fires remove_attribute → remove_accessors → _remove_accessor at Class/MOP/Attribute.pm:475,
which dies again with the visible Can't call method "get_method" on an undefined value message.

Root cause statement

Cooperative refCount fails to count the strong reference held by
the package-variable hash slot $METAS{$package_name}. When the
temporary metaclass return-value from ->meta expires, refCount
goes 1 → 0 even though %METAS still references it.

Unit tests added

Two regression tests under src/test/resources/unit/refcount/drift/:

  • our_metas_underflow.t — 20 assertions across 3 scenarios
    (storage in package hash, method-chain temps with weakened back-
    refs, multi-meta cross-linking).
  • cmop_pattern.t — 11 assertions exercising the exact CMOP shape
    with non-CMOP class names (so the walker gate is not relevant).

Both pass on master via correct refCount discipline. They serve as
regression guards for the eventual fix.

The actual Class::MOP failure mode (when the gate is disabled)
requires the real CMOP recursive bootstrap and many more nodes than
synthetic tests can model. Documented in
dev/modules/moose_support.md D-W6.7/D-W6.8.

Two paths to a real fix (documented in moose_support.md D-W6.8)

  1. Fix refCount discipline. Ensure RuntimeHash.put() /
    slot-assignment increments the cooperative refCount of the
    referent when storing a tracked RuntimeBase value.

  2. Walker as ground truth before destroy. When refCount
    underflows, walk reachability from package-variable roots; if
    reachable, leave refCount at 1 instead of dropping to 0.
    Replaces the heuristic name-based gate.

Walker gate restored to current behaviour; this PR ships only
diagnostics + tests + docs.

Test plan

  • make passes locally (full unit test suite).
  • No production behaviour change (probes are gated on env-flag).
  • New unit tests pass on master.

Generated with Devin

fglock and others added 9 commits April 29, 2026 10:30
Add granular diagnostic probes (gated on PJ_WEAKCLEAR_TRACE) to
WeakRefRegistry covering weaken(), removeWeakRef(), and
clearWeakRefsTo(). Combined with a Perl-side wrapper around
Class::MOP::Attribute::{attach_to_class, install_accessors}, this
nails down exactly when and why "Can't call method 'get_method' on an
undefined value at Class/MOP/Attribute.pm line 475" happens during
'use Class::MOP' with the walker gate disabled.

Finding: the metaclass RuntimeHash held strongly by `our %METAS`
nevertheless has its cooperative refCount drop to 0 at end-of-statement
when the temporary returned by `HasMethods->meta` falls out of scope.
MortalList.flush() then fires DESTROY -> clearWeakRefsTo, which nulls
all weakened `associated_class` slots on the just-attached attributes.

Two fix paths documented in dev/modules/moose_support.md (D-W6.8):
1. Fix RuntimeHash slot-assignment refCount discipline.
2. Walk package-variable roots before clearWeakRefsTo on refCount=0.

Walker gate restored to its previous state; this PR is purely
diagnostic. `make` passes.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two unit tests under src/test/resources/unit/refcount/drift/ that
demonstrate the WANTED behavior (storage in package hash strongly
holds the value):

- our_metas_underflow.t: 20 assertions across 3 scenarios
  (object stored in package hash; method-chain temps with weakened
  back-refs; multi-meta cross-linking).

- cmop_pattern.t: 11 assertions exercising the exact CMOP shape
  (`Class->meta->add_attribute(Attr->new(...))` repeated as
  separate top-level statements, with weakened associated_class
  back-refs that must stay alive across statements).

Both files pass on master (the walker gate masks the underflow for
the heuristic class names; these tests use non-CMOP class names so
the gate isn't relevant — they stay green via correct refCount
discipline). When the underlying refCount underflow is fixed, these
will continue to pass and serve as regression guards.

The actual `Class::MOP` failure mode (when the gate is disabled)
requires the real CMOP recursive bootstrap and is documented in
moose_support.md D-W6.7.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add a per-RuntimeBase `refCountTrace` flag that is armed at `bless`
time when the bless target matches `classNeedsWalkerGate`
(Class::MOP / Moose / Moo) and the env-flag `PJ_REFCOUNT_TRACE` is
set. The flag-gated `traceRefCount(delta, reason)` method writes a
stderr line for every refCount transition with a short stack
snippet.

Wired into the four critical refCount-changing sites:

- WeakRefRegistry.weaken (decrement on weakening)
- RuntimeScalar.setLargeRefCounted (increment on store)
- RuntimeScalar.setLargeRefCounted (decrement on overwrite)
- MortalList.flush (deferred decrement)

Cost when off: one boolean && test per site (trivially branch-
predicted). Cost when on: limited to objects matching the heuristic
(typically <200 metaclasses, never DBIC schema or rows).

Findings on `use Class::MOP` with the gate disabled: the failing
metaclass accumulated 50 setLargeRefCounted increments + 1 silent
bless increment vs 51 decrements (45 from MortalList.flush, 6 from
weaken). Net: +51 - 51 = 0, i.e. one extra decrement somewhere.
The full transition log for the failing metaclass is preserved at
`dev/diagnostic-traces/cmop-metaclass-refcount-underflow.txt`.

Walker fix experiments documented in moose_support.md D-W6.9:
- Universal walker (no class-name heuristic): regresses DBIC's
  t/52leaks.t (expected leak is incorrectly rescued, downstream
  weaken dies).
- Hybrid (heuristic full + non-heuristic globalOnly): same DBIC
  regression.
- Adding walker gate to weaken's dec-to-0 path: timeout (>120s) on
  DBIC t/52leaks.t — too expensive on Moose-blessed objects.

Conclusion: the class-name heuristic is the best known compromise.
The proper fix needs the unpaired increment site identified, then
fixed. The tracer is left in place to make that pinpointable in a
follow-up.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implements Step 1 of the D-W6.11 fix plan: per-RuntimeBase ownership
tracking with recordOwner(scalar, site) / releaseOwner(scalar, site)
hooks. JVM shutdown hook prints surviving owners with full call-site
attribution.

Wired increments into:
- RuntimeScalar.setLargeRefCounted (normal store)
- RuntimeScalar.incrementRefCountForContainerStore (container retroactive)
- ReferenceOperators.bless rebless-from-untracked path
- RuntimeScalar.setLargeRefCounted rescue path (currentDestroyTarget)

Wired releases into:
- RuntimeScalar.setLargeRefCounted overwrite
- WeakRefRegistry.weaken
- MortalList.deferDecrementIfTracked
- MortalList.deferDecrementRecursive
- RuntimeArray.shift (deferred dec on shift)
- DestroyDispatch.doCallDestroy args balance

Result on `use Class::MOP` with the gate disabled: zero
"*** UNPAIRED RELEASE ***" events, but one CMOP metaclass survives
the run with refCount=Integer.MIN_VALUE (destroyed) and 2 owners
still recorded — proving the imbalance is in cooperative-refCount
itself, not in the instrumentation. The 2 owners came from
`RuntimeScalar.set` paths, one via `RuntimeBaseProxy.set:65` (the
proxy delegate that copies `lvalue.value` to `this.value` without
recording ownership) and one via `addToScalar:902`.

Smoking-gun candidates for the unpaired decrement(s):
- RuntimeBaseProxy.set (line 65-67): proxy holds a stale strong-ref
  field invisible to refcounting.
- RuntimeList.java (lines 623, 642, 666, 699): list-assignment
  "undo materialized copy" — direct --refCount without ownership
  release; needs audit.

Walker gate restored. Full plan in
dev/modules/moose_support.md D-W6.11/12. Trace artifact at
dev/diagnostic-traces/cmop-metaclass-with-owner-tracking.txt (206
lines, every refCount transition + owner record/release with
stack snippets).

`make` passes (unit tests). The diagnostic instrumentation stays
in place (gated on `PJ_REFCOUNT_TRACE`) for the next session to
finish the inc/dec audit and remove the walker-gate heuristic.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Step 2 of the D-W6.11 plan: audit and instrument all direct
`--base.refCount` sites with paired release calls.

New on RuntimeBase:
- activeOwners set (lazy-init, IdentityHashMap-backed)
- activateOwnerTracking() — initializes + backfills from
  ScalarRefRegistry
- recordActiveOwner(scalar) / releaseActiveOwner(scalar)
- activeOwnerCount() — returns count filtered to scalars that
  still satisfy refCountOwned==true && value==this

Wired record into:
- RuntimeScalar.setLargeRefCounted (store)
- RuntimeScalar.incrementRefCountForContainerStore

Wired release into:
- RuntimeScalar.setLargeRefCounted (overwrite)
- WeakRefRegistry.weaken
- MortalList.deferDecrementIfTracked
- MortalList.deferDecrementRecursive
- RuntimeArray.shift (deferred dec)
- DestroyDispatch.doCallDestroy (args balance)
- RuntimeList.java (4 "undo materialized copy" paths)
- Storable.releaseApplyArgs

Production rescue experiment (replacing the class-name heuristic
with `activeOwnerCount() > 0 → restore refCount`):
- use Class::MOP / use Moose: work without heuristic
- Unit tests: green
- DBIC t/52leaks.t: 9-10/18 fail — too aggressive on cycles

Cooperative refcount cannot tolerate cycles without weaken; rescuing
on real strong owners keeps cycle members alive that DBIC's leak
test expects to see destroyed.

Reverted production rescue. Master parity restored (11/11). Walker-
gate heuristic still primary; activeOwners infrastructure dormant
unless future audit work activates it.

Surprise finding: calling `activateOwnerTracking()` (snapshot of
ScalarRefRegistry) on every weaken caused 9/18 leak fails — likely
WeakHashMap.expungeStaleEntries() side-effect shifting JVM GC
timing visible to DBIC's leak detection. Removed the auto-activation.

`make` passes; Class::MOP/Moose load (with heuristic still on).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Implements D-W6.14 Option 2 (owner snapshot + reachability filter)
as design-doc analysis of how system Perl handles cycles vs
PerlOnJava's cooperative refcount.

Key observation: system Perl doesn't solve cycles — they leak by
design. The programmer breaks them with weaken(). DBIC's leak tests
pass because DBIC weakens its internal back-references; cycles
without weaken legitimately leak.

For PerlOnJava to match, we need:
1. Precise cooperative refCount (no transient zeros), OR
2. Reachable-owner verification at refCount→0

This commit ships infrastructure for (2):

- New ReachabilityWalker.isScalarReachable(target) — walks roots
  looking for the target SCALAR identity (existing
  isReachableFromRoots only finds RuntimeBases as referents, not
  scalars themselves). Includes closure capture walking via
  RuntimeCode.capturedScalars.

- New RuntimeBase.reachableOwnerCount() — counts owners that are
  both refCountOwned-true and reachable from roots. Phantoms are
  filtered.

- RuntimeBase.activeOwners populated incrementally via
  record/release calls (activated lazily on first activation —
  default not active to avoid behavior changes).

- Audit and pair release calls at all known direct --refCount
  sites: RuntimeArray.pop, RuntimeStash undefine, RuntimeScalar
  undefine, MortalList.deferDestroyForContainerClear,
  MortalList.deferDecrementRecursive (unblessed branch).

Walker-gate heuristic restored as primary mechanism. Class::MOP
and Moose continue to work (heuristic), DBIC t/52leaks.t passes
11/11. The reachableOwnerCount() rescue path is wired but
disabled by default (activeOwners is never auto-initialized in
production) until the DBIC reachability gap (D-W6.15) is bridged.

When activated experimentally on `use Class::MOP` (with heuristic
disabled): Class::MOP and Moose load successfully — confirming
the algorithm works for the original blocker. DBIC has a
reachability gap (intermediate JVM-stack-live scalars not in any
walker seed root) that needs bridging before universal
deployment.

Three concrete paths forward documented in moose_support.md
D-W6.15:
1. Register every refCount-affecting RuntimeScalar in
   MyVarCleanupStack
2. Use Java GC as ground truth (deferred destroy + System.gc)
3. Eliminate transient zeros at the source

`make` passes; Class::MOP/Moose load; DBIC t/52leaks.t 11/11.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Key insight from this session: cooperative refCount and JVM
reference graph interact in ways that make a "universal walker
rescue" inherently risky. activeOwners using IdentityHashMap (strong
refs) prevents JVM GC of the scalars, breaking DBIC's leak detection
which relies on objects becoming JVM-collectible. Switching to
WeakHashMap allows GC but introduces premature collection bugs
(Schema's owner gets GC'd too early).

This commit:
1. Improves ReachabilityWalker.isScalarReachable to:
   - Skip weak refs (matching real Perl semantics)
   - Walk closure captures
   - Seed from package globals + live my-vars
2. Keeps walker-gate heuristic as primary mechanism (master parity)
3. Documents the JVM-GC interaction issue and path forward

Findings:
- Class::MOP/Moose load correctly via reachableOwnerCount() when
  manually invoked (proven by experiment — heuristic disabled, no
  regressions on those modules).
- DBIC's t/52leaks.t per-object GC tests fail when reachableOwnerCount
  is the primary rescue, due to phantom JVM-stack-live scalars
  holding rows.
- Replacing IdentityHashMap with WeakHashMap for activeOwners has
  correct GC semantics but breaks Schema reachability — Schema's
  owner scalar gets GC'd before refCount→0 transitions complete.

Path forward (D-W6.17):
The right fix needs the cooperative refCount itself to be precise
enough that transient zeros only happen at scope-exit boundaries
(matching real Perl's FREETMPS semantics). The walker can be a
debug/audit tool but shouldn't be a production rescue path.

For now: heuristic gate stays as primary; walker infrastructure
ready for future use. `make` passes; Class::MOP/Moose load; DBIC
t/52leaks.t 11/11 (master parity).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…not sufficient

Tested the hypothesis that replacing per-assignment MortalList.flush()
with per-statement FREETMPS-style flush would eliminate transient
refCount zeros and remove the need for the class-name walker-gate
heuristic.

**Implementation experiment** (rolled back to master parity in this
commit):
- Removed MortalList.flush() from setLargeRefCounted
- Added per-statement MortalList.flush() in JVM EmitBlock
- Added per-statement MORTAL_FLUSH opcode in BytecodeCompiler

**Result**:
- Unit tests: PASS
- DBIC t/52leaks.t 11/11: PASS (with heuristic enabled)
- Class::MOP load WITH heuristic: works
- Class::MOP load WITHOUT heuristic: STILL FAILS

**Conclusion**: per-statement flush is correct (matches Perl 5
semantics) but it does NOT fix the underlying refCount imbalance.
The 50 inc / 51 dec event count from D-W6.10 is a REAL imbalance,
not a timing artifact. Flush schedule changes WHEN decrements
fire, not WHETHER they're paired.

The transient-zero hypothesis is misdiagnosis: 1→0 mid-statement
is the CONSEQUENCE of the imbalance, not the cause.

**Revised plan**:
1. Per-pair tracing: tag each setLargeRefCounted increment with a
   unique ID; tag each decrement with the ID it pairs with; find
   unmatched IDs at end of run.
2. Identify the specific code path producing the unmatched events.
3. Fix the path (add missing inc OR remove spurious dec).
4. Verify all four invariants without heuristic:
   - Unit tests pass
   - Class::MOP/Moose load
   - DBIC t/52leaks.t 11/11
   - src/test/resources/unit/refcount/drift/* pass

Per-statement flush remains worth doing eventually (matches Perl 5
FREETMPS, more performant), but as a separate refactor distinct
from the imbalance fix.

This commit ships only the design-doc revision — no code changes.
The walker-gate heuristic remains primary (master parity preserved).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… gate

The walker-gate heuristic (`classNeedsWalkerGate`) was a hard-coded
list of class-name prefixes (Class::MOP, Moose, Moo) that triggered
refCount→0 rescue. This commit replaces it with a property-based
check that captures the underlying lifetime semantics directly.

The property: an object is module-global metadata if it is stored
as a value in a package-global hash (e.g., `$METAS{Foo} = $meta`).
Such objects have program-lifetime persistence held by the package
hash, but cooperative refCount may transient-zero during method
calls.

Implementation:
- New `RuntimeHash.isGlobalPackageHash` — set when the hash is
  registered in `GlobalVariable.globalHashes`.
- New `RuntimeBase.storedInPackageGlobal` — set when a value is
  stored as an element of a package-global hash via
  `RuntimeBaseProxy.set` for `RuntimeHashProxyEntry`.
- Walker gate at `MortalList.flush` and `RuntimeScalar.setLargeRefCounted`
  overwrite path now checks `base.storedInPackageGlobal` instead of
  `classNeedsWalkerGate(base.blessId)`.

Results — all green WITHOUT the class-name heuristic:
- Unit tests: PASS
- `use Class::MOP; use Moose;`: PASS
- `examples/moose.pl`: PASS (full Moose demo, 7 test groups)
- DBIC `t/52leaks.t`: 11/11 PASS

Performance:
- Per-store cost: one boolean check + one flag set if the destination
  scalar's parent hash is package-global. O(1) per store.
- Walker cost: only fires when object has weak refs AND
  storedInPackageGlobal AND refCount→0. Same gating as the old
  heuristic — same performance envelope.

Correctness:
- Class-agnostic: works for any module that stores metadata in
  package-global hashes (extensibility).
- Behaviorally equivalent to the old heuristic for tested modules.
- Captures the underlying property explicitly rather than via
  ad-hoc class-name list.

The `classNeedsWalkerGate` method is still in the code (used by
the D-W6.10 refCount tracer for diagnostic activation). It can be
deleted in a follow-up commit; this commit focuses on the
production fix.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock
Copy link
Copy Markdown
Owner Author

fglock commented Apr 29, 2026

Superseded by #618 — unified PR with squashed history and clean narrative.

@fglock fglock closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant