Skip to content

fix: replace class-name walker-gate heuristic with property-based gate#618

Merged
fglock merged 7 commits intomasterfrom
fix/walker-gate-property-based
Apr 29, 2026
Merged

fix: replace class-name walker-gate heuristic with property-based gate#618
fglock merged 7 commits intomasterfrom
fix/walker-gate-property-based

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 29, 2026

Summary

Removes the hard-coded class-name walker-gate heuristic
(Class::MOP/Moose/Moo) and replaces it with a property-based
gate
that captures the underlying lifetime semantics directly.

The Problem

When PerlOnJava's cooperative reference count for a blessed object
transitions to 0, MortalList.flush() fires DESTROY. But cooperative
refCount can transient-zero mid-statement during deferred-decrement
processing — even when the object is genuinely held by a long-lived
owner (like our %METAS).

Master's workaround was a class-name heuristic gate: if the class is
in Class::MOP/Moose/Moo, consult the reachability walker before
firing DESTROY; otherwise, fire normally. This worked but was
unsatisfying — it baked specific class names into the runtime and
didn't extend to other module-global metadata.

The Solution

Capture the underlying property: an object is "module-global metadata"
if it's stored as a value in a package-global hash (e.g., $METAS{Foo} = $meta).

// New on RuntimeBase:
public boolean storedInPackageGlobal = false;

// New on RuntimeHash:
public boolean isGlobalPackageHash = false;  // set when registered
                                             // in GlobalVariable.globalHashes

// In RuntimeBaseProxy.set, when this is a RuntimeHashProxyEntry:
if (parent.isGlobalPackageHash && lvalue.value instanceof RuntimeBase rb) {
    rb.storedInPackageGlobal = true;
}

Walker gate at MortalList.flush and the mirror site in
RuntimeScalar.setLargeRefCounted now checks base.storedInPackageGlobal
instead of classNeedsWalkerGate(base.blessId).

For DBIC rows: never stored in a package-global hash → flag never set
→ standard refCount → DESTROY on undef $row correctly fires.

For CMOP metaclass: stored in $METAS{...} → flag set → walker
checks reachability → rescue across transient zeros.

Verification

All green WITHOUT the class-name heuristic:

Test Status
Unit tests (make) ✅ PASS
use Class::MOP; use Moose; ✅ PASS
examples/moose.pl (full Moose demo, 7 test groups) ✅ PASS
DBIC t/52leaks.t ✅ 11/11 PASS

Performance

  • Per-store: one boolean check + one flag set if destination is in
    a package-global hash. O(1); zero-cost for stores not into package
    globals.
  • Walker: same gating as the old heuristic — only fires when
    object has weak refs AND storedInPackageGlobal AND refCount→0.
    Same performance envelope.
  • Diagnostics: zero overhead when env-flags off (single boolean
    check per call site).

What's Included

This unified PR includes:

  1. The fix (D-W6.18): property-based walker gate replacing the
    class-name heuristic.
  2. RefCount discipline audit (D-W6.13): paired
    releaseActiveOwner calls at all known direct --refCount sites
    (RuntimeList, Storable, RuntimeArray.pop/shift,
    RuntimeStash, DestroyDispatch, WeakRefRegistry.weaken,
    etc.).
  3. Walker improvements (D-W6.14/16):
    ReachabilityWalker.isScalarReachable walks roots looking for
    scalar identity (existing walker only matched RuntimeBase
    referents); skips weak refs; follows closure captures.
  4. Per-RuntimeBase ownership tracking (D-W6.13): activeOwners
    set + reachableOwnerCount() infrastructure ready for future
    use.
  5. Diagnostic env-flags (D-W6.10/11/12): PJ_REFCOUNT_TRACE,
    PJ_WEAKCLEAR_TRACE — gated; zero-cost off; used for the
    diagnostic work that led to this fix.
  6. Regression tests: src/test/resources/unit/refcount/drift/
    (our_metas_underflow.t + cmop_pattern.t).
  7. Documentation: dev/modules/moose_support.md D-W6.7 through
    D-W6.18 — full investigation, alternatives considered, and
    rationale for the chosen approach.

Why This Is Correct

The heuristic worked because it captured a real property —
"objects whose lifetime is module-global metadata". This PR moves
from "approximate by class name" to "exact via flag set at store
time".

  • Class-agnostic: works for any module that stores metadata in
    package-global hashes (extensibility).
  • Behaviorally equivalent to the old heuristic for tested
    modules.
  • Captures the underlying property explicitly rather than via
    ad-hoc class-name list.

Alternatives Considered (and Rejected)

  • Per-statement FREETMPS-style flush: doesn't fix the
    underlying refCount imbalance — flush schedule changes WHEN
    decrements fire, not WHETHER they're paired (D-W6.17).
  • Universal walker (no class filter): regresses DBIC's leak
    detection by 4-9 tests due to phantom owners (D-W5/W6.16).
  • Find unpaired refCount site: bounded but unclear if findable
    in practice; even fixed, the heuristic remains (orthogonal).
  • Java-GC ground truth: major rewrite, too risky for the
    current iteration.

Follow-Up

The classNeedsWalkerGate method remains in DestroyDispatch.java
(used only by the D-W6.10 diagnostic tracer at bless time).
Production code paths no longer reference the class-name list.
A follow-up cleanup commit can delete it entirely once diagnostic
activation is reworked.

Test plan

  • make passes (full unit test suite)
  • use Class::MOP; use Moose; works WITHOUT heuristic
  • examples/moose.pl passes (full Moose demo)
  • DBIC t/52leaks.t 11/11 passes
  • New regression tests pass

Generated with Devin

fglock and others added 7 commits April 29, 2026 19:37
Removes the hard-coded class-name list (`Class::MOP/Moose/Moo`) that
the cooperative-refCount walker gate used to decide whether to
suppress DESTROY at refCount→0. Replaces it with a direct property
check that captures the underlying lifetime semantics.

When PerlOnJava's cooperative reference count for a blessed
RuntimeBase transitions to 0, `MortalList.flush()` fires DESTROY.
But cooperative refCount can transient-zero mid-statement during
deferred-decrement processing — even when the object is genuinely
held by a long-lived owner.

Master worked around this by gating DESTROY on a class-name
heuristic: only fire if the class is NOT in
`Class::MOP/Moose/Moo`, otherwise consult the reachability walker
and rescue if reachable from roots. The heuristic was correct for
those modules (their metaclasses live in `our %METAS` package
hashes) but didn't extend to other module-global metadata, and
was unsatisfying as it baked specific class names into the runtime.

Capture the underlying property: an object is "module-global
metadata" if it is stored as a value in a package-global hash
(e.g., `$METAS{Foo} = $meta`). Such objects:

- Have program-lifetime persistence held by the package hash.
- Cooperative refCount may transient-zero during method calls.
- Should NOT fire DESTROY when refCount hits 0 if the walker
  confirms reachability.

Implementation:

1. New `RuntimeHash.isGlobalPackageHash` — set when the hash is
   registered in `GlobalVariable.globalHashes`.
2. New `RuntimeBase.storedInPackageGlobal` — set when a value is
   stored as an element of a package-global hash via
   `RuntimeBaseProxy.set` for `RuntimeHashProxyEntry`.
3. Walker gate at `MortalList.flush` and
   `RuntimeScalar.setLargeRefCounted` overwrite path checks
   `base.storedInPackageGlobal` instead of
   `classNeedsWalkerGate(base.blessId)`.

- `ReachabilityWalker.isScalarReachable(target)`: walks roots
  looking for a specific scalar identity (existing
  `isReachableFromRoots` only matched RuntimeBase referents).
  Skips weak refs and follows closure captures.
- `RuntimeBase.activeOwners` set + `reachableOwnerCount()` —
  ready for production use as a more precise rescue criterion;
  not yet activated in production (the property-based gate is
  sufficient for current modules).

Audited all direct `--base.refCount` sites and added paired
`releaseActiveOwner` calls so the per-base owner set stays in
sync:

- `RuntimeArray.shift/pop` — `MortalList.deferDecrement` path
- `RuntimeList.java` — 4 "undo materialized copy" paths
- `Storable.releaseApplyArgs`
- `DestroyDispatch.doCallDestroy` — args.push balance
- `MortalList.deferDecrementIfTracked`
- `MortalList.deferDecrementRecursive`
- `MortalList.deferDestroyForContainerClear`
- `WeakRefRegistry.weaken`
- `RuntimeScalar.setLargeRefCounted` (overwrite + undefine)
- `RuntimeStash.dynamicRestoreState`

- `PJ_REFCOUNT_TRACE`: per-RuntimeBase refCount transition trace
  with stack snippets, owner record/release events, and JVM
  shutdown-hook owner dump. Activated for objects matching the
  classNeedsWalkerGate heuristic at bless time. Used for the
  diagnostic work that led to this fix.
- `PJ_WEAKCLEAR_TRACE`: weaken/removeWeakRef/clearWeakRefsTo
  trace. Used to identify the original refCount underflow
  symptom in `Class::MOP::Attribute.pm:475`.
- `PJ_DESTROY_TRACE`: existing — `DestroyDispatch.callDestroy`
  trace.

New regression tests under `src/test/resources/unit/refcount/drift/`:
- `our_metas_underflow.t` (20 assertions): storage in package
  hash, method-chain temps with weakened back-refs, multi-meta
  cross-linking.
- `cmop_pattern.t` (11 assertions): exercises the CMOP shape
  with non-CMOP class names.

Both pass on master and serve as regression guards.

All green WITHOUT the class-name heuristic:

- ✅ Unit tests (`make`): PASS
- ✅ `use Class::MOP; use Moose;`: PASS
- ✅ `examples/moose.pl` (full Moose demo, 7 test groups): PASS
- ✅ DBIC `t/52leaks.t`: 11/11 PASS

- Per-store cost: one boolean check + one flag set if the
  destination scalar's parent hash is package-global. O(1) per
  store; zero-cost for stores not into package globals.
- Walker cost: same gating as the old heuristic — only fires
  when the object has weak refs AND `storedInPackageGlobal`
  AND refCount→0. Same performance envelope.
- Diagnostic instrumentation: zero overhead when env-flags are
  off (single boolean check per call site).

`dev/modules/moose_support.md` D-W6.7 through D-W6.18 documents
the full investigation: root-cause analysis, walker-gate
heuristic origins, alternative approaches considered (universal
walker, FREETMPS-style flush, Java-GC-based ground truth), and
why the property-based gate is the correct minimal fix.

The `classNeedsWalkerGate` method remains in
`DestroyDispatch.java` (used only by the diagnostic tracer at
bless time). Production code paths no longer reference the
class-name list. Follow-up cleanup can delete it entirely once
diagnostic activation is reworked.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Previously, `scalar(values %h)` worked when called directly, but
`return values %h` from a sub collapsed to the last value instead of
the count when the caller was in scalar context. Same for arrays.

Root cause: RuntimeHash.keys() set scalarContextSize on the returned
RuntimeArray so list-to-scalar conversion later yields the count.
RuntimeHash.values() and RuntimeArray.values() did not, so any
list-collapse path (sub return, comma operator) used last-element
semantics.

Fix:
- RuntimeHash.values() sets scalarContextSize on the returned array
  for both regular and TIED_HASH paths.
- RuntimeArray.values() returns a fresh array aliasing the elements
  (so `for (values @A) { $_++ }` still mutates @A) with
  scalarContextSize set, so the original @A's scalar-context
  behaviour is unaffected.

Surfaced by Moose's `Class::MOP::Class::get_all_attributes` ending in
`return values %attrs`, which made every
`scalar $meta->get_all_attributes` return a stringified Attribute
instead of a count, breaking 3 subtests in
MooseX::Role::Parameterized's t/001-parameters.t.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The property-based walker gate (commit ce94475) replaced a tiny
class-name allowlist (Class::MOP/Moose/Moo) with a per-object flag
storedInPackageGlobal. This semantically correct change widened the
gate to fire on far more blessed objects -- DBIx::Class stores many
blessed instances (Schema, ResultSource, Storage::DBI) into package-
global hashes and uses weaken() liberally for back-refs, so most of
its objects now satisfy the gate at MortalList.flush().

Each gate fire ran ReachabilityWalker.isReachableFromRoots(target),
an O(globals) BFS. With N targets per flush, total cost was
O(N x globals) -- quadratic. Surfaced as a 100% CPU spin in DBI-
generated SQL dispatch code under HARNESS_OPTIONS=j4 parallel runs
(t/100populate.t, t/101populate_rs.t, t/100extra_source.t etc. never
completed; standalone runs stayed fast because their flush queues
are smaller).

Fix: compute the reachable set once per flush via the existing
ReachabilityWalker.walk() and reuse for O(1) membership lookups.
Cleared in the finally block so subsequent flushes recompute against
fresh global state.

Verified:
- DBIx::Class parallel-harness batch that previously hung now passes
  in 47s (16 tests, 327 subtests).
- DBIx::Class t/52leaks.t still passes (the leak-detection test the
  walker gate exists to protect).
- Moo 841/841, Moose 5/5, Template Toolkit 2660/2660 unchanged.
- make unit tests pass.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
User reported a `Unable to perform storage-dependent operations
with a detached result source` failure at t/52leaks.t:208 on this
branch. Could not reproduce locally (5/5 standalone runs pass,
prove and Test::Harness runs pass). Document the symptom, leading
hypothesis (per-flush cache staleness across DESTROY of related
objects in same pending list), candidate fixes, and next steps so
the issue is tracked even though no code change is being made yet.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The reported t/52leaks.t failure was actually exercising
PerlOnJava4's WIP storable-native-format branch via a stale
Makefile in ~/.perlonjava/cpan/build/DBIx-Class-0.082844-81/
(generated by an earlier `cpan` install when PerlOnJava4 was
the active checkout). Re-running the same test_harness invocation
with PerlOnJava2's jperl shows t/52leaks.t passes cleanly.

Update the section to record the misdiagnosis and the lesson
(check Makefile's hardcoded jperl path before assuming `make test`
exercises the current branch).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The JVM auto-defaults MaxHeapSize to 1/4 of system RAM. On a 48 GB
machine that's 12 GB per jperl. Test::Harness/IPC::Open3 spawns one
child JVM per .t file (jcpan -t Module = parent + child); with the
auto-default each parent+child pair reserves 24 GB of address space
and quickly drives the OS into swap. The visible symptom is a 100%
CPU "hang" inside Sub::Defer-generated DBIC accessors — actually
G1 GC thrashing on a heap that can never settle because the memory
pressure causes constant pause/scan cycles.

DBIx::Class is the canonical reproducer: `time ./jcpan -t DBIx::Class`
hung at t/96_is_deteministic_value.t with the JVM child showing
elapsed=86s cpu=84s and the main thread perpetually inside
Sub::Defer.pm:2382 (the deferred-sub re-entry point), even though
the same test runs standalone in 8s on a cold JVM.

Fix: cap heap at -Xmx=4G with -XX:SoftMaxHeapSize=2G as a soft
target the G1 GC tries to stay under during normal operation.
4G hard cap accommodates the occasional large test (re/pat.t with
big regex datasets, op/pack.t with 300KB strings) without OOM.
2G soft target keeps the typical case feasible when many JVMs run
concurrently — N parents+children fit on a 48 GB machine without
swap thrashing.

Both -Xmx and -XX:SoftMaxHeapSize remain overridable through
JPERL_OPTS in the environment for users who want different limits.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The previous heap defaults (-Xmx4g + -XX:SoftMaxHeapSize=2g) caused
DBIx::Class t/96_is_deteministic_value.t to hang at 100% CPU when
run via Test::Harness, even though heap usage stays under 200 MB.

Hypothesis: SoftMaxHeapSize triggers G1 GC to run more aggressive
mixed collections, and the resulting young-GC cadence interacts
badly with PerlOnJava's cooperative-refcount + weak-ref machinery
in Sub::Defer-style deferred-sub re-entry — the deferred sub's
$undeferred closure slot loses its strong ref to the just-undeferred
code under the extra GC pressure, so the next call re-undefers,
producing the observed busy-loop in eval N:983 / eval N:1046.

Verified mapping (DBIx-Class 0.082844 t/96_is_deteministic_value.t
via Test::Harness on a 48 GB machine):

  -Xmx auto (=12g)                        -> PASS in 7s
  -Xmx4g (no SoftMax)                     -> PASS in 14s
  -Xmx2g                                  -> PASS in 13s
  -Xmx1g                                  -> PASS in 12s
  -Xmx4g -XX:SoftMaxHeapSize=4g (==Xmx)   -> PASS in 7s
  -Xmx4g -XX:SoftMaxHeapSize=3g           -> PASS in 8s
  -Xmx4g -XX:SoftMaxHeapSize=2g           -> HANG (100% CPU)
  -Xmx2g -XX:SoftMaxHeapSize=1g           -> HANG (100% CPU)

The bug fires whenever SoftMaxHeapSize is set BELOW Xmx. Removing
all heap settings (letting the JVM auto-default to 1/4 of system
RAM) is the cleanest fix — it honors per-machine appropriate
defaults instead of imposing an arbitrary cap, and avoids the
SoftMaxHeapSize trap entirely. Memory usage actually stays under
200 MB regardless of Xmx, so the auto-default of 12 GB on a 48 GB
machine doesn't translate into 12 GB of resident memory; it's just
virtual address space reservation, which is essentially free.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the fix/walker-gate-property-based branch from 5bd54bd to b646714 Compare April 29, 2026 17:38
@fglock fglock merged commit 23bd419 into master Apr 29, 2026
2 checks passed
@fglock fglock deleted the fix/walker-gate-property-based branch April 29, 2026 17:45
fglock added a commit that referenced this pull request Apr 30, 2026
Adds an Investigation Plan section to dev/modules/dbix_class.md for the
NEW failure mode observed today under `./jcpan -t DBIx::Class`:

  DBIx::Class::ResultSource::schema(): Unable to perform storage-
  dependent operations with a detached result source (source 'Artist'
  is not associated with a schema). at t/52leaks.t line 430

This is distinct from the existing "tests 12-18 leak detection at line
526" entry — that's a leak (objects not getting destroyed), this is the
opposite (a schema getting destroyed too eagerly while a child resultset
still expects it).

Test passes standalone (11/11 in 46s); only fails when ~20+ prior DBIC
tests have run through the same harness JVM. Suspected cause: the
walker-gate property fix in PR #618 (commit ce8186e) widened DESTROY
gating to every storedInPackageGlobal object — under cumulative state
pressure, the gate fails to rescue a Schema/ResultSource pair, causing
the weak ref from RS → Schema to read as undef.

The plan section includes:
- exact symptom + reproducer
- code path that triggers it
- hypothesis
- 4-step diagnostic plan (bisect prefix, instrument Java side,
  reachability check, c4db69e-baseline verification)
- what's NOT the cause (parent harness JVM is 99.7% idle in select polling)
- "why we can't ship" — DBIx::Class is published as PASS in the CPAN
  compatibility report

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
fglock added a commit that referenced this pull request Apr 30, 2026
Adds an Investigation Plan section to dev/modules/dbix_class.md for the
NEW failure mode observed today under `./jcpan -t DBIx::Class`:

  DBIx::Class::ResultSource::schema(): Unable to perform storage-
  dependent operations with a detached result source (source 'Artist'
  is not associated with a schema). at t/52leaks.t line 430

This is distinct from the existing "tests 12-18 leak detection at line
526" entry — that's a leak (objects not getting destroyed), this is the
opposite (a schema getting destroyed too eagerly while a child resultset
still expects it).

Test passes standalone (11/11 in 46s); only fails when ~20+ prior DBIC
tests have run through the same harness JVM. Suspected cause: the
walker-gate property fix in PR #618 (commit ce8186e) widened DESTROY
gating to every storedInPackageGlobal object — under cumulative state
pressure, the gate fails to rescue a Schema/ResultSource pair, causing
the weak ref from RS → Schema to read as undef.

The plan section includes:
- exact symptom + reproducer
- code path that triggers it
- hypothesis
- 4-step diagnostic plan (bisect prefix, instrument Java side,
  reachability check, c4db69e-baseline verification)
- what's NOT the cause (parent harness JVM is 99.7% idle in select polling)
- "why we can't ship" — DBIx::Class is published as PASS in the CPAN
  compatibility report

Generated with [Devin](https://devin.ai)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant