DBIx::Class t/52leaks.t intermittent: `detached result source` from premature weak-ref clear by ReachabilityWalker auto-sweep

## Problem

When running the full DBIx::Class test suite via `./jcpan -t DBIx::Class`, the test file `t/52leaks.t` intermittently fails with:

```
DBIx::Class::ResultSource::schema(): Unable to perform storage-dependent operations
with a detached result source (source 'Artist' is not associated with a schema).
at t/52leaks.t line 430
```

Followed by:

```
# Tests were run but no plan was declared and done_testing() was not seen.
t/52leaks.t .........................................
Dubious, test returned 255 (wstat 65280, 0xff00)
```

`t/52leaks.t` line 430 sits inside a closure-chain over `$phantom`:

```perl
sub { shift->result_source },     # $phantom = ResultSource
sub { shift->resultset },          # ← line 430: calls $self->schema, dies if undef
```

The `ResultSource → Schema` weak ref has been cleared *while the test is still mid-statement* and DBIC subsequently expects to dereference it.

The same intermittent has been observed across runs on different DBIC test files (`t/76joins.t`, `t/96_is_deteministic_value.t`, `t/752sqlite.t`, `t/86might_have.t`, `t/63register_class.t`, `t/46where_attribute.t`, `t/33exception_wrap.t`, `t/101populate_rs.t`) — different victim every run. The harness times those out as "no TAP output for 300s" and SIGKILLs them.

This blocks publishing **DBIx::Class PASS** in `dev/cpan-reports/cpan-compatibility.md` despite all 9 hard-failure regressions from PR #621 having been fixed in PR #644.

## How to reproduce

### Reliable (forced timing)

The bug fires only when `MortalList.maybeAutoSweep()` happens to run at a particular moment relative to DBIC's cumulative `WeakRefRegistry` state. The 5-second throttle on auto-sweep makes natural reproduction stochastic; PR #635 added a debug knob that bypasses both the throttle and the `weakRefsExist` gate so the sweep fires on every Perl statement boundary:

```bash
JPERL_FORCE_SWEEP_EVERY_FLUSH=1 timeout 3000 ./jcpan -t DBIx::Class > /tmp/dbic.log 2>&1
```

Under this knob, ~1–3 DBIC test files crash mid-test on every run with the `detached result source` exception. Standalone (single-test) reproduction via `JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jperl t/52leaks.t` or simpler in-isolation reproducers does *not* yet trigger the bug — see "What is NOT the problem" below.

### Workaround (proves the cause)

```bash
JPERL_NO_AUTO_GC=1 timeout 3000 ./jcpan -t DBIx::Class > /tmp/no_gc.log 2>&1
```

Disabling auto-GC makes the `detached result source` crash go away — but exposes the existing `t/52leaks.t` tests 12–18 leak-detection failures (14/23 subtests fail because objects aren't getting their weak refs cleared). So the auto-sweep IS doing necessary work; we just want it to not clear refs to live objects.

## What we think the problem is

`MortalList.maybeAutoSweep()` calls `ReachabilityWalker.sweepWeakRefs(quiet=true)`, which:

1. Forces a JVM GC via `ScalarRefRegistry.forceGcAndSnapshot()`.
2. Snapshots the live set from `ScalarRefRegistry` (a `WeakHashMap`).
3. Walks reachability from globals + lexicals.
4. Clears every weak ref whose referent is *not* in the reachable set.

Under DBIC's cumulative state pressure, **a still-live object is being classified as unreachable** by the walker, and its weak-ref entries get cleared.

The most likely root cause is **a missing seeding gate in `ReachabilityWalker.walk()`** (lines 111–153 of `src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java`). The walker has many guards on what counts as a reachability root:

```java
for (RuntimeScalar sc : ScalarRefRegistry.snapshot()) {
    if (sc.captureCount > 0) continue;                  // (A) skip if captured by a closure
    if (WeakRefRegistry.isweak(sc)) continue;           // (B) skip weak scalars
    if (MortalList.isDeferredCapture(sc)) continue;     // (C) skip pending mortals
    if (!MyVarCleanupStack.isLive(sc)) {
        if (sc.scopeExited) continue;                   // (D) skip exited scopes
        if (!sc.refCountOwned) continue;                // (E) skip non-owning copies
    }
    visitScalar(sc, todo);
}
```

DBIC's pattern (`my $phantom; for ( sub {...}, sub {...}, ... ) { $phantom = $sub->($phantom) }`) involves passing complex blessed refs through `@_` temporaries, accessor-method dispatch, Class::C3::XS next-method, Moo/Sub::Quote-generated accessors, and DBIC's `_result_source` weak-ref accessor magic. One of these creates a reachability path the walker doesn't follow, OR creates a scalar slot that lands in one of the *skip* branches above (A–E) when it shouldn't.

A second, less-likely possibility: there's a race between `forceGcAndSnapshot()` and the snapshot read — the WeakHashMap entry for the schema's lexical gets GC'd between the force-GC and the snapshot, even though the Perl-level lexical is still on the stack. We've ruled this out for simple cases (see below) but not for the full DBIC pattern.

## What is NOT the problem

We've already disconfirmed several plausible-sounding hypotheses:

- ❌ **NOT** "walker doesn't seed `my $scalar = $ref` lexicals at all". A standalone reproducer with a single `my $obj = bless {…}, …` lexical + weakened back-ref + `JPERL_FORCE_SWEEP_EVERY_FLUSH=1` + 20× explicit `Internals::jperl_gc()` PASSES. The walker correctly seeds the lexical via `MyVarCleanupStack` (see `dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t`).
- ❌ **NOT** "walker doesn't trace through globals". A DBIC-shape reproducer with the schema kept alive only via `our %REGISTRY` (mirroring DBIC's package-level storage) PASSES under the same conditions (see `dev/sandbox/walker_blind_spot/dbic_real_pattern_PASSES.t`).
- ❌ **NOT** harness `select()` polling slowness. Investigated separately — the parent harness JVM is 99.7% idle in `IOOperator.selectWithNIO` `Thread.sleep(10)` polling (10 jstack samples over 32 min, 6 s CPU total). When tests time out, it's the *child* JVM that's silent, not the parent.
- ❌ **NOT** orphan-JVM CPU starvation. PR #644's orphan watchdog (gated on `JPERL_ORPHAN_EXIT=1` set by `./jcpan` and `./jprove`) prevents leftover JVMs from contaminating subsequent runs. The schema-detached bug still reproduces with the watchdog active.
- ❌ **NOT** macOS Jetsam SIGKILL. We tested `-Xmx2g` to constrain heap and reduce JVM virtual-memory pressure; the schema-detached bug still appears.
- ❌ **NOT** specific to one test. Across 5 separate full DBIC runs today, the failing test was different each time (8 distinct files have been hit). It's the parent process state at the moment of the auto-sweep that determines who loses.

## How to investigate

The deterministic approach has three steps:

### Step A — Add `JPERL_WALKER_TRACE=1` instrumentation

In `src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java`, inside `sweepWeakRefs(boolean quiet)` (around the loop that clears weak refs to unreachable objects), add an env-gated `System.err.println` for *every* clear that records:

```
WALKER_CLEAR target=<className>@<identityHashCode>
             refCount=<base.refCount>
             refCountOwned=<sc.refCountOwned>
             scopeExited=<sc.scopeExited>
             captureCount=<sc.captureCount>
             storedInPackageGlobal=<base.storedInPackageGlobal>
             path=<findPathTo(target) or "<unreachable>">
             seedStats=globals:<n> myVars:<n> scalarRefReg:<n>
```

The `path=` field is the key piece — the existing `ReachabilityWalker.findPathTo()` (line 202) returns the actual reachability path the walker found (or `<unreachable>` if it didn't find one). The `seedStats` snapshot tells us whether the schema's lexical was *eligible* to be seeded, even if it ended up filtered out.

### Step B — Capture a real DBIC failure

```bash
JPERL_FORCE_SWEEP_EVERY_FLUSH=1 JPERL_WALKER_TRACE=1 \
    timeout 3000 ./jcpan -t DBIx::Class > /tmp/dbic.log 2>/tmp/dbic.trace
```

In the resulting `/tmp/dbic.trace` file, find the first `WALKER_CLEAR` line whose target class is one of:

- `DBIx::Class::Schema` (or any subclass like `DBICTest::Schema`)
- `DBIx::Class::ResultSource::Table`
- `DBIx::Class::ResultSourceHandle`
- `DBIx::Class::Storage::DBI`

That single trace line names the seeding gate that incorrectly excluded a still-live object.

### Step C — Diagnose

The trace line will fall into one of these patterns. Each points to a different fix:

| trace pattern | meaning | fix location |
|---|---|---|
| `path=<reachable via …captures…>` but cleared anyway | walker found a path but cleared regardless — bug in the clear-loop's check | `ReachabilityWalker.sweepWeakRefs` clear logic |
| `path=<unreachable>` AND `refCountOwned=true` AND `captureCount>0` | the (A) skip is over-eager — closures capturing this scalar are themselves not reachable from `globalCodeRefs` | walker Phase 2 (a) — make `captureCount` skip conditional on the closure being walkable |
| `path=<unreachable>` AND `refCountOwned=true` AND `MyVarCleanupStack.isLive=false` AND `scopeExited=false` | walker missed the live lexical because `MyVarCleanupStack` doesn't know about it | bytecode emission in `EmitVariable.java` — register `my $scalar` declarations into MyVarCleanupStack the same way `my @arr` / `my %hash` are |
| `path=<unreachable>` AND `storedInPackageGlobal=true` | the global container isn't being walked or the link from container → element broke | walker Phase 2 — fix the global-hash visit pattern |
| `path=<unreachable>` AND DBIC-specific accessor (Sub::Quote eval, Class::C3::XS) | walker doesn't follow Sub::Quote's eval-string captures or the C3::XS dispatch chain | walker — add a code-walking phase for these constructs |

## How it can be fixed

Once Step C identifies which property/state caused the false-negative seeding, the fix is in **`ReachabilityWalker.walk()`** (Phase 2, lines 111–153). Most likely candidates, in order of expected probability:

1. **Sub::Quote / Class::C3::XS captures.** DBIC heavily uses Sub::Quote-generated accessors (look for the `B::Deparse`-style eval-string subs created by Moo and DBIC accessors). When such a sub captures the schema, the walker's "Phase 1: walk globalCodeRefs WITH captures" should follow the capture — but only if the sub is in `globalCodeRefs`. Sub::Quote-generated subs may not be (they're stored in package stash slots that aren't always walked the same way). **Fix: walk all RuntimeCode instances reachable from anywhere in the symbol table, not just `globalCodeRefs`.**

2. **Captured scalar over-skip.** The `captureCount > 0 → continue` skip at line 113 assumes that captured scalars will be reached via the closure walk. If the closure isn't itself reachable, the captured scalar gets dropped. **Fix: gate the skip on `captureCount > 0 AND we_actually_walked_a_closure_that_captures_it`.**

3. **MyVarCleanupStack scalar registration.** Today `MyVarCleanupStack` is registered on every `my $x` (in `EmitVariable.java` line 1551 — gated by `cleanupNeeded`), but the walker's path (b) handles `RuntimeScalar` only via the `liveVar instanceof RuntimeScalar` branch. The branch logic may have a gap. **Fix: trace what `MyVarCleanupStack.snapshotLiveVars()` returns at sweep time and verify scalars with refCountOwned=true are always seeded.**

4. **Storable seen-table inflation.** DBIC tests use `dclone` extensively. Storable's seen-table holds strong refs internally during the freeze/thaw window — these can leak to other tests via the static encoder/decoder. **Fix: ensure Storable's seen-table is fully cleared at every freeze/thaw boundary** (tracked separately as part of PR #621 follow-up).

A regression test under `src/test/resources/unit/refcount/walker_<specific_gate>.t` should be added once the actual gate is identified (Step C). The two `dev/sandbox/walker_blind_spot/*_PASSES.t` files already exercise the harness setup correctly; whoever fixes this just needs to extend them with the real failing pattern.

## What's already landed (PR #635)

- `JPERL_FORCE_SWEEP_EVERY_FLUSH=1` debug knob in `MortalList.maybeAutoSweep()` — the prerequisite for deterministic Step B.
- Two `*_PASSES.t` reproducers that *don't* fail — they prove what's NOT the bug.
- Updated investigation plan in `dev/modules/dbix_class.md` (sections "Investigation Plan: Schema-Detached Bug in t/52leaks.t (line 430)" and "Next steps (concrete, in order)").
- Sandbox README `dev/sandbox/walker_blind_spot/README.md` summarising the current state for the next investigator.

## Why this matters

`DBIx::Class` is published as `PASS` in `dev/cpan-reports/cpan-compatibility.md`. Users running `jcpan DBIx::Class` see a clean install when run alone (it really does pass standalone), but a failing install under the published smoke-test infrastructure. That's a worse user experience than fully blocking the install upfront — losing the trust signal in the compatibility report. We can't update the report to `FAIL` either, because the standalone install really is functional. The intermittent under harness needs to be fixed, not papered over.

## Acceptance criteria

- [ ] `./jcpan -t DBIx::Class` passes 0/314 failures over 5 consecutive runs on a quiet box.
- [ ] `JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jcpan -t DBIx::Class` also passes (deterministic mode is now safe).
- [ ] A unit test under `src/test/resources/unit/refcount/walker_<…>.t` reproduces the original failure pattern and is part of `make`.
- [ ] No regression in the existing leak-detection assertions (`t/52leaks.t` tests 12-18 still pass at the same rate as before).


trace pattern	meaning	fix location
`path=<reachable via …captures…>` but cleared anyway	walker found a path but cleared regardless — bug in the clear-loop's check	`ReachabilityWalker.sweepWeakRefs` clear logic
`path=<unreachable>` AND `refCountOwned=true` AND `captureCount>0`	the (A) skip is over-eager — closures capturing this scalar are themselves not reachable from `globalCodeRefs`	walker Phase 2 (a) — make `captureCount` skip conditional on the closure being walkable
`path=<unreachable>` AND `refCountOwned=true` AND `MyVarCleanupStack.isLive=false` AND `scopeExited=false`	walker missed the live lexical because `MyVarCleanupStack` doesn't know about it	bytecode emission in `EmitVariable.java` — register `my $scalar` declarations into MyVarCleanupStack the same way `my @arr` / `my %hash` are
`path=<unreachable>` AND `storedInPackageGlobal=true`	the global container isn't being walked or the link from container → element broke	walker Phase 2 — fix the global-hash visit pattern
`path=<unreachable>` AND DBIC-specific accessor (Sub::Quote eval, Class::C3::XS)	walker doesn't follow Sub::Quote's eval-string captures or the C3::XS dispatch chain	walker — add a code-walking phase for these constructs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBIx::Class t/52leaks.t intermittent: `detached result source` from premature weak-ref clear by ReachabilityWalker auto-sweep #646

Problem

How to reproduce

Reliable (forced timing)

Workaround (proves the cause)

What we think the problem is

What is NOT the problem

How to investigate

Step A — Add `JPERL_WALKER_TRACE=1` instrumentation

Step B — Capture a real DBIC failure

Step C — Diagnose

How it can be fixed

What's already landed (PR #635)

Why this matters

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DBIx::Class t/52leaks.t intermittent: detached result source from premature weak-ref clear by ReachabilityWalker auto-sweep #646

Description

Problem

How to reproduce

Reliable (forced timing)

Workaround (proves the cause)

What we think the problem is

What is NOT the problem

How to investigate

Step A — Add JPERL_WALKER_TRACE=1 instrumentation

Step B — Capture a real DBIC failure

Step C — Diagnose

How it can be fixed

What's already landed (PR #635)

Why this matters

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

DBIx::Class t/52leaks.t intermittent: `detached result source` from premature weak-ref clear by ReachabilityWalker auto-sweep #646

Step A — Add `JPERL_WALKER_TRACE=1` instrumentation