Skip to content

DBIx::Class t/52leaks.t intermittent: detached result source from premature weak-ref clear by ReachabilityWalker auto-sweep #646

@fglock

Description

@fglock

Problem

When running the full DBIx::Class test suite via ./jcpan -t DBIx::Class, the test file t/52leaks.t intermittently fails with:

DBIx::Class::ResultSource::schema(): Unable to perform storage-dependent operations
with a detached result source (source 'Artist' is not associated with a schema).
at t/52leaks.t line 430

Followed by:

# Tests were run but no plan was declared and done_testing() was not seen.
t/52leaks.t .........................................
Dubious, test returned 255 (wstat 65280, 0xff00)

t/52leaks.t line 430 sits inside a closure-chain over $phantom:

sub { shift->result_source },     # $phantom = ResultSource
sub { shift->resultset },          # ← line 430: calls $self->schema, dies if undef

The ResultSource → Schema weak ref has been cleared while the test is still mid-statement and DBIC subsequently expects to dereference it.

The same intermittent has been observed across runs on different DBIC test files (t/76joins.t, t/96_is_deteministic_value.t, t/752sqlite.t, t/86might_have.t, t/63register_class.t, t/46where_attribute.t, t/33exception_wrap.t, t/101populate_rs.t) — different victim every run. The harness times those out as "no TAP output for 300s" and SIGKILLs them.

This blocks publishing DBIx::Class PASS in dev/cpan-reports/cpan-compatibility.md despite all 9 hard-failure regressions from PR #621 having been fixed in PR #644.

How to reproduce

Reliable (forced timing)

The bug fires only when MortalList.maybeAutoSweep() happens to run at a particular moment relative to DBIC's cumulative WeakRefRegistry state. The 5-second throttle on auto-sweep makes natural reproduction stochastic; PR #635 added a debug knob that bypasses both the throttle and the weakRefsExist gate so the sweep fires on every Perl statement boundary:

JPERL_FORCE_SWEEP_EVERY_FLUSH=1 timeout 3000 ./jcpan -t DBIx::Class > /tmp/dbic.log 2>&1

Under this knob, ~1–3 DBIC test files crash mid-test on every run with the detached result source exception. Standalone (single-test) reproduction via JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jperl t/52leaks.t or simpler in-isolation reproducers does not yet trigger the bug — see "What is NOT the problem" below.

Workaround (proves the cause)

JPERL_NO_AUTO_GC=1 timeout 3000 ./jcpan -t DBIx::Class > /tmp/no_gc.log 2>&1

Disabling auto-GC makes the detached result source crash go away — but exposes the existing t/52leaks.t tests 12–18 leak-detection failures (14/23 subtests fail because objects aren't getting their weak refs cleared). So the auto-sweep IS doing necessary work; we just want it to not clear refs to live objects.

What we think the problem is

MortalList.maybeAutoSweep() calls ReachabilityWalker.sweepWeakRefs(quiet=true), which:

  1. Forces a JVM GC via ScalarRefRegistry.forceGcAndSnapshot().
  2. Snapshots the live set from ScalarRefRegistry (a WeakHashMap).
  3. Walks reachability from globals + lexicals.
  4. Clears every weak ref whose referent is not in the reachable set.

Under DBIC's cumulative state pressure, a still-live object is being classified as unreachable by the walker, and its weak-ref entries get cleared.

The most likely root cause is a missing seeding gate in ReachabilityWalker.walk() (lines 111–153 of src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java). The walker has many guards on what counts as a reachability root:

for (RuntimeScalar sc : ScalarRefRegistry.snapshot()) {
    if (sc.captureCount > 0) continue;                  // (A) skip if captured by a closure
    if (WeakRefRegistry.isweak(sc)) continue;           // (B) skip weak scalars
    if (MortalList.isDeferredCapture(sc)) continue;     // (C) skip pending mortals
    if (!MyVarCleanupStack.isLive(sc)) {
        if (sc.scopeExited) continue;                   // (D) skip exited scopes
        if (!sc.refCountOwned) continue;                // (E) skip non-owning copies
    }
    visitScalar(sc, todo);
}

DBIC's pattern (my $phantom; for ( sub {...}, sub {...}, ... ) { $phantom = $sub->($phantom) }) involves passing complex blessed refs through @_ temporaries, accessor-method dispatch, Class::C3::XS next-method, Moo/Sub::Quote-generated accessors, and DBIC's _result_source weak-ref accessor magic. One of these creates a reachability path the walker doesn't follow, OR creates a scalar slot that lands in one of the skip branches above (A–E) when it shouldn't.

A second, less-likely possibility: there's a race between forceGcAndSnapshot() and the snapshot read — the WeakHashMap entry for the schema's lexical gets GC'd between the force-GC and the snapshot, even though the Perl-level lexical is still on the stack. We've ruled this out for simple cases (see below) but not for the full DBIC pattern.

What is NOT the problem

We've already disconfirmed several plausible-sounding hypotheses:

  • NOT "walker doesn't seed my $scalar = $ref lexicals at all". A standalone reproducer with a single my $obj = bless {…}, … lexical + weakened back-ref + JPERL_FORCE_SWEEP_EVERY_FLUSH=1 + 20× explicit Internals::jperl_gc() PASSES. The walker correctly seeds the lexical via MyVarCleanupStack (see dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t).
  • NOT "walker doesn't trace through globals". A DBIC-shape reproducer with the schema kept alive only via our %REGISTRY (mirroring DBIC's package-level storage) PASSES under the same conditions (see dev/sandbox/walker_blind_spot/dbic_real_pattern_PASSES.t).
  • NOT harness select() polling slowness. Investigated separately — the parent harness JVM is 99.7% idle in IOOperator.selectWithNIO Thread.sleep(10) polling (10 jstack samples over 32 min, 6 s CPU total). When tests time out, it's the child JVM that's silent, not the parent.
  • NOT orphan-JVM CPU starvation. PR Storable hook cookie + leaf seen-tag fix; AGENTS timeout rule; jperl orphan watchdog #644's orphan watchdog (gated on JPERL_ORPHAN_EXIT=1 set by ./jcpan and ./jprove) prevents leftover JVMs from contaminating subsequent runs. The schema-detached bug still reproduces with the watchdog active.
  • NOT macOS Jetsam SIGKILL. We tested -Xmx2g to constrain heap and reduce JVM virtual-memory pressure; the schema-detached bug still appears.
  • NOT specific to one test. Across 5 separate full DBIC runs today, the failing test was different each time (8 distinct files have been hit). It's the parent process state at the moment of the auto-sweep that determines who loses.

How to investigate

The deterministic approach has three steps:

Step A — Add JPERL_WALKER_TRACE=1 instrumentation

In src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java, inside sweepWeakRefs(boolean quiet) (around the loop that clears weak refs to unreachable objects), add an env-gated System.err.println for every clear that records:

WALKER_CLEAR target=<className>@<identityHashCode>
             refCount=<base.refCount>
             refCountOwned=<sc.refCountOwned>
             scopeExited=<sc.scopeExited>
             captureCount=<sc.captureCount>
             storedInPackageGlobal=<base.storedInPackageGlobal>
             path=<findPathTo(target) or "<unreachable>">
             seedStats=globals:<n> myVars:<n> scalarRefReg:<n>

The path= field is the key piece — the existing ReachabilityWalker.findPathTo() (line 202) returns the actual reachability path the walker found (or <unreachable> if it didn't find one). The seedStats snapshot tells us whether the schema's lexical was eligible to be seeded, even if it ended up filtered out.

Step B — Capture a real DBIC failure

JPERL_FORCE_SWEEP_EVERY_FLUSH=1 JPERL_WALKER_TRACE=1 \
    timeout 3000 ./jcpan -t DBIx::Class > /tmp/dbic.log 2>/tmp/dbic.trace

In the resulting /tmp/dbic.trace file, find the first WALKER_CLEAR line whose target class is one of:

  • DBIx::Class::Schema (or any subclass like DBICTest::Schema)
  • DBIx::Class::ResultSource::Table
  • DBIx::Class::ResultSourceHandle
  • DBIx::Class::Storage::DBI

That single trace line names the seeding gate that incorrectly excluded a still-live object.

Step C — Diagnose

The trace line will fall into one of these patterns. Each points to a different fix:

trace pattern meaning fix location
path=<reachable via …captures…> but cleared anyway walker found a path but cleared regardless — bug in the clear-loop's check ReachabilityWalker.sweepWeakRefs clear logic
path=<unreachable> AND refCountOwned=true AND captureCount>0 the (A) skip is over-eager — closures capturing this scalar are themselves not reachable from globalCodeRefs walker Phase 2 (a) — make captureCount skip conditional on the closure being walkable
path=<unreachable> AND refCountOwned=true AND MyVarCleanupStack.isLive=false AND scopeExited=false walker missed the live lexical because MyVarCleanupStack doesn't know about it bytecode emission in EmitVariable.java — register my $scalar declarations into MyVarCleanupStack the same way my @arr / my %hash are
path=<unreachable> AND storedInPackageGlobal=true the global container isn't being walked or the link from container → element broke walker Phase 2 — fix the global-hash visit pattern
path=<unreachable> AND DBIC-specific accessor (Sub::Quote eval, Class::C3::XS) walker doesn't follow Sub::Quote's eval-string captures or the C3::XS dispatch chain walker — add a code-walking phase for these constructs

How it can be fixed

Once Step C identifies which property/state caused the false-negative seeding, the fix is in ReachabilityWalker.walk() (Phase 2, lines 111–153). Most likely candidates, in order of expected probability:

  1. Sub::Quote / Class::C3::XS captures. DBIC heavily uses Sub::Quote-generated accessors (look for the B::Deparse-style eval-string subs created by Moo and DBIC accessors). When such a sub captures the schema, the walker's "Phase 1: walk globalCodeRefs WITH captures" should follow the capture — but only if the sub is in globalCodeRefs. Sub::Quote-generated subs may not be (they're stored in package stash slots that aren't always walked the same way). Fix: walk all RuntimeCode instances reachable from anywhere in the symbol table, not just globalCodeRefs.

  2. Captured scalar over-skip. The captureCount > 0 → continue skip at line 113 assumes that captured scalars will be reached via the closure walk. If the closure isn't itself reachable, the captured scalar gets dropped. Fix: gate the skip on captureCount > 0 AND we_actually_walked_a_closure_that_captures_it.

  3. MyVarCleanupStack scalar registration. Today MyVarCleanupStack is registered on every my $x (in EmitVariable.java line 1551 — gated by cleanupNeeded), but the walker's path (b) handles RuntimeScalar only via the liveVar instanceof RuntimeScalar branch. The branch logic may have a gap. Fix: trace what MyVarCleanupStack.snapshotLiveVars() returns at sweep time and verify scalars with refCountOwned=true are always seeded.

  4. Storable seen-table inflation. DBIC tests use dclone extensively. Storable's seen-table holds strong refs internally during the freeze/thaw window — these can leak to other tests via the static encoder/decoder. Fix: ensure Storable's seen-table is fully cleared at every freeze/thaw boundary (tracked separately as part of PR fix(IO::Socket::SSL,Storable): SSL_WANT_* constants + clearer retrieve error; plan native Storable #621 follow-up).

A regression test under src/test/resources/unit/refcount/walker_<specific_gate>.t should be added once the actual gate is identified (Step C). The two dev/sandbox/walker_blind_spot/*_PASSES.t files already exercise the harness setup correctly; whoever fixes this just needs to extend them with the real failing pattern.

What's already landed (PR #635)

  • JPERL_FORCE_SWEEP_EVERY_FLUSH=1 debug knob in MortalList.maybeAutoSweep() — the prerequisite for deterministic Step B.
  • Two *_PASSES.t reproducers that don't fail — they prove what's NOT the bug.
  • Updated investigation plan in dev/modules/dbix_class.md (sections "Investigation Plan: Schema-Detached Bug in t/52leaks.t (line 430)" and "Next steps (concrete, in order)").
  • Sandbox README dev/sandbox/walker_blind_spot/README.md summarising the current state for the next investigator.

Why this matters

DBIx::Class is published as PASS in dev/cpan-reports/cpan-compatibility.md. Users running jcpan DBIx::Class see a clean install when run alone (it really does pass standalone), but a failing install under the published smoke-test infrastructure. That's a worse user experience than fully blocking the install upfront — losing the trust signal in the compatibility report. We can't update the report to FAIL either, because the standalone install really is functional. The intermittent under harness needs to be fixed, not papered over.

Acceptance criteria

  • ./jcpan -t DBIx::Class passes 0/314 failures over 5 consecutive runs on a quiet box.
  • JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jcpan -t DBIx::Class also passes (deterministic mode is now safe).
  • A unit test under src/test/resources/unit/refcount/walker_<…>.t reproduces the original failure pattern and is part of make.
  • No regression in the existing leak-detection assertions (t/52leaks.t tests 12-18 still pass at the same rate as before).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions