You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running the full DBIx::Class test suite via ./jcpan -t DBIx::Class, the test file t/52leaks.t intermittently fails with:
DBIx::Class::ResultSource::schema(): Unable to perform storage-dependent operations
with a detached result source (source 'Artist' is not associated with a schema).
at t/52leaks.t line 430
Followed by:
# Tests were run but no plan was declared and done_testing() was not seen.
t/52leaks.t .........................................
Dubious, test returned 255 (wstat 65280, 0xff00)
t/52leaks.t line 430 sits inside a closure-chain over $phantom:
sub { shift->result_source }, # $phantom = ResultSourcesub { shift->resultset }, # ← line 430: calls $self->schema, dies if undef
The ResultSource → Schema weak ref has been cleared while the test is still mid-statement and DBIC subsequently expects to dereference it.
The same intermittent has been observed across runs on different DBIC test files (t/76joins.t, t/96_is_deteministic_value.t, t/752sqlite.t, t/86might_have.t, t/63register_class.t, t/46where_attribute.t, t/33exception_wrap.t, t/101populate_rs.t) — different victim every run. The harness times those out as "no TAP output for 300s" and SIGKILLs them.
This blocks publishing DBIx::Class PASS in dev/cpan-reports/cpan-compatibility.md despite all 9 hard-failure regressions from PR #621 having been fixed in PR #644.
How to reproduce
Reliable (forced timing)
The bug fires only when MortalList.maybeAutoSweep() happens to run at a particular moment relative to DBIC's cumulative WeakRefRegistry state. The 5-second throttle on auto-sweep makes natural reproduction stochastic; PR #635 added a debug knob that bypasses both the throttle and the weakRefsExist gate so the sweep fires on every Perl statement boundary:
Under this knob, ~1–3 DBIC test files crash mid-test on every run with the detached result source exception. Standalone (single-test) reproduction via JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jperl t/52leaks.t or simpler in-isolation reproducers does not yet trigger the bug — see "What is NOT the problem" below.
Disabling auto-GC makes the detached result source crash go away — but exposes the existing t/52leaks.t tests 12–18 leak-detection failures (14/23 subtests fail because objects aren't getting their weak refs cleared). So the auto-sweep IS doing necessary work; we just want it to not clear refs to live objects.
Forces a JVM GC via ScalarRefRegistry.forceGcAndSnapshot().
Snapshots the live set from ScalarRefRegistry (a WeakHashMap).
Walks reachability from globals + lexicals.
Clears every weak ref whose referent is not in the reachable set.
Under DBIC's cumulative state pressure, a still-live object is being classified as unreachable by the walker, and its weak-ref entries get cleared.
The most likely root cause is a missing seeding gate in ReachabilityWalker.walk() (lines 111–153 of src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java). The walker has many guards on what counts as a reachability root:
for (RuntimeScalarsc : ScalarRefRegistry.snapshot()) {
if (sc.captureCount > 0) continue; // (A) skip if captured by a closureif (WeakRefRegistry.isweak(sc)) continue; // (B) skip weak scalarsif (MortalList.isDeferredCapture(sc)) continue; // (C) skip pending mortalsif (!MyVarCleanupStack.isLive(sc)) {
if (sc.scopeExited) continue; // (D) skip exited scopesif (!sc.refCountOwned) continue; // (E) skip non-owning copies
}
visitScalar(sc, todo);
}
DBIC's pattern (my $phantom; for ( sub {...}, sub {...}, ... ) { $phantom = $sub->($phantom) }) involves passing complex blessed refs through @_ temporaries, accessor-method dispatch, Class::C3::XS next-method, Moo/Sub::Quote-generated accessors, and DBIC's _result_source weak-ref accessor magic. One of these creates a reachability path the walker doesn't follow, OR creates a scalar slot that lands in one of the skip branches above (A–E) when it shouldn't.
A second, less-likely possibility: there's a race between forceGcAndSnapshot() and the snapshot read — the WeakHashMap entry for the schema's lexical gets GC'd between the force-GC and the snapshot, even though the Perl-level lexical is still on the stack. We've ruled this out for simple cases (see below) but not for the full DBIC pattern.
What is NOT the problem
We've already disconfirmed several plausible-sounding hypotheses:
❌ NOT "walker doesn't seed my $scalar = $ref lexicals at all". A standalone reproducer with a single my $obj = bless {…}, … lexical + weakened back-ref + JPERL_FORCE_SWEEP_EVERY_FLUSH=1 + 20× explicit Internals::jperl_gc() PASSES. The walker correctly seeds the lexical via MyVarCleanupStack (see dev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t).
❌ NOT "walker doesn't trace through globals". A DBIC-shape reproducer with the schema kept alive only via our %REGISTRY (mirroring DBIC's package-level storage) PASSES under the same conditions (see dev/sandbox/walker_blind_spot/dbic_real_pattern_PASSES.t).
❌ NOT harness select() polling slowness. Investigated separately — the parent harness JVM is 99.7% idle in IOOperator.selectWithNIOThread.sleep(10) polling (10 jstack samples over 32 min, 6 s CPU total). When tests time out, it's the child JVM that's silent, not the parent.
❌ NOT macOS Jetsam SIGKILL. We tested -Xmx2g to constrain heap and reduce JVM virtual-memory pressure; the schema-detached bug still appears.
❌ NOT specific to one test. Across 5 separate full DBIC runs today, the failing test was different each time (8 distinct files have been hit). It's the parent process state at the moment of the auto-sweep that determines who loses.
How to investigate
The deterministic approach has three steps:
Step A — Add JPERL_WALKER_TRACE=1 instrumentation
In src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java, inside sweepWeakRefs(boolean quiet) (around the loop that clears weak refs to unreachable objects), add an env-gated System.err.println for every clear that records:
The path= field is the key piece — the existing ReachabilityWalker.findPathTo() (line 202) returns the actual reachability path the walker found (or <unreachable> if it didn't find one). The seedStats snapshot tells us whether the schema's lexical was eligible to be seeded, even if it ended up filtered out.
In the resulting /tmp/dbic.trace file, find the first WALKER_CLEAR line whose target class is one of:
DBIx::Class::Schema (or any subclass like DBICTest::Schema)
DBIx::Class::ResultSource::Table
DBIx::Class::ResultSourceHandle
DBIx::Class::Storage::DBI
That single trace line names the seeding gate that incorrectly excluded a still-live object.
Step C — Diagnose
The trace line will fall into one of these patterns. Each points to a different fix:
trace pattern
meaning
fix location
path=<reachable via …captures…> but cleared anyway
walker found a path but cleared regardless — bug in the clear-loop's check
ReachabilityWalker.sweepWeakRefs clear logic
path=<unreachable> AND refCountOwned=true AND captureCount>0
the (A) skip is over-eager — closures capturing this scalar are themselves not reachable from globalCodeRefs
walker Phase 2 (a) — make captureCount skip conditional on the closure being walkable
path=<unreachable> AND refCountOwned=true AND MyVarCleanupStack.isLive=false AND scopeExited=false
walker missed the live lexical because MyVarCleanupStack doesn't know about it
bytecode emission in EmitVariable.java — register my $scalar declarations into MyVarCleanupStack the same way my @arr / my %hash are
path=<unreachable> AND storedInPackageGlobal=true
the global container isn't being walked or the link from container → element broke
walker Phase 2 — fix the global-hash visit pattern
path=<unreachable> AND DBIC-specific accessor (Sub::Quote eval, Class::C3::XS)
walker doesn't follow Sub::Quote's eval-string captures or the C3::XS dispatch chain
walker — add a code-walking phase for these constructs
How it can be fixed
Once Step C identifies which property/state caused the false-negative seeding, the fix is in ReachabilityWalker.walk() (Phase 2, lines 111–153). Most likely candidates, in order of expected probability:
Sub::Quote / Class::C3::XS captures. DBIC heavily uses Sub::Quote-generated accessors (look for the B::Deparse-style eval-string subs created by Moo and DBIC accessors). When such a sub captures the schema, the walker's "Phase 1: walk globalCodeRefs WITH captures" should follow the capture — but only if the sub is in globalCodeRefs. Sub::Quote-generated subs may not be (they're stored in package stash slots that aren't always walked the same way). Fix: walk all RuntimeCode instances reachable from anywhere in the symbol table, not just globalCodeRefs.
Captured scalar over-skip. The captureCount > 0 → continue skip at line 113 assumes that captured scalars will be reached via the closure walk. If the closure isn't itself reachable, the captured scalar gets dropped. Fix: gate the skip on captureCount > 0 AND we_actually_walked_a_closure_that_captures_it.
MyVarCleanupStack scalar registration. Today MyVarCleanupStack is registered on every my $x (in EmitVariable.java line 1551 — gated by cleanupNeeded), but the walker's path (b) handles RuntimeScalar only via the liveVar instanceof RuntimeScalar branch. The branch logic may have a gap. Fix: trace what MyVarCleanupStack.snapshotLiveVars() returns at sweep time and verify scalars with refCountOwned=true are always seeded.
Storable seen-table inflation. DBIC tests use dclone extensively. Storable's seen-table holds strong refs internally during the freeze/thaw window — these can leak to other tests via the static encoder/decoder. Fix: ensure Storable's seen-table is fully cleared at every freeze/thaw boundary (tracked separately as part of PR fix(IO::Socket::SSL,Storable): SSL_WANT_* constants + clearer retrieve error; plan native Storable #621 follow-up).
A regression test under src/test/resources/unit/refcount/walker_<specific_gate>.t should be added once the actual gate is identified (Step C). The two dev/sandbox/walker_blind_spot/*_PASSES.t files already exercise the harness setup correctly; whoever fixes this just needs to extend them with the real failing pattern.
JPERL_FORCE_SWEEP_EVERY_FLUSH=1 debug knob in MortalList.maybeAutoSweep() — the prerequisite for deterministic Step B.
Two *_PASSES.t reproducers that don't fail — they prove what's NOT the bug.
Updated investigation plan in dev/modules/dbix_class.md (sections "Investigation Plan: Schema-Detached Bug in t/52leaks.t (line 430)" and "Next steps (concrete, in order)").
Sandbox README dev/sandbox/walker_blind_spot/README.md summarising the current state for the next investigator.
Why this matters
DBIx::Class is published as PASS in dev/cpan-reports/cpan-compatibility.md. Users running jcpan DBIx::Class see a clean install when run alone (it really does pass standalone), but a failing install under the published smoke-test infrastructure. That's a worse user experience than fully blocking the install upfront — losing the trust signal in the compatibility report. We can't update the report to FAIL either, because the standalone install really is functional. The intermittent under harness needs to be fixed, not papered over.
Acceptance criteria
./jcpan -t DBIx::Class passes 0/314 failures over 5 consecutive runs on a quiet box.
JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jcpan -t DBIx::Class also passes (deterministic mode is now safe).
A unit test under src/test/resources/unit/refcount/walker_<…>.t reproduces the original failure pattern and is part of make.
No regression in the existing leak-detection assertions (t/52leaks.t tests 12-18 still pass at the same rate as before).
Problem
When running the full DBIx::Class test suite via
./jcpan -t DBIx::Class, the test filet/52leaks.tintermittently fails with:Followed by:
t/52leaks.tline 430 sits inside a closure-chain over$phantom:The
ResultSource → Schemaweak ref has been cleared while the test is still mid-statement and DBIC subsequently expects to dereference it.The same intermittent has been observed across runs on different DBIC test files (
t/76joins.t,t/96_is_deteministic_value.t,t/752sqlite.t,t/86might_have.t,t/63register_class.t,t/46where_attribute.t,t/33exception_wrap.t,t/101populate_rs.t) — different victim every run. The harness times those out as "no TAP output for 300s" and SIGKILLs them.This blocks publishing DBIx::Class PASS in
dev/cpan-reports/cpan-compatibility.mddespite all 9 hard-failure regressions from PR #621 having been fixed in PR #644.How to reproduce
Reliable (forced timing)
The bug fires only when
MortalList.maybeAutoSweep()happens to run at a particular moment relative to DBIC's cumulativeWeakRefRegistrystate. The 5-second throttle on auto-sweep makes natural reproduction stochastic; PR #635 added a debug knob that bypasses both the throttle and theweakRefsExistgate so the sweep fires on every Perl statement boundary:Under this knob, ~1–3 DBIC test files crash mid-test on every run with the
detached result sourceexception. Standalone (single-test) reproduction viaJPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jperl t/52leaks.tor simpler in-isolation reproducers does not yet trigger the bug — see "What is NOT the problem" below.Workaround (proves the cause)
Disabling auto-GC makes the
detached result sourcecrash go away — but exposes the existingt/52leaks.ttests 12–18 leak-detection failures (14/23 subtests fail because objects aren't getting their weak refs cleared). So the auto-sweep IS doing necessary work; we just want it to not clear refs to live objects.What we think the problem is
MortalList.maybeAutoSweep()callsReachabilityWalker.sweepWeakRefs(quiet=true), which:ScalarRefRegistry.forceGcAndSnapshot().ScalarRefRegistry(aWeakHashMap).Under DBIC's cumulative state pressure, a still-live object is being classified as unreachable by the walker, and its weak-ref entries get cleared.
The most likely root cause is a missing seeding gate in
ReachabilityWalker.walk()(lines 111–153 ofsrc/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java). The walker has many guards on what counts as a reachability root:DBIC's pattern (
my $phantom; for ( sub {...}, sub {...}, ... ) { $phantom = $sub->($phantom) }) involves passing complex blessed refs through@_temporaries, accessor-method dispatch, Class::C3::XS next-method, Moo/Sub::Quote-generated accessors, and DBIC's_result_sourceweak-ref accessor magic. One of these creates a reachability path the walker doesn't follow, OR creates a scalar slot that lands in one of the skip branches above (A–E) when it shouldn't.A second, less-likely possibility: there's a race between
forceGcAndSnapshot()and the snapshot read — the WeakHashMap entry for the schema's lexical gets GC'd between the force-GC and the snapshot, even though the Perl-level lexical is still on the stack. We've ruled this out for simple cases (see below) but not for the full DBIC pattern.What is NOT the problem
We've already disconfirmed several plausible-sounding hypotheses:
my $scalar = $reflexicals at all". A standalone reproducer with a singlemy $obj = bless {…}, …lexical + weakened back-ref +JPERL_FORCE_SWEEP_EVERY_FLUSH=1+ 20× explicitInternals::jperl_gc()PASSES. The walker correctly seeds the lexical viaMyVarCleanupStack(seedev/sandbox/walker_blind_spot/lexical_scalar_root_PASSES.t).our %REGISTRY(mirroring DBIC's package-level storage) PASSES under the same conditions (seedev/sandbox/walker_blind_spot/dbic_real_pattern_PASSES.t).select()polling slowness. Investigated separately — the parent harness JVM is 99.7% idle inIOOperator.selectWithNIOThread.sleep(10)polling (10 jstack samples over 32 min, 6 s CPU total). When tests time out, it's the child JVM that's silent, not the parent.JPERL_ORPHAN_EXIT=1set by./jcpanand./jprove) prevents leftover JVMs from contaminating subsequent runs. The schema-detached bug still reproduces with the watchdog active.-Xmx2gto constrain heap and reduce JVM virtual-memory pressure; the schema-detached bug still appears.How to investigate
The deterministic approach has three steps:
Step A — Add
JPERL_WALKER_TRACE=1instrumentationIn
src/main/java/org/perlonjava/runtime/runtimetypes/ReachabilityWalker.java, insidesweepWeakRefs(boolean quiet)(around the loop that clears weak refs to unreachable objects), add an env-gatedSystem.err.printlnfor every clear that records:The
path=field is the key piece — the existingReachabilityWalker.findPathTo()(line 202) returns the actual reachability path the walker found (or<unreachable>if it didn't find one). TheseedStatssnapshot tells us whether the schema's lexical was eligible to be seeded, even if it ended up filtered out.Step B — Capture a real DBIC failure
JPERL_FORCE_SWEEP_EVERY_FLUSH=1 JPERL_WALKER_TRACE=1 \ timeout 3000 ./jcpan -t DBIx::Class > /tmp/dbic.log 2>/tmp/dbic.traceIn the resulting
/tmp/dbic.tracefile, find the firstWALKER_CLEARline whose target class is one of:DBIx::Class::Schema(or any subclass likeDBICTest::Schema)DBIx::Class::ResultSource::TableDBIx::Class::ResultSourceHandleDBIx::Class::Storage::DBIThat single trace line names the seeding gate that incorrectly excluded a still-live object.
Step C — Diagnose
The trace line will fall into one of these patterns. Each points to a different fix:
path=<reachable via …captures…>but cleared anywayReachabilityWalker.sweepWeakRefsclear logicpath=<unreachable>ANDrefCountOwned=trueANDcaptureCount>0globalCodeRefscaptureCountskip conditional on the closure being walkablepath=<unreachable>ANDrefCountOwned=trueANDMyVarCleanupStack.isLive=falseANDscopeExited=falseMyVarCleanupStackdoesn't know about itEmitVariable.java— registermy $scalardeclarations into MyVarCleanupStack the same waymy @arr/my %hasharepath=<unreachable>ANDstoredInPackageGlobal=truepath=<unreachable>AND DBIC-specific accessor (Sub::Quote eval, Class::C3::XS)How it can be fixed
Once Step C identifies which property/state caused the false-negative seeding, the fix is in
ReachabilityWalker.walk()(Phase 2, lines 111–153). Most likely candidates, in order of expected probability:Sub::Quote / Class::C3::XS captures. DBIC heavily uses Sub::Quote-generated accessors (look for the
B::Deparse-style eval-string subs created by Moo and DBIC accessors). When such a sub captures the schema, the walker's "Phase 1: walk globalCodeRefs WITH captures" should follow the capture — but only if the sub is inglobalCodeRefs. Sub::Quote-generated subs may not be (they're stored in package stash slots that aren't always walked the same way). Fix: walk all RuntimeCode instances reachable from anywhere in the symbol table, not justglobalCodeRefs.Captured scalar over-skip. The
captureCount > 0 → continueskip at line 113 assumes that captured scalars will be reached via the closure walk. If the closure isn't itself reachable, the captured scalar gets dropped. Fix: gate the skip oncaptureCount > 0 AND we_actually_walked_a_closure_that_captures_it.MyVarCleanupStack scalar registration. Today
MyVarCleanupStackis registered on everymy $x(inEmitVariable.javaline 1551 — gated bycleanupNeeded), but the walker's path (b) handlesRuntimeScalaronly via theliveVar instanceof RuntimeScalarbranch. The branch logic may have a gap. Fix: trace whatMyVarCleanupStack.snapshotLiveVars()returns at sweep time and verify scalars with refCountOwned=true are always seeded.Storable seen-table inflation. DBIC tests use
dcloneextensively. Storable's seen-table holds strong refs internally during the freeze/thaw window — these can leak to other tests via the static encoder/decoder. Fix: ensure Storable's seen-table is fully cleared at every freeze/thaw boundary (tracked separately as part of PR fix(IO::Socket::SSL,Storable): SSL_WANT_* constants + clearer retrieve error; plan native Storable #621 follow-up).A regression test under
src/test/resources/unit/refcount/walker_<specific_gate>.tshould be added once the actual gate is identified (Step C). The twodev/sandbox/walker_blind_spot/*_PASSES.tfiles already exercise the harness setup correctly; whoever fixes this just needs to extend them with the real failing pattern.What's already landed (PR #635)
JPERL_FORCE_SWEEP_EVERY_FLUSH=1debug knob inMortalList.maybeAutoSweep()— the prerequisite for deterministic Step B.*_PASSES.treproducers that don't fail — they prove what's NOT the bug.dev/modules/dbix_class.md(sections "Investigation Plan: Schema-Detached Bug in t/52leaks.t (line 430)" and "Next steps (concrete, in order)").dev/sandbox/walker_blind_spot/README.mdsummarising the current state for the next investigator.Why this matters
DBIx::Classis published asPASSindev/cpan-reports/cpan-compatibility.md. Users runningjcpan DBIx::Classsee a clean install when run alone (it really does pass standalone), but a failing install under the published smoke-test infrastructure. That's a worse user experience than fully blocking the install upfront — losing the trust signal in the compatibility report. We can't update the report toFAILeither, because the standalone install really is functional. The intermittent under harness needs to be fixed, not papered over.Acceptance criteria
./jcpan -t DBIx::Classpasses 0/314 failures over 5 consecutive runs on a quiet box.JPERL_FORCE_SWEEP_EVERY_FLUSH=1 ./jcpan -t DBIx::Classalso passes (deterministic mode is now safe).src/test/resources/unit/refcount/walker_<…>.treproduces the original failure pattern and is part ofmake.t/52leaks.ttests 12-18 still pass at the same rate as before).