fix(orchestrator): phantom-sibling false positive + stale-file reaper by evannadeau · Pull Request #4 · evannadeau/claude-plugins

evannadeau · 2026-05-31T02:22:14Z

Summary

Two related fixes that share a root cause: on a fresh single-claude launch the MCP server's selfSession.session_id can drift from the harness session_id, AND nothing reaps the per-PID active-session-<pid> files that feed the drift. Tracker: QLN-145.

Bug 1 - phantom-sibling false positive

On a cold start with no /resume, if a stale active-session-<pid> file from a prior session collides on PID with the new claude process (or the legacy single-file fallback wins a race), the MCP self-registers in agent_channel.db under an id8 that is NOT the harness session_id. The every-turn cross-session injection then compares against the harness id only and reports the MCP's own row as "1 sibling session active" for the whole session lifetime - heartbeat keeps advancing so the bug never self-clears.

Root cause: getLiveOtherSessionIds(sessionId) in mcp/engine/live_sessions.ts only filters by the caller (harness) id. When the MCP's selfSession.session_id differs from the caller, the MCP's own row leaks through as a phantom sibling.

Fix: live_sessions.ts gains a process-wide opt-in self-id filter (setSelfSessionForLiveFilter / clearSelfSessionForLiveFilter). server.ts registers the MCP's selfSession id at startAgentChannel time AND on the first explicit session_id observed via resolveSessionId (the explicit value is the strongest signal of the harness's real id). getLiveOtherSessionIds excludes BOTH the caller id and the registered self id, preventing the phantom even when the two disagree.

Bug 2 - stale per-PID `active-session-<pid>` reaper

The per-PID active-session-<pid> scheme (0.30.19+) makes session_id lookup race-free under concurrent sessions, but nothing reaps the files when the owning claude process exits. They accumulate indefinitely - one workspace hit 30 stale files in ~12 days. Worse, PID reuse hands the fallback resolver a session_id that is no longer live, feeding directly into Bug 1.

Root cause: No startup-time sweep. Upstream PR SpawnBox-dev#8 (b7f43b7) proposed the reaper but never merged, so this vendored fork still ships without it.

Fix: extract reapStaleActiveSessionFiles into its own engine module (mcp/engine/startup_hygiene.ts) so it is testable in isolation with an injectable liveness probe, and wire it at MCP server startup. Cheap, idempotent, race-safe via process.kill(pid, 0). Lost races with concurrent sessions tolerated - next startup retries. This is a slimmer port of upstream PR SpawnBox-dev#8 (just the reaper - the linux-only orphan-bun warner from that PR is deferred until needed).

Files changed

plugins/orchestrator/mcp/engine/live_sessions.ts - self-id filter machinery
plugins/orchestrator/mcp/engine/startup_hygiene.ts - new module (reaper)
plugins/orchestrator/mcp/server.ts - wires both fixes at the right lifecycle points
plugins/orchestrator/tests/engine/live_sessions_phantom_self.test.ts - 3 cases
plugins/orchestrator/tests/engine/startup_hygiene.test.ts - 6 cases
plugins/orchestrator/dist/server.js - rebuilt via bun run build
plugins/orchestrator/package.json + .claude-plugin/plugin.json - 0.30.52 -> 0.30.53

Verification

bun test: 599 pass / 0 fail / 42 files / 1527 assertions (was 590 baseline; +9 new)
bun run typecheck (tsc --noEmit): clean
bun run build: dist rebuilt; grep -c "reapStaleActiveSessionFiles\|setSelfSessionForLiveFilter" dist/server.js = 5 (both fixes present)

Test plan

Fresh single-claude launch (no /resume) in a workspace with no stale per-PID files - sibling line should not appear when no real siblings exist.
Fresh single-claude launch in a workspace with stale active-session-<pid> files from prior sessions - verify [orchestrator] startup hygiene: reaped N stale active-session-<pid> file(s) appears in MCP stderr at boot, and the files are gone.
Two real concurrent claude sessions in the same project - each should see the other as a sibling (filter does NOT swallow genuine siblings).
PID-reuse repro: manually create <project>/.orchestrator-state/active-session-<your-current-claude-pid> with a stale session_id, then start a new claude session - the rebind on first tool call must clear the phantom.

… (QLN-145) Two related fixes that share a root cause: a fresh single-claude launch can end up with the MCP server's selfSession.session_id drifted from the harness session_id, and nothing reaps the per-PID active-session-<pid> files that feed the drift. Bug 1 (phantom-sibling false positive): On a cold start with no /resume, if a stale active-session-<pid> file from a prior session collides on PID with the new claude process (or the legacy single-file fallback wins a race), the MCP self-registers in agent_channel.db under an id8 that is NOT the harness session_id. The every-turn cross-session injection then compares against the harness id only and reports the MCP's own row as "1 sibling session active" for the whole session lifetime - heartbeat keeps advancing so the bug never self-clears. Fix: live_sessions.ts gains a process-wide self-id filter. server.ts registers the MCP's selfSession id at startAgentChannel time AND on the first explicit session_id observed via resolveSessionId. getLive- OtherSessionIds excludes BOTH the caller id and the registered self id, preventing the phantom even when the two disagree. Bug 2 (stale per-PID active-session file reaper): The per-PID active-session-<pid> scheme (0.30.19+) makes session_id lookup race-free under concurrent sessions, but nothing reaps the files when the owning claude process exits. They accumulate indefinitely - one project hit 30 stale files in ~12 days. Worse, PID reuse hands the fallback resolver a session_id that is no longer live, feeding directly into Bug 1. Fix: extract reapStaleActiveSessionFiles into its own engine module (testable in isolation with an injectable liveness probe) and wire it at MCP server startup. Cheap, idempotent, race-safe via process.kill(pid, 0). Originally proposed upstream as spawnbox-dev/claude-plugins PR SpawnBox-dev#8; that PR never merged so this fork carries the fix. Tests: - tests/engine/live_sessions_phantom_self.test.ts (3 cases) - covers drift, genuine siblings still surface, idempotent self-id updates. - tests/engine/startup_hygiene.test.ts (6 cases) - covers reap of dead- PID files, non-PID files ignored, missing dir no-op, empty dir no-op, real process.kill probe smoke test, pid<=0 rejection. - Full suite: 599 pass / 0 fail (was 590 baseline). - tsc --noEmit clean. Version: 0.30.52 -> 0.30.53. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

evannadeau

Review verdict: DONE_WITH_CONCERNS

Solid investigative work and clean test infra discipline. One correctness gap and one test-as-spec concern I want surfaced before this is treated as a closed chapter on QLN-145.

What I verified

Branch off single-orchestrator (ADR-0021 D2): merge-base = 79c0908, confirmed.
No PA/SA markdown changes (D3): diff touches only mcp/, tests/, dist/, .claude-plugin/plugin.json, package.json.
No quayline identifier anywhere under plugins/orchestrator/{mcp,tests,.claude-plugin} or package.json (D10): grep clean.
bun test: 599 pass / 0 fail / 1527 assertions, matches coder claim. New live_sessions_phantom_self.test.ts (3) and startup_hygiene.test.ts (6) all green.
bun run typecheck (tsc --noEmit): clean.
dist/server.js rebuilt: grep -c "reapStaleActiveSessionFiles\|setSelfSessionForLiveFilter" = 5.
Upstream PR SpawnBox-dev#8 status (SpawnBox-dev/claude-plugins#8): open, not merged — coder claim verified, QLN-145 issue body was wrong.
Reaper module's process.kill(pid, 0) probe + injectable liveness shape: appropriate testability surface.
Reaper scope (no orphan-bun warner): defensibly slimmer. Linux-only, separable concern, "ship when needed."

Concern 1 — phantom-sibling fix is partial, not a complete close on the race

The fix excludes the MCP's drifted self-row via a process-wide selfSessionIdForFilter set in two places:

startAgentChannel registers selfSession.session_id (drifted value X).
resolveSessionId(explicit) rebinds the filter to the harness id Y on first tool call.

But step 2 only rebinds the filter — it does NOT rebind AgentChannel.selfSession, does NOT call removeSession(X), and does NOT writeSession(Y). The heartbeat timer in agent_channel.ts:251-278 keeps refreshing the row keyed on this.selfSession.session_id = X forever.

Concrete trace on a currently-drifted launch:

Cold start, stale active-session-<pid> matches new PID, getFallbackSessionId() returns X.
startAgentChannel writes self-row under X, filter = X. (Phantom suppressed.)
First hook event fires with session_id = Y (harness id).
resolveSessionId(Y) runs setSelfSessionForLiveFilter(Y). Filter is now Y.
renderSiblingActivity → tracker.getActiveSiblings(Y) → getLiveOtherSessionIds(Y).
Iteration: Y excluded as caller; filter = Y (no help); X (heartbeat-fresh self-row) passes through → phantom resurfaces on the first prompt.

The fix only fully closes the window for the brief moment between startAgentChannel and the first explicit session_id. The reaper (Bug 2 fix) prevents recurrence on the NEXT launch, which is genuinely the durable fix. But on the launch where drift already occurred (the QLN-145 reported symptom), this PR does not actually close the symptom.

Two ways forward:

Smaller: in resolveSessionId on rebind, also call into AgentChannel.rebindSelfSession(Y) which does removeSession(X) + updates this.selfSession.session_id = Y + immediate heartbeat. ~15-20 LOC.
Larger: maintain a self-id exclusion SET, not point-in-time. Trivially correct but leaks unbounded across MCP lifetime (small set in practice).

Concern 2 — test 3 pins the wrong contract

live_sessions_phantom_self.test.ts:184-194:

setSelfSessionForLiveFilter(updatedSelf);
others = getLiveOtherSessionIds(harnessSessionId);
expect(others).not.toContain(updatedSelf);
// ...
// firstSelf reappears as an apparent sibling - acceptable, because in
// production startAgentChannel removes the old row before re-registering
// under the new id.
expect(others).toContain(firstSelf);

The comment's "in production startAgentChannel removes the old row before re-registering" assumption does not match the code I read. startAgentChannel runs exactly once at MCP boot (with a retry loop until session becomes resolvable), never re-runs on resolver rebind, and never removes the prior self-row. The test pins the buggy behavior with a justification that doesn't hold. Either:

the fix should be extended so this assertion can flip to expect(others).not.toContain(firstSelf) (correct contract), or
the comment should be rewritten to honestly state "this is a known partial-fix limitation; the durable fix is Bug 2's reaper preventing the drift from happening at all on the next launch."

Concern 3 (minor) — test infra workaround vs design fix

writeFileSync(join(stateDir, "agent_channel.db"), "") as a marker to satisfy getAgentChannelStateDir's existsSync check is a fine test workaround. Worth a follow-up to consider whether :memory: DB mode should signal "channel state is logically present" some other way (env var? a sentinel?) so tests don't need to forge a zero-byte SQLite file. Not blocking.

Verdict rationale

The reaper (Bug 2) is correct, well-tested, well-scoped, and the durable root-cause fix. Ship.
The phantom-sibling filter (Bug 1) reduces the exposure window but does not actually fix the reported symptom on a currently-drifted launch. The combination of reaper + filter means: drift won't recur after this lands, but the launch that already drifted continues to show the phantom for its lifetime. That's a real improvement, but the PR description and tests overclaim closure.
ADR-0021 invariants all clean. Engine extraction shape is right.
9 new tests for 2 bugs — coverage is adequate for what the fix DOES do; missing coverage for "rebind clears the drifted row from DB" because that path doesn't exist.

Approving with concerns because: (a) reaper alone is worth shipping immediately to stop bleeding, (b) the partial filter is a strict improvement over status quo, (c) the gap can be closed in a follow-up. Strong recommendation: open a follow-up issue for "complete the Bug 1 fix — rebind AgentChannel.selfSession on resolveSessionId rebind" before closing QLN-145.

Plugin-class review precedent (first ever)

Future plugin-class PRs should expect: ADR-0021 invariant spot-check (D2/D3/D10 grep), upstream PR status verification when fix is "ported from upstream", bun test + tsc --noEmit from plugins/orchestrator/, race-condition fix correctness traced end-to-end (not just test-green), dist artifact verified rebuilt, and willingness to push back when tests pin buggy contracts even if green. Plugin code is reviewed adversarially same as product code, with extra weight on race-window analysis and embeddings-sidecar contract preservation.

Reviewer: skip subordinate, dispatched 2026-05-30.

evannadeau commented May 31, 2026

View reviewed changes

evannadeau merged commit cb43f32 into single-orchestrator May 31, 2026

evannadeau deleted the plugin-coder/qln-145-impl branch May 31, 2026 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): phantom-sibling false positive + stale-file reaper#4

fix(orchestrator): phantom-sibling false positive + stale-file reaper#4
evannadeau merged 1 commit into
single-orchestratorfrom
plugin-coder/qln-145-impl

evannadeau commented May 31, 2026

Uh oh!

evannadeau left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evannadeau commented May 31, 2026

Summary

Bug 1 - phantom-sibling false positive

Bug 2 - stale per-PID active-session-<pid> reaper

Files changed

Verification

Test plan

Uh oh!

evannadeau left a comment

Choose a reason for hiding this comment

Review verdict: DONE_WITH_CONCERNS

What I verified

Concern 1 — phantom-sibling fix is partial, not a complete close on the race

Concern 2 — test 3 pins the wrong contract

Concern 3 (minor) — test infra workaround vs design fix

Verdict rationale

Plugin-class review precedent (first ever)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 2 - stale per-PID `active-session-<pid>` reaper