Skip to content

fix(orchestrator): phantom-sibling false positive + stale-file reaper#4

Merged
evannadeau merged 1 commit into
single-orchestratorfrom
plugin-coder/qln-145-impl
May 31, 2026
Merged

fix(orchestrator): phantom-sibling false positive + stale-file reaper#4
evannadeau merged 1 commit into
single-orchestratorfrom
plugin-coder/qln-145-impl

Conversation

@evannadeau

Copy link
Copy Markdown
Owner

Summary

Two related fixes that share a root cause: on a fresh single-claude launch the MCP server's selfSession.session_id can drift from the harness session_id, AND nothing reaps the per-PID active-session-<pid> files that feed the drift. Tracker: QLN-145.

Bug 1 - phantom-sibling false positive

On a cold start with no /resume, if a stale active-session-<pid> file from a prior session collides on PID with the new claude process (or the legacy single-file fallback wins a race), the MCP self-registers in agent_channel.db under an id8 that is NOT the harness session_id. The every-turn cross-session injection then compares against the harness id only and reports the MCP's own row as "1 sibling session active" for the whole session lifetime - heartbeat keeps advancing so the bug never self-clears.

Root cause: getLiveOtherSessionIds(sessionId) in mcp/engine/live_sessions.ts only filters by the caller (harness) id. When the MCP's selfSession.session_id differs from the caller, the MCP's own row leaks through as a phantom sibling.

Fix: live_sessions.ts gains a process-wide opt-in self-id filter (setSelfSessionForLiveFilter / clearSelfSessionForLiveFilter). server.ts registers the MCP's selfSession id at startAgentChannel time AND on the first explicit session_id observed via resolveSessionId (the explicit value is the strongest signal of the harness's real id). getLiveOtherSessionIds excludes BOTH the caller id and the registered self id, preventing the phantom even when the two disagree.

Bug 2 - stale per-PID active-session-<pid> reaper

The per-PID active-session-<pid> scheme (0.30.19+) makes session_id lookup race-free under concurrent sessions, but nothing reaps the files when the owning claude process exits. They accumulate indefinitely - one workspace hit 30 stale files in ~12 days. Worse, PID reuse hands the fallback resolver a session_id that is no longer live, feeding directly into Bug 1.

Root cause: No startup-time sweep. Upstream PR SpawnBox-dev#8 (b7f43b7) proposed the reaper but never merged, so this vendored fork still ships without it.

Fix: extract reapStaleActiveSessionFiles into its own engine module (mcp/engine/startup_hygiene.ts) so it is testable in isolation with an injectable liveness probe, and wire it at MCP server startup. Cheap, idempotent, race-safe via process.kill(pid, 0). Lost races with concurrent sessions tolerated - next startup retries. This is a slimmer port of upstream PR SpawnBox-dev#8 (just the reaper - the linux-only orphan-bun warner from that PR is deferred until needed).

Files changed

  • plugins/orchestrator/mcp/engine/live_sessions.ts - self-id filter machinery
  • plugins/orchestrator/mcp/engine/startup_hygiene.ts - new module (reaper)
  • plugins/orchestrator/mcp/server.ts - wires both fixes at the right lifecycle points
  • plugins/orchestrator/tests/engine/live_sessions_phantom_self.test.ts - 3 cases
  • plugins/orchestrator/tests/engine/startup_hygiene.test.ts - 6 cases
  • plugins/orchestrator/dist/server.js - rebuilt via bun run build
  • plugins/orchestrator/package.json + .claude-plugin/plugin.json - 0.30.52 -> 0.30.53

Verification

  • bun test: 599 pass / 0 fail / 42 files / 1527 assertions (was 590 baseline; +9 new)
  • bun run typecheck (tsc --noEmit): clean
  • bun run build: dist rebuilt; grep -c "reapStaleActiveSessionFiles\|setSelfSessionForLiveFilter" dist/server.js = 5 (both fixes present)

Test plan

  • Fresh single-claude launch (no /resume) in a workspace with no stale per-PID files - sibling line should not appear when no real siblings exist.
  • Fresh single-claude launch in a workspace with stale active-session-<pid> files from prior sessions - verify [orchestrator] startup hygiene: reaped N stale active-session-<pid> file(s) appears in MCP stderr at boot, and the files are gone.
  • Two real concurrent claude sessions in the same project - each should see the other as a sibling (filter does NOT swallow genuine siblings).
  • PID-reuse repro: manually create <project>/.orchestrator-state/active-session-<your-current-claude-pid> with a stale session_id, then start a new claude session - the rebind on first tool call must clear the phantom.

… (QLN-145)

Two related fixes that share a root cause: a fresh single-claude launch can
end up with the MCP server's selfSession.session_id drifted from the harness
session_id, and nothing reaps the per-PID active-session-<pid> files that
feed the drift.

Bug 1 (phantom-sibling false positive):

On a cold start with no /resume, if a stale active-session-<pid> file from
a prior session collides on PID with the new claude process (or the legacy
single-file fallback wins a race), the MCP self-registers in
agent_channel.db under an id8 that is NOT the harness session_id. The
every-turn cross-session injection then compares against the harness id
only and reports the MCP's own row as "1 sibling session active" for the
whole session lifetime - heartbeat keeps advancing so the bug never
self-clears.

Fix: live_sessions.ts gains a process-wide self-id filter. server.ts
registers the MCP's selfSession id at startAgentChannel time AND on the
first explicit session_id observed via resolveSessionId. getLive-
OtherSessionIds excludes BOTH the caller id and the registered self id,
preventing the phantom even when the two disagree.

Bug 2 (stale per-PID active-session file reaper):

The per-PID active-session-<pid> scheme (0.30.19+) makes session_id lookup
race-free under concurrent sessions, but nothing reaps the files when the
owning claude process exits. They accumulate indefinitely - one project
hit 30 stale files in ~12 days. Worse, PID reuse hands the fallback
resolver a session_id that is no longer live, feeding directly into Bug 1.

Fix: extract reapStaleActiveSessionFiles into its own engine module
(testable in isolation with an injectable liveness probe) and wire it at
MCP server startup. Cheap, idempotent, race-safe via process.kill(pid, 0).
Originally proposed upstream as spawnbox-dev/claude-plugins PR SpawnBox-dev#8; that PR
never merged so this fork carries the fix.

Tests:
- tests/engine/live_sessions_phantom_self.test.ts (3 cases) - covers
  drift, genuine siblings still surface, idempotent self-id updates.
- tests/engine/startup_hygiene.test.ts (6 cases) - covers reap of dead-
  PID files, non-PID files ignored, missing dir no-op, empty dir no-op,
  real process.kill probe smoke test, pid<=0 rejection.
- Full suite: 599 pass / 0 fail (was 590 baseline).
- tsc --noEmit clean.

Version: 0.30.52 -> 0.30.53.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@evannadeau evannadeau left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review verdict: DONE_WITH_CONCERNS

Solid investigative work and clean test infra discipline. One correctness gap and one test-as-spec concern I want surfaced before this is treated as a closed chapter on QLN-145.

What I verified

  • Branch off single-orchestrator (ADR-0021 D2): merge-base = 79c0908, confirmed.
  • No PA/SA markdown changes (D3): diff touches only mcp/, tests/, dist/, .claude-plugin/plugin.json, package.json.
  • No quayline identifier anywhere under plugins/orchestrator/{mcp,tests,.claude-plugin} or package.json (D10): grep clean.
  • bun test: 599 pass / 0 fail / 1527 assertions, matches coder claim. New live_sessions_phantom_self.test.ts (3) and startup_hygiene.test.ts (6) all green.
  • bun run typecheck (tsc --noEmit): clean.
  • dist/server.js rebuilt: grep -c "reapStaleActiveSessionFiles\|setSelfSessionForLiveFilter" = 5.
  • Upstream PR SpawnBox-dev#8 status (SpawnBox-dev/claude-plugins#8): open, not merged — coder claim verified, QLN-145 issue body was wrong.
  • Reaper module's process.kill(pid, 0) probe + injectable liveness shape: appropriate testability surface.
  • Reaper scope (no orphan-bun warner): defensibly slimmer. Linux-only, separable concern, "ship when needed."

Concern 1 — phantom-sibling fix is partial, not a complete close on the race

The fix excludes the MCP's drifted self-row via a process-wide selfSessionIdForFilter set in two places:

  1. startAgentChannel registers selfSession.session_id (drifted value X).
  2. resolveSessionId(explicit) rebinds the filter to the harness id Y on first tool call.

But step 2 only rebinds the filter — it does NOT rebind AgentChannel.selfSession, does NOT call removeSession(X), and does NOT writeSession(Y). The heartbeat timer in agent_channel.ts:251-278 keeps refreshing the row keyed on this.selfSession.session_id = X forever.

Concrete trace on a currently-drifted launch:

  1. Cold start, stale active-session-<pid> matches new PID, getFallbackSessionId() returns X.
  2. startAgentChannel writes self-row under X, filter = X. (Phantom suppressed.)
  3. First hook event fires with session_id = Y (harness id).
  4. resolveSessionId(Y) runs setSelfSessionForLiveFilter(Y). Filter is now Y.
  5. renderSiblingActivitytracker.getActiveSiblings(Y)getLiveOtherSessionIds(Y).
  6. Iteration: Y excluded as caller; filter = Y (no help); X (heartbeat-fresh self-row) passes through → phantom resurfaces on the first prompt.

The fix only fully closes the window for the brief moment between startAgentChannel and the first explicit session_id. The reaper (Bug 2 fix) prevents recurrence on the NEXT launch, which is genuinely the durable fix. But on the launch where drift already occurred (the QLN-145 reported symptom), this PR does not actually close the symptom.

Two ways forward:

  • Smaller: in resolveSessionId on rebind, also call into AgentChannel.rebindSelfSession(Y) which does removeSession(X) + updates this.selfSession.session_id = Y + immediate heartbeat. ~15-20 LOC.
  • Larger: maintain a self-id exclusion SET, not point-in-time. Trivially correct but leaks unbounded across MCP lifetime (small set in practice).

Concern 2 — test 3 pins the wrong contract

live_sessions_phantom_self.test.ts:184-194:

setSelfSessionForLiveFilter(updatedSelf);
others = getLiveOtherSessionIds(harnessSessionId);
expect(others).not.toContain(updatedSelf);
// ...
// firstSelf reappears as an apparent sibling - acceptable, because in
// production startAgentChannel removes the old row before re-registering
// under the new id.
expect(others).toContain(firstSelf);

The comment's "in production startAgentChannel removes the old row before re-registering" assumption does not match the code I read. startAgentChannel runs exactly once at MCP boot (with a retry loop until session becomes resolvable), never re-runs on resolver rebind, and never removes the prior self-row. The test pins the buggy behavior with a justification that doesn't hold. Either:

  • the fix should be extended so this assertion can flip to expect(others).not.toContain(firstSelf) (correct contract), or
  • the comment should be rewritten to honestly state "this is a known partial-fix limitation; the durable fix is Bug 2's reaper preventing the drift from happening at all on the next launch."

Concern 3 (minor) — test infra workaround vs design fix

writeFileSync(join(stateDir, "agent_channel.db"), "") as a marker to satisfy getAgentChannelStateDir's existsSync check is a fine test workaround. Worth a follow-up to consider whether :memory: DB mode should signal "channel state is logically present" some other way (env var? a sentinel?) so tests don't need to forge a zero-byte SQLite file. Not blocking.

Verdict rationale

  • The reaper (Bug 2) is correct, well-tested, well-scoped, and the durable root-cause fix. Ship.
  • The phantom-sibling filter (Bug 1) reduces the exposure window but does not actually fix the reported symptom on a currently-drifted launch. The combination of reaper + filter means: drift won't recur after this lands, but the launch that already drifted continues to show the phantom for its lifetime. That's a real improvement, but the PR description and tests overclaim closure.
  • ADR-0021 invariants all clean. Engine extraction shape is right.
  • 9 new tests for 2 bugs — coverage is adequate for what the fix DOES do; missing coverage for "rebind clears the drifted row from DB" because that path doesn't exist.

Approving with concerns because: (a) reaper alone is worth shipping immediately to stop bleeding, (b) the partial filter is a strict improvement over status quo, (c) the gap can be closed in a follow-up. Strong recommendation: open a follow-up issue for "complete the Bug 1 fix — rebind AgentChannel.selfSession on resolveSessionId rebind" before closing QLN-145.

Plugin-class review precedent (first ever)

Future plugin-class PRs should expect: ADR-0021 invariant spot-check (D2/D3/D10 grep), upstream PR status verification when fix is "ported from upstream", bun test + tsc --noEmit from plugins/orchestrator/, race-condition fix correctness traced end-to-end (not just test-green), dist artifact verified rebuilt, and willingness to push back when tests pin buggy contracts even if green. Plugin code is reviewed adversarially same as product code, with extra weight on race-window analysis and embeddings-sidecar contract preservation.

Reviewer: skip subordinate, dispatched 2026-05-30.

@evannadeau evannadeau merged commit cb43f32 into single-orchestrator May 31, 2026
@evannadeau evannadeau deleted the plugin-coder/qln-145-impl branch May 31, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant