Skip to content

Perf WG-4: Relay perf — hook event dedup, cleanup leaks, SSE backpressure #74

@DFearing

Description

@DFearing

Summary

Workflow 4 of 5 from the 2026-05-03 performance review of main. Parent: #46. Sibling: #75 (extension JSONL ingestion). Six findings on scripts/relay.ts and the webview message bus. Two are correctness bugs (cleanup leak, hook-event duplication); the rest are throughput / lifecycle wins.

Critical

CR-5 · relay.ts dispose() leaks subagent watchers, dir watchers, and permission timers

  • scripts/relay.ts:633-661 vs extension/src/session-watcher.ts:559-587 (the working reference)
  • Relay loops for (const session of sessions.values()) and only closes fileWatcher, pollTimer, inactivityTimer. NOT closed:
    • session.subagentsDirWatcher (fs.FSWatcher set in subagent-watcher.ts:55)
    • Each session.subagentWatchers[*].watcher (per-subagent-file watchers)
    • session.permissionTimer and per-subagent permission timers
    • parser.clearSessionState(...) is not called
  • The two dispose paths share the same WatchedSession shape but the cleanup logic diverged — the extension's is correct.
  • Also: there is no per-session cleanup hook. When a session ends, watchers and parser state stay alive until full process dispose.
  • Fix: mirror SessionWatcher.dispose() cleanup loop into the relay; add a per-session disposeSession(sessionId) invoked from agent_complete/inactivity paths. Pairs naturally with IR-21.

CR-6 · Hook events bypass eventBuffer + agentSnapshots and skip dedup

  • scripts/relay.ts:470-472:
    hookServer.onEvent((event) => {
      broadcast(JSON.stringify({ type: 'agent-event', event }))
    })
  • All other event paths route through broadcastEvent (relay.ts:148-168) which (a) increments sessionEventCount, (b) tracks observedModels, (c) pushes into eventBuffer for replay, (d) updates agentSnapshots for late-connecting clients
  • The hook handler skips all of this. Consequences:
    • Late-connecting SSE clients miss hook-originated events on replay
    • agentSnapshots doesn't capture hook-only spawns
    • sessionEventCount undercounts → telemetry wrong
    • Hook + transcript watcher both fire for the same session in the common case → agent_spawn/agent_complete/subagent_dispatch/subagent_return double-fire to all clients
  • The dedup logic that should be applied was deliberately written for the extension at extension/src/claude-runtime.ts:88-119 — it filters subagent lifecycle events when the watcher is already handling the session — but was not ported to the relay.
  • Fix: route hook events through broadcastEvent AND port the dedup filter (predicate is sessions.has(eventSessionId)).
  • Open: is hook delivery from the relay still useful (standalone-app users without the extension), or vestigial? If vestigial, simpler fix is to remove the entire hookServer block from relay.ts.

Important

IR-18 · No SSE backpressure handling

  • scripts/relay.ts:73: res.write(...) return value ignored
  • Node's HTTP write() returns false when the kernel buffer is full (backpressure signal). A slow SSE client accumulates events in Node's write buffer unbounded; the only client-removal mechanism is the synchronous throw from res.write on a closed socket.
  • Fix: track return value; on false set a per-client "paused" flag and re-enable on drain. Pairs well with IR-19 if both adopt the existing agent-event-batch envelope (relay.ts:610 already uses it for replay; extending to live with a 16 ms flush window kills two birds).

IR-19 · Webview postMessage is per-event with no batching

  • extension/src/webview-provider.ts:110-121 (postMessage and sendEvent)
  • Called from extension/src/extension.ts:234, session-runtime.ts:77, claude-runtime.ts:103-117
  • Each call is structured-clone over an IPC pipe; in dev mode (Next iframe in webview) there's a triple hop (extension → outer webview → iframe at webview-provider.ts:206-227)
  • Stress scenario at 200 evt/s → 200-600 IPC crossings/s
  • Fix: 16 ms flush window or N-event batch using the same agent-event-batch envelope as IR-18.

IR-20 · Relay parses full burst inline without yields

  • scripts/relay.ts:329-331 (and the equivalent session-watcher.ts:453-455 belongs to WG-5)
  • for (const line of result.lines) loop runs JSON.parse + delegate.emitbroadcastEventJSON.stringify + res.write to every SSE client per line, all in one tick
  • Compaction event flushing 2 k lines stalls SSE delivery, hook server, and scan interval simultaneously
  • Fix: setImmediate yield every 100 lines, or move parsing to a Worker.

IR-21 · eventBuffer and agentSnapshots never evict

  • scripts/relay.ts:81-94
  • No eventBuffer.delete or agentSnapshots.delete anywhere. Per-session ring is bounded (5000 events) but the maps themselves grow with distinct session IDs.
  • Long-running relay (days): ~5000 events × ~500 B = ~2.5 MB per session × 1000 sessions = ~2.5 GB.
  • Fix: evict on agent_complete + grace period (1 hour). Pairs with CR-5 in the same per-session-cleanup PR.

Minor

MR-12 · app/src/static.ts:62 re-reads index.js / index.css per request

  • No in-memory cache, no If-None-Match, no compression
  • Cache the bytes at startup; emit ETag / Cache-Control: public, max-age=.... Note: this is in the app/ package, not strictly the relay — bundling here because it's the same standalone-app process.

Parallelism

Independent of WG-1 (#71), WG-2 (#72), WG-3 (#73), WG-5 (#74). This workflow owns scripts/relay.ts end-to-end, including the relay's caller of CR-4 (the JSONL tail contract) at relay.ts:326. WG-5 owns the rest of extension/src/ and does NOT touch scripts/relay.ts or extension/src/webview-provider.ts.

Test plan

  • CR-5: under concurrent sim, restart relay 5 times; check FD count via ls /proc/$(pgrep -f dev-relay)/fd | wc -l stays bounded
  • CR-6: connect SSE client mid-session via hook-only event path; verify event appears in replay AND that no duplicate spawn events fire when both hook + transcript handle the same session
  • IR-18: connect a deliberately slow SSE client (sleep on read); check Node memory does not grow unbounded; client should be paused and resumed cleanly
  • IR-19: at stress sim 200 evt/s with VS Code dev mode, check VS Code main thread CPU before/after — IPC crossings should drop
  • IR-21: long-running test (overnight); agentSnapshots.size and eventBuffer.size should plateau, not grow

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions