Skip to content

Dev#19

Merged
im4codes merged 91 commits into
masterfrom
dev
May 14, 2026
Merged

Dev#19
im4codes merged 91 commits into
masterfrom
dev

Conversation

@im4codes
Copy link
Copy Markdown
Owner

No description provided.

IM.codes and others added 30 commits May 10, 2026 10:32
The visual canvas editor was previously gated behind the Participants tab
with no entry point for new users — they would see only the agent grid
and never reach the workflow canvas. Move the canvas, allowed-executables
allowlist, migration banner, future-schema banner, and capability banners
into a dedicated "Advanced Workflow" tab. Auto-bootstrap a starter draft
when a user first enters the tab so the canvas is reachable from a cold
panel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Users can now save multiple named workflows per session and pick which one
P2P invokes. The advanced tab gains a workflow library section with new /
duplicate / delete buttons, an active badge, and a workflow name input
above the canvas. Legacy single-draft configs auto-migrate into a
single-entry library on load, with the legacy `workflowDraft` field kept
in sync as a mirror so older clients mid-rollout continue to launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…scade

Production daemon on a self-hosted server (211, 2026-05-10) was hitting
OOM at the default 4 GB V8 heap every 1–9 hours: 4 ABRT crashes in 24h.
Each restart cost ~30 s of WS downtime, surfacing as the operator-
visible "always offline" symptom.

Diagnosis (in this order):

  1. journalctl confirmed `code=dumped, status=6/ABRT` cycles, not
     systemd lifecycle issues.
  2. /proc/PID/smaps showed 1.47 GB anon (V8) + 1.23 GB [heap] (glibc /
     onnxruntime + better-sqlite3). RSS 2.7 GB.
  3. SIGUSR2 (`--heapsnapshot-signal`) triggered V8 major GC and
     dropped RSS by 779 MB IN A SINGLE CYCLE. Heap snapshot showed
     only 218 MB of *live* objects.

  Conclusion: not a leak. V8's major GC is lazy by design — it lets the
  old generation accumulate garbage until heap pressure forces a sweep.
  With the default 4 GB ceiling, "force" came at ~3.5 GB live + ~3 GB
  pending garbage = OOM whenever a transient spike landed in that
  window. With the 12 GB ceiling we set on 211 as a runtime workaround,
  daemons survive but RSS bloats to many GB and major-GC pauses grow
  multi-second (also looks like "offline" to the UI).

Fix has two parts that ship paired:

  (a) `src/daemon/lifecycle.ts startGcPoller()` — calls `globalThis.gc()`
      every 5 min (tuneable via IMCODES_GC_POLL_MS). Logs only when GC
      freed >50 MB or took >200 ms so quiet daemons don't spam logs.
      Defensive: silent no-op if --expose-gc is not enabled.

  (b) `Environment="NODE_OPTIONS=--expose-gc --max-old-space-size=8192"`
      added to BOTH systemd unit templates (bind-flow.ts +
      setup-flow.ts) AND the macOS launchctl plist (bind-flow.ts).
      Without this, (a) is dead code.

Pinned with a contract test that scans both source-code anchors so a
future refactor can't silently break the pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production observation (211, 2026-05-10): the server pushes
`daemon.upgrade` every time it sees a new dev tag on the npm registry.
With CI publishing every ~5 min during active dev work, four daemons
each restart for ~7 s on every tag, and the windows tile so the
operator perceives the fleet as "always offline" — even though each
individual upgrade is fast and correct.

Add a cooldown: handleDaemonUpgrade declines an AUTO upgrade
(no targetVersion specified, or `latest`) when a previous upgrade
completed within IMCODES_UPGRADE_COOLDOWN_MS (default 10 min). The
state is persisted to ~/.imcodes/last-upgrade-at, written by
upgrade.sh after a successful step 5 health check, so the cooldown
survives the very restart it is throttling.

Operator-pinned upgrades (`imcodes upgrade --version X`) bypass the
cooldown — explicit intent always wins. Same for missing /
unreadable / future-dated / NaN sentinel content. The cooldown can
be disabled entirely by setting IMCODES_UPGRADE_COOLDOWN_MS=0.

Logic extracted into `evaluateAutoUpgradeCooldown(input)` — pure
function, IO injected via `readSentinel`. 10/10 tests cover: missing
sentinel, garbage sentinel, in-window block + remainingMs report,
out-of-window pass, undefined / '' / 'latest' all treated as auto,
pinned target bypass, opt-out (cooldownMs<=0), NaN cooldown,
clock-skew future-dated sentinel, whitespace tolerance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The P2P quick-pick dropdown above the chat input now has a tab switcher
between the original combo presets list and the saved advanced workflow
library, so users can launch a saved workflow in one click without
opening Settings. The active tab persists globally across sessions and
reloads via a new userPref.

Also fixes a daemon-upgrade race: the gate previously only counted
'running' as in-progress for process agents, so a turn dispatched a few
hundred ms before a daemon.upgrade broadcast (still in 'queued' state)
would be silently killed by the upgrade restart. The gate now matches
the web client's isRunningSessionState and also blocks on 'queued'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…18n language

Three rounds of UX feedback addressed in one commit so the panel + canvas
+ orchestrator changes ship together:

PR-λ — Split Save into Save (keep open) + Save & Close so users can
persist mid-edit without dismissing the panel. Auto-align permissionScope
+ dispatchStyle when the user picks a new preset (fixes the
"implementation preset = invalid_workflow_graph" trap). Surface a
dispatchStyle dropdown so single_main vs multi_dispatch is editable. Show
the per-preset default prompt as the promptAppend placeholder. Widen the
desktop panel from 780 to 1400 px.

PR-μ — Workflow runs now auto-include a per-round summary for every
preset, matching the legacy combo system. New
P2P_PRESET_DEFAULT_SUMMARY_PROMPT covers all 10 workflow presets with
rich structured prompts; canvas inspector adds a per-node
summaryPromptOverride textarea. single_main rounds (implementation etc.)
that previously skipped the summary phase now also dispatch a summary
hop. Final-run synthesis prefers the round's resolved summary prompt
over the BUILT_IN_MODES fallback.

PR-ν — Replace the 79-char tail-of-prompt English language hint
("Use the user's selected i18n language ...") with a concise locale-native
one-liner sourced from p2p.discussion_language_instruction (e.g.
"请用中文回复。" / "日本語で回答してください。"). The line now sits right
after P2P_BASELINE_PROMPT in both the legacy combo and advanced workflow
prompt builders, and the daemon no longer pollutes user-supplied
extraPrompt with anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ate stale banner

PR-ξ — The allowed-executables UI was always visible after the bootstrap
auto-created an LLM-only draft, suggesting a configuration burden where
none existed. Now hide it entirely when the workflow has no script nodes
and no entries; when surfaced, default-collapse behind a disclosure with
a yellow "Required for script nodes" inline warning. Daemon enforcement
is unchanged — empty allowlist still rejects every script.

PR-ο — After PR-λ widened the panel to 1400 px, the canvas SVG's
width="100%" stretched to fill the parent, scaling every node ~80%
bigger. Cap the SVG at its native viewBox width so nodes render at the
authored 132×62 px size and the extra panel width becomes inspector
breathing room.

PR-π — Canvas now supports zoom via mouse wheel and Mac touchpad pinch
(both delivered as wheel events; pinch gets ctrlKey=true). Added
zoom-out / 100% reset / zoom-in toolbar buttons. Default node size also
shrunk ~21% so out-of-the-box density matches user expectation. Wheel
listener attached non-passively so preventDefault stops page-scroll.
Also rewrote the cryptic "Daemon workflow capability information is
stale." banner — was hardcoded English in every locale despite the
"translation". New text in all 7 locales explains what still works
(saved configs), what is paused (new advanced launches), and the
typical recovery window (<30s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itive

PR-ρ — Composer attachments now carry a per-composer sequential `seq`
(1, 2, 3, ...). The badge UI surfaces the seq as a `#N` prefix on each
attachment chip, and the send-payload `text` field prepends a
`#${seq}: ${name}` mapping line for every attachment so the LLM sees
both the short reference tag AND the filename in the prompt. The user
can then reference `#1` / `#2` naturally in subsequent text. Counter
resets on send because `clearComposer` wipes the attachments array.
Removing a middle attachment renumbers the survivors consecutively.

PR-σ canvas — PR-ο capped the SVG at `CANVAS_VIEW_WIDTH` to stop nodes
auto-scaling but the side-effect was a permanent empty gutter to the
right of the canvas at the new 1400 px panel width. Replace the cap
with a `ResizeObserver` that tracks the parent container width;
viewBox extents are derived from the measured width divided by the
zoom level so 1 viewBox unit = 1 screen pixel at zoom=1. Canvas now
fills the full panel width AND nodes stay at their authored 132×62 px.

PR-σ bridge — The `capability_stale` banner kept firing as a
false-positive even though the daemon was healthy. Root cause: the
daemon only sends `daemon.hello` on (a) WS connect/reconnect and (b)
capability change, and the server bridge never replayed cached state
to newly-connected browsers. Browsers that opened AFTER the daemon's
most recent hello never received one, so the 30 s `capability_stale`
TTL fired even though the daemon was fine. Fix: bridge
`handleBrowserConnection` now replays the cached
`daemonP2pWorkflowCapabilities` to every newly-connected browser as
part of the opening-state push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…; single_main never)

PR-μ over-generalised the summary contract by gating `synthesisStyle`
on the presence of an effective summary prompt. The correct rule is to
gate it purely on `executionMode`:

- `multi_dispatch` (N parallel workers, each writing to an isolated
  copy of the discussion file) → ALWAYS run an initiator-led synthesis
  hop afterward. Workers cannot see each other within the round; the
  synthesis hop is the ONLY place their outputs converge into one
  authoritative paragraph. Falls back to a generic prompt when no
  override / preset prompt is supplied (closes the previously-broken
  `custom` preset case where SUMMARY_PROMPTS had no entry and the
  round silently lost its synthesis hop).
- `single_main` (1 worker = the initiator itself) → NEVER run a
  synthesis hop. The worker's own output IS the round's authoritative
  segment; asking the same agent to summarise itself is wasteful +
  confusing. The resolved `summaryPrompt` is left populated so the
  FINAL-RUN synthesis (PR-μ chain) can still pick it up when this
  happens to be the last round.

The canvas inspector also hides the per-node summary-prompt textarea
when the node's effective `dispatchStyle` is `single_main` — it was
dead config there (the executor's single_main branch never dispatches
a synthesis hop) and showing it gave users a false signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six P0 fixes for the OOM regression introduced after a368875 (advanced
P2P workflow), and the related slow reconnect after server restart. Per
the round 1-3 audit decision in .imc/discussions/94b9b837-822.md, these
fixes form an unbreakable group: A3 alone is a dead fix because N1
prevents runs from reaching a terminal status, so the cleanup never runs.
Reconnect fixes ship together because reconnect storms are documented in
server-link.ts:111-125 as a secondary RSS pressure source.

A1 (discussion-orchestrator.ts) — discussions Map was append-only;
schedule a 60 s deferred delete on done/failed/stopDiscussion to match
the P2P activeRuns cadence.

A2 (p2p-orchestrator.ts + shared/p2p-workflow-constants.ts) — cap
P2pRun.routingHistory at P2P_ROUTING_HISTORY_RETENTION_COUNT = 500 via a
new pushRoutingHistory helper that mirrors helperDiagnostics' FIFO trim.
Long-running advanced workflows that loop through compiled-edge jumps
were growing routingHistory without bound.

A3 (p2p-orchestrator.ts) — failRun / timed_out paths called
scheduleP2pRunTerminalCleanup but never deleted the P2pRun from
activeRuns; only completed and cancelled paths did. Move the
activeRuns.delete into scheduleP2pRunTerminalCleanup's 60 s timer so
every terminal status (completed/failed/timed_out/cancelled) hits a
single cleanup path, and remove the now-redundant explicit setTimeouts
on the success path.

A4 (p2p-orchestrator.ts) — the writer-queue onWriteFailure /
onSegmentDropped closures captured the full P2pRun, so even after the
60 s activeRuns delete the queue's callback still pinned the run object
in the heap. Stage primitives (runId, contextFilePath, attempt,
initiatorSession) before enqueue and look the run up via getP2pRun(runId)
inside the closure; the queue now retains only strings, and stale runs
swallow gracefully.

N1 (p2p-orchestrator.ts) — runP2pScriptNode was called without an
AbortSignal even though the runner already supports one. A script with
argv ['/bin/sleep','9999'] and no script.timeoutMs would block
executeAdvancedChain forever; ensureRunDeadline never fires because the
loop never advances; the run stays running, A3 cleanup never schedules.
Add an AbortController per dispatch, register an aborter into a
module-level currentScriptAborters map so cancelP2pRun can reach in,
schedule a setTimeout-based abort tied to run.deadlineAt (default 30 min
via shared/p2p-advanced.ts:DEFAULT_ADVANCED_RUN_TIMEOUT_MINUTES), and
clean up the timer + map entry in finally. Runner already escalates
SIGTERM -> 5 s -> SIGKILL via process group.

A6 (server-link.ts) — three-pack reconnect fix. INITIAL_BACKOFF_MS
1_000 -> 500, MAX_BACKOFF_MS 60_000 -> 5_000 (server's IP rate limit is
5 attempts per 10 s, so 500 ms initial / 5 s ceiling stays inside
budget). Add an 8 s connect-timeout watchdog per attempt so a hung TCP
SYN cannot wedge the daemon for 75-127 s on Linux/macOS. Apply +/-20%
jitter to scheduleReconnect so multiple daemons behind one NAT don't
trip the rate limiter together.

Verification:
- npx tsc --noEmit (daemon), npx tsc -p server/tsconfig.json --noEmit,
  cd web && npx tsc --noEmit — all clean
- npm run test:unit — 315 files / 3356 tests pass
- npm run test:server — 40 files / 505 tests pass
- p2p-workflow-regression spec #59 regex relaxed to accept the A4
  primitive-closure variant (run.contextFilePath | logicContextFilePath
  | contextFilePath); intent of the assertion preserved

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…transport restore

restoreTransportSessions was rebuilding the qwen preset config
(env / settings / availableModels / preferred model) but dropping the
preset's contextWindow and systemPrompt onto the floor. After a daemon
restart, ccPreset='MiniMax' sessions came back with the runtime catalog
context window (and the qwen CLI's built-in identity), not the preset's
declared one — causing usage-pane numbers and the "I am MiniMax" runtime
facts to drift on every reconnect.

Persist presetContextWindow into the upserted record alongside the
preset's other rebuilt fields, and tighten the preferred-model selection
so an explicit user-requested model is preserved unless the preset's
catalog explicitly forbids it. Update the qwen-transport-flow e2e to
assert the new restored fields (presetContextWindow=200000, qwenModel,
qwenAuthType, qwenAvailableModels, systemPrompt containing the runtime
facts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le (P0)

Five P0 fixes for the user-visible breakage in screenshot
c2642dfd955e6f525be4558408f1afb6.png (logic/script nodes both error out,
"DAEMON 失联" banner never clears) plus 91a60a20400daec367041cf480c585cb.png
(P2P review content displays "(加载失败)"). Audit trail in
.imc/discussions/e940d73f-a8e.md (3 rounds, 3 reviewers).

A1 (web/src/components/AdvancedWorkflowCanvasEditor.tsx) — nodeKind
onChange now calls a new `alignNodeForKind` helper that forces
preset='custom' (logic+script), permissionScope='analysis_only'
(logic), and dispatchStyle='single_main'. The forward direction (preset
onChange aligning scope/dispatch) shipped with R3 v2 PR-λ but the
reverse direction was missed, so picking nodeKind=logic on a default
llm+discuss+analysis_only node produced the cryptic
`invalid_workflow_graph (nodes[N])` error in the user screenshot. For
script nodes we deliberately do NOT auto-fill argv[0] — the executable
is a security boundary; let the validator surface a precise required-
field diagnostic instead.

N3 (web/src/components/AdvancedWorkflowCanvasEditor.tsx) — legacy
saved drafts that pre-date A1 still load with logic/script nodes in
violating combinations (the user's screenshot is one such draft).
A new pure helper `normalizeP2pWorkflowDraftForEditing` walks the
incoming draft and returns the repair list; the editor surfaces a
banner with Apply / Dismiss buttons. Per Cx1 R2-Cx1-1's design
constraint, normalize is NEVER triggered as a render side-effect; the
user must explicitly Apply before onChange fires. This preserves the
contract that loading old data does not silently rewrite it.

N5 (shared/p2p-workflow-validators.ts) — refine the
`validateNodeCombination()` diagnostic fieldPath so the inspector can
highlight the exact dropdown that's wrong. logic+non-custom-preset
points at `nodes[N].preset`; logic+non-analysis_only-scope points at
`nodes[N].permissionScope`; openspec_propose missing artifact points
at `nodes[N].artifacts`. Multiple simultaneous violations on a logic
node now produce two distinct diagnostics instead of a single
opaque `nodes[N]` entry.

N4 (web/src/ws-client.ts) — capability snapshot freshness now keys on
a new `daemonLastSeenAt` clock that is bumped only by daemon-originated
messages (whitelist: DAEMON_HELLO, RUN_UPDATE, daemon.stats,
timeline.event, transport.* deltas/status/tools, etc.).
Server-synthesized messages (`pong`, `session.event`) are explicitly
excluded so the UI cannot show "fresh" while the daemon is actually
down. Without this, healthy long-lived browser pages tripped the 30 s
TTL on the one-time `daemon.hello.observedAt` and showed
"DAEMON 失联" forever — the user's second screenshot symptom.

M7 (web/src/app.tsx + src/daemon/command-handler.ts) — DiscussionsPage
now passes `requestScope` derived from the active session, and the
daemon's read_discussion / list_discussions handlers fall back to
(a) active P2P run's contextFilePath → reverse-derive projectDir, and
(b) cross-project file sweep, before returning
`missing_or_invalid_scope`. Multi-project daemons used to fail every
read because the UI didn't pass scope; this surfaced as the
"(加载失败)" body in the second screenshot.

Verification:
- npx tsc --noEmit (daemon), npx tsc -p server/tsconfig.json --noEmit,
  cd web && npx tsc --noEmit — all clean
- npm run test:unit — 316 daemon files / 3365 tests pass
- npm run test:web — 107 files / 1324 tests pass
- npm run test:server — 40 files / 505 tests pass
- 3 new regression test files (15 fresh assertions): nodeKind onChange
  alignment, normalize banner, validator fieldPath specificity,
  daemonLastSeenAt whitelist
- 7 i18n locale files synchronized for the new normalize_banner /
  apply / dismiss keys (MANDATORY per CLAUDE.md)

Per the discussion's final plan (round 3 hop 2 §3, Cu1's grep
evidence), N7 (server bridge `receivedAt` semantic shift) is
intentionally OMITTED — `getDaemonP2pWorkflowCapabilities()` has no
production caller, so N4 UI-only is sufficient to fix the user's
"daemon stale" symptom without touching server-side semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (PR-φ follow-up)

User reported in screenshot 7c2570e96eeca1a9eefa3a92d3c7212e.png that the
"DAEMON 失联" banner still fires on a healthy long-lived browser page,
even after the N4 daemonLastSeenAt whitelist landed in c5dc1b1.

Root cause: my N4 fix updated `WsClient.isDaemonCapabilityStale()` and
the private `daemonLastSeenAt` clock, but `P2pConfigPanel.tsx:612-613`
computed staleness inline from `capabilitySnapshot.observedAt`:

    const capabilityStale = !capabilitySnapshot
      || (Date.now() - capabilitySnapshot.observedAt) > P2P_CAPABILITY_FRESHNESS_TTL_MS;

`observedAt` is set ONLY when `daemon.hello` arrives (WS connect or
capability change). On long-lived browser pages it never refreshed, so
the panel tripped the 30 s TTL and the banner stuck — exactly the
screenshot symptom. The N4 fix on the WS client was correct in its own
right but the panel never consumed it.

Fix is structural — single source of truth for "is the daemon stale":

  - `P2pConfigPanelCapabilitySource` now exposes optional
    `isStale(now?: number): boolean`. The panel's `capabilityStale`
    computation prefers `daemonCapabilitySource.isStale()` and falls
    back to the legacy `observedAt` check only when the source omits
    the method (preserves existing test fixtures that pass plain
    object sources).

  - `SessionControls.tsx` source object now wires
    `isStale: (now) => ws.isDaemonCapabilityStale(now)` so the panel
    and the WS client share one definition of staleness.

  - The panel's freshness re-evaluation switches from a single
    setTimeout pinned to `observedAt` (which never re-armed once
    snapshot stayed constant) to a steady setInterval at TTL/2.
    Worst-case lag between daemon going silent and banner appearing
    is now bounded by `TTL + TTL/2`.

Tests:

  - `web/test/components/P2pConfigPanel-stale-banner.test.tsx` (4
    tests): panel hides banner when isStale()=false even with ancient
    observedAt; panel shows banner when isStale()=true even with fresh
    observedAt; legacy fixture without isStale falls back to
    observedAt-based check correctly in both directions.

  - `web/test/components/P2pConfigPanel-stale-banner-e2e.test.tsx` (3
    tests): full WsClient ↔ panel chain integration. (a) healthy
    long-lived daemon with periodic daemon.stats keeps banner hidden
    across 90+ s (3× TTL); (b) silent daemon flips to stale past TTL;
    (c) server-only pong stream does NOT keep banner hidden — the key
    reverse assertion that prevents future regressions where someone
    "fixes" staleness by bumping on every WS message.

  - SessionControls test mock updated to include
    `isDaemonCapabilityStale: vi.fn(() => false)` so the panel's new
    `source.isStale()` call doesn't crash unrelated tests.

Verification:
  - npx tsc --noEmit (daemon), npx tsc -p server/tsconfig.json --noEmit,
    cd web && npx tsc --noEmit — all clean
  - npm run test:unit — 316 daemon files / 3365 tests pass
  - npm run test:server — 40 files / 505 tests pass
  - npm run test:web — 109 files / 1331 tests pass (added 7 fresh
    assertions across 2 new test files)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ct storm

Production logs on 116.62.239.78 showed a single daemon authenticating
~5 times per 10 seconds, with the daemon side reporting
`code:4001 reason:auth_required` every cycle. The user-visible symptom
was "server restart → daemon reconnect 极慢" and the persistent
"DAEMON 失联" banner that survived all earlier fixes (A6 reconnect
tuning, N4 daemonLastSeenAt whitelist, PR-φ panel.isStale routing).

Root cause: a race in `WsBridge.handleDaemonConnection`'s async message
handler.

The daemon sends two messages back-to-back on every WS open
(`server-link.ts:201-202`):

    ws.send(JSON.stringify({ type: 'auth', ... }));
    this.sendDaemonHello();   // sends `daemon.hello`

Both messages reach the server before the auth message handler's
`await db.queryOne(...)` settles. While the auth flow is parked at the
DB await, `this.authenticated` is still `false`. The `daemon.hello`
handler runs concurrently, sees `msg.type !== 'auth'`, and trips the
gate at line 894-896:

    if (!this.authenticated) {
      if (msg.type !== 'auth' || ...) {
        ws.close(4001, 'auth_required');
        return;
      }
      ...
    }

The auth handler then completes, flips `authenticated` to true, and
logs "Daemon authenticated" — but the WebSocket is already gone. The
daemon sees the 4001 close, reconnects (fast, thanks to A6's
500 ms initial / 5 s cap), races again, and so on. None of the earlier
fixes could break the cycle because they all live downstream of this
race.

Fix: capture the auth flow in `this.authPromise` and await it from
every subsequent message handler before evaluating
`this.authenticated`. Concurrent `daemon.hello` (or any other
post-auth message) now waits for the DB lookup to settle, then sees
the correct `authenticated === true` and proceeds normally.

Implementation details:

  - New private field `WsBridge.authPromise: Promise<void> | null`.
  - The auth message handler creates the promise, runs the DB lookup,
    and `resolveAuth()`s it on both success and failure paths. The
    promise ALWAYS resolves (never rejects) — failure is signaled via
    `ws.close()` + `this.daemonWs = null`, which awaiting handlers
    detect with their `daemonWs !== ws` bail-out check. Resolving (vs
    rejecting) avoids unhandled-rejection warnings when no concurrent
    handler is currently awaiting.
  - Awaiting handlers also re-check `this.daemonWs === ws` AFTER the
    await; if a different connection has replaced this one (or the
    socket closed during the await), they bail.
  - `authPromise` is reset to `null` on (a) new connection (so the
    next reconnect doesn't await a stale promise from a different
    `ws`), (b) `ws.on('close')`, and (c) `kickDaemon()`.

Regression test: `server/test/bridge.test.ts` adds
"does NOT 4001-close when auth and daemon.hello arrive back-to-back
during DB lookup". The test uses a deferred DB query so both messages
can land before auth resolves, then asserts:

  1. The socket is NOT closed during the in-flight auth window.
  2. Once the DB query resolves with a valid token, auth completes and
     the socket stays open.

Without this fix, the test fails immediately with `ws.closed === true`
and `closeCode === 4001`. With it, both assertions pass.

Verification:
  - npx tsc -p server/tsconfig.json --noEmit clean
  - npm run test:server — 506 tests pass (up from 505 with the new
    regression case)
  - Production daemon log on the dev box (118 KB before fix) showed
    18+ auth flap cycles per minute. After deploy, expected: a single
    `Daemon authenticated` per real reconnect (1× per server restart
    + 1× per network blip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks the PR-υ auth race fix (`42db4000`) against a real `ws` server +
client stack — the previous bridge.test.ts coverage was a mocked
EventEmitter, which can't reproduce the message-ordering semantics that
caused the production reconnect-storm on 78.

Three scenarios:

  1. **Single back-to-back handshake** (latency=0): daemon sends
     `auth + daemon.hello` synchronously after WS open; assert no
     4001-close and bridge.isAuthenticated flips to true within the
     observe window.

  2. **50ms-DB-latency window**: deferred DB query sleep guarantees
     BOTH messages reach the server before auth's `await db.queryOne`
     resolves. This is the exact production race window. Without the
     fix, hits `ws.close(4001, 'auth_required')` 100% of the time.

  3. **Burst of 10 back-to-back reconnect cycles** (latency=20):
     simulates the production reconnect cascade after a server
     restart. Asserts every single cycle authenticates cleanly with
     no 4001-close. Counting failures (rather than asserting a
     boolean) gives a clearer diagnostic when a flake creeps in.

Test rig:

  - Spins up an in-process `http.Server` + `WebSocketServer` with
    `noServer: true`, mirroring `server/src/index.ts`'s upgrade
    handler.
  - Each test/cycle uses a fresh `serverId` extracted from the URL.
    Reason: `WsBridge.maybeCleanup` deletes from the shared
    `WsBridge.instances` map by serverId, NOT by instance pointer; a
    stale-bridge close handler firing AFTER the next test's
    connection has registered evicts the new bridge from the map. In
    production each serverId hosts exactly one bridge so the path is
    harmless, but rapid-cycling the same id in tests exposes the
    eviction. Per-test serverIds sidestep it.
  - Polls `bridge.isAuthenticated` AFTER the observe window but
    BEFORE the test closes the socket — the bridge's ws.on('close')
    resets the flag, so checking after the local close would always
    observe false. Capture-during-window is the correct contract.

Verification: `npm run test:server` — 41 files / 509 tests pass
(up from 506 with the new 3-test file).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the synthetic auth+daemon.hello scenarios in
`server/test/bridge-auth-race-e2e.test.ts` with a true end-to-end test
that wires the production daemon `ServerLink`
(`src/daemon/server-link.ts`) against the production server `WsBridge`
(`server/src/ws/bridge.ts`) over a real `ws` server. The reason: a
synthesized handshake only covers the messages the test author thought
to send. If a future change adds a new "do X immediately after open
before auth" step on either side, the synthetic test continues to pass
while production breaks. Driving the real `ServerLink` makes the test
follow the daemon's real wire protocol.

Two scenarios:

  1. **Cold start** (50 ms DB latency — the worst-case race window):
     create a `ServerLink`, await auth, then sleep 1 s and assert
     EXACTLY ONE accepted WS connection + EXACTLY ONE successful auth.
     Pre-fix produces ≥2 connections within 1 s because the daemon
     reconnects immediately after every 4001 close.

  2. **Server restart** (20 ms DB latency, simulates
     `docker compose restart server`): connect, auth, then `wss.close`
     + `httpServer.close` (terminating live clients), wait 200 ms,
     `listen` on the same port. Assert the ServerLink reconnects
     cleanly with EXACTLY ONE post-restart auth and ≤2 reconnect
     attempts (the +1 allowance covers an ECONNREFUSED race when the
     port is still TIME_WAIT-free for a moment).

Test rig:

  - In-process `http.Server` + `WebSocketServer({ noServer: true })`
    matching `server/src/index.ts:setupWebSocketUpgrade`. The real
    upgrade path (URL parse → `WsBridge.get(serverId)` →
    `handleDaemonConnection`) runs unmodified.
  - `handleDaemonConnection` is invoked with an `onAuthenticated`
    callback so the rig can count successful auths without intercepting
    the bridge's logger.
  - Server restart uses `wss.clients.forEach(c => c.terminate())` +
    `httpServer.closeAllConnections()` to immediately drop in-flight
    sockets — without this, `wss.close` blocks waiting for clients and
    the test hangs ~30 s. This mirrors the ECONNRESET behaviour
    `docker compose stop` actually produces.
  - Each test uses a fresh `serverId` so stale-bridge cleanup
    (which deletes from the shared `WsBridge.instances` map by
    serverId) cannot evict the current bridge entry.

Verification:
  - `npx tsc --noEmit` clean (daemon + server)
  - `npm run test:server` — 41 files / 509 tests (unchanged; this test
    runs in the e2e workspace, not server)
  - `npx vitest run --project e2e test/e2e/daemon-server-real-handshake.test.ts`
    — 2 tests pass in ~6 s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CLAUDE.md "FORBIDDEN — Never `git add` these directories: openspec/
and docs/ are local-only planning/documentation directories. NEVER
stage, commit, or push any file under openspec/ or docs/ to git."

`.gitignore` already lists `openspec/` (line 56) and `docs/` (line 65),
but 12 openspec files + 1 docs/plan file were committed to git BEFORE
those gitignore rules existed. gitignore does not retroactively untrack
anything, so they continued to be tracked — visible in git status when
edited or deleted locally.

This commit removes them from the index via `git rm -r --cached`.
Local copies on disk are preserved for files the user still has;
already-deleted ones (the `daemon-file-preview-worker/` set the user
manually deleted on mobile) are simply unstaged.

After this commit, `openspec/` and `docs/` are fully out of version
control and stay ignored on every future change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User report: the P2P progress banner appeared in EVERY active session's
sub-session bar, even when the discussion had nothing to do with the
session the user was currently viewing. Cross-session noise.

Root cause: `app.tsx:3733` filtered the bar's discussions list only by
status (`d.state !== 'done'`) — there was no session-identity gate.
Every running P2P run rendered its banner everywhere.

The mapping in `p2p-run-mapping.ts` was also dropping the run's
`main_session` / `initiator_session` / hop participant identities
during `mapP2pRunToDiscussion`, so even if the bar wanted to filter,
it had nothing to filter on.

Two-part fix:

  1. **Preserve session identity in the mapping**
     (`web/src/p2p-run-mapping.ts`). Add `mainSession`,
     `initiatorSession`, and `participantSessions[]` (de-duplicated
     set of initiator + main + current target + every
     `hop_states[].session` + every `all_targets[].session`).
     Empty/missing aggregates degrade to `undefined` so legacy
     server payloads still round-trip cleanly.

  2. **Filter at the bar render** (`web/src/app.tsx:3765`). The
     SubSessionBar's discussions prop is now scoped:
       - `mainSession === activeRootSession` covers the common case
         (user viewing the session that launched the discussion or
         any of its sub-sessions, since `activeRootSession`
         resolves sub→parent).
       - `participantSessions.includes(activeSession || activeRootSession)`
         covers the cross-root case (user navigates into a sub-session
         that's a hop in another root's discussion).
       - Discussions with no scope info (legacy mid-rollout entries)
         fall through and show unscoped — preserves the previous
         behaviour for those rather than hiding them.

Also expand the local discussions state shape in `app.tsx` to declare
the three new fields so TypeScript pins the contract.

Tests:

  - 3 new cases in `web/test/p2p-run-mapping.test.ts`:
    - Advanced run with full session payload preserves all three fields
      and de-duplicates `participantSessions`.
    - Legacy run without session fields → all three undefined (caller
      treats as "show unscoped" via the legacy fallback).
    - Pre-dispatch run uses `all_targets` when `hop_states` is absent.

Verification:
  - `cd web && npx tsc --noEmit` clean
  - `npm run test:web` — 109 files / 1336 tests pass (added 3)

Note: this only changes the bar's filter — `DiscussionsPage`
(`liveDiscussions={discussions}` at app.tsx:4003) intentionally still
receives the full unfiltered set so the global "View all discussions"
panel keeps showing every run as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns button

Follow-up to 6977512 (P2P bar scoping). After scoping the bar to the
active session, users lost visibility of P2P runs happening in OTHER
sessions — the bar correctly hid them from the current view but
nothing told the user they exist. Easy to forget about background
runs and switch sessions thinking nothing's going on.

Add a numeric badge to the 📋 View Discussions button that ALWAYS
shows the daemon-wide running discussion count, regardless of which
session the user is currently viewing. Click-through opens the
DiscussionsPage which already shows the unfiltered global list.

Implementation:
  - New `totalRunningDiscussions?: number` prop on SubSessionBar
    (defaults to 0 so existing callers don't break).
  - Absolute-positioned span on the button, blue circle with bold
    white digit. `99+` for runaway counts. Hidden when 0.
  - Tooltip switches to "{count} running discussions — view all"
    when count > 0, falling back to the original "P2P discussions"
    label otherwise.
  - `aria-label` provides screen-reader friendly count.
  - `data-running-discussions` attribute for tooling/test inspection.

Wiring:
  - `app.tsx:3779` passes `discussions.filter((d) => d.state !== 'done').length`
    — the UNFILTERED count, NOT the scoped subset that goes to
    `discussions={...}` above. The two numbers can legitimately
    differ:
      - `totalRunningDiscussions = 3` (daemon-wide)
      - shown banners = 1 (only one P2P run involves the active
        session) → user sees "3" badge AND knows 2 are elsewhere.

i18n — 7 locales (en/zh-CN/zh-TW/es/ru/ja/ko) get:
  - `subsessionBar.p2p_discussions_with_running` (+ `_one`/`_other`)
  - `subsessionBar.p2p_running_count_aria` (+ `_one`/`_other`)

Tests — 4 new cases in SubSessionBar.test.tsx:
  - badge hidden when count is 0
  - badge renders with the count when count >= 1
  - count caps at "99+" for runaway daemons
  - data-running-discussions attribute reflects the count

Verification:
  - `cd web && npx tsc --noEmit` clean
  - `npm run test:web` — 109 files / 1340 tests pass (added 4)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…matching file

User report (screenshot 71e2d014d9cf975f): on the discussions page, the
live P2P progress bar at the top and the discussion file list below
were two unrelated UIs. Clicking the live bar did nothing; users had
to manually find the matching entry in the list by id and click it.

Root cause: the P2pProgressCard rendered in the discussions page's
"live progress strip" was missing an `onClick` handler
(DiscussionsPage.tsx:305-311). The card already supports `onClick`
(used by SubSessionBar in app.tsx) but DiscussionsPage never wired
one. Additionally `P2pProgressDiscussion.fileId` wasn't declared on
the interface even though the run-mapping function populates it.

Fix:

  - Add `fileId?: string` to `P2pProgressDiscussion` so callers can
    rely on it without a `(d as any).fileId` cast. The mapping in
    `p2p-run-mapping.ts` already sets it from the run's
    `discussion_id` field; this just makes the type honest.

  - In DiscussionsPage's live-strip render, pass
    `onClick={d.fileId ? () => selectDiscussion(d.fileId!) : undefined}`.
    `selectDiscussion(fileId)` is the same function the file list
    uses, so:
      1. it sends `p2p.read_discussion` with the fileId,
      2. the daemon returns the file content,
      3. the right-pane (or full-screen on mobile) shows the
         discussion,
      4. the matching list entry gets the `active` class — visual
         link between the bar at top and the highlighted entry below.

  - Runs without a fileId (failed-bind / supervision-internal /
    pre-dispatch) get no onClick — a click is simply a no-op rather
    than crashing on `undefined.fileId`.

Tests — 2 new cases in `web/test/pages/DiscussionsPage.test.tsx`:

  - clicking a live progress card with `fileId` sends
    `p2p.read_discussion` for that fileId AND highlights the
    matching list entry as `.active`.
  - clicking a live progress card WITHOUT fileId is a no-op (no new
    `ws.send` calls).

The P2pProgressCard mock in the test file is upgraded from `() => null`
to a clickable button forwarding the `onClick` prop, so the click
contract can actually be exercised. Production rendering (SVG layout)
is unchanged.

Verification:
  - `cd web && npx tsc --noEmit` clean
  - `npm run test:web` — 109 files / 1342 tests pass (added 2)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…a tight loop

Production symptom (mobile screenshot aab0338a3d2bb6f5708f0ea5f): the
discussions page hung on "加载中…" forever — no live progress bar and
no file list ever appeared. Server logs revealed the cause:

    "p2p per-socket pending cap exceeded — dropped"
    type: "p2p.list_discussions"
    requestId: ...

The bridge enforces a per-socket cap on outstanding p2p workflow
requests. The web page was dispatching `p2p.list_discussions` faster
than the daemon could respond, so the bridge dropped them; with no
response ever returning, `loading` stayed true and nothing rendered.

Two compounding causes:

  1. **Inline `requestScope` literal in `app.tsx:4017`** — every
     parent render of `App` produced a fresh
     `{ sessionName, projectDir }` object with new identity. That made
     `DiscussionsPage`'s `useCallback(loadList, [requestScope])`
     re-identify, fired its mount-time `useEffect([loadList])`, and
     dispatched another list request — once per parent render.

  2. **`RUN_UPDATE` handler called `loadList()` synchronously**
     (DiscussionsPage.tsx:235). When many P2P runs update in quick
     succession (canvas projection at ~5 Hz × N runs), this fired
     several list requests per second on its own.

Three-layer fix:

  A. **`app.tsx`** — wrap the request scope object in `useMemo`
     keyed on `[activeSession, activeSessionInfo?.projectDir]`. Stable
     identity across parent renders.

  B. **`DiscussionsPage.tsx`** — defense-in-depth: even if a future
     refactor reverts the parent's `useMemo` (or a test/caller
     passes an inline literal), normalise `requestScope` internally
     via `useMemo(() => requestScope, [JSON.stringify(...)])` so the
     downstream `loadList` callback's dependency only changes when
     the SCOPE CONTENT changes, not its identity.

  C. **`DiscussionsPage.tsx`** — debounce the `RUN_UPDATE`-driven
     refresh. Bursts of run updates now coalesce into a single
     `loadList()` call after a 250 ms quiet window. Cleanup timer
     is cleared on unmount.

Tests — 2 new cases in `web/test/pages/DiscussionsPage.test.tsx`:

  - 5 parent rerenders with new-identity-but-content-equal
    `requestScope` literals → at most 2 `p2p.list_discussions`
    dispatches (covers initial mount + one tolerated retry). Pre-fix
    produced 6.

  - 10 rapid `RUN_UPDATE` messages → at most 1 coalesced
    `p2p.list_discussions` dispatch after the debounce window.
    Pre-fix produced 10.

Verification:
  - `cd web && npx tsc --noEmit` clean (the two pre-existing errors
    in `shared/session-group-clone.ts` and
    `web/src/components/CloneSessionGroupDialog.tsx` are from a
    concurrent in-progress feature branch in the working tree, NOT
    from this commit).
  - `npm run test:web` — 109 files / 1344 tests pass (added 2 net new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom (screenshot a8495587-...): a logic node could end up with
preset='implementation_audit' + scope='analysis_only' +
dispatch='single_main', producing a cryptic
`invalid_workflow_graph (nodes[N].preset)` diagnostic with no obvious
recovery path from the UI.

Root cause: the canvas editor's preset / permissionScope /
dispatchStyle dropdowns exposed the FULL constant array regardless of
the current nodeKind. The A1 fix (`alignNodeForKind`) covered the
forward path — switching nodeKind to logic auto-set preset=custom —
but a subsequent click on the preset dropdown could re-pick an
LLM-only preset while nodeKind stayed `logic`, putting the node back
into an invalid state with no banner trigger (since the user wasn't
loading legacy data).

Fix: filter each dropdown's option set against the validator's
`validateNodeCombination` rules so the user simply cannot click their
way into a rejected combination. Single-option dropdowns (e.g., logic
preset locked to `custom`) are rendered disabled to make the
constraint explicit; if the current value isn't in the legal subset
(legacy draft), it's preserved as a transient extra option so the
select still reflects what's stored and the normalize banner is the
unambiguous path forward.

Also adds the previously-missing `script.argv` textarea inline for
script nodes: one argv entry per line, first line is the executable.
Blank-line stripping keeps the argv array tight; clearing the
textarea drops `node.script` entirely so the validator surfaces a
clean required-field error instead of an opaque empty-array hit.

Tests: 24 new regression tests in
`AdvancedWorkflowCanvasEditor-dropdown-restrictions.test.tsx` cover
every nodeKind/preset combination + the argv edit/clear paths. All
46 canvas-editor tests + 1381 web tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom (screenshot 7f112b6e...): a script node with no
`script.argv` shows the diagnostic
`A workflow script contract is invalid. (nodes[1].script)`,
and there was no inspector UI to recover from it. Commit
f4e539b added the `script.argv` textarea; this commit closes
the OTHER half — switching nodeKind FROM script back to llm
(or to logic) left the `script` field dangling, and the
validator's `validateNodeDraft` then emits
`invalid_script_contract` on the llm node because it doesn't
allow a `script` field on non-script kinds.

`alignNodeForKind` could not express field deletion through its
`Partial<P2pWorkflowNodeDraft>` return shape, so the cleanup
now lives at the editor's nodeKind onChange call site: after
merging the aligned partial, we explicitly drop `script` when
the next kind is not `script`, and drop `logic` when the next
kind is not `logic`. Same shape used for both kind-specific
fields keeps the rule discoverable.

Regression tests added:
- `script node with no script.argv surfaces the
  nodes[N].script diagnostic` — reproduces the screenshot
  state (script node at index 1) and asserts the fieldPath
  appears in the inline Diagnostics list.
- `after filling argv via the textarea, the nodes[N].script
  diagnostic clears` — full recovery flow: starts broken,
  fills argv via the new textarea, asserts the diagnostic is
  gone AND the validator accepts the draft.
- `switching nodeKind from script to llm drops the lingering
  script field` — pins the new cleanup behaviour.
- `shows "Required for script nodes" warning when workflow
  has a script node but allowedExecutables is empty` — pins
  the P2pConfigPanel badge the user also saw at the bottom of
  the screenshot.
- `does NOT show "Required for script nodes" warning for
  LLM-only workflows` — symmetric guarantee.

All 1386 web tests pass (4 net new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IM.codes and others added 29 commits May 12, 2026 08:12
…y, capability exports, and detail oracle

Fixes from OpenSpec audit against tasks.md:

1. server-link DATA_PLANE_SEND_QUEUE_CAP=256 with overflow telemetry
   - task 4.5: bounded queue capacity + observable overflow
   - Previously unbounded queue; now logs warning when depth exceeds cap

2. daemon hello includes TIMELINE_PROTOCOL_CAPABILITY in base capabilities
   - task 1.6: timeline protocol capability via daemon hello
   - Updated p2p-workflow-runtime.test.ts expectations to include
     timeline.protocol.v1 alongside existing P2P capabilities

3. detail store eventId/fieldPath mismatch returns MISSING (not UNAUTHORIZED)
   - task 2.5 / spec D6: non-enumerating error to avoid detailId oracle
   - Updated command-handler-transport-queue.test.ts assertion to
     expect MISSING instead of UNAUTHORIZED for field mismatch

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Copying assistant messages used to flatten paragraphs and list structure
because `Element.textContent` joins descendant text nodes with no
separator, and `Selection.toString()` is browser-defined at block
boundaries (Safari often collapses). Both copy paths now route through a
new `domNodeToPlainText` DOM walker that emits explicit newlines for
block elements, expands `<br>`, preserves `<pre>` content verbatim, and
prefixes list items / blockquotes — so what the user sees is what they
paste.

On touch devices the chat view disables `user-select` so long-press can
fire the custom Copy/Quote menu, which makes native selection of a
specific portion of a message impossible. Double-tapping a chat bubble
now opens a ZoomedTextDialog with selection re-enabled and a Copy-all
button, giving users a place to drag the iOS/Android handles and pick
out exactly the substring they want.

Also extracted shared `copyToClipboard()` so ZoomedTextDialog and the
existing CodeBlock copy button share one implementation of the
non-secure-context fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mobile double-tap-to-zoom detector was pairing taps by HTMLElement
identity, which fails the common case where a streaming assistant block
re-renders between the two taps and Preact replaces the underlying DOM
node — `===` returns false even though the logical bubble is unchanged
and the user feels nothing happens.

Pair by `data-event-id` string instead. AssistantBlock now threads its
merged-block key onto the bubble (user messages already carry their
event id), so the comparator is stable across re-renders. Also widen the
double-tap window to 450ms (450ms reads as "forgiving" on a phone where
fingers are slower than mouse buttons), bump tap-vs-scroll tolerance to
15px, and add `touch-action: manipulation` on chat bubbles so iOS hands
the second touchend to JS without the 300ms double-tap-zoom probe.

Adds two regression tests covering full `.chat-event` extraction for the
right-click context-menu path, which surfaces the same format-preserving
text used by mobile zoom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The touchend-based double-tap pairing didn't fire on real iOS Safari
even after switching to event-id matching: the synthetic touchend after
a short tap is racy on touch devices (subject to scroll resolution and
the system's tap-vs-scroll decision), so the second tap was sometimes
missed entirely.

Move pairing to the synthetic `click` event, which iOS fires reliably
on every short tap once viewport `user-scalable=no` removes the 300 ms
zoom probe. Long-press still suppresses the click via the existing
`cancelEvent` preventDefault on touchend, so the menu and zoom paths
remain mutually exclusive. touchend now only clears the long-press
timer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`'ontouchstart' in window` was returning true on every device that has
any touch capability, which meant Surface-class laptops in mouse mode
landed on the mobile-gesture path (long-press menu, double-tap zoom) and
desktop selection felt broken. Worse, the predicate didn't help diagnose
why double-tap "isn't working" on Android — both iOS and Android Chrome
report touch support, but the chat-gesture user is really asking for the
phone-class layout, which is more about narrow viewport + coarse pointer
than about touch capability alone.

Switch to `matchMedia('(pointer: coarse), (max-width: 768px)')` and
react to viewport changes via the same media-query listener. CSS picks
up the matching predicate so the two stay in sync — a narrow desktop
window now also disables native text selection and gets the mobile
gesture set, while a 1080p touchscreen with a mouse falls through to
the desktop path. Threshold bumped to 500ms for extra forgiveness on
slow phones; the click-event detector from the previous commit still
fires for both iOS and Android.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gression in 42dfabe

Commit 42dfabe introduced two silent-drop paths on the daemon → server →
browser timeline link that together produced the user-reported symptoms:
"消息更新断了, 必须手动刷新页面才更新. 打字机效果也没了."

1. Daemon ServerLink:
   - `scheduleDataPlaneFlush` shifted the queue head and then called
     `trySend()` without checking the return value. When the WS link was
     not OPEN (short Wi-Fi handoff, reconnect window) trySend returned
     false and the message was lost. Switched to peek-then-shift: leave
     the item on the queue until trySend confirms it landed, otherwise
     halt the drain and wait for reconnect.
   - Added `flushDataPlaneAfterReconnect()` and wired it into the WS
     `open` handler so a queued backlog resumes draining without needing
     a fresh enqueue to kick the scheduler.
   - Bumped DEFAULT_DATA_PLANE_SEND_QUEUE_HARD_CAP 512 → 100_000 and
     DEFAULT_DATA_PLANE_SEND_STALE_MS 30s → 24h. With peek-then-shift the
     stale GC is now purely a memory-protection upper bound, not a
     primary correctness mechanism.

2. `timelineStore.readPreferred` (also from 42dfabe) now throws
   `TimelinePreferredReadError` when the SQLite projection is
   unavailable instead of returning []. Three callers had no per-call
   catch:
   - `lifecycle.ts:599-604` startup backfill loop — one bad session
     could abort the rest. Now per-session try/catch with JSONL fallback.
   - `subsession-manager.ts:474` `readSubSessionResponse` — projection
     blip would reject the RPC. Now falls back to JSONL.
   - `opencode-watcher.ts:115` — outer poll catch swallowed the throw at
     debug level. Now logs warn + falls back to JSONL.

3. Server bridge `timelineDataPlaneErrorResponse` emitted error frames
   without a `recoverable` flag, so the web `useTimeline` hook treated
   any errorReason as terminal via `hasExplicitTimelineOutcome`. Wired
   in `isRecoverableTimelineRequestErrorReason()` so transient reasons
   (queue_full, deadline_exceeded, timeout, unavailable) come back with
   `recoverable: true`. Also bumped the bridge cap 128 → 4096 and the
   job deadline 15s → 60s to match the daemon-side ceilings — the strict
   defaults were tripping on weak links well before any real problem.

4. Web `shouldRetryTimelineHistoryResponse` retries when either:
   - the server sent `recoverable: true`, OR
   - the server omitted `recoverable` AND `errorReason` is in the shared
     allow-list (`isRecoverableTimelineRequestErrorReason`).
   When the server explicitly sets `recoverable: false` we respect that
   positive "don't retry" signal — the allow-list only kicks in when the
   server didn't decide.

5. New shared allow-list `RECOVERABLE_TIMELINE_REQUEST_ERROR_REASONS` +
   `isRecoverableTimelineRequestErrorReason()` in
   `shared/timeline-history-errors.ts` so daemon, bridge, and web all
   agree on which reasons should auto-retry. No string literals for
   recoverable reasons are written outside the shared module.

Test updates:
- `server/test/bridge.test.ts` pins `deadlineMs: 15_000` on the one
  test that simulates wall-clock deadline expiry; production default
  is now 60s so a hardcoded `now = 16_000` would no longer trip it.

Verified: daemon unit (3641 pass), server (561 pass), web (1530 pass),
all three typechecks clean, web vite build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@im4codes im4codes merged commit 5218117 into master May 14, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant