Skip to content

Portable Engine Runtime#46

Draft
yourbuddyconner wants to merge 68 commits into
mainfrom
portable-runtime-v1-spec
Draft

Portable Engine Runtime#46
yourbuddyconner wants to merge 68 commits into
mainfrom
portable-runtime-v1-spec

Conversation

@yourbuddyconner
Copy link
Copy Markdown
Owner

Summary

  • Adds the portable runtime engine V1 specification.
  • Documents the required V1 contracts across engine API, data model, decision gates, providers, sandbox RPC, channel transports, API routes, client events, schema, adapter hosting, and observability.
  • Standardizes the spec on DecisionGate terminology for approvals, questions, and credential requests.

Verification

  • rg -n "InteractivePrompt|interactive_prompt|interactive_prompts|interactive prompt" docs/specs/2026-05-02-portable-runtime-engine-design.md
  • rg -n "TODO|TBD|FIXME" docs/specs/2026-05-02-portable-runtime-engine-design.md

Adds packages/engine — a portable agent runtime library that implements
the V1 design from docs/specs/2026-05-02-portable-runtime-engine-design.md.

Built on @mariozechner/pi-ai + @mariozechner/pi-agent-core. The engine
itself has zero platform dependencies; platform adapters (Cloudflare,
K8s) host it.

What's implemented:
- Engine public API (createSession, prompt, resolveDecision, abort,
  pause/resume) + Session/Thread classes.
- Per-thread queue with three modes: followup (FIFO), steer (abort +
  start), collect (buffered window).
- Decision gates with full lifecycle: pending -> resolved/withdrawn/
  expired. Steer withdraws pending gates with reason=steer; abort with
  reason=abort. DecisionGateEntry persists in the DAG.
- Multi-thread sessions: threads run concurrently with isolated
  histories; aborting one doesn't affect siblings.
- Built-in thread_read tool for cross-thread visibility.
- In-memory providers + VirtualSandbox so the full engine runs in
  vitest with no containers.

14 tests (happy path, decision gates, queue modes, multi-thread +
thread_read), all green in <2s.

Spec-vs-reality deltas reconciled by the implementation are documented
in packages/engine/README.md (tool signature wrapping, no native
suspension primitive, message_start vs message_update).

Deferred for follow-up: restart-safe re-entrant gates (the
SuspendedTurnState record is written but Engine.restoreSession is a
stub), compaction, role/skill loading, model failover, ActionSource
bridge, structured-result extraction.
Spec updates (docs/specs/2026-05-02-portable-runtime-engine-design.md):
- Engine.restoreSession now takes { sessionId, options } so the host
  re-supplies tools/sandbox/model — the engine doesn't persist creation
  options across restarts.
- Decision-gate IDs are explicitly derived as
  gate:{sessionId}:{threadId}:{queueItemId}:{resumeKey}, and resumeKey
  is required (not optional) on DecisionGateRequest.
- Restart-safe contract section now explains what "re-entrant up to the
  decision point" means in practice (work before requestDecision runs
  twice; work after runs once on replay), how the engine populates
  ctx.suspendedDecision on restoration, and what events the engine must
  emit during replay.
- New "LLM-faithful entry persistence" rule: assistant tool-call blocks
  must persist in MessageEntry.parts so the rehydrated transcript can
  be sent to LLM providers without producing a malformed
  [user, assistant(text-only), toolResult] sequence.
- ToolContext.requestDecision typed as DecisionGateRequest (not
  DecisionGate); ctx.suspendedDecision documented as engine-only.
- SuspendedTurnState bullets updated to reference the deterministic
  formula and the resumeKey explicitly.
- Adapter Host Contract calls out engine.restoreSession({ sessionId,
  options }) and the queue-item / suspended-turn fields that must
  survive hibernation.

Plan (docs/plans/2026-05-05-persistent-store-restart-safe-gates.md):
- 16 tasks across 4 phases (schema, store + contract tests, restart-safe
  primitives, restoreSession + replay), all aligned with the updated
  spec.
- Task 11 reworked from a flaky integration test that raced the agent
  loop into a deterministic unit test of a pure shouldShortCircuit
  predicate.
- Task 12's rehydrateTranscript explicitly reconstructs assistant
  ToolCall blocks from MessageEntry.parts.
Task 15 surfaced two real bugs in the prototype:
- Session.rehydrate's resumeBlockedThreadIfReady call was fire-and-forget,
  racing with resolveDecision callers; awaiting it ensures the gate is
  re-armed before any caller can resolve it.
- During replay, Thread.replayBlocked needs to mirror the original
  queueItemId so the deterministic gate ID matches and the
  short-circuit fires; without this, the tool tries to open a new gate.
- Gate-status persistence (pending -> resolved) lived in the
  requestDecision continuation; the short-circuit path bypassed it.
  Moved to Thread.resolveDecision so both live and replay paths persist
  the resolved status.
bin/repl.ts wires the engine to a real Anthropic model via pi-ai's
getModel('anthropic', ...), with InMemorySessionStore +
InMemoryEventBus + VirtualSandbox. Supports single-shot
(`pnpm repl 'say hi'`) and interactive (`pnpm repl`) modes. Streams
text deltas, tool calls, decision gates, and turn boundaries to stdout.

Defaults to claude-haiku-4-5; override with VALET_MODEL or
VALET_SYSTEM_PROMPT. Reads ANTHROPIC_API_KEY from env via pi-ai's
provider auto-resolution.
LocalSandbox wraps node:fs/promises and node:child_process.spawn to
implement the Sandbox interface against the real host filesystem and
shell. Relative paths resolve against the configured workspace; absolute
paths are honored as-is (no escape prevention — this is a dev/testing
sandbox, security goes into the Docker provider).

ExecOpts honored:
- cwd (relative paths resolved against workspace, default = workspace)
- env (merged over process.env)
- timeout (SIGKILL on expiry, timedOut: true on result)
- signal (AbortSignal cancellation)
- stdin (piped to the child)
- maxOutputBytes (truncated: true on result)

19 tests cover FS round-trips, exec lifecycle (timeout, abort, truncation,
stdin, env, cwd), and provider behavior. Total suite: 60 tests, all green.

REPL gains a VALET_SANDBOX=local|virtual switch (default virtual) and
VALET_WORKSPACE for the local workspace path. Smoke-tested against a tmp
scratch dir AND the valet repo itself — engine read its own README and
listed packages/ via real shell.
…tion

The bridge does NOT register one engine-visible tool per plugin action.
That approach would (a) collide with Anthropic's tool-name regex
(`^[a-zA-Z0-9_-]{1,128}$` — action ids like 'github.create_issue' are
rejected), (b) blow past LLM tool-catalog budgets when many plugins are
active, and (c) force every session to pay the prompt cost of every
action even when only a few are relevant.

Instead, actionBridgeTools({ sources }) returns exactly two ToolDefs:

- list_tools({ service?, query?, limit? }): searchable catalog with
  per-action params + risk levels + per-service auth warnings.
- call_tool({ tool_id, params, summary }): dispatches by action id
  (kept untouched, dots and all). Approval gates honor riskLevel via
  ctx.requestDecision; user denial short-circuits without invoking
  the action.

Same pattern OpenCode uses in the existing valet runtime, so plugin
ActionSource shapes port across with no changes.

Spec updated ("Plugin Action Bridge" section): documents the why
(provider regex, catalog budget, prompt cost), the new ActionBridgeOptions
shape, and the list_tools/call_tool semantics.

Engine.thread now also surfaces assistant errorMessage as an 'error'
event and translates stopReason 'error' into a 'turn_end: error' rather
than masking it as 'end_turn' — found while debugging the dogfood pass.

Dogfood: REPL with GITHUB_TOKEN=$(gh auth token) successfully searches
the catalog with list_tools, calls github.get_repository via call_tool,
and reports description/default branch/star count from the live API
response. 9 bridge unit tests + 60 existing tests all green (69 total).
Replaces the hand-wavy compaction section with a concrete two-technique
design informed by OpenCode's compaction module.

- Two triggers: proactive (token threshold post-turn) and reactive
  (overflow error retry via pi-ai's isContextOverflow).
- Tail preservation: keep last N turns clamped to a token budget, with
  mid-turn split when a single turn exceeds the budget.
- Pruning: walk newest-first, mark stale tool outputs as elided after
  pruneProtectTokens of recent output. No LLM call. Skips protected
  tools. Only commits if it'd save >= pruneMinimumTokens.
- Compaction: summarize head into a structured markdown template
  (Goal/Constraints/Progress/Key Decisions/Next Steps/Critical Context/
  Relevant Files). Iterative — subsequent compactions update the
  previous summary rather than write a fresh one.
- LLM-context assembly: convertToLlm drops covered entries and injects
  the summary as a user message; elided tool outputs are replaced with
  a placeholder. Same path used during restoreSession.
- Auto-continue after proactive compaction; reactive compactions
  retry the originating turn instead.
- Concrete configuration table with defaults.
Implements the spec'd compaction design (informed by OpenCode):

- src/compaction.ts: pure primitives — usableTokens, tailBudget, turns,
  selectCutPoint (with mid-turn split), planPrune/applyPrune,
  extractFileContext (read vs modified), summarize (one-shot
  completeSimple with the OpenCode-style structured-markdown template),
  iterative anchoring via previousSummary.

- src/types.ts: CompactionConfig (enabled, reserveTokens, tailTurns,
  min/maxPreserveRecentTokens, pruneProtectTokens, pruneMinimumTokens,
  toolOutputMaxChars, summarizerModel, protectedTools), wired into
  CreateSessionOptions. ToolDef.protectedFromPruning. MessagePart
  tool_call.elided flag.

- src/thread.ts:
  - lastAssistantUsage capture in turn_end handler.
  - Thread.compactThread orchestrator: prune (cheap, no LLM), select
    cut point, summarize head into a CompactionEntry, persist, rewrite
    agent.state.messages.
  - Proactive trigger in runItem: post-turn check shouldCompactProactive
    (lastAssistantUsage.total >= usable).
  - Reactive trigger in runAgent: catch isContextOverflow on assistant
    error, compact, retry the same prompt once.
  - rehydrateTranscript now delegates to entriesToAgentMessages, an
    exported pure function that drops covered entries and injects the
    summary as a <previous-context> user message.

- 21 pure compaction tests + 2 integration tests against the faux
  provider. Total suite: 92 tests, all green.

Known limitation: applyPrune mutates entries in memory but the current
SessionStore APIs (appendEntries-only) don't expose an in-place row
update, so pruning persists only to the live agent transcript for now.
Proper persistence requires adding an updateEntry method to
SessionStore — left as a follow-up since it doesn't block compaction
correctness, just observability of pruned state across restarts.
Required so pruning during compaction can persist tool-result elision
back to the DAG. Also clarifies that pruning's persistence is atomic
per entry: updateEntry rewrites the entire MessageEntry row with the
same id, including the mutated tool_call parts. Throws NotFoundError
if no matching entry exists in (sessionId, threadId).
Two follow-ups to the compaction landing:

1. SessionStore.updateEntry: rewrite an entry in place by id. Throws
   NotFoundError (new errors.ts module) when (sessionId, threadId, id)
   doesn't match. Implemented on InMemorySessionStore (Array.findIndex
   + replace) and SqliteSessionStore (UPDATE with changes=0 check).
   Two contract tests added; runs against both backends.

   Thread.compactThread's prune branch now calls store.updateEntry for
   each elided entry instead of dropping the persistence on the floor.
   Verified by an integration test that pre-populates the DAG with
   bash-output-heavy turns, triggers compaction, and confirms a-1's
   tool_call.elided is set in the DAG after the compaction completes.

2. Auto-continue: after a proactive compaction (cfg.autoContinue !==
   false), inject a synthetic user message —
     'Continue if you have next steps, or stop and ask for clarification...'
   — tagged with metadata.compaction_continue=true so client UIs can
   hide it. Pushed onto the thread's queue, picked up by the next
   tickQueue cycle. Reactive (overflow) compactions never auto-continue.

   Added skipNextProactiveCheck cooldown so the auto-continue turn
   itself doesn't immediately re-trigger compaction (the summary +
   system prompt can still exceed usable on a small-context model;
   without the cooldown we'd loop).

   QueueItem.metadata now flows through to MessageEntry.metadata so the
   compaction_continue tag survives into the DAG and across restarts.

   Two integration tests: on-path (auto-continue runs, response
   recorded), off-path (autoContinue: false suppresses).

Spec updated: SessionStore.updateEntry signature; pruning persistence
clarified; CompactionConfig.autoContinue added to the config table.
99 tests, all green.
Two new env vars:
- VALET_CONTEXT_WINDOW: override the model's local contextWindow
- VALET_MAX_TOKENS: override the model's local maxTokens

Anthropic's API still accepts the model's real (much larger) context
window; the override only affects the engine's 'usable' calculation,
which is what triggers proactive compaction. Useful for dogfooding the
compaction loop with a real LLM at small budgets so we don't have to
generate 100k tokens of context to see it fire.

Plus per-event printers for compaction_start / compaction_end so the
REPL output makes it obvious when compaction kicks in.

Verified end-to-end against Claude Haiku 4.5 with VALET_CONTEXT_WINDOW=8000
and VALET_MAX_TOKENS=1000:
- Agent reads 5 files (~60KB tool output)
- Proactive compaction fires after the first turn
- Auto-continue turn runs and references 'the previous context' from
  the injected summary, correctly understanding the task was complete.
… thread.skill)

Implements the spec's roles & skills design — the last engine-internal
piece that hadn't been touched.

src/roles-skills/parser.ts: minimal YAML-frontmatter parser for
markdown artifacts (handles key:value pairs, quoted strings, bools,
numbers, comments). renderTemplate(body, args) substitutes {{var}}
placeholders. Pure functions, hand-rolled to avoid pulling in
gray-matter for shallow needs; if a future role/skill needs nested
YAML, swap it in.

src/roles-skills/loader.ts: loadRoleFromMarkdown / loadSkillFromMarkdown
build typed RoleSpec / SkillSource from a markdown blob. Skill argsSchema
is supplied separately (markdown frontmatter is the wrong place for it).

src/session.ts: Session now indexes options.roles and options.skills
into Maps for O(1) lookup.

src/thread.ts:
- PromptOptions.role flows through QueueItem.role to runItem.
- Per-turn role overlay: applyRoleForTurn concatenates role.content
  onto agent.state.systemPrompt; if role.model is set, looks it up via
  pi-ai's getModel and overrides agent.state.model. Both restored in
  finally{}, regardless of error or compaction. Unknown role names
  emit an 'error' event with code=role_not_found and run without
  overlay (spec: prompt-level resolution failures fail gracefully).
- Thread.skill(name, opts): looks up skill, validates args against
  argsSchema via TypeBox's compile.Compile, renders {{var}} template,
  submits as a normal prompt with optional model/resultSchema/author/
  channel forwarded. metadata.skill records the skill name and
  metadata.syntheticFrom='skill' tags the synthetic origin.

Tests:
- test/roles-skills-pure.test.ts: 15 unit tests for the parser
  (frontmatter shapes, quoted strings, comments, unclosed fence)
  + the renderer (basic/whitespace/missing/null/nested) + the loaders.
- test/roles-skills.test.ts: 5 integration tests using the faux
  provider with response factories that capture ctx.systemPrompt and
  ctx.messages, verifying:
  * role overlay reaches the LLM via the system prompt
  * unknown role emits role_not_found and runs without overlay
  * thread.skill renders {{var}} placeholders correctly
  * unknown skill name throws
  * argsSchema validation rejects bad input

REPL gains VALET_ROLE_FILE / VALET_ROLE_DEFAULT env vars for
dogfooding. Verified end-to-end against Claude Haiku 4.5 with a
'pirate captain' role markdown — the LLM responds in pirate slang
when the role is applied and reverts to base behavior without it.

119 tests, all green.
@yourbuddyconner yourbuddyconner changed the title Document portable runtime V1 spec Document portable runtime V1 May 6, 2026
@yourbuddyconner yourbuddyconner changed the title Document portable runtime V1 Portable Engine Runtime May 6, 2026
Copy link
Copy Markdown
Collaborator

@figitaki figitaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong direction overall — session/thread semantics, deterministic gate replay, and the in-memory/SQLite store contract are all solid. I’m requesting changes before merge for a few high-leverage correctness and rollout edges: gate-resolution durability/persistence consistency, restart behavior for local sandbox restore, env override validation in REPL, and updateEntry overwrite safety. If we tighten these, this lands much more safely.

- GET  /api/sessions/:id/threads  → returns the engine's default thread
- GET  /api/sessions/:id/messages → reads engine entries via thread.readEntries
- POST /api/sessions/:id/messages → engineHost.sessionFor() + session.prompt()

v1 scopes to a single implicit thread per session (engine's web:default).
Multi-thread is a future enhancement; the wire shape carries threadId so the
client doesn't need to change when we add it.

Verified: typecheck clean, routes serve 200/201/404 correctly against a fresh
sqlite db; engine session lazily materialized via EngineHost on first thread/
message access.
GET /api/sessions/:id/ws upgrades via @hono/node-ws. On open:
- verify session ownership; close 4040 if missing
- materialize the engine session + default thread
- send 'init' frame with session detail + recent messages
- subscribe to engine eventBus filtered by sessionId
- map BusEvent → wire events via bridge
- track active assistant messageId per thread for text_delta tagging
- 30s ping for keepalive

main.ts now calls injectWebSocket(server) after serve() to attach the
upgrade handler to the running http server.

Verified: WS connects, init frame arrives with proper shape and seq=1.
scripts/dogfood.ts boots the server in-process, creates a session, opens the
WS, posts a prompt, and asserts the engine actually drove Anthropic, ran bash
in Docker, wrote the file via the bind mount, and streamed 12 wire events in
order. Run with:

  ANTHROPIC_API_KEY=sk-ant-... pnpm --filter @valet/api dogfood

Verified end-to-end: file 'ok' lands at $WORKSPACE/hello.txt; events:
init, status×4, message_start/end, tool_start×2 (bash, read), tool_end×2,
turn_end. The parallel-tool race (read fires alongside bash) is the known
pi-ai `disable_parallel_tool_use` follow-up; bash succeeds and the file is
written, which is what S6 verifies.
Vite + React 19 + TanStack Router (file-based) + TanStack Query + Tailwind 3.
Imports wire types from @valet/api/wire (workspace).

- vite.config.ts: TanStack Router plugin auto-generates routeTree.gen.ts;
  /api proxy points at the server (PORT 8788 by default).
- tailwind.config.ts: design tokens — neutral grayscale (OKLCH-tuned),
  accent/danger/success scales, sans/mono font stacks, radius scale.
- src/styles/globals.css: base layer with light/dark CSS vars driven by
  prefers-color-scheme.
- src/main.tsx: QueryClient + RouterProvider.
- src/routes/{__root.tsx,index.tsx}: stub home page.
- src/lib/cn.ts: tailwind-merge + clsx helper.

Verified: pnpm install resolves, vite boots on :5173, typecheck clean.
Primitives + screens land in W2+.
12 primitives built as thin intentional wrappers over Radix:
- button, input/textarea, label, dialog (+ trigger/close/content/footer),
  dropdown-menu (+ items/separator/sub*), tooltip, card (+ header/title/
  body/footer), avatar, badge, spinner, separator, scroll-area.

Each primitive has its own variant/size API; Radix is implementation detail.
Tokens live in tailwind.config.ts (OKLCH-tuned color scales) and CSS vars in
src/styles/globals.css for light/dark.

tsconfig.json gains 'paths: { ~/*: src/* }' so tsc resolves the same alias
vite already does. /primitives route is a small in-app showcase for eyeballing.

Verified: typecheck clean, both / and /primitives serve 200.
- src/api/client.ts: typed fetch wrapper using @valet/api/wire types.
- src/api/queries.ts: TanStack Query hooks (useMe/useSessions/useSession/
  useThreads/useMessages/useCreateSession/useDeleteSession/useSendPrompt)
  with a query-key factory.
- src/stores/stream.ts: Zustand store keyed by sessionId. Reducer applies
  wire events to the message list — message_start/text_delta/message_update/
  tool_start/tool_end/status/turn_end/error/ping. Drops out-of-order frames
  via lastSeq tracking.
- src/api/ws.ts: useSessionWebSocket(sessionId) hook — opens WS, pipes
  events into the store, reconnects with exponential backoff (500ms→8s),
  cleans up on unmount.

Adds zustand@5 dep. Typecheck clean. UI components consume these in W4-W7.
…w/ live stream (W4-W7)

UI surface for the agent loop, end-to-end:

- AppShell: two-pane layout (sidebar + main).
- Sidebar: lists sessions via useSessions; nav links via TanStack Router;
  active state highlighting; 'New' button opens NewSessionDialog.
- NewSessionDialog: workspace input + Create; navigates to /sessions/:id.
- /sessions/$sessionId route: SessionHeader (workspace, conn badge, agent
  status badge, delete) + MessageList (auto-stick-to-bottom) + Composer
  (Cmd/Ctrl+Enter to send, disabled while engine mid-turn) + error banner.
- MessageItem: renders text and tool_call parts; tool_call card shows args/
  result with running/completed/error status.

useSessionWebSocket pipes wire events into the streaming store; the store's
reducer applies message_start/text_delta/message_update/tool_start/tool_end/
status/turn_end/error to derive the rendered message list.

Verified: typecheck clean, routeTree picks up /sessions/$sessionId, all
routes serve via vite. Visual + live agent dogfood lands in W8.
- packages/api/README.md: scope, routes table, env vars, dogfood usage,
  what works / doesn't.
- packages/web/README.md: stack, layout, design tokens, what's not built yet.
- Makefile: dev-api-node, dev-web, dev-local, dogfood-api targets.
- CLAUDE.md: project structure + tech stack rows updated to call out the
  new packages and label legacy ones explicitly.
- docs/plans/2026-05-09-greenfield-followups.md: captured what's deferred —
  real auth, CF cutover, multi-thread UI, decision gates in wire,
  pi-ai disable_parallel_tool_use, etc.

Final state verified: api typecheck clean, web typecheck clean, 11 api
tests pass, full stack dogfooded end-to-end through Vite proxy with real
Anthropic + Docker bind-mount file write.
Three resilience fixes after first browser-side dogfood:

1. POST /api/sessions now mkdir -p the workspace if it doesn't exist (and
   rejects when the path exists but isn't a directory, or the path isn't
   absolute). The default-pathway 'pick a workspace, click Create' flow
   now Just Works without the user having to pre-create the dir.

2. WS onOpen wraps engine setup in try/catch. Previously, if engine.create
   Session threw (e.g. workspace ENOENT, Docker unreachable), the
   uncaught rejection killed the server process — taking down every live
   session and triggering vite proxy ECONNREFUSED storms. Now we send a
   wire 'error' frame and close the socket cleanly.

3. Hono onError catch-all returns errors as JSON; main.ts adds
   unhandledRejection/uncaughtException loggers so a bug in one route
   no longer cascades into a process crash. Belt-and-braces — real fixes
   still belong in the handler that swallowed the error.

Also: .gitignore now covers .env.deploy.* (suffixed dev/prod variants).
applyAppMigrations and applyEngineMigrations both ran every SQL file on
every boot — so the second startup hit 'table already exists' and
crashed before the server could even bind a port. Fix:

- Track applied migrations in __valet_app_migrations / __valet_engine_
  migrations (filename + applied_at). Re-runs across restarts are no-ops.
- Run each migration in a transaction so partial application leaves the
  tracker untouched.
- One-time backfill: if the schema tables already exist (db pre-dates
  this change) but the tracker is empty, mark all migrations applied
  without re-executing. Lets existing local DBs survive the upgrade
  without manual intervention.

Existing tests (13 in store-sqlite, 11 in api) all pass against the
backfill path.
Two compounding bugs caused user prompts to never appear in the UI during
a live session — the screenshot showed two assistant replies with no user
prompts above either one.

Root cause 1: the engine doesn't emit a wire event for the user's own
prompt. message_start fires only for assistant + system roles. The user
MessageEntry is persisted into engine_entries (so it'd reappear on the
next WS init / page reload via readEntries), but during live streaming
nothing pushes it to the client.

Root cause 2: the stream store's message_start reducer collapsed any
non-system role to 'assistant' — defensive code from when wire types
were narrower. Even if a user-role start were synthesized, the reducer
would have rendered it as the assistant.

Fix:

- stream.ts message_start reducer forwards ev.role verbatim. The wire's
  MessageRole is the full union (user/assistant/tool/system) and the
  Message type accepts it.
- new addUserMessage(sessionId, text) action on the stream store —
  appends a synthetic user-role Message with content + a single text
  part. ID prefixed 'user-opt-' so it's distinguishable.
- Composer calls addUserMessage immediately on submit, before the POST.
  User sees their text instantly. On the next WS init the server's
  persisted copy replaces it (different id, same content; brief overlap
  acceptable for v1).

Considered three alternatives and rejected: server-synthesized wire event
(requires extending the engine EngineEvent union for a UI bug); refetch
on turn_end (slow round-trip after every turn, prompt vanishes during
the gap); returning the persisted message from POST (couples route to
engine internals + assumes synchronous persistence).
Composer:
- Enter sends; Shift+Enter inserts a newline (was: Cmd/Ctrl+Enter sends).
- Skips submit while IME composition is active so Enter confirms the
  composition instead of sending a half-finished message.

Markdown rendering for chat text:
- New components/markdown.tsx wraps react-markdown + remark-gfm with
  token-aware styling. Code blocks, inline code, headings, lists, links,
  tables, blockquotes themed against --bg/--fg/accent so dark mode works.
- @tailwindcss/typography added; theme overrides via prose-* utilities.
- TextBlock in MessageItem now renders via Markdown. Tool call args/
  results stay as <pre> (JSON / shell output, not prose).
- Raw HTML disallowed (react-markdown default) so user/assistant text
  is safe to render. Links open in a new tab w/ noopener+noreferrer.
Two compounding bugs caused tool cards to vanish on page refresh while
working fine during a live turn:

1. The WS init frame was sending parts: [] for every persisted message,
   with a comment claiming the client would refetch via REST. That was
   wrong — useMessages's data is never piped into the stream store, so
   parts went nowhere. Init now uses engineToWireParts to forward the
   real parts (text + tool_call) for each message.

2. Engine's thread.handleAgentEvent persisted the assistant entry at
   message_end with all tool_call parts at status='running' (and no
   result yet — the actual execution happens *after* message_end fires).
   Then tool_execution_end mutated the in-memory part objects but never
   re-persisted, leaving sqlite stuck with stale 'running' rows. Fix:
   hold a reference to the entry on message_end and call updateEntry
   after each tool_execution_end. (parts are shared by reference, so
   the mutation flows through.)

Plus: dev-only console.debug of incoming WS frames in useSessionWebSocket
to make future bugs of this shape easier to debug from a browser.

87 engine tests + 11 api tests still pass.
The bug we just fixed (engine persisted tool_call at message_end with
status='running' and never re-persisted) had no test coverage. Existing
happy-path tool test only checked that the file landed and bus events
fired — never read the persisted entry back. Adding three guards:

1. happy-path.test.ts: after the tool turn, read the assistant entry
   from the store and assert tool_call.status === 'completed' with the
   result + correct args. Catches the engine forgetting to updateEntry.

2. store-contract round-trip: persist a tool_call(status: running) with
   nested args, read back, assert all fields preserved. Catches a store
   impl that drops or coerces tool_call fields. Runs against both
   InMemorySessionStore and SqliteSessionStore.

3. store-contract updateEntry transition: appendEntries a running
   tool_call, then updateEntry the same id with status='completed' +
   result, read back, assert the transition stuck. Catches a store impl
   that fails to overwrite parts on update.

89 engine tests pass, 15 store-sqlite tests pass.
Boots createApp(providers) on a real (port=0) http server with an
in-memory sqlite + virtual sandbox + InMemoryEventBus, drives a real
Anthropic-backed turn that calls the write tool, closes the WS, opens
a fresh one (simulating a page reload), and asserts the init frame
contains an assistant message with a completed tool_call part.

Catches the exact pair of bugs we just shipped a fix for:

1. Init frame stripping parts: [] — the assertion finds no tool_call
   anywhere in initFrame.messages.
2. Engine forgetting to updateEntry on tool completion — the assertion
   finds tool_calls but only with status='running', not 'completed'.

Skipped via describe.skip when ANTHROPIC_API_KEY is missing so CI
without a key still passes. ~1.2s wall-clock with claude-haiku-4-5,
~fractions of a cent per run.
Restructures the app shell along the user's requested shape:

- New TopNav: brand → session dropdown → '+ New session' button.
  SessionPicker reads sessionId from the URL and shows current title,
  with a Radix dropdown listing all sessions. Switching navigates to
  /sessions/$sessionId.
- New ThreadList replaces the previous sidebar contents. v1 only has
  the engine's web:default thread per session, so the list is short —
  but the structure is in place for multi-thread support (server CRUD
  + thread switching are Phase B).
- AppShell now has three zones (top, sidebar, main) instead of two.
- New-session dialog still triggered from a button (now in the top nav)
  and renders as a Radix Dialog.
- Empty-state copy on / points at the new top-nav button.

The old layout/sidebar.tsx (sessions list in a left pane) is removed.
Typecheck clean; no server-side changes.
Server (api):
- POST /api/sessions/:id/threads creates a new engine thread (key
  generated server-side). Returns the new ThreadSummary.
- GET /api/sessions/:id/threads now lists ALL threads from
  engineSession.listThreads(), not just the default.
- POST/GET /api/sessions/:id/messages accept a threadId (body field /
  query param). Defaults to the session's default thread when omitted,
  so single-thread clients keep working unchanged.
- Helpers: loadEngineSession, resolveThread, threadToSummary.

Wire types:
- Added CreateThreadRequest.
- SendPromptRequest gains optional threadId.

Client:
- api.createThread + listMessages threadId param. useCreateThread
  mutation invalidates the threads query on success. useMessages now
  takes an optional threadId (queryKey is per-thread).
- /sessions/$sessionId route declares a typed search schema with
  ?thread=<id>. ActiveThreadId resolves to the search param if set,
  else the first thread (default).
- ThreadList: '+ New thread' button at the top right; clicking creates
  a thread and navigates to it. Each thread row is a Link that updates
  ?thread=. Default thread renders without a search param to keep URLs
  clean.
- MessageList accepts a threadId prop and filters store messages.
  Optimistic user messages (threadId: null) are always shown so the
  user sees their text immediately; server's persisted copy with the
  real thread id replaces them on the next WS init.
- Composer takes threadId and posts it with each prompt.

Known limitation (documented in thread-list.tsx): WS init still loads
only the default thread's history. Reloading the page on a non-default
thread shows old messages on the default but blank on the active until
new live events arrive. Fix is REST-driven history loading on thread
switch — separate follow-up.

12 api tests still pass (incl. the WS reload integration test).
New integration test: `cross-thread.test.ts`. Drives two real Anthropic
turns in the same session against separate threads:

  1. Thread A (web:default): user tells the assistant a unique phrase,
     agent acknowledges. The phrase lives only in A's message history.
  2. Thread B (created via POST /threads): user asks the agent to call
     thread_read against web:default and report the phrase.

Asserts thread B's final assistant message contains the phrase verbatim,
proving:
  - POST /threads creates a fresh engine thread accessible by the
    messages routes via threadId
  - Threads are isolated — B's user messages don't include A's prompt
  - The engine's thread_read builtin works across threads
  - Multi-thread routing via ?threadId=… on /messages is correct

driveTurn (and reload-tool-rendering by extension) now waits for the
agent loop to fully settle, not just the first turn_end. Engine emits
turn_end per LLM round; with tool use the agent does multiple rounds
(tool-use turn → tool exec → follow-up text turn). Returning after the
first turn_end raced the second round and missed the final assistant
message. Now: armed settle timer on each turn_end (3s default), reset
on any new agent activity. Resolves only after a quiet period.

Boot harness extracted to _setup.ts and shared helpers to _test-utils.ts
(underscore-prefix so vitest's *.test.ts glob skips them).

Reload test now 5.0s (was 1.2s — extra time is the settle wait); cross-
thread test 11.2s. 13 api tests pass.
Optimistic user messages were tagged 'threadId: null' because addUserMessage
took only sessionId+text. The MessageList filter then accepted both the
active threadId AND null as a fallback, so user bubbles posted in one
thread reappeared in every other thread's view after a switch.

Fix:

- addUserMessage now requires a threadId. The store tags the optimistic
  message with the active thread, so it's correctly scoped.
- Composer takes threadId; submit is gated on it being defined. The
  textarea + Send button disable to 'Loading thread…' while the threads
  query is in flight (typically <100ms). Without this we'd lose the
  guarantee that addUserMessage receives a real threadId.
- MessageList filters strictly: m.threadId === activeThreadId. The null
  fallback is gone. (Loading state — activeThreadId undefined — still
  shows everything so init-frame messages render before the threads
  query resolves; nothing new can be added in that window because
  Composer is disabled.)

Out of scope (separate bug, less common): WS init replaces the message
list with the default thread's history on every reconnect. If the user
is on a non-default thread when the WS reconnects, that thread's live
messages get wiped from the store. The full fix is REST-driven history
load on thread switch + treating WS init as session metadata only.
Documented as a follow-up.
Fixes the documented limitation: reloading the page (or any WS reconnect)
on a non-default thread used to wipe that thread's messages because the
init frame replaced the entire message list with the default thread's
persisted history.

Server (api):
- WS init drops the messages field. It now carries only session metadata.
  The client loads thread history via REST (GET /messages?threadId=…).
- Wire type 'init' updated; ws.ts no longer reads entries on connect
  (still ensures the engine session + default thread are materialized).

Client (web):
- New stream-store action setThreadMessages(sessionId, threadId, msgs)
  replaces the messages for one thread, leaves other threads untouched,
  and preserves still-in-flight optimistic user messages whose content
  hasn't yet appeared in the REST snapshot. Once the server has persisted
  a matching user message, the optimistic copy is dropped to avoid a
  duplicate row.
- Reducer's 'init' case no longer mutates messages — only clears
  transient status/error.
- /sessions/$sessionId route now drives useMessages(sessionId,
  activeThreadId) and pipes the result into setThreadMessages via
  useEffect on (sessionId, activeThreadId, data) changes.
- useMessages disables refetchOnWindowFocus + refetchOnReconnect so
  background refetches can't wipe in-flight live state. Initial load and
  thread-switch fetches still happen because each (sessionId, threadId)
  is its own queryKey.

Tests:
- reload-tool-rendering.test.ts switched from captureInitFrame() to
  GET /messages — the actual code path the client now takes after a
  reload. Same regression coverage (init stripping parts becomes 'GET
  /messages dropping parts'; engine forgetting updateEntry on tool
  completion still surfaces the same way).
- captureInitFrame() helper kept (init still fires; could be useful for
  future tests asserting on session metadata).

13 api tests pass. Cross-thread test still green: thread B reads thread
A's persisted history via thread_read.
…er tool

Replaces the old generic ToolCallBlock (icon + tool name + raw JSON args
in <pre> + plain text result) with a registry of per-tool renderers under
src/components/session/tool-renderers/. Each renderer claims one or more
tool names and contributes a custom Body for that tool's args + result.
Unknown plugin tools route to a fallback that auto-extracts a primary
identifier from args and renders typed key/value tables — so plugins
look polished without anyone writing custom code for them.

Common chrome (ToolShell):
- 2px category-colored left strip identifies tool family at a glance
  (shell/read/write/edit/thread/generic).
- Compact mono header: tool name (uppercase, tracked) + target identifier
  + optional summary + status pip.
- While running, a low-opacity scanner-line animation sweeps the header
  in the category color — the visual heartbeat for 'agent is working'.
- Click header to expand/collapse. Default: expanded while running and
  on error, collapsed once completed (keeps the chat dense by default).

Tool-specific renderers:
- bash: terminal aesthetics — black bg, emerald-300 mono, dollar prompt,
  blinking caret while running, exit-code summary in the header.
- read: file-viewer with line numbers, byte+line summary.
- write: additive diff (green '+' lines) of the new file content.
- edit: side-by-side diff lines (red '-' before, green '+' after) +
  surfaces 'no match' failures inline.
- thread_read: parses the engine's markdown dump back into bubbles with
  role-tinted left borders, relative timestamps, recent-N collapse.
- fallback: typed key/value table for args (strings mono, numbers
  tabular-aligned, booleans as pills, objects/arrays as collapsed JSON
  with click-to-expand). Smart 'identifier' extraction prefers
  path/id/name/key/url/command/query.

Adding a renderer for a plugin tool: build a ToolRenderer (see types.ts)
and add it to RENDERERS in tool-renderers/index.ts before the fallback.
Same shell, same status semantics, same shape contract.

Typecheck clean. All five new files serve through Vite.
Tool cards on reload were rendering '(empty output)' / '(empty file)'
even after our prior fix made tool_call status persist correctly. Root
cause: pi-agent-core emits AgentToolResult-shaped objects
({ content: [{ type: 'text', text }] }) and the engine was storing
event.result verbatim. Frontend's resultText() only knew about
{ text: string }, so it found nothing and rendered the empty-state.

Fixes:

1. Engine (thread.ts tool_execution_end) now persists a hybrid shape:
   spread the raw result fields, then add a top-level  rendered
   from renderToolResult(). Readers that pull  Just Work;
   anything that wants the raw blocks can still inspect result.content.
2. Frontend resultText() handles all three shapes defensively (string,
   { text }, { content: [...] }) so already-persisted entries from
   before this commit also render correctly.
3. Strengthened tests: happy-path now asserts result.text is a non-empty
   string containing the actual output (previously just .toBeDefined());
   reload integration test does the same end-to-end through GET /messages.

Plus a new top section in CLAUDE.md ('Working on this codebase') that
documents the persistence-shape-drift class of bug, the canonical
round-trip path (engine.appendEntries/updateEntry → bridge → REST →
resultText), the test commands to run when touching this code, and a
debug recipe (sqlite3 ~/.valet/app.db) for the next time tool cards
render empty.

89 engine tests + 15 store-sqlite tests + 13 api integration tests pass
(including the 'real Anthropic + Docker, then GET /messages' round-trip
that strictly asserts a readable result.text exists).
…tool

Adds first-class model switching to the engine, plumbed through to a
builtin tool the agent can call mid-turn and to the API surface so the
UI / a future client can drive it directly.

Resolution chain at turn time: thread.modelOverride → session.options.model
(no global fallback in the engine; the API host supplies the global default
when creating a session).

Engine
- SessionData gains 'model: string?' (persisted session default).
- Thread holds a per-thread modelOverride; Thread.setModel(id|null,reason)
  resolves the id, persists via store.saveThread, and emits model_switched.
- Session.setModel(id,reason) does the same at session scope (threadId
  omitted on the bus event to indicate scope).
- runItem reads thread.resolveTurnModel() once at turn start (overlays
  before role overlay) and restores the baseline in finally so the next
  turn picks up any mid-turn change.
- Generalized resolveModelId (was resolveRoleModel) — provider/model form
  or bare ids tried under anthropic/openai/google.
- New ToolContext.setModel({ model, scope? }): allows tools to dispatch
  to either thread or session scope without seeing engine internals.
- New switchModelTool builtin — wraps ctx.setModel; surfaces failures as
  a readable result instead of throwing past the agent.
- model_switched event finally has an emit site (was vestigial).

Persistence
- engine_sessions gains a 'model' column. Migration is edited in place
  (0000_lonely_lizard.sql) — pre-1.0 we don't accumulate ALTER TABLEs;
  blow away the local sqlite if you're upgrading. CLAUDE.md gets a new
  section calling this rule out so we don't drift.
- store-sqlite saveSession / getSession / listSessions read+write the
  field. InMemorySessionStore needs no change (round-trips SessionData
  verbatim).

Wire + API
- Wire types: SessionDetail.model + ThreadSummary.model added; new
  PatchSessionRequest / PatchThreadRequest; model_switched added to
  WireEvent (threadId optional — present for thread scope, absent for
  session scope).
- bridge.ts maps model_switched (was dropped). Empty-string threadId
  emitted by session-scope switches is normalized to undefined.
- POST /api/sessions/:id supports body.model at create … (next commit
  for the host-default piece). PATCH /api/sessions/:id sets the session
  default; PATCH /api/sessions/:id/threads/:tid sets the thread override.
  GET /sessions/:id surfaces the current model when the engine session
  is live; GET /threads surfaces each thread's override.

Tests
- 4 new engine tests (model-switching.test.ts): Thread.setModel persists
  + emits, rejects unknown ids, Session.setModel updates default + persists,
  switchModelTool dispatches via ctx.setModel + is registered in
  builtinTools.
- bridge.test.ts has new cases for model_switched (thread + session
  scope) replacing the now-obsolete 'drops model_switched' case.
- 94 engine + 15 store-sqlite + 15 api tests pass.
Adds the user-facing surface for the model-switching engine work:

- src/lib/models.ts: small curated catalog (haiku 4.5, sonnet 4.5/4.6,
  opus 4.7) with tier (fast/balanced/powerful) + descriptions. Engine
  resolves any string id, so this list is just what we surface in the
  picker — adding more is a one-line change.
- src/components/session/model-picker.tsx: Radix dropdown grouping
  models by tier with description text under each. Two variants:
  'compact' for tight chrome (session header) and 'row' for inline
  rows (thread sidebar). Shows a violet dot + label tint when the
  current value is a thread-level override. Optional 'inherit'
  affordance at the bottom — wired by callers that have a clearable
  scope (i.e. threads, which can fall back to the session default).
- session-header.tsx: shows the session-default model picker on the
  right of the header. Selecting fires PATCH /api/sessions/:id.
- thread-list.tsx: each thread row shows its current model under the
  title — muted 'inherits session model' when no override, violet
  '(model name)' when overridden. The picker itself only renders for
  the active thread to keep the sidebar dense; collapsed rows just
  show the label. Selecting fires PATCH on the thread.

API client + hooks:
- api/client.ts gains patchSession + patchThread.
- queries.ts gains useSetSessionModel + useSetThreadModel mutations
  that invalidate the right query keys on success.
- stream store + WS hook handle the new model_switched wire event
  (no-op reduction; logger summarizes scope).

Typecheck clean. All new files serve through vite.
The session default is a user-facing setting that affects every future
thread the user opens — not something an agent should reach for
unilaterally. Drop the `scope` parameter from switchModelTool and from
ctx.setModel; both now mutate only the calling thread's override.

Session-default mutation still exists on Session.setModel itself, but
it's only reachable via PATCH /api/sessions/:id from the UI.
Agents previously had to know thread keys ahead of time to call
thread_read. list_threads pulls from store.listThreads (so paused
threads still surface) and renders key, status, model override, and
summary — including a self-marker for the calling thread.

Also exposes ctx.listThreads() on ToolContext for plugin tools.
Two concurrent requests for a fresh session id both missed the cache,
each created their own Engine + Session, and each called
ensureDefaultThread — persisting two distinct thread rows with the same
key 'web:default'. Subsequent rehydrations loaded both, breaking thread
identity (list_threads returned duplicates, sidebar showed two
"Default thread" rows).

Two-layer fix:
- Single-flight inflight map in EngineHost.sessionFor so concurrent
  callers share one in-progress build.
- UNIQUE INDEX on (session_id, key) in engine_threads as
  defense-in-depth: any future race fails the second insert loudly
  instead of silently corrupting state.

Pre-1.0 migration policy: edited 0000_lonely_lizard.sql in place; wipe
the local sqlite to recreate.
The bridge was silently dropping all four decision-gate engine events,
so the UI never saw them. Add wire-side DecisionGate / DecisionAction /
DecisionResolution types, four new WireEvent variants, and small
projectors that strip engine-only fields (context, origin, refs) from
the wire shape.

REST endpoints + UI come in follow-up commits.
GET    /api/sessions/:id/decisions               → list pending gates
POST   /api/sessions/:id/decisions/:gateId/resolve   → resolve (actionId or value)
POST   /api/sessions/:id/decisions/:gateId/withdraw  → user-initiated cancel

Both mutating routes confirm the gate is actually pending in the
session before delegating to Session.resolveDecision /
Session.withdrawDecision; otherwise a stale gateId would silently no-op.
Withdraw rejects engine-internal reasons (steer, abort) so audit records
stay honest — clients can only send reason='cancel'.
When the engine raises a gate via ctx.requestDecision, the bridge now
forwards it on the wire and the web stores it as pending state per
session. The card renders only on the gate's owning thread, so
switching to a sibling thread hides it (the engine still has the
original thread suspended; the user can come back and answer).

Render modes by gate.type:
- approval / credential_request: action buttons (style: primary | danger | secondary)
- question: textarea + Submit

The X button calls POST /decisions/:gateId/withdraw with reason='cancel'.

Bootstrap: GET /sessions/:id/decisions seeds pending gates on mount so
a card raised before the WS opens still shows immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants