Portable Engine Runtime#46
Draft
yourbuddyconner wants to merge 68 commits into
Draft
Conversation
Adds packages/engine — a portable agent runtime library that implements the V1 design from docs/specs/2026-05-02-portable-runtime-engine-design.md. Built on @mariozechner/pi-ai + @mariozechner/pi-agent-core. The engine itself has zero platform dependencies; platform adapters (Cloudflare, K8s) host it. What's implemented: - Engine public API (createSession, prompt, resolveDecision, abort, pause/resume) + Session/Thread classes. - Per-thread queue with three modes: followup (FIFO), steer (abort + start), collect (buffered window). - Decision gates with full lifecycle: pending -> resolved/withdrawn/ expired. Steer withdraws pending gates with reason=steer; abort with reason=abort. DecisionGateEntry persists in the DAG. - Multi-thread sessions: threads run concurrently with isolated histories; aborting one doesn't affect siblings. - Built-in thread_read tool for cross-thread visibility. - In-memory providers + VirtualSandbox so the full engine runs in vitest with no containers. 14 tests (happy path, decision gates, queue modes, multi-thread + thread_read), all green in <2s. Spec-vs-reality deltas reconciled by the implementation are documented in packages/engine/README.md (tool signature wrapping, no native suspension primitive, message_start vs message_update). Deferred for follow-up: restart-safe re-entrant gates (the SuspendedTurnState record is written but Engine.restoreSession is a stub), compaction, role/skill loading, model failover, ActionSource bridge, structured-result extraction.
Spec updates (docs/specs/2026-05-02-portable-runtime-engine-design.md):
- Engine.restoreSession now takes { sessionId, options } so the host
re-supplies tools/sandbox/model — the engine doesn't persist creation
options across restarts.
- Decision-gate IDs are explicitly derived as
gate:{sessionId}:{threadId}:{queueItemId}:{resumeKey}, and resumeKey
is required (not optional) on DecisionGateRequest.
- Restart-safe contract section now explains what "re-entrant up to the
decision point" means in practice (work before requestDecision runs
twice; work after runs once on replay), how the engine populates
ctx.suspendedDecision on restoration, and what events the engine must
emit during replay.
- New "LLM-faithful entry persistence" rule: assistant tool-call blocks
must persist in MessageEntry.parts so the rehydrated transcript can
be sent to LLM providers without producing a malformed
[user, assistant(text-only), toolResult] sequence.
- ToolContext.requestDecision typed as DecisionGateRequest (not
DecisionGate); ctx.suspendedDecision documented as engine-only.
- SuspendedTurnState bullets updated to reference the deterministic
formula and the resumeKey explicitly.
- Adapter Host Contract calls out engine.restoreSession({ sessionId,
options }) and the queue-item / suspended-turn fields that must
survive hibernation.
Plan (docs/plans/2026-05-05-persistent-store-restart-safe-gates.md):
- 16 tasks across 4 phases (schema, store + contract tests, restart-safe
primitives, restoreSession + replay), all aligned with the updated
spec.
- Task 11 reworked from a flaky integration test that raced the agent
loop into a deterministic unit test of a pure shouldShortCircuit
predicate.
- Task 12's rehydrateTranscript explicitly reconstructs assistant
ToolCall blocks from MessageEntry.parts.
Task 15 surfaced two real bugs in the prototype: - Session.rehydrate's resumeBlockedThreadIfReady call was fire-and-forget, racing with resolveDecision callers; awaiting it ensures the gate is re-armed before any caller can resolve it. - During replay, Thread.replayBlocked needs to mirror the original queueItemId so the deterministic gate ID matches and the short-circuit fires; without this, the tool tries to open a new gate. - Gate-status persistence (pending -> resolved) lived in the requestDecision continuation; the short-circuit path bypassed it. Moved to Thread.resolveDecision so both live and replay paths persist the resolved status.
bin/repl.ts wires the engine to a real Anthropic model via pi-ai's
getModel('anthropic', ...), with InMemorySessionStore +
InMemoryEventBus + VirtualSandbox. Supports single-shot
(`pnpm repl 'say hi'`) and interactive (`pnpm repl`) modes. Streams
text deltas, tool calls, decision gates, and turn boundaries to stdout.
Defaults to claude-haiku-4-5; override with VALET_MODEL or
VALET_SYSTEM_PROMPT. Reads ANTHROPIC_API_KEY from env via pi-ai's
provider auto-resolution.
LocalSandbox wraps node:fs/promises and node:child_process.spawn to implement the Sandbox interface against the real host filesystem and shell. Relative paths resolve against the configured workspace; absolute paths are honored as-is (no escape prevention — this is a dev/testing sandbox, security goes into the Docker provider). ExecOpts honored: - cwd (relative paths resolved against workspace, default = workspace) - env (merged over process.env) - timeout (SIGKILL on expiry, timedOut: true on result) - signal (AbortSignal cancellation) - stdin (piped to the child) - maxOutputBytes (truncated: true on result) 19 tests cover FS round-trips, exec lifecycle (timeout, abort, truncation, stdin, env, cwd), and provider behavior. Total suite: 60 tests, all green. REPL gains a VALET_SANDBOX=local|virtual switch (default virtual) and VALET_WORKSPACE for the local workspace path. Smoke-tested against a tmp scratch dir AND the valet repo itself — engine read its own README and listed packages/ via real shell.
…tion
The bridge does NOT register one engine-visible tool per plugin action.
That approach would (a) collide with Anthropic's tool-name regex
(`^[a-zA-Z0-9_-]{1,128}$` — action ids like 'github.create_issue' are
rejected), (b) blow past LLM tool-catalog budgets when many plugins are
active, and (c) force every session to pay the prompt cost of every
action even when only a few are relevant.
Instead, actionBridgeTools({ sources }) returns exactly two ToolDefs:
- list_tools({ service?, query?, limit? }): searchable catalog with
per-action params + risk levels + per-service auth warnings.
- call_tool({ tool_id, params, summary }): dispatches by action id
(kept untouched, dots and all). Approval gates honor riskLevel via
ctx.requestDecision; user denial short-circuits without invoking
the action.
Same pattern OpenCode uses in the existing valet runtime, so plugin
ActionSource shapes port across with no changes.
Spec updated ("Plugin Action Bridge" section): documents the why
(provider regex, catalog budget, prompt cost), the new ActionBridgeOptions
shape, and the list_tools/call_tool semantics.
Engine.thread now also surfaces assistant errorMessage as an 'error'
event and translates stopReason 'error' into a 'turn_end: error' rather
than masking it as 'end_turn' — found while debugging the dogfood pass.
Dogfood: REPL with GITHUB_TOKEN=$(gh auth token) successfully searches
the catalog with list_tools, calls github.get_repository via call_tool,
and reports description/default branch/star count from the live API
response. 9 bridge unit tests + 60 existing tests all green (69 total).
Replaces the hand-wavy compaction section with a concrete two-technique design informed by OpenCode's compaction module. - Two triggers: proactive (token threshold post-turn) and reactive (overflow error retry via pi-ai's isContextOverflow). - Tail preservation: keep last N turns clamped to a token budget, with mid-turn split when a single turn exceeds the budget. - Pruning: walk newest-first, mark stale tool outputs as elided after pruneProtectTokens of recent output. No LLM call. Skips protected tools. Only commits if it'd save >= pruneMinimumTokens. - Compaction: summarize head into a structured markdown template (Goal/Constraints/Progress/Key Decisions/Next Steps/Critical Context/ Relevant Files). Iterative — subsequent compactions update the previous summary rather than write a fresh one. - LLM-context assembly: convertToLlm drops covered entries and injects the summary as a user message; elided tool outputs are replaced with a placeholder. Same path used during restoreSession. - Auto-continue after proactive compaction; reactive compactions retry the originating turn instead. - Concrete configuration table with defaults.
Implements the spec'd compaction design (informed by OpenCode):
- src/compaction.ts: pure primitives — usableTokens, tailBudget, turns,
selectCutPoint (with mid-turn split), planPrune/applyPrune,
extractFileContext (read vs modified), summarize (one-shot
completeSimple with the OpenCode-style structured-markdown template),
iterative anchoring via previousSummary.
- src/types.ts: CompactionConfig (enabled, reserveTokens, tailTurns,
min/maxPreserveRecentTokens, pruneProtectTokens, pruneMinimumTokens,
toolOutputMaxChars, summarizerModel, protectedTools), wired into
CreateSessionOptions. ToolDef.protectedFromPruning. MessagePart
tool_call.elided flag.
- src/thread.ts:
- lastAssistantUsage capture in turn_end handler.
- Thread.compactThread orchestrator: prune (cheap, no LLM), select
cut point, summarize head into a CompactionEntry, persist, rewrite
agent.state.messages.
- Proactive trigger in runItem: post-turn check shouldCompactProactive
(lastAssistantUsage.total >= usable).
- Reactive trigger in runAgent: catch isContextOverflow on assistant
error, compact, retry the same prompt once.
- rehydrateTranscript now delegates to entriesToAgentMessages, an
exported pure function that drops covered entries and injects the
summary as a <previous-context> user message.
- 21 pure compaction tests + 2 integration tests against the faux
provider. Total suite: 92 tests, all green.
Known limitation: applyPrune mutates entries in memory but the current
SessionStore APIs (appendEntries-only) don't expose an in-place row
update, so pruning persists only to the live agent transcript for now.
Proper persistence requires adding an updateEntry method to
SessionStore — left as a follow-up since it doesn't block compaction
correctness, just observability of pruned state across restarts.
Required so pruning during compaction can persist tool-result elision back to the DAG. Also clarifies that pruning's persistence is atomic per entry: updateEntry rewrites the entire MessageEntry row with the same id, including the mutated tool_call parts. Throws NotFoundError if no matching entry exists in (sessionId, threadId).
Two follow-ups to the compaction landing:
1. SessionStore.updateEntry: rewrite an entry in place by id. Throws
NotFoundError (new errors.ts module) when (sessionId, threadId, id)
doesn't match. Implemented on InMemorySessionStore (Array.findIndex
+ replace) and SqliteSessionStore (UPDATE with changes=0 check).
Two contract tests added; runs against both backends.
Thread.compactThread's prune branch now calls store.updateEntry for
each elided entry instead of dropping the persistence on the floor.
Verified by an integration test that pre-populates the DAG with
bash-output-heavy turns, triggers compaction, and confirms a-1's
tool_call.elided is set in the DAG after the compaction completes.
2. Auto-continue: after a proactive compaction (cfg.autoContinue !==
false), inject a synthetic user message —
'Continue if you have next steps, or stop and ask for clarification...'
— tagged with metadata.compaction_continue=true so client UIs can
hide it. Pushed onto the thread's queue, picked up by the next
tickQueue cycle. Reactive (overflow) compactions never auto-continue.
Added skipNextProactiveCheck cooldown so the auto-continue turn
itself doesn't immediately re-trigger compaction (the summary +
system prompt can still exceed usable on a small-context model;
without the cooldown we'd loop).
QueueItem.metadata now flows through to MessageEntry.metadata so the
compaction_continue tag survives into the DAG and across restarts.
Two integration tests: on-path (auto-continue runs, response
recorded), off-path (autoContinue: false suppresses).
Spec updated: SessionStore.updateEntry signature; pruning persistence
clarified; CompactionConfig.autoContinue added to the config table.
99 tests, all green.
Two new env vars: - VALET_CONTEXT_WINDOW: override the model's local contextWindow - VALET_MAX_TOKENS: override the model's local maxTokens Anthropic's API still accepts the model's real (much larger) context window; the override only affects the engine's 'usable' calculation, which is what triggers proactive compaction. Useful for dogfooding the compaction loop with a real LLM at small budgets so we don't have to generate 100k tokens of context to see it fire. Plus per-event printers for compaction_start / compaction_end so the REPL output makes it obvious when compaction kicks in. Verified end-to-end against Claude Haiku 4.5 with VALET_CONTEXT_WINDOW=8000 and VALET_MAX_TOKENS=1000: - Agent reads 5 files (~60KB tool output) - Proactive compaction fires after the first turn - Auto-continue turn runs and references 'the previous context' from the injected summary, correctly understanding the task was complete.
… thread.skill)
Implements the spec's roles & skills design — the last engine-internal
piece that hadn't been touched.
src/roles-skills/parser.ts: minimal YAML-frontmatter parser for
markdown artifacts (handles key:value pairs, quoted strings, bools,
numbers, comments). renderTemplate(body, args) substitutes {{var}}
placeholders. Pure functions, hand-rolled to avoid pulling in
gray-matter for shallow needs; if a future role/skill needs nested
YAML, swap it in.
src/roles-skills/loader.ts: loadRoleFromMarkdown / loadSkillFromMarkdown
build typed RoleSpec / SkillSource from a markdown blob. Skill argsSchema
is supplied separately (markdown frontmatter is the wrong place for it).
src/session.ts: Session now indexes options.roles and options.skills
into Maps for O(1) lookup.
src/thread.ts:
- PromptOptions.role flows through QueueItem.role to runItem.
- Per-turn role overlay: applyRoleForTurn concatenates role.content
onto agent.state.systemPrompt; if role.model is set, looks it up via
pi-ai's getModel and overrides agent.state.model. Both restored in
finally{}, regardless of error or compaction. Unknown role names
emit an 'error' event with code=role_not_found and run without
overlay (spec: prompt-level resolution failures fail gracefully).
- Thread.skill(name, opts): looks up skill, validates args against
argsSchema via TypeBox's compile.Compile, renders {{var}} template,
submits as a normal prompt with optional model/resultSchema/author/
channel forwarded. metadata.skill records the skill name and
metadata.syntheticFrom='skill' tags the synthetic origin.
Tests:
- test/roles-skills-pure.test.ts: 15 unit tests for the parser
(frontmatter shapes, quoted strings, comments, unclosed fence)
+ the renderer (basic/whitespace/missing/null/nested) + the loaders.
- test/roles-skills.test.ts: 5 integration tests using the faux
provider with response factories that capture ctx.systemPrompt and
ctx.messages, verifying:
* role overlay reaches the LLM via the system prompt
* unknown role emits role_not_found and runs without overlay
* thread.skill renders {{var}} placeholders correctly
* unknown skill name throws
* argsSchema validation rejects bad input
REPL gains VALET_ROLE_FILE / VALET_ROLE_DEFAULT env vars for
dogfooding. Verified end-to-end against Claude Haiku 4.5 with a
'pirate captain' role markdown — the LLM responds in pirate slang
when the role is applied and reverts to base behavior without it.
119 tests, all green.
figitaki
requested changes
May 6, 2026
Collaborator
figitaki
left a comment
There was a problem hiding this comment.
Strong direction overall — session/thread semantics, deterministic gate replay, and the in-memory/SQLite store contract are all solid. I’m requesting changes before merge for a few high-leverage correctness and rollout edges: gate-resolution durability/persistence consistency, restart behavior for local sandbox restore, env override validation in REPL, and updateEntry overwrite safety. If we tighten these, this lands much more safely.
- GET /api/sessions/:id/threads → returns the engine's default thread - GET /api/sessions/:id/messages → reads engine entries via thread.readEntries - POST /api/sessions/:id/messages → engineHost.sessionFor() + session.prompt() v1 scopes to a single implicit thread per session (engine's web:default). Multi-thread is a future enhancement; the wire shape carries threadId so the client doesn't need to change when we add it. Verified: typecheck clean, routes serve 200/201/404 correctly against a fresh sqlite db; engine session lazily materialized via EngineHost on first thread/ message access.
GET /api/sessions/:id/ws upgrades via @hono/node-ws. On open: - verify session ownership; close 4040 if missing - materialize the engine session + default thread - send 'init' frame with session detail + recent messages - subscribe to engine eventBus filtered by sessionId - map BusEvent → wire events via bridge - track active assistant messageId per thread for text_delta tagging - 30s ping for keepalive main.ts now calls injectWebSocket(server) after serve() to attach the upgrade handler to the running http server. Verified: WS connects, init frame arrives with proper shape and seq=1.
scripts/dogfood.ts boots the server in-process, creates a session, opens the WS, posts a prompt, and asserts the engine actually drove Anthropic, ran bash in Docker, wrote the file via the bind mount, and streamed 12 wire events in order. Run with: ANTHROPIC_API_KEY=sk-ant-... pnpm --filter @valet/api dogfood Verified end-to-end: file 'ok' lands at $WORKSPACE/hello.txt; events: init, status×4, message_start/end, tool_start×2 (bash, read), tool_end×2, turn_end. The parallel-tool race (read fires alongside bash) is the known pi-ai `disable_parallel_tool_use` follow-up; bash succeeds and the file is written, which is what S6 verifies.
Vite + React 19 + TanStack Router (file-based) + TanStack Query + Tailwind 3.
Imports wire types from @valet/api/wire (workspace).
- vite.config.ts: TanStack Router plugin auto-generates routeTree.gen.ts;
/api proxy points at the server (PORT 8788 by default).
- tailwind.config.ts: design tokens — neutral grayscale (OKLCH-tuned),
accent/danger/success scales, sans/mono font stacks, radius scale.
- src/styles/globals.css: base layer with light/dark CSS vars driven by
prefers-color-scheme.
- src/main.tsx: QueryClient + RouterProvider.
- src/routes/{__root.tsx,index.tsx}: stub home page.
- src/lib/cn.ts: tailwind-merge + clsx helper.
Verified: pnpm install resolves, vite boots on :5173, typecheck clean.
Primitives + screens land in W2+.
12 primitives built as thin intentional wrappers over Radix:
- button, input/textarea, label, dialog (+ trigger/close/content/footer),
dropdown-menu (+ items/separator/sub*), tooltip, card (+ header/title/
body/footer), avatar, badge, spinner, separator, scroll-area.
Each primitive has its own variant/size API; Radix is implementation detail.
Tokens live in tailwind.config.ts (OKLCH-tuned color scales) and CSS vars in
src/styles/globals.css for light/dark.
tsconfig.json gains 'paths: { ~/*: src/* }' so tsc resolves the same alias
vite already does. /primitives route is a small in-app showcase for eyeballing.
Verified: typecheck clean, both / and /primitives serve 200.
- src/api/client.ts: typed fetch wrapper using @valet/api/wire types. - src/api/queries.ts: TanStack Query hooks (useMe/useSessions/useSession/ useThreads/useMessages/useCreateSession/useDeleteSession/useSendPrompt) with a query-key factory. - src/stores/stream.ts: Zustand store keyed by sessionId. Reducer applies wire events to the message list — message_start/text_delta/message_update/ tool_start/tool_end/status/turn_end/error/ping. Drops out-of-order frames via lastSeq tracking. - src/api/ws.ts: useSessionWebSocket(sessionId) hook — opens WS, pipes events into the store, reconnects with exponential backoff (500ms→8s), cleans up on unmount. Adds zustand@5 dep. Typecheck clean. UI components consume these in W4-W7.
…w/ live stream (W4-W7) UI surface for the agent loop, end-to-end: - AppShell: two-pane layout (sidebar + main). - Sidebar: lists sessions via useSessions; nav links via TanStack Router; active state highlighting; 'New' button opens NewSessionDialog. - NewSessionDialog: workspace input + Create; navigates to /sessions/:id. - /sessions/$sessionId route: SessionHeader (workspace, conn badge, agent status badge, delete) + MessageList (auto-stick-to-bottom) + Composer (Cmd/Ctrl+Enter to send, disabled while engine mid-turn) + error banner. - MessageItem: renders text and tool_call parts; tool_call card shows args/ result with running/completed/error status. useSessionWebSocket pipes wire events into the streaming store; the store's reducer applies message_start/text_delta/message_update/tool_start/tool_end/ status/turn_end/error to derive the rendered message list. Verified: typecheck clean, routeTree picks up /sessions/$sessionId, all routes serve via vite. Visual + live agent dogfood lands in W8.
- packages/api/README.md: scope, routes table, env vars, dogfood usage, what works / doesn't. - packages/web/README.md: stack, layout, design tokens, what's not built yet. - Makefile: dev-api-node, dev-web, dev-local, dogfood-api targets. - CLAUDE.md: project structure + tech stack rows updated to call out the new packages and label legacy ones explicitly. - docs/plans/2026-05-09-greenfield-followups.md: captured what's deferred — real auth, CF cutover, multi-thread UI, decision gates in wire, pi-ai disable_parallel_tool_use, etc. Final state verified: api typecheck clean, web typecheck clean, 11 api tests pass, full stack dogfooded end-to-end through Vite proxy with real Anthropic + Docker bind-mount file write.
Three resilience fixes after first browser-side dogfood: 1. POST /api/sessions now mkdir -p the workspace if it doesn't exist (and rejects when the path exists but isn't a directory, or the path isn't absolute). The default-pathway 'pick a workspace, click Create' flow now Just Works without the user having to pre-create the dir. 2. WS onOpen wraps engine setup in try/catch. Previously, if engine.create Session threw (e.g. workspace ENOENT, Docker unreachable), the uncaught rejection killed the server process — taking down every live session and triggering vite proxy ECONNREFUSED storms. Now we send a wire 'error' frame and close the socket cleanly. 3. Hono onError catch-all returns errors as JSON; main.ts adds unhandledRejection/uncaughtException loggers so a bug in one route no longer cascades into a process crash. Belt-and-braces — real fixes still belong in the handler that swallowed the error. Also: .gitignore now covers .env.deploy.* (suffixed dev/prod variants).
applyAppMigrations and applyEngineMigrations both ran every SQL file on every boot — so the second startup hit 'table already exists' and crashed before the server could even bind a port. Fix: - Track applied migrations in __valet_app_migrations / __valet_engine_ migrations (filename + applied_at). Re-runs across restarts are no-ops. - Run each migration in a transaction so partial application leaves the tracker untouched. - One-time backfill: if the schema tables already exist (db pre-dates this change) but the tracker is empty, mark all migrations applied without re-executing. Lets existing local DBs survive the upgrade without manual intervention. Existing tests (13 in store-sqlite, 11 in api) all pass against the backfill path.
Two compounding bugs caused user prompts to never appear in the UI during a live session — the screenshot showed two assistant replies with no user prompts above either one. Root cause 1: the engine doesn't emit a wire event for the user's own prompt. message_start fires only for assistant + system roles. The user MessageEntry is persisted into engine_entries (so it'd reappear on the next WS init / page reload via readEntries), but during live streaming nothing pushes it to the client. Root cause 2: the stream store's message_start reducer collapsed any non-system role to 'assistant' — defensive code from when wire types were narrower. Even if a user-role start were synthesized, the reducer would have rendered it as the assistant. Fix: - stream.ts message_start reducer forwards ev.role verbatim. The wire's MessageRole is the full union (user/assistant/tool/system) and the Message type accepts it. - new addUserMessage(sessionId, text) action on the stream store — appends a synthetic user-role Message with content + a single text part. ID prefixed 'user-opt-' so it's distinguishable. - Composer calls addUserMessage immediately on submit, before the POST. User sees their text instantly. On the next WS init the server's persisted copy replaces it (different id, same content; brief overlap acceptable for v1). Considered three alternatives and rejected: server-synthesized wire event (requires extending the engine EngineEvent union for a UI bug); refetch on turn_end (slow round-trip after every turn, prompt vanishes during the gap); returning the persisted message from POST (couples route to engine internals + assumes synchronous persistence).
Composer: - Enter sends; Shift+Enter inserts a newline (was: Cmd/Ctrl+Enter sends). - Skips submit while IME composition is active so Enter confirms the composition instead of sending a half-finished message. Markdown rendering for chat text: - New components/markdown.tsx wraps react-markdown + remark-gfm with token-aware styling. Code blocks, inline code, headings, lists, links, tables, blockquotes themed against --bg/--fg/accent so dark mode works. - @tailwindcss/typography added; theme overrides via prose-* utilities. - TextBlock in MessageItem now renders via Markdown. Tool call args/ results stay as <pre> (JSON / shell output, not prose). - Raw HTML disallowed (react-markdown default) so user/assistant text is safe to render. Links open in a new tab w/ noopener+noreferrer.
Two compounding bugs caused tool cards to vanish on page refresh while working fine during a live turn: 1. The WS init frame was sending parts: [] for every persisted message, with a comment claiming the client would refetch via REST. That was wrong — useMessages's data is never piped into the stream store, so parts went nowhere. Init now uses engineToWireParts to forward the real parts (text + tool_call) for each message. 2. Engine's thread.handleAgentEvent persisted the assistant entry at message_end with all tool_call parts at status='running' (and no result yet — the actual execution happens *after* message_end fires). Then tool_execution_end mutated the in-memory part objects but never re-persisted, leaving sqlite stuck with stale 'running' rows. Fix: hold a reference to the entry on message_end and call updateEntry after each tool_execution_end. (parts are shared by reference, so the mutation flows through.) Plus: dev-only console.debug of incoming WS frames in useSessionWebSocket to make future bugs of this shape easier to debug from a browser. 87 engine tests + 11 api tests still pass.
The bug we just fixed (engine persisted tool_call at message_end with status='running' and never re-persisted) had no test coverage. Existing happy-path tool test only checked that the file landed and bus events fired — never read the persisted entry back. Adding three guards: 1. happy-path.test.ts: after the tool turn, read the assistant entry from the store and assert tool_call.status === 'completed' with the result + correct args. Catches the engine forgetting to updateEntry. 2. store-contract round-trip: persist a tool_call(status: running) with nested args, read back, assert all fields preserved. Catches a store impl that drops or coerces tool_call fields. Runs against both InMemorySessionStore and SqliteSessionStore. 3. store-contract updateEntry transition: appendEntries a running tool_call, then updateEntry the same id with status='completed' + result, read back, assert the transition stuck. Catches a store impl that fails to overwrite parts on update. 89 engine tests pass, 15 store-sqlite tests pass.
Boots createApp(providers) on a real (port=0) http server with an in-memory sqlite + virtual sandbox + InMemoryEventBus, drives a real Anthropic-backed turn that calls the write tool, closes the WS, opens a fresh one (simulating a page reload), and asserts the init frame contains an assistant message with a completed tool_call part. Catches the exact pair of bugs we just shipped a fix for: 1. Init frame stripping parts: [] — the assertion finds no tool_call anywhere in initFrame.messages. 2. Engine forgetting to updateEntry on tool completion — the assertion finds tool_calls but only with status='running', not 'completed'. Skipped via describe.skip when ANTHROPIC_API_KEY is missing so CI without a key still passes. ~1.2s wall-clock with claude-haiku-4-5, ~fractions of a cent per run.
Restructures the app shell along the user's requested shape: - New TopNav: brand → session dropdown → '+ New session' button. SessionPicker reads sessionId from the URL and shows current title, with a Radix dropdown listing all sessions. Switching navigates to /sessions/$sessionId. - New ThreadList replaces the previous sidebar contents. v1 only has the engine's web:default thread per session, so the list is short — but the structure is in place for multi-thread support (server CRUD + thread switching are Phase B). - AppShell now has three zones (top, sidebar, main) instead of two. - New-session dialog still triggered from a button (now in the top nav) and renders as a Radix Dialog. - Empty-state copy on / points at the new top-nav button. The old layout/sidebar.tsx (sessions list in a left pane) is removed. Typecheck clean; no server-side changes.
Server (api): - POST /api/sessions/:id/threads creates a new engine thread (key generated server-side). Returns the new ThreadSummary. - GET /api/sessions/:id/threads now lists ALL threads from engineSession.listThreads(), not just the default. - POST/GET /api/sessions/:id/messages accept a threadId (body field / query param). Defaults to the session's default thread when omitted, so single-thread clients keep working unchanged. - Helpers: loadEngineSession, resolveThread, threadToSummary. Wire types: - Added CreateThreadRequest. - SendPromptRequest gains optional threadId. Client: - api.createThread + listMessages threadId param. useCreateThread mutation invalidates the threads query on success. useMessages now takes an optional threadId (queryKey is per-thread). - /sessions/$sessionId route declares a typed search schema with ?thread=<id>. ActiveThreadId resolves to the search param if set, else the first thread (default). - ThreadList: '+ New thread' button at the top right; clicking creates a thread and navigates to it. Each thread row is a Link that updates ?thread=. Default thread renders without a search param to keep URLs clean. - MessageList accepts a threadId prop and filters store messages. Optimistic user messages (threadId: null) are always shown so the user sees their text immediately; server's persisted copy with the real thread id replaces them on the next WS init. - Composer takes threadId and posts it with each prompt. Known limitation (documented in thread-list.tsx): WS init still loads only the default thread's history. Reloading the page on a non-default thread shows old messages on the default but blank on the active until new live events arrive. Fix is REST-driven history loading on thread switch — separate follow-up. 12 api tests still pass (incl. the WS reload integration test).
New integration test: `cross-thread.test.ts`. Drives two real Anthropic
turns in the same session against separate threads:
1. Thread A (web:default): user tells the assistant a unique phrase,
agent acknowledges. The phrase lives only in A's message history.
2. Thread B (created via POST /threads): user asks the agent to call
thread_read against web:default and report the phrase.
Asserts thread B's final assistant message contains the phrase verbatim,
proving:
- POST /threads creates a fresh engine thread accessible by the
messages routes via threadId
- Threads are isolated — B's user messages don't include A's prompt
- The engine's thread_read builtin works across threads
- Multi-thread routing via ?threadId=… on /messages is correct
driveTurn (and reload-tool-rendering by extension) now waits for the
agent loop to fully settle, not just the first turn_end. Engine emits
turn_end per LLM round; with tool use the agent does multiple rounds
(tool-use turn → tool exec → follow-up text turn). Returning after the
first turn_end raced the second round and missed the final assistant
message. Now: armed settle timer on each turn_end (3s default), reset
on any new agent activity. Resolves only after a quiet period.
Boot harness extracted to _setup.ts and shared helpers to _test-utils.ts
(underscore-prefix so vitest's *.test.ts glob skips them).
Reload test now 5.0s (was 1.2s — extra time is the settle wait); cross-
thread test 11.2s. 13 api tests pass.
Optimistic user messages were tagged 'threadId: null' because addUserMessage took only sessionId+text. The MessageList filter then accepted both the active threadId AND null as a fallback, so user bubbles posted in one thread reappeared in every other thread's view after a switch. Fix: - addUserMessage now requires a threadId. The store tags the optimistic message with the active thread, so it's correctly scoped. - Composer takes threadId; submit is gated on it being defined. The textarea + Send button disable to 'Loading thread…' while the threads query is in flight (typically <100ms). Without this we'd lose the guarantee that addUserMessage receives a real threadId. - MessageList filters strictly: m.threadId === activeThreadId. The null fallback is gone. (Loading state — activeThreadId undefined — still shows everything so init-frame messages render before the threads query resolves; nothing new can be added in that window because Composer is disabled.) Out of scope (separate bug, less common): WS init replaces the message list with the default thread's history on every reconnect. If the user is on a non-default thread when the WS reconnects, that thread's live messages get wiped from the store. The full fix is REST-driven history load on thread switch + treating WS init as session metadata only. Documented as a follow-up.
Fixes the documented limitation: reloading the page (or any WS reconnect) on a non-default thread used to wipe that thread's messages because the init frame replaced the entire message list with the default thread's persisted history. Server (api): - WS init drops the messages field. It now carries only session metadata. The client loads thread history via REST (GET /messages?threadId=…). - Wire type 'init' updated; ws.ts no longer reads entries on connect (still ensures the engine session + default thread are materialized). Client (web): - New stream-store action setThreadMessages(sessionId, threadId, msgs) replaces the messages for one thread, leaves other threads untouched, and preserves still-in-flight optimistic user messages whose content hasn't yet appeared in the REST snapshot. Once the server has persisted a matching user message, the optimistic copy is dropped to avoid a duplicate row. - Reducer's 'init' case no longer mutates messages — only clears transient status/error. - /sessions/$sessionId route now drives useMessages(sessionId, activeThreadId) and pipes the result into setThreadMessages via useEffect on (sessionId, activeThreadId, data) changes. - useMessages disables refetchOnWindowFocus + refetchOnReconnect so background refetches can't wipe in-flight live state. Initial load and thread-switch fetches still happen because each (sessionId, threadId) is its own queryKey. Tests: - reload-tool-rendering.test.ts switched from captureInitFrame() to GET /messages — the actual code path the client now takes after a reload. Same regression coverage (init stripping parts becomes 'GET /messages dropping parts'; engine forgetting updateEntry on tool completion still surfaces the same way). - captureInitFrame() helper kept (init still fires; could be useful for future tests asserting on session metadata). 13 api tests pass. Cross-thread test still green: thread B reads thread A's persisted history via thread_read.
…er tool Replaces the old generic ToolCallBlock (icon + tool name + raw JSON args in <pre> + plain text result) with a registry of per-tool renderers under src/components/session/tool-renderers/. Each renderer claims one or more tool names and contributes a custom Body for that tool's args + result. Unknown plugin tools route to a fallback that auto-extracts a primary identifier from args and renders typed key/value tables — so plugins look polished without anyone writing custom code for them. Common chrome (ToolShell): - 2px category-colored left strip identifies tool family at a glance (shell/read/write/edit/thread/generic). - Compact mono header: tool name (uppercase, tracked) + target identifier + optional summary + status pip. - While running, a low-opacity scanner-line animation sweeps the header in the category color — the visual heartbeat for 'agent is working'. - Click header to expand/collapse. Default: expanded while running and on error, collapsed once completed (keeps the chat dense by default). Tool-specific renderers: - bash: terminal aesthetics — black bg, emerald-300 mono, dollar prompt, blinking caret while running, exit-code summary in the header. - read: file-viewer with line numbers, byte+line summary. - write: additive diff (green '+' lines) of the new file content. - edit: side-by-side diff lines (red '-' before, green '+' after) + surfaces 'no match' failures inline. - thread_read: parses the engine's markdown dump back into bubbles with role-tinted left borders, relative timestamps, recent-N collapse. - fallback: typed key/value table for args (strings mono, numbers tabular-aligned, booleans as pills, objects/arrays as collapsed JSON with click-to-expand). Smart 'identifier' extraction prefers path/id/name/key/url/command/query. Adding a renderer for a plugin tool: build a ToolRenderer (see types.ts) and add it to RENDERERS in tool-renderers/index.ts before the fallback. Same shell, same status semantics, same shape contract. Typecheck clean. All five new files serve through Vite.
Tool cards on reload were rendering '(empty output)' / '(empty file)'
even after our prior fix made tool_call status persist correctly. Root
cause: pi-agent-core emits AgentToolResult-shaped objects
({ content: [{ type: 'text', text }] }) and the engine was storing
event.result verbatim. Frontend's resultText() only knew about
{ text: string }, so it found nothing and rendered the empty-state.
Fixes:
1. Engine (thread.ts tool_execution_end) now persists a hybrid shape:
spread the raw result fields, then add a top-level rendered
from renderToolResult(). Readers that pull Just Work;
anything that wants the raw blocks can still inspect result.content.
2. Frontend resultText() handles all three shapes defensively (string,
{ text }, { content: [...] }) so already-persisted entries from
before this commit also render correctly.
3. Strengthened tests: happy-path now asserts result.text is a non-empty
string containing the actual output (previously just .toBeDefined());
reload integration test does the same end-to-end through GET /messages.
Plus a new top section in CLAUDE.md ('Working on this codebase') that
documents the persistence-shape-drift class of bug, the canonical
round-trip path (engine.appendEntries/updateEntry → bridge → REST →
resultText), the test commands to run when touching this code, and a
debug recipe (sqlite3 ~/.valet/app.db) for the next time tool cards
render empty.
89 engine tests + 15 store-sqlite tests + 13 api integration tests pass
(including the 'real Anthropic + Docker, then GET /messages' round-trip
that strictly asserts a readable result.text exists).
…tool
Adds first-class model switching to the engine, plumbed through to a
builtin tool the agent can call mid-turn and to the API surface so the
UI / a future client can drive it directly.
Resolution chain at turn time: thread.modelOverride → session.options.model
(no global fallback in the engine; the API host supplies the global default
when creating a session).
Engine
- SessionData gains 'model: string?' (persisted session default).
- Thread holds a per-thread modelOverride; Thread.setModel(id|null,reason)
resolves the id, persists via store.saveThread, and emits model_switched.
- Session.setModel(id,reason) does the same at session scope (threadId
omitted on the bus event to indicate scope).
- runItem reads thread.resolveTurnModel() once at turn start (overlays
before role overlay) and restores the baseline in finally so the next
turn picks up any mid-turn change.
- Generalized resolveModelId (was resolveRoleModel) — provider/model form
or bare ids tried under anthropic/openai/google.
- New ToolContext.setModel({ model, scope? }): allows tools to dispatch
to either thread or session scope without seeing engine internals.
- New switchModelTool builtin — wraps ctx.setModel; surfaces failures as
a readable result instead of throwing past the agent.
- model_switched event finally has an emit site (was vestigial).
Persistence
- engine_sessions gains a 'model' column. Migration is edited in place
(0000_lonely_lizard.sql) — pre-1.0 we don't accumulate ALTER TABLEs;
blow away the local sqlite if you're upgrading. CLAUDE.md gets a new
section calling this rule out so we don't drift.
- store-sqlite saveSession / getSession / listSessions read+write the
field. InMemorySessionStore needs no change (round-trips SessionData
verbatim).
Wire + API
- Wire types: SessionDetail.model + ThreadSummary.model added; new
PatchSessionRequest / PatchThreadRequest; model_switched added to
WireEvent (threadId optional — present for thread scope, absent for
session scope).
- bridge.ts maps model_switched (was dropped). Empty-string threadId
emitted by session-scope switches is normalized to undefined.
- POST /api/sessions/:id supports body.model at create … (next commit
for the host-default piece). PATCH /api/sessions/:id sets the session
default; PATCH /api/sessions/:id/threads/:tid sets the thread override.
GET /sessions/:id surfaces the current model when the engine session
is live; GET /threads surfaces each thread's override.
Tests
- 4 new engine tests (model-switching.test.ts): Thread.setModel persists
+ emits, rejects unknown ids, Session.setModel updates default + persists,
switchModelTool dispatches via ctx.setModel + is registered in
builtinTools.
- bridge.test.ts has new cases for model_switched (thread + session
scope) replacing the now-obsolete 'drops model_switched' case.
- 94 engine + 15 store-sqlite + 15 api tests pass.
Adds the user-facing surface for the model-switching engine work: - src/lib/models.ts: small curated catalog (haiku 4.5, sonnet 4.5/4.6, opus 4.7) with tier (fast/balanced/powerful) + descriptions. Engine resolves any string id, so this list is just what we surface in the picker — adding more is a one-line change. - src/components/session/model-picker.tsx: Radix dropdown grouping models by tier with description text under each. Two variants: 'compact' for tight chrome (session header) and 'row' for inline rows (thread sidebar). Shows a violet dot + label tint when the current value is a thread-level override. Optional 'inherit' affordance at the bottom — wired by callers that have a clearable scope (i.e. threads, which can fall back to the session default). - session-header.tsx: shows the session-default model picker on the right of the header. Selecting fires PATCH /api/sessions/:id. - thread-list.tsx: each thread row shows its current model under the title — muted 'inherits session model' when no override, violet '(model name)' when overridden. The picker itself only renders for the active thread to keep the sidebar dense; collapsed rows just show the label. Selecting fires PATCH on the thread. API client + hooks: - api/client.ts gains patchSession + patchThread. - queries.ts gains useSetSessionModel + useSetThreadModel mutations that invalidate the right query keys on success. - stream store + WS hook handle the new model_switched wire event (no-op reduction; logger summarizes scope). Typecheck clean. All new files serve through vite.
The session default is a user-facing setting that affects every future thread the user opens — not something an agent should reach for unilaterally. Drop the `scope` parameter from switchModelTool and from ctx.setModel; both now mutate only the calling thread's override. Session-default mutation still exists on Session.setModel itself, but it's only reachable via PATCH /api/sessions/:id from the UI.
Agents previously had to know thread keys ahead of time to call thread_read. list_threads pulls from store.listThreads (so paused threads still surface) and renders key, status, model override, and summary — including a self-marker for the calling thread. Also exposes ctx.listThreads() on ToolContext for plugin tools.
Two concurrent requests for a fresh session id both missed the cache, each created their own Engine + Session, and each called ensureDefaultThread — persisting two distinct thread rows with the same key 'web:default'. Subsequent rehydrations loaded both, breaking thread identity (list_threads returned duplicates, sidebar showed two "Default thread" rows). Two-layer fix: - Single-flight inflight map in EngineHost.sessionFor so concurrent callers share one in-progress build. - UNIQUE INDEX on (session_id, key) in engine_threads as defense-in-depth: any future race fails the second insert loudly instead of silently corrupting state. Pre-1.0 migration policy: edited 0000_lonely_lizard.sql in place; wipe the local sqlite to recreate.
The bridge was silently dropping all four decision-gate engine events, so the UI never saw them. Add wire-side DecisionGate / DecisionAction / DecisionResolution types, four new WireEvent variants, and small projectors that strip engine-only fields (context, origin, refs) from the wire shape. REST endpoints + UI come in follow-up commits.
GET /api/sessions/:id/decisions → list pending gates POST /api/sessions/:id/decisions/:gateId/resolve → resolve (actionId or value) POST /api/sessions/:id/decisions/:gateId/withdraw → user-initiated cancel Both mutating routes confirm the gate is actually pending in the session before delegating to Session.resolveDecision / Session.withdrawDecision; otherwise a stale gateId would silently no-op. Withdraw rejects engine-internal reasons (steer, abort) so audit records stay honest — clients can only send reason='cancel'.
When the engine raises a gate via ctx.requestDecision, the bridge now forwards it on the wire and the web stores it as pending state per session. The card renders only on the gate's owning thread, so switching to a sibling thread hides it (the engine still has the original thread suspended; the user can come back and answer). Render modes by gate.type: - approval / credential_request: action buttons (style: primary | danger | secondary) - question: textarea + Submit The X button calls POST /decisions/:gateId/withdraw with reason='cancel'. Bootstrap: GET /sessions/:id/decisions seeds pending gates on mount so a card raised before the WS opens still shows immediately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Verification