diff --git a/SPEC.md b/SPEC.md new file mode 100644 index 0000000000..d3263241d4 --- /dev/null +++ b/SPEC.md @@ -0,0 +1,2554 @@ +# sf v3 — Specification + +**Version:** 1.0.0-draft +**Status:** Research / Pre-implementation +**Authors:** singularity-ng +**Implementation target:** the next major version of [`singularity-forge`](https://github.com/sf-build/get-shit-done) (sf, formerly Get-Shit-Done / GSD), built on the existing [pi-mono](https://github.com/badlogic/pi-mono) SDK packages already vendored under `packages/pi-*`. **Not** a fork of [charmbracelet/crush](https://github.com/charmbracelet/crush). + +--- + +The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this document are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119). + +> **Retarget note (v1.0):** earlier draft versions (0.1–0.8) targeted a Go fork of Crush. That direction was reconsidered after recognising: +> - sf already has gen-2 harness control via pi-mono (vs. gen-1 skills which proved insufficient). +> - The cold-start performance argument for Go is largely moot once the daemon (`packages/daemon`) absorbs startup cost. +> - sf already ships an MCP server (`packages/mcp-server`) — meaning other agent CLIs can call sf as a backend, not the inverse. +> - Most of the Crush infrastructure we'd inherit (TUI, agent loop, multi-provider) is duplicated in pi-mono. +> +> The structure of the previous spec — phase machine, schema, hook pipeline, knowledge layer, persistent agents, conformance checklist — survives this retarget. The implementation target changes from Go-on-Crush to TypeScript-on-pi-mono. + +--- + +## Table of Contents + +1. [Overview](#1-overview) +2. [Definitions](#2-definitions) +3. [Data Model](#3-data-model) +4. [Phase State Machine](#4-phase-state-machine) +5. [Orchestration Loop](#5-orchestration-loop) +6. [Worker Attempt Lifecycle](#6-worker-attempt-lifecycle) +7. [Prompt Contract](#7-prompt-contract) +8. [Context Budget](#8-context-budget) +9. [Supervision](#9-supervision) +10. [Hook Pipeline](#10-hook-pipeline) +11. [Workspace Management](#11-workspace-management) +12. [Worktree Isolation](#12-worktree-isolation) +13. [Verification Gates](#13-verification-gates) +14. [Configuration](#14-configuration) +15. [Model Routing](#15-model-routing) +16. [Knowledge Layer](#16-knowledge-layer) +17. [Persistent Agents](#17-persistent-agents) +18. [Inter-Agent Messaging](#18-inter-agent-messaging) +19. [Observability](#19-observability) +20. [Failure Taxonomy](#20-failure-taxonomy) +21. [Trust Boundary](#21-trust-boundary) +22. [Distributed Execution](#22-distributed-execution) +23. [Plugin Extension Points](#23-plugin-extension-points) +24. [Secret Management](#24-secret-management) +25. [CLI Commands](#25-cli-commands) +26. [Conformance Checklist](#26-conformance-checklist) + +--- + +## 1. Overview + +sf is an **autopilot for software engineering work** that the user owns end-to-end: the user states a goal (`/sf plan "add OAuth"`) and sf decomposes, plans, executes, verifies, reviews, and merges through a structured phase pipeline without per-unit human intervention. The user watches or steers; the agent executes. + +sf v3 is the next major version of singularity-forge, built directly on the [pi-mono](https://github.com/badlogic/pi-mono) SDK packages already vendored under `packages/pi-*`: + +| Vendored package | Role | +|---|---| +| `@singularity-forge/pi-coding-agent` | Coding agent CLI primitives (vendored from pi-mono) | +| `@singularity-forge/pi-agent-core` | General-purpose agent core | +| `@singularity-forge/pi-ai` | Unified LLM API across 20+ providers | +| `@singularity-forge/pi-tui` | TUI primitives | + +sf adds the autopilot layer on top: phase state machine, persistent agent fleet, knowledge integration with Singularity Memory, gates, hooks, worktree management, blockers and dispatch scheduling. The agent harness itself (tool execution, model calls, hook plumbing) is pi-mono's; the orchestration is sf's. + +### 1.1 Existing infrastructure + +sf already ships: + +- **`packages/daemon`** — long-lived background process that absorbs Node.js cold-start cost. The autopilot loop runs in the daemon; CLI invocations (`/sf status`, `/sf next`) talk to it via local RPC. +- **`packages/mcp-server`** — exposes sf orchestration tools (plan/dispatch/status) via MCP so other agent CLIs (Claude Code, Cursor) can call sf as a backend. +- **`packages/native`** — N-API bindings for performance-critical native (Rust) code where TypeScript would be too slow. +- **`packages/rpc-client`** — standalone RPC client SDK with zero internal dependencies, used by the CLI to talk to the daemon. + +This spec defines the v3 contract that ties these together with the gen-2 harness control pattern that GSD established (drop into pi-mono primitives directly; do NOT layer skills on top and hope the LLM follows them). + +### 1.2 Versioning + +sf follows [SemVer 2.0](https://semver.org/). For this spec: + +- **Patch** (1.x.Y): clarifications, conformance refinements, no behavioural change. +- **Minor** (1.Y.0): additions to the harness API, schema, or CLI that do not break existing implementations. +- **Major** (X.0.0): breaking changes to schema, hook contracts, or harness API. + +v1.0.0 (this spec, when finalised) freezes §§3 (Data Model), 4 (Phase State Machine), 6 (Worker Attempt Lifecycle), 10 (Hooks), 14 (Configuration), and 26 (Conformance) — changes to those sections post-v1 require a major bump. sf v3 MUST NOT rebuild what pi-mono already provides: + +- Agent loop via `pi-coding-agent` +- Multi-provider LLM (20+ providers including Anthropic, OpenAI, Gemini, Groq, Bedrock, Azure, Ollama) via `pi-ai` +- MCP client (`modelcontextprotocol/go-sdk`) +- LSP integration +- SQLite state via `ncruces/go-sqlite3` +- TUI primitives via `pi-tui` +- Tool execution (bash, file read/write, grep, web search, sourcegraph) +- Agent Skills open standard (`internal/skills/`) +- Permission service with pubsub, persistent grants, hook pre-approval +- PreToolUse hook system with allow/deny/halt, input rewriting, multi-hook aggregation + +This specification covers only what sf v3 adds on top of pi-mono. Behaviour already provided by the pi-mono SDK packages is inherited. + +**Project-level conformance.** sf MUST enforce JSDoc on every exported function, type, and class in its harness modules via a CI check (`scripts/specs-check.ts` — an AST walk, no external linter dependency). This applies to sf's own development; it is not a runtime gate against user projects. + +--- + +## 2. Definitions + +**Unit** — the atomic unit of work. Has a type (`milestone`, `slice`, `task`), a phase, and an attempt counter. Units are ephemeral — they complete or fail and are archived. + +Unit IDs use the format `{type}/{slug}` where slug is hierarchical: +- Milestone: `milestone/m{n}` (e.g. `milestone/m2`) +- Slice: `slice/m{n}/s{n}` (e.g. `slice/m2/s3`) +- Task: `task/m{n}/s{n}/t{n}` (e.g. `task/m2/s3/t1`) + +The slug encodes the parent hierarchy redundantly with `units.parent_id` to make trace and log lines self-describing without requiring a join. + +**Phase** — a named stage of a unit's lifecycle. The harness owns all phase transitions; no other layer may transition a phase directly. + +**Attempt** — one dispatch of a worker for a unit. A unit may accumulate multiple attempts across failures and retries. + +**Turn** — one model call within an attempt. An attempt consists of one or more turns. The first turn receives the full task prompt; subsequent turns receive continuation guidance only. + +**Project** — a directory with `.sf/config.toml`. The project root is the directory containing `.sf/`. Each project has its own SQLite DB at `/.sf/sf.db` — `~/.sf/sf.db` is the cross-project default DB used only when no project-local DB exists. Multiple projects on the same machine MUST use separate `.sf/` directories and therefore separate DBs, locks, and trace files. + +**Session** — a top-level container scoped to one project, with a stable ULID, persisting across process restarts of the same project. A session is created on the first `/sf auto` or `/sf next` invocation in a project and reused on subsequent invocations until explicitly ended (`/sf session end`) or until 30 days of inactivity. The session holds the running state for all units, the context budget, and the supervisor state. + +**Harness** — the layer between pi-coding-agent's agent loop and sf's orchestration logic (milestones, phases, git, worktrees). It owns: context budget, phase transitions, unit lifecycle hooks, session contract, observability, and supervision. Nothing in the planning or git layers MUST reach past the harness boundary into pi-coding-agent directly. + +**Worker** — the process (local or SSH-remote) that executes one attempt. Spawned by the orchestrator. + +**Orchestrator** — central process. Owns the scheduling loop, in-memory state, and all SQLite writes. Always runs locally even in distributed deployments. + +**Singularity Memory** (`sm`) — the durable knowledge layer. An HTTP + MCP server holding memories, learnings, and anti-patterns across sessions, projects, and tools. Originally derived from `vectorize-io/hindsight` (MIT) and assimilated into our codebase under `singularity_memory_server/`; we own the engine. Runs either embedded (in-process for single-user sf) or remote (shared service on tailnet, reachable from sf, Hermes, OpenClaw, Claude Code, Cursor, etc.). Not SQLite — knowledge lives in Singularity Memory; SQLite holds only orchestration state. + +**Skill** — a `SKILL.md` file providing prompt guidance to the agent. Inspirational, not enforced. + +**Workflow template** — a TOML file specifying the exact phase sequence the harness enforces for a class of work. Programmatic, not a suggestion to the agent. + +**Plan** — the local source of truth for work units. Created by the user via `/sf plan "..."` or by editing `.sf/plan.md`. Decomposed by sf into milestones → slices → tasks. There is no external tracker — sf's SQLite DB is authoritative. (External visibility, e.g. mirroring to GitHub Issues for teammates, is achieved via PostUnit hook scripts, not a built-in tracker integration. See § 10.) + +**Claim** — a soft lock recorded on a `units` row indicating the orchestrator is currently dispatching it. Stored as `claim_holder` (worker host or PID) and `claim_until` (UNIX ms expiry). A claim is released on terminal phase, worker exit, or claim expiry. Prevents two workers picking up the same unit simultaneously. The orchestrator MUST sweep expired claims at the start of every poll tick: any row with `claim_until < now()` and `phase_status = 'running'` is reset to `phase_status = 'interrupted'` and `claim_holder = NULL`. + +**Run** — the unifying abstraction for one execution of the worker attempt lifecycle (§ 6). A run is either a **unit attempt** (driven by the phase state machine) or a **persistent agent run** (driven by inbox messages). The `runs` table (§ 3.5) records both, distinguished by `run_kind`. Trace, billing, and supervisor monitoring all key on `run_id`. + +--- + +## 3. Data Model + +The orchestrator uses a single SQLite database **per project** at `/.sf/sf.db` (or `~/.sf/sf.db` for non-project sessions) for **orchestration state only**: sessions, units, phase transitions, blockers, gate results, benchmarks, circuit breakers, and persistent agents. **Knowledge** (memories, learnings, anti-patterns, codebase context) lives in Singularity Memory (§ 16), not SQLite. + +All primary keys for runtime-allocated rows (sessions, units, runs, agents, agent_messages, agent_inbox, gate_results, session_blockers, pending_retain) MUST be [ULIDs](https://github.com/ulid/spec) — sortable by creation time without a separate timestamp column. Schema-natural keys (model name, agent name) remain TEXT but are not ULIDs. + +The schema MUST be managed via versioned migrations (Drizzle / Kysely) and MUST use WAL mode: + +```sql +PRAGMA journal_mode=WAL; +PRAGMA synchronous=NORMAL; +``` + +### 3.1 Core tables + +```sql +CREATE TABLE sessions ( + id TEXT PRIMARY KEY, + status TEXT NOT NULL, -- idle | running | paused | interrupted | complete | failed + created_at INTEGER NOT NULL, + updated_at INTEGER NOT NULL +); + +CREATE TABLE units ( + id TEXT PRIMARY KEY, -- format: type/m{n}[/s{n}[/t{n}]] + session_id TEXT NOT NULL REFERENCES sessions(id), + parent_id TEXT REFERENCES units(id), -- NULL for milestone; ≤ 3 levels deep + type TEXT NOT NULL CHECK (type IN ('milestone', 'slice', 'task')), + workflow TEXT NOT NULL, -- workflow template name; pinned at first dispatch + workflow_hash TEXT NOT NULL, -- SHA-256 of pinned template content (FK workflow_pins.hash) + phase TEXT NOT NULL, + phase_status TEXT NOT NULL CHECK (phase_status IN + ('pending', 'running', 'succeeded', 'failed', 'canceled', 'interrupted')), + attempt INTEGER NOT NULL DEFAULT 1, -- 1 = first try, 2 = first retry, ... + claim_holder TEXT, -- format: "{host}#{pid}" or "ssh:{host}#{pid}" + claim_until INTEGER, -- UNIX ms; claim auto-expires at this time + priority INTEGER, -- 1 (urgent) .. 4 (low); NULL sorts last + title TEXT NOT NULL, + description TEXT, + metadata TEXT, -- arbitrary JSON: gh_issue, slack_channel, custom keys + worker_host TEXT, -- "local" | SSH host name; current/last worker + workspace TEXT, -- path of latest workspace (current attempt) + archived_at INTEGER, -- soft-delete; non-NULL = archived/forgotten + created_at INTEGER NOT NULL, + updated_at INTEGER NOT NULL +); + +-- Hierarchy depth is enforced in code (the harness rejects parent_id pointing to a task). +-- It would also be enforceable via a recursive trigger, but that adds write-path overhead +-- for a constraint that the planning layer already validates. + +CREATE TABLE phase_transitions ( + id TEXT PRIMARY KEY, + unit_id TEXT NOT NULL REFERENCES units(id), + from_phase TEXT NOT NULL, + to_phase TEXT NOT NULL, + reason TEXT, + transitioned_at INTEGER NOT NULL +); + +CREATE TABLE task_blockers ( + task_id TEXT NOT NULL REFERENCES units(id) ON DELETE CASCADE, + blocked_by TEXT NOT NULL REFERENCES units(id) ON DELETE CASCADE, + PRIMARY KEY (task_id, blocked_by) +); + +CREATE TABLE gate_results ( + id TEXT PRIMARY KEY, + unit_id TEXT NOT NULL REFERENCES units(id), + gate_name TEXT NOT NULL, + passed INTEGER NOT NULL, + attempt INTEGER NOT NULL, + max_retries INTEGER NOT NULL, + output TEXT, -- truncated at 8KB + duration_ms INTEGER NOT NULL, + recorded_at INTEGER NOT NULL +); + +CREATE TABLE session_blockers ( + id TEXT PRIMARY KEY, -- ULID + session_id TEXT NOT NULL REFERENCES sessions(id), + event TEXT NOT NULL, -- GateBlocked | MergeConflict | Paused | UATPending + unit_id TEXT, + detail TEXT, + created_at INTEGER NOT NULL, + resolved_at INTEGER, -- non-NULL = resolved; see resolution rules below + resolved_by TEXT -- "user" | "auto" | command name (e.g. "/sf uat-approve") +); + +-- Resolution rules: +-- GateBlocked : resolved when the gate passes on a subsequent attempt OR the unit +-- transitions to PhaseReassess; resolved_by = "auto" | "/sf force-clear" +-- MergeConflict : resolved on /sf revert, /sf merge-resolve, or git service hook; +-- resolved_by = command name +-- Paused : resolved on /sf resume; resolved_by = "user" +-- UATPending : resolved on /sf uat-approve or /sf uat-reject; resolved_by = command name +-- +-- An unresolved blocker MUST be displayed in /sf status. The TUI also subscribes to +-- the corresponding pubsub event (§ 10.1) for live updates. + +CREATE TABLE benchmark_results ( + id TEXT PRIMARY KEY, + model TEXT NOT NULL, + tier TEXT NOT NULL, + fingerprint TEXT NOT NULL, -- phase+complexity+project hash + quality REAL NOT NULL, -- 0.0 .. 1.0 + latency_p50 INTEGER NOT NULL, -- milliseconds + cost_per_1k_micro_usd INTEGER NOT NULL, -- micro-USD per 1k tokens + sample_count INTEGER NOT NULL DEFAULT 1, + recorded_at INTEGER NOT NULL +); + +CREATE TABLE circuit_breakers ( + model TEXT PRIMARY KEY, + tier TEXT NOT NULL, + tripped_at INTEGER NOT NULL, + resets_at INTEGER NOT NULL, -- UNIX ms; auto-reset deadline + fail_count INTEGER NOT NULL DEFAULT 3, + reason TEXT +); + +CREATE TABLE schema_migrations ( + version INTEGER PRIMARY KEY, + applied_at INTEGER NOT NULL, + description TEXT +); + +CREATE TABLE runs ( + id TEXT PRIMARY KEY, -- ULID + run_kind TEXT NOT NULL CHECK (run_kind IN ('unit_attempt', 'agent_run')), + unit_id TEXT REFERENCES units(id) ON DELETE SET NULL, -- preserve forensics + agent_id TEXT REFERENCES agents(id) ON DELETE SET NULL, -- preserve forensics + unit_id_snap TEXT, -- ID at run start; survives delete + agent_name_snap TEXT, -- name at run start; survives delete + attempt INTEGER, -- only for unit_attempt + worker_host TEXT, + workspace TEXT, -- workspace AT THIS attempt; authoritative for this run + started_at INTEGER NOT NULL, + ended_at INTEGER, + outcome TEXT CHECK (outcome IS NULL OR outcome IN + ('success','failure','abandoned','canceled','interrupted', + 'unit_timeout','turn_timeout','stalled')), + error_code TEXT, -- typed error from § 20.1; stores the string + -- value of the const, e.g. "turn_timeout" + input_tokens INTEGER NOT NULL DEFAULT 0, + output_tokens INTEGER NOT NULL DEFAULT 0, + cost_micro_usd INTEGER NOT NULL DEFAULT 0, -- cost in micro-USD (1e-6 USD); avoids float drift + CHECK ( + (run_kind = 'unit_attempt' AND unit_id_snap IS NOT NULL AND agent_name_snap IS NULL AND attempt IS NOT NULL) + OR + (run_kind = 'agent_run' AND agent_name_snap IS NOT NULL AND unit_id_snap IS NULL AND attempt IS NULL) + ) +); + +-- Aggregate token/cost columns are an end-of-run rollup written once on ended_at. +-- Span data in trace.jsonl (§ 19.3) is authoritative; runs columns are the cached +-- summary used by /sf session-report and the HTTP API without re-scanning JSONL. +-- +-- Soft-delete model: units and agents are NEVER hard-deleted by the harness — only +-- marked archived (units.archived_at, agents.archived_at). The snap_ columns ensure +-- run history survives even if a future operator manually drops rows. + +-- Local mirror of selected Singularity Memory entries that the harness needs offline. +-- Limited to anti-patterns by default — small, high-value, MUST surface even +-- if Singularity Memory is unreachable. +CREATE TABLE local_anti_patterns ( + id TEXT PRIMARY KEY, + description TEXT NOT NULL, + context TEXT NOT NULL, + correct_path TEXT NOT NULL, + source_unit TEXT, + fingerprint TEXT, -- phase + project hash, for fast filter + created_at INTEGER NOT NULL, + synced_at INTEGER -- last time confirmed against Singularity Memory +); +``` + +### 3.2 Persistent agent tables + +```sql +CREATE TABLE agents ( + id TEXT PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + system TEXT NOT NULL, -- system prompt template + model TEXT NOT NULL, + state TEXT NOT NULL DEFAULT 'idle' CHECK (state IN ('idle','running','waiting','stopped')), + capabilities TEXT, -- JSON array of capability tags; cached in agent_capabilities + max_turns_per_run INTEGER NOT NULL DEFAULT 100, + archived_at INTEGER, -- soft-delete; non-NULL = archived + created_at INTEGER NOT NULL, + last_active INTEGER +); + +-- Indexed lookup table for capability matching (handoff "capability:tag1,tag2"). +-- Maintained in sync with agents.capabilities by the agent CRUD layer. +CREATE TABLE agent_capabilities ( + agent_id TEXT NOT NULL REFERENCES agents(id) ON DELETE CASCADE, + capability TEXT NOT NULL, + PRIMARY KEY (agent_id, capability) +); +CREATE INDEX agent_capabilities_by_tag ON agent_capabilities(capability, agent_id); + +CREATE TABLE agent_memory_blocks ( + agent_id TEXT NOT NULL REFERENCES agents(id), + label TEXT NOT NULL, + value TEXT NOT NULL DEFAULT '', + char_limit INTEGER NOT NULL DEFAULT 2000, + read_only INTEGER NOT NULL DEFAULT 0, + updated_at INTEGER NOT NULL, + PRIMARY KEY (agent_id, label) +); + +CREATE TABLE agent_messages ( + id TEXT PRIMARY KEY, + agent_id TEXT NOT NULL REFERENCES agents(id), + seq INTEGER NOT NULL, -- monotonically increasing per agent + role TEXT NOT NULL, -- user | assistant | tool_call | tool_return | system + content TEXT NOT NULL, + tool_name TEXT, + created_at INTEGER NOT NULL +); + +CREATE TABLE agent_inbox ( + id TEXT PRIMARY KEY, + agent_id TEXT NOT NULL REFERENCES agents(id), + from_agent TEXT NOT NULL, + content TEXT NOT NULL, + delivered INTEGER NOT NULL DEFAULT 0, + created_at INTEGER NOT NULL +); +``` + +`agent_inbox` is append-only. Rows MUST NOT be deleted or modified after insert. `delivered` is the only mutable field. + +### 3.3 No external tracker + +sf v3 does **not** integrate with external task trackers (Linear, GitHub Issues, Jira). Work units are entirely local — created by `/sf plan "..."`, edited via `.sf/plan.md`, and stored in `units` (§ 3.1). The local SQLite DB is the only authoritative source of unit state. + +This is a deliberate simplification from the earlier draft. Reasons: + +- sf's gen-2 model is "user states a goal, sf decomposes and executes" — not "team files tickets in Linear, sf picks them up." The autopilot doesn't need an external queue. +- External tracker integration adds network dependency on the orchestrator's critical path (rate limits, outages, GraphQL pagination edge cases). +- Symphony-style reconciliation (cancel mid-run when external state changes) doesn't apply when the only source is internal. + +**External visibility is achieved via hooks, not core integration.** A PostUnit hook can call `gh issue comment`, `slack-cli send`, or any other publishing target to broadcast progress. Read-side stays in sf's DB; write-side goes through hooks. See § 10.5.1 for an example GH Issues publishing hook. + +**Sources of truth for unit creation:** + +| Source | When | +|---|---| +| `/sf plan ""` CLI command | Adds a milestone with sf-decomposed slices and tasks | +| `.sf/plan.md` file edit | Declarative; sf re-reads and reconciles on `/sf plan reload` | +| `/sf dispatch ` | One-off task, no enclosing milestone | +| `/sf agent run ""` | Wakes a persistent agent; not a unit (§ 17.1) | + +There is no poll loop against any external API. The orchestrator's poll cycle (§ 5.1) reads only from local SQLite. + +--- + +## 4. Phase State Machine + +### 4.1 Phase enum + +```go +type Phase int + +const ( + PhaseResearch Phase = iota // map the problem, gather context + PhasePlan // decompose into slices and tasks, get sign-off + PhaseExecute // write the code + PhaseTDD // write tests for what was just built; red → green + PhaseVerify // run full test suite + lint + type check; gates pass + PhaseReview // structured self-review: correctness, style, security + PhaseMerge // commit, push, open PR + PhaseComplete // unit done; result recorded; artifact archived + PhaseReassess // re-enter planning with failure context + PhaseUAT // human acceptance; only when workflow has require_uat = true +) +``` + +### 4.2 Standard flow + +`Research → Plan → Execute → TDD → Verify → Review → Merge → Complete` + +Permitted non-standard transitions: + +| Trigger | Transition | +|---|---| +| Gate failure in Verify (attempt < max_retries) | `Verify → Execute` | +| Gate failure in Verify (attempt = max_retries) | `Verify → Reassess` | +| Review finds a real problem | `Review → Execute` | +| Merge conflict | `Merge → Reassess` | +| External cancellation | Any → (AttemptCanceled, no phase write) | + +All other transitions are REJECTED at the harness boundary with a typed error. The harness MUST NOT silently allow invalid transitions. + +### 4.3 Attempt state + +Within each phase, individual dispatch attempts move through finer-grained states: + +```go +type AttemptState int + +const ( + AttemptPreparingWorkspace AttemptState = iota + AttemptBuildingPrompt + AttemptLaunchingAgent + AttemptInitializingSession + AttemptStreamingTurn + AttemptFinishing + AttemptSucceeded + AttemptFailed + AttemptTimedOut + AttemptStalled // stall_timeout exceeded since last agent event + AttemptCanceled // issue became non-active mid-run (reconciliation) +) +``` + +`AttemptCanceled` is distinct from `AttemptFailed`. It means the work was valid but the task was externally invalidated (deleted, moved to a terminal state, superseded). The harness MUST NOT retry a canceled attempt — it releases the slot and moves on. + +### 4.4 Turn kind + +```go +type TurnKind int + +const ( + TurnFirst TurnKind = iota // full rendered task prompt + TurnContinuation // short continuation guidance, same thread +) +``` + +Turn 1 of every attempt is always `TurnFirst`. Turns 2+ are `TurnContinuation`. The harness determines `TurnKind`; the agent never does. + +### 4.5 Workflow templates + +A workflow template MUST be a TOML file in `.sf/workflows/.toml`. The harness reads the template, constructs the phase sequence from it, and enforces it programmatically. The agent has no say in phase ordering or skipping. + +```toml +# .sf/workflows/feature.toml +name = "feature" +phases = ["research", "plan", "execute", "tdd", "verify", "review", "merge", "complete"] +require_tdd = true # PhaseTDD is enforced; skipping is a gate violation +require_review = true +require_uat = false # if true, PhaseUAT is inserted before PhaseComplete +max_retries = 3 # per gate in PhaseVerify +max_reassess = 2 + +# .sf/workflows/release.toml — uses UAT +name = "release" +phases = ["research", "plan", "execute", "tdd", "verify", "review", "uat", "merge", "complete"] +require_tdd = true +require_review = true +require_uat = true # halts after UAT enters; only resumes on /sf uat-approve +max_retries = 3 + +# .sf/workflows/spike.toml +name = "spike" +phases = ["research", "plan", "execute", "complete"] +require_tdd = false +require_review = false +max_retries = 0 +``` + +PhaseUAT halts the auto-loop with `SignalPause` and waits for `/sf uat-approve ` (advance to PhaseMerge) or `/sf uat-reject "reason"` (advance to PhaseReassess). The harness MUST fail startup if a configured workflow template references an unknown phase or includes `uat` without `require_uat = true`. + +#### Workflow selection at dispatch + +The workflow used for a given unit is determined in this order: + +1. Explicit unit metadata: `metadata.workflow = ""` set at `/sf plan` time. +2. Project default: `[harness] default_workflow = "feature"` in `.sf/config.toml`. +3. Built-in fallback: `feature` (if available) else the first workflow in `.sf/workflows/`. + +The selected workflow is recorded in `units.workflow` at dispatch time and never re-evaluated for that unit, even on retry — workflow stability across attempts is a hard guarantee. Additionally, the *content* of the chosen template is hashed (SHA-256) and stored in `units.workflow_hash`. If the on-disk template changes mid-session, the harness uses the pinned hash's content (cached in SQLite at `workflow_pins.content`) for that unit; new units pick up the new content. This prevents in-flight units from silently changing rules. + +```sql +CREATE TABLE workflow_pins ( + hash TEXT PRIMARY KEY, -- SHA-256 of template content + name TEXT NOT NULL, + content TEXT NOT NULL, -- frozen TOML at first pin + pinned_at INTEGER NOT NULL +); +``` + +### 4.6 PhaseReassess + +`PhaseReassess` is entered when a unit cannot make progress through normal phases (gate failed `max_retries` times, merge conflict, supervisor halt). The Reassess agent is dispatched at the **`reasoning`** tier with `Think: true` and is given: + +- The original task description. +- The full failure trail: gate output, last `max_retries` attempt errors, last commit history. +- The unit's plan (from `.sf/active/{unit-id}/plan.md`). + +The Reassess agent MUST output one of: + +| Outcome | Effect | +|---|---| +| **Re-plan** | Writes a new `plan.md`, transitions back to `PhasePlan`. Counter `max_reassess` decrements. | +| **Abandon** | Writes a `decision.md` explaining why the task cannot succeed; transitions to `PhaseComplete` with verdict `abandoned`. Any registered visibility hook (e.g. GH Issues comment) fires from the standard PostUnit pipeline. | +| **Escalate** | Halts auto-loop with `SignalPause`; writes a `human-question.md` with concrete questions for the operator. Resumes on `/sf reassess-resolve `. | + +If `max_reassess` hits zero on a Re-plan path, the next entry into PhaseReassess MUST be Abandon or Escalate; Re-plan is rejected. + +### 4.7 Phase transition rules + +1. All phase transitions MUST go through a single `Harness.Transition(ctx, from, to, reason)` method. +2. `Transition` MUST persist the `PhaseTransition` record to SQLite BEFORE the new phase begins. A crash mid-phase means on resume the harness re-enters the last committed phase cleanly (see § 4.8). +3. `Transition` MUST emit a pubsub `PhaseChange` event after the SQLite write. The TUI subscribes — it MUST NOT poll phase state directly. +4. The harness MUST set `Think: true` on the model config for `Research`, `Plan`, and `Reassess` phases. The agent does not control this. +5. **`PhaseChange` is non-vetoable.** Hook subscribers receive a notification *after* the transition is committed; they cannot block or reject. Hooks that need veto semantics MUST register on `PreDispatch` instead, which fires before the next dispatch and IS vetoable. + +### 4.8 Crash recovery + +In-memory scheduler state is intentionally not persisted (§ 20.2). On restart, the orchestrator MUST follow this exact sequence: + +1. **Acquire project lock** at `/.sf/run.lock` (PID file). Stale lock (PID not in `/proc` on Linux, `kill(pid, 0)` on other Unixes) is cleaned and logged. The lock is per-project; multiple projects can run auto concurrently on the same machine. +2. **Mark interrupted units.** All units with `phase_status = 'running'` are updated to `phase_status = 'interrupted'`. This is the only schema-level recovery action. +3. **Run startup cleanup** (§ 5.6) — move stale active artifacts to archive. +4. **Resume from the last committed phase boundary.** Each `interrupted` unit is treated as eligible for fresh dispatch; the worker re-enters at `unit.phase` with a new attempt number (`unit.attempt + 1`). The agent receives a `last_error` of `"resumed_after_crash"` so the prompt can warn the agent. +5. **Begin polling.** Resume normal poll cycle. Operator-issued `/sf abandon` commands made during the outage are visible via the next poll because they're persisted in `units.phase_status`. + +The harness MUST NOT replay tool calls. It MUST NOT attempt to "resume" a partial agent session. The crash recovery model is **fresh dispatch from the last persisted phase boundary**, not transparent continuation. + +**Side effects are not rolled back.** A crash mid-Merge may have produced a partial commit, push, or PR. The agent on retry sees the existing commits and either continues from there or surfaces a conflict. This MUST be documented in the Merge phase prompt: "if you see existing commits from a previous attempt, integrate them; do not start over." + +**Workspace state is preserved.** A crashed worker's workspace remains on disk; the next attempt reuses it (`ensure_workspace` returns `created=false`). The `before_run` hook is responsible for any cleanup (e.g. `git stash`, `npm clean`) appropriate for the project. + +--- + +## 5. Orchestration Loop + +### 5.1 Poll cycle + +The orchestrator runs a single goroutine that polls on a configurable interval (default 1s). Each tick: + +1. Re-check config stamp (§ 14.3). +2. Fetch eligible units from SQLite. +3. Apply priority sort (§ 5.2). +4. For each eligible unit (up to capacity), dispatch a worker. +5. Check running workers for stalled/timed-out attempts. +6. Write orchestrator snapshot to HTTP API state (§ 19.4). + +The orchestrator MUST be the single authority for all in-memory scheduler state. No other goroutine writes scheduler state. + +### 5.2 Priority ordering + +When multiple units are eligible, the orchestrator sorts them: + +1. **Explicit priority** — `priority` 1 (urgent) before 4 (low); `NULL` sorts last. +2. **Blocker-free first** — units with no non-terminal upstream blockers before blocked units. +3. **Phase order** — earlier phases first (Research before Execute) within the same priority bucket. +4. **Created-at** — oldest first as tie-breaker. +5. **Unit ID lexicographic** — final deterministic tie-breaker. + +This ordering is re-evaluated fresh on every poll tick. + +### 5.3 Blocker-aware dispatch + +A unit MUST NOT be dispatched if any of its upstream dependencies (in `task_blockers`) are in a non-terminal state. + +**Terminal** means `PhaseComplete`, `PhaseReassess` (resolved), or explicitly cancelled. **Non-terminal** means any other state, including `PhaseVerify` in progress. + +A dependency that failed and was marked abandoned is terminal and MUST NOT block downstream dispatch. + +Blocked units stay queued and are re-evaluated on the next poll tick. No backoff, no retry counter increment for a blocked wait. + +### 5.3.1 Atomic claim acquisition + +The orchestrator acquires a claim with a single conditional UPDATE: + +```sql +UPDATE units + SET claim_holder = ?, claim_until = ?, phase_status = 'running', updated_at = ? + WHERE id = ? + AND (claim_holder IS NULL OR claim_until < ?); -- ? = now() +``` + +Dispatch proceeds only if `rows_affected = 1`. This makes the claim race-free at the DB level and supports multiple orchestrators against the same `~/.sf/sf.db` even though SF normally runs as a singleton (one process per `~/.sf/run.lock`). The atomic claim is the safety net if the lock fails (e.g. shared NFS, broken filesystem semantics). + +`units.attempt` is the **current** attempt counter (used as the `attempt` prompt template variable). Historical attempts live in `runs` (§ 3.1). Authority: `units.attempt` is incremented exactly when a new `runs` row is inserted; the two are kept in sync inside the same transaction. + +### 5.4 Per-phase concurrency + +The harness MUST NOT exceed `max_agents_by_phase[phase]` concurrent units in any given phase. When a phase slot is full, further dispatches for that phase wait until the next tick. + +```toml +[harness.concurrency] +max_agents = 10 +max_agents_by_phase.execute = 4 +max_agents_by_phase.tdd = 4 +max_agents_by_phase.verify = 10 +``` + +### 5.5 Continuation retry and exponential backoff + +**After a normal (clean) exit** from a worker, the orchestrator MUST schedule a 1-second continuation retry to re-poll eligibility. If the unit is still active, a new session starts. If terminal, the claim is released. This is not a failure retry. + +### 5.4.1 Turn outcome signal + +Between transport-level "turn ran cleanly" and phase-level "gate passed," the harness MUST capture a per-turn semantic signal. After every turn, the harness inspects the model output for an explicit terminal marker: + +| Marker (in agent output) | Meaning | Effect | +|---|---|---| +| `complete` | Agent considers this turn's goal achieved | Recorded; allow continuation if max_turns_per_attempt not reached | +| `blocked` | Agent stuck, need user input or escalation | Triggers `SignalPause` if auto-mode | +| `giving_up` | Agent has decided the task can't be done | Ends attempt; transitions to PhaseReassess | +| (no marker) | Default success | Continue normally | + +The marker is parsed from the last 200 chars of the agent's response. Markers appearing earlier are ignored (prevents partial-quote false positives). This gives the harness a checkpoint *between* turns without waiting for a phase boundary. + +The agent prompt template (`prompts/execute-task.md`) instructs the agent to emit one of these markers at end-of-turn. Compliance is best-effort — absence of a marker is treated as default success. + +**After an abnormal exit**, exponential backoff. `attempt` is 1-indexed (first try = 1, first retry = 2, …): + +``` +delay = min(10s × 2^(attempt - 1), max_retry_backoff) +``` + +| Attempt | Delay before next dispatch | +|---|---| +| 1 (first try) | (no retry yet) | +| 2 (first retry) | 20 s | +| 3 | 40 s | +| 4 | 80 s | +| 5 | 160 s | +| 6+ | capped at `max_retry_backoff` (default 5 min) | + +Configurable: `[harness] max_retry_backoff = "5m"`, `[harness] max_attempts = 6`. + +### 5.6 Startup cleanup + +On startup, the orchestrator MUST: + +1. Scan `.sf/active/` for unit artifacts whose tasks are in terminal states. +2. Move stale active artifacts to `.sf/archive/` atomically (rename, not copy+delete). +3. Mark any running/claimed units as interrupted in SQLite. +4. Release all worker slots. + +--- + +## 6. Worker Attempt Lifecycle + +The exact sequence inside a single worker attempt: + +``` +run_worker_attempt(unit, attempt): + # 1. Workspace + workspace = create_or_reuse_workspace(unit.id, unit.worker_host) + if workspace failed: + fail_attempt(ErrWorkspaceCreation) + + # 2. Before-run hook (fatal) + result = run_hook("before_run", workspace, unit) + if result failed: + fail_attempt(ErrHookFailed) + + # 3. Session start + session = agent.start_session(cwd=workspace, model=route(unit.phase)) + if session failed: + run_hook_best_effort("after_run", workspace, unit) + fail_attempt(ErrAgentStartup) + + # 4. Turn loop + turn = 1 + loop: + kind = TurnFirst if turn == 1 else TurnContinuation + prompt = build_prompt(unit, attempt, turn, kind) + if prompt failed: + agent.stop_session(session) + run_hook_best_effort("after_run", workspace, unit) + fail_attempt(ErrPromptRender) + + result = agent.run_turn(session, prompt) + if result failed: + agent.stop_session(session) + run_hook_best_effort("after_run", workspace, unit) + fail_attempt(result.error) + + # Re-check unit state between turns (local DB only — no external tracker) + current_state = db.fetch_unit_phase_status(unit.id) + if current_state in ('canceled', 'succeeded'): + break # → AttemptCanceled (e.g. operator ran /sf abandon mid-run) + + if turn >= max_turns_per_attempt: + break + + turn++ + + # 5. Teardown + agent.stop_session(session) + run_hook_best_effort("after_run", workspace, unit) + exit_normal() +``` + +Rules: +- `before_run` hook failure is fatal — the harness MUST fail the attempt without starting the session. +- `after_run` hook is always attempted, even after failure. Its failure is logged but MUST NOT change the attempt outcome. +- The unit state re-check between turns MUST happen before building the next turn prompt. A canceled unit MUST NOT receive another turn. + +--- + +## 7. Prompt Contract + +### 7.1 Template variables + +Every prompt template MUST be rendered with a strict variable checker. An unknown variable in the template MUST cause `loadPrompt` to panic at startup rather than silently render an empty string. + +Canonical variables for execute-task templates: + +| Variable | Type | Notes | +|---|---|---| +| `unit_id` | string | Stable unit identifier | +| `unit_type` | string | `"milestone"` \| `"slice"` \| `"task"` | +| `phase` | string | Current phase name (`"execute"`, `"tdd"`, etc.) | +| `attempt` | int \| null | `null` on first dispatch; integer ≥ 1 on retry | +| `session_id` | string | Stable session UUID | +| `issue` | object | Full issue/task struct as flat map | +| `last_error` | string \| null | Injected automatically when `attempt >= 1` | + +When adding a new `{{variable}}` to any template: (1) pass it in every `loadPrompt` call site, (2) add a placeholder in every test that renders that template, (3) recompile. Skipping either step causes a startup panic. + +### 7.2 Continuation turns + +A `TurnContinuation` MUST receive a short guidance prompt, not the full task prompt. The full prompt is already in the thread history — resending it inflates context and degrades model reasoning. The continuation prompt MUST NOT re-state the task description; it provides only steering context for the current turn. + +### 7.3 Attempt variable semantics + +The `attempt` variable enables prompt templates to give different instructions to retrying agents vs. fresh starts. A retry prompt SHOULD include: `"your previous attempt failed with: {{last_error}} — focus on that specifically."` The harness injects `last_error` automatically on `attempt >= 2`. + +**`last_error` is only injected on `TurnFirst` of attempts ≥ 2.** Continuation turns within the same attempt have already established context and don't need it. A turn failure within an attempt always fails the entire attempt (§ 6); there are no mid-attempt error injections to reason about. + +`last_error` content MUST be capped at 4 KB. Larger payloads (gate output, lint dumps, traceback) are truncated head-and-tail: 2 KB from the start, marker `... [truncated, full payload at ] ...`, then 2 KB from the end. The full payload is written to `.sf/active/{unit-id}/last-error-full.txt` so the agent can `read_file` it if the truncated context isn't enough. + +### 7.4 `turn_input_required` in auto-mode + +When the agent raises `turn_input_required` during auto-mode, the harness MUST respond according to the `turn_input_required` config (default: `"soft"`): + +- **`"soft"`** — inject `"This is a non-interactive session. Operator input is unavailable."` as a `user` role turn and let the session continue. The agent adapts. +- **`"hard"`** — end the attempt immediately, record `ErrTurnInputRequired`, schedule failure retry. + +In interactive/step mode, the harness MUST surface the request to the user via the TUI and MUST NOT auto-respond. It waits up to `unit_timeout` before failing. + +The harness MUST NOT leave a run stalled indefinitely waiting for interactive input in any mode. + +--- + +## 8. Context Budget + +### 8.1 Budget type + +```go +type Budget struct { + MaxTokens int + UsedTokens int + CompactAt float64 // fraction e.g. 0.80 + HardLimitAt float64 // fraction e.g. 0.95 +} + +func (b *Budget) ShouldCompact() bool { + return float64(b.UsedTokens)/float64(b.MaxTokens) >= b.CompactAt +} + +func (b *Budget) AtHardLimit() bool { + return float64(b.UsedTokens)/float64(b.MaxTokens) >= b.HardLimitAt +} +``` + +### 8.2 Rules + +- The harness MUST update `UsedTokens` after every model response. The agent loop MUST NOT manage budget. +- When `ShouldCompact()` is true, the harness MUST trigger compaction before the next turn, not mid-turn. +- When `AtHardLimit()`, the harness MUST halt the current unit, snapshot state, and surface `ErrBudgetExhausted`. It MUST NOT let the agent proceed and hit a provider context error. +- Budget state MUST be persisted to SQLite after every turn so crash recovery can restore it. + +### 8.3 Compaction + +When compaction fires (budget at compact threshold): + +1. Write a `session_summary` entry to Singularity Memory via `retain`. +2. Clear the hot cache (in-memory last-N turns). +3. Start the next turn with a fresh context window seeded by a `recall` from Singularity Memory. + +Compaction MUST NOT truncate the window — it MUST replace it with a fresh recall. A truncated window loses structure; a recalled window gains relevance. + +**Agent run compaction preserves the wake context.** For persistent agent runs, the compacted window MUST include verbatim: +- The wake message that started this run. +- The most recent 3 inbox arrivals delivered in this run. +- The agent's full `agent_memory_blocks` (these are durable anyway, but they go above the recall block). + +Compaction without this preservation can drop the originating intent and cause the agent to lose thread continuity mid-run. + +### 8.4 Token accounting precision + +Provider responses arrive as either absolute thread totals or per-turn deltas. The harness MUST prefer absolute totals (`thread/tokenUsage/updated`-style events) and MUST track the last-reported total to compute deltas, preventing double-counting. + +Aggregate totals (input, output, cache-read, cache-write, cost-usd) MUST accumulate in orchestrator state and be included in every runtime snapshot. + +--- + +## 9. Supervision + +### 9.1 Supervisor interface + +The harness MUST run a supervisor goroutine alongside the agent loop. The supervisor communicates exclusively via pubsub — it MUST NOT touch agent state directly. + +```go +type SupervisorCheck interface { + Name() string + Check(ctx context.Context, state SupervisorState) SupervisorSignal +} + +type SupervisorSignal int + +const ( + SignalOK SupervisorSignal = iota + SignalWarn // log, surface in TUI + SignalPause // pause auto-loop, wait for user + SignalAbort // stop unit, mark interrupted +) +``` + +### 9.2 Built-in checks + +| Check | Trigger | Signal | +|---|---|---| +| `StuckLoop` | Same phase for > N turns with no successful tool calls | `SignalPause` | +| `BudgetWarning` | Context approaching compaction threshold | `SignalWarn` | +| `TimeoutCheck` | Unit running longer than `unit_timeout` | `SignalAbort` | +| `AbandonDetect` | Agent producing output with no tool calls | `SignalPause` | +| `GitDivergence` | Working branch diverged from base unexpectedly | `SignalPause` | +| `BlockerCheck` | Upstream dependency moved to non-terminal state mid-run | `SignalPause` | +| `ModelUnavailable` | Provider returns "model not supported / not found" class error | `SignalAbort` immediately (not after timeout) | +| `CircuitBreaker` | Same model fails 3 consecutive times within a session | Trip circuit; `SignalAbort` on next dispatch to tripped model | + +### 9.3 Circuit breaker + +When the circuit trips for a model: + +- Write circuit state to SQLite (`circuit_breakers` table — `model`, `tripped_at`, `resets_at`). +- Subsequent dispatches in that tier MUST skip the tripped model. +- Circuit auto-resets after 24 hours or on explicit `/sf reset-circuits`. +- The circuit state MUST survive a process restart. + +### 9.4 Supervisor constraints + +- The supervisor MUST NOT call `os.Exit` or panic. +- The supervisor MUST NOT write to agent state or SQLite unit state directly. +- The auto-loop acts on `SignalPause` and `SignalAbort`. The TUI shows warnings on `SignalWarn`. + +### 9.5 SignalAbort and in-flight tool calls + +When the harness receives `SignalAbort` while a tool call is in flight (e.g. a long-running `bash` subprocess), it MUST follow this sequence: + +1. Cancel the tool call's context (Go `context.CancelFunc`). Cooperative cancellation MUST be honoured by built-in tools. +2. Wait up to `[harness] tool_abort_grace = "5s"` for the tool to exit cleanly. +3. After the grace period, send `SIGTERM` to any tool subprocess. +4. Wait an additional `[harness] tool_abort_kill = "3s"`. +5. If the subprocess is still running, send `SIGKILL`. + +Total worst case: 8 seconds from `SignalAbort` to forcible termination. The harness MUST NOT hang the orchestrator waiting on a non-cooperating tool call. + +After the tool call ends (cleanly or via SIGKILL), the harness records the run as `outcome = canceled` with `error_code = canceled_by_supervisor` and emits the `after_run` hook before releasing the slot. + +--- + +## 10. Hook Pipeline + +### 10.1 Events + +The harness extends pi-coding-agent's hook system with sf-specific events: + +```go +const ( + // Existing pi-coding-agent event + EventPreToolUse = "PreToolUse" + + // Unit lifecycle + EventPreDispatch = "PreDispatch" // before a unit is dispatched; can block + EventPostUnit = "PostUnit" // after a unit completes + EventPhaseChange = "PhaseChange" // on phase transition + + // Auto-loop + EventAutoLoop = "AutoLoop" // each iteration of the auto-loop + + // Worktree + EventWorktreeCreate = "WorktreeCreate" + EventWorktreeDelete = "WorktreeDelete" + EventMergeReady = "MergeReady" + EventMergeConflict = "MergeConflict" + + // Agent fleet + EventAgentWake = "AgentWake" // target agent should start/resume + EventAgentMessage = "AgentMessage" // message routed (TUI + tracing) + EventAgentIdle = "AgentIdle" // agent completed its turn, inbox empty +) +``` + +### 10.2 UnitResult payload + +PostUnit hooks receive: + +```go +type UnitResult struct { + UnitID string + UnitType string // "milestone" | "slice" | "task" + Phase Phase + Verdict string // "success" | "failure" | "abandoned" + Duration time.Duration + InputTokens int + OutputTokens int + CacheHits int + CostUSD float64 + Model string + WorkerHost string + Error error + Learnings []string +} +``` + +The payload is serialized to JSON and passed to hook subprocesses via stdin. + +### 10.3 Hook execution rules + +- PostUnit hooks run **sequentially**, not concurrently. The next dispatch MUST NOT begin until all PostUnit hooks have returned. +- A hook subprocess that exits non-zero for `PreDispatch` or `PostUnit` MUST trigger `SignalAbort`. The harness stops the session and marks it `SessionFailed`. +- Hook timeouts are per-hook-type. Defaults: + + | Hook | Default | Rationale | + |---|---|---| + | `before_run` | `120s` | Cleanup, dependency install can take time | + | `after_run` | `30s` | Best-effort teardown | + | `after_create` | `120s` | First-time setup | + | `before_remove` | `30s` | Cleanup | + | `pre_dispatch` | `15s` | Should be a fast check | + | `post_unit` | `60s` | Subprocess work; longer for git push | + | `doc_sync` (built-in) | `5m` | Runs an agent dispatch over the diff | + + All overridable in config via `[harness.hooks.timeouts.] = ""`. A timeout kills the hook and logs. A `PostUnit` hook timeout MUST NOT block the next dispatch. +- The git service subscribes to PostUnit via a hook and handles commits, branch creation, and push. The harness MUST NOT call `git` directly. +- Singularity Memory feedback (retain learnings, mark anti-patterns) is emitted from a built-in PostUnit hook (not a subprocess) — it calls the Singularity Memory client directly. +- PostUnit hook results MUST be written to the trace as child spans of the unit span. + +### 10.4 Tool response contract + +Every tool call — successful or not — MUST return a response in this shape: + +```go +type ToolResponse struct { + Success bool `json:"success"` + Output string `json:"output"` + ContentItems []ContentItem `json:"contentItems"` +} + +type ContentItem struct { + Type string `json:"type"` // always "inputText" for text results + Text string `json:"text"` +} +``` + +For successful calls: `success = true`, `output` = result summary. For unsupported or failed calls: `success = false`, `output` = human-readable error, `contentItems` lists which tools are available in the current context. The shape MUST be consistent — the agent relies on `success` to distinguish real failures from tool-not-found errors. + +If the agent calls a tool that is not registered, the harness MUST return a structured failure response and continue the session. It MUST NOT stall, panic, or exit on an unknown tool name. + +### 10.5.0 SF tool registration + +pi-coding-agent (vendored from pi-mono under `packages/pi-coding-agent/`) provides the agent's tool registry. sf adds new tools (`send_message`, `core_memory_append/replace`, `handoff`, `wait_for_reply`, `chapter_open`, `stop`, `plan_unit`, etc.) by registering them at sf-startup via pi-coding-agent's API. There is NO parallel tool registry — sf tools live in `src/resources/extensions/sf/tools/` and call into pi-coding-agent's registration during module init. + +sf-specific tools MUST: +1. Conform to the response shape of § 10.4 (`{success, output, contentItems}`). +2. Honour pi-coding-agent's `PreToolUse` hook system — they receive the same hook pipeline as built-in tools. +3. Document the auto_approve key they expect (e.g. `agent:send_message`) so projects can list them in `[harness.auto_approve.tools]`. + +This means PreToolUse hooks can deny sf tool calls just like any other; the auto-approve list scopes them; permissions are uniform. + +### 10.5.1 External visibility via PostUnit hooks (recipe) + +If the user wants teammates to see sf's progress in GitHub Issues (or Slack, or any other system), this is done as a PostUnit hook script — **not** a built-in tracker integration. + +Example: `.sf/hooks/post-unit-gh.sh` + +```bash +#!/usr/bin/env bash +# Reads UnitResult JSON from stdin; posts a comment to a GitHub issue +# whose number is stored in the unit's `external_ref` field (set at plan +# time via /sf plan --link-issue=42 "..."). + +set -euo pipefail +payload="$(cat)" +issue=$(jq -r '.unit.metadata.gh_issue // empty' <<< "$payload") +verdict=$(jq -r '.verdict' <<< "$payload") +phase=$(jq -r '.phase' <<< "$payload") +[ -z "$issue" ] && exit 0 # not linked, no-op + +gh issue comment "$issue" --body "sf $phase: $verdict" +``` + +Wired in `.sf/config.toml`: + +```toml +[harness.hooks] +post_unit = ["./.sf/hooks/post-unit-gh.sh"] +``` + +The unit's `metadata.gh_issue` field is set at plan time: + +```bash +sf plan --link-issue=42 "implement OAuth" +``` + +This pattern keeps the orchestrator's critical path local (sf's DB) while still giving external visibility where the user wants it. The same pattern works for Slack, Discord, Jira, Linear, in-house dashboards — sf doesn't need to know about any of them. + +### 10.5 Doc sync (sub-step of PhaseMerge or PhaseComplete) + +Doc sync runs as the final sub-step of the **last code-mutating phase** before `PhaseComplete`: + +- For workflows that include `PhaseMerge`: doc sync runs at end of `PhaseMerge`. +- For workflows that omit `PhaseMerge` but include `PhaseExecute` (e.g. `spike`): doc sync runs at end of the last code-mutating phase that ran. If the spike adopted a new dependency, doc sync still gets a chance to update `STACK.md`. + +It is not a separate phase and not a post-merge hook; it is the final sub-step of whichever phase was last to mutate code. + +The doc-sync sub-step: + +1. Dispatches a `fast`-tier turn against the merged diff with a short prompt asking whether project-level docs (`ARCHITECTURE.md`, `CONVENTIONS.md`, `STACK.md`) need updating. +2. The agent emits a diff (possibly empty) to stdout. +3. If the diff is non-empty, the harness surfaces it to the TUI for user approval. On approval, it is committed as `docs: sync after {unit_id}` on the same branch and the merge hook is re-triggered. +4. On empty diff, the sub-step is a no-op and PhaseMerge proceeds to PhaseComplete. + +Configuration: +- `[harness] doc_sync = false` disables the sub-step entirely. +- `[harness] doc_sync_auto_approve = true` skips the user prompt and commits the diff directly. Off by default. + +--- + +## 11. Workspace Management + +### 11.1 Naming + +Workspace directories are derived from the unit identifier. The identifier MUST be sanitized: replace any character not in `[a-zA-Z0-9._-]` with `_`. This prevents path injection via issue identifiers containing slashes, `..`, or null bytes. + +### 11.2 Symlink-aware path containment + +Workspace path validation MUST use segment-by-segment canonicalization, not `filepath.EvalSymlinks` or `path.Clean` alone. A naive call can be defeated by a symlink that resolves outside the workspace root. + +Algorithm: + +``` +resolveCanonical(path): + segments = split(path) + resolved = root + for segment in segments: + candidate = join(resolved, segment) + stat = lstat(candidate) + if stat == symlink: + target = readlink(candidate) + # expand target relative to current resolved prefix + # restart segment walk from resolved target + elif stat == exists: + resolved = candidate + elif stat == ENOENT: + resolved = join(resolved, remaining segments) # path not yet created; OK + break + else: + return error + return resolved +``` + +After canonicalization, MUST assert `canonical_workspace` has `canonical_root + "/"` as a prefix. If it does not, reject with `ErrWorkspaceSymlinkEscape`. + +For remote workers, the same check MUST be performed via a shell script that resolves each path segment before `mkdir`. + +### 11.3 Workspace lifecycle + +1. `after_create` — runs once when the workspace directory is first created. +2. `before_run` — runs before every attempt. Fatal if it fails. +3. `after_run` — runs after every attempt (success or failure). Best-effort. +4. `before_remove` — runs before the workspace is deleted. + +All hooks run in the workspace directory as the working directory. + +### 11.4 Local workspace creation + +``` +ensure_workspace(workspace): + if directory exists: + return (workspace, created=false) + if file exists at path: + rm -rf path + mkdir -p path + return (workspace, created=true) +``` + +### 11.5 Remote workspace creation + +For SSH workers, the orchestrator runs a shell script on the remote host that atomically creates and resolves the workspace, then echoes a tab-separated marker line: + +``` +printf '%s\t%s\t%s\n' '__SINGULARITY_WORKSPACE__' "$created" "$(pwd -P)" +``` + +The orchestrator parses this line from stdout to confirm the resolved canonical path. + +--- + +## 12. Worktree Isolation + +### 12.1 Modes + +```toml +[harness] +worktree_mode = "branch-per-slice" # or "milestone-per-worktree" +``` + +**`branch-per-slice`** (default): +- Each slice gets its own git branch (`sf/m{n}-s{n}-{slug}`) created from the current base. +- The harness emits `WorktreeCreate` before branch creation; the git service handles the actual `git worktree add`. +- After PostUnit hooks run, the git service merges the branch to the integration branch. The harness waits for the merge hook before marking the slice complete. +- Merge conflicts emit `MergeConflict`, which triggers `SignalPause`. + +**`milestone-per-worktree`**: +- A single worktree created for the entire milestone. +- All slices share that worktree. The git service commits incrementally. +- The worktree is merged at milestone PostUnit time. + +### 12.2 Rules + +- The harness MUST emit `WorktreeCreate` and `WorktreeDelete` events. It MUST NOT call `git` directly. +- `worktree_mode` is session-immutable — changing it requires restart. + +### 12.3 Merge ordering for parallel slices + +When multiple slices in `branch-per-slice` mode complete concurrently, the harness MUST merge them in **dependency-aware** order, not completion order: + +1. A slice marked `code_depends_on: ["m1/s2"]` in unit metadata is held until that upstream slice's branch has merged. +2. With no declared code dependency, slices merge in `created_at` order. +3. The merge gate is serial: only one slice's merge runs at a time per project, even if multiple are eligible. + +This is distinct from `task_blockers` (task-completion dependency). **Code dependency** means slice B's diff cannot merge cleanly before slice A's diff. Without explicit declaration, the harness assumes no code dependency and merges in creation order — accept that this can produce avoidable conflicts that the next attempt will resolve. + +--- + +## 13. Verification Gates + +### 13.1 Configuration + +```toml +[harness.gates] +post_slice = ["./gates/run-tests.sh", "./gates/lint.sh"] +post_milestone = ["./gates/integration-tests.sh"] +``` + +### 13.2 Execution rules + +- Gates run as subprocesses. The `UnitResult` JSON is passed via stdin. +- Exit 0 = pass. Non-zero = fail. +- Fail increments the gate-level retry counter (separate from `units.attempt`). The gate retry counter resets on the next phase transition. +- Default max gate retries: 3. Configurable per gate via `[harness.gates.max_retries.]`. +- On retry, the harness re-dispatches the same unit with gate failure output appended to context. The agent MUST see what failed and why. +- After max retries, the harness transitions to `PhaseReassess` and emits `GateBlocked` on pubsub. +- Gate results MUST be stored in `gate_results` table and written as span events on the unit span. + +### 13.2.1 Gate script protocol + +Every gate script MUST adhere to this contract. Implementations that violate any rule are rejected at startup validation. + +**Environment variables provided:** + +| Variable | Value | +|---|---| +| `SF_PROJECT_ROOT` | Absolute path to project root | +| `SF_HOME` | SF data directory (`~/.sf` or override) | +| `SF_UNIT_ID` | Active unit ID (§ 2 format) | +| `SF_RUN_ID` | Active run ULID | +| `SF_PHASE` | Phase name (e.g. `verify`) | +| `SF_ATTEMPT` | Attempt counter, 1-indexed | +| `SF_GATE_NAME` | This gate's name (script basename without extension) | +| `SF_GATE_RETRY` | Gate retry counter, 0-indexed | +| `SF_WORKSPACE` | Path of the unit's workspace | +| `SF_TRACE_FILE` | Path to current day's trace JSONL | + +**Stdin:** the `UnitResult` JSON struct (§ 10.2). UTF-8, single line, terminated with `\n`. + +**Exit code:** `0` = pass; `1` = fail (retry); `2` = block (do not retry, transition straight to PhaseReassess); `3` = skip (gate is not applicable for this unit). Other codes are treated as `1`. + +**Stdout / stderr:** captured combined, truncated at 8 KB, stored in `gate_results.output`. Multi-line is fine. No structured output is required, but if the first line is valid JSON of the form `{"summary": "...", "issues": [...]}` the harness uses it for richer reporting. + +**Timeout:** default 5 minutes per gate, configurable via `[harness.gates.timeouts.]`. Timeout = SIGTERM, then 10s grace, then SIGKILL; recorded as `error_code = "gate_timeout"`. + +**Cwd:** the workspace directory. Scripts MAY assume `git status` etc. work as expected. + +```go +type GateResult struct { + GateName string + UnitID string + Passed bool + Attempt int + MaxRetries int + Output string // combined stdout+stderr, truncated at 8KB + Duration time.Duration +} +``` + +### 13.3 PhaseReview — chunked review + +Large diffs MUST NOT be reviewed in a single pass. The harness MUST split the changed file list into chunks of ≤ 300 lines (`ReviewChunkLines = 300`) before dispatching the review agent. Files larger than `ReviewChunkLines` get their own chunk. + +To prevent context-blind review of cross-file changes, the harness runs three passes: + +1. **Establish-context pass (single dispatch, fast tier).** The agent receives the full diff summary (file list + first/last 20 lines of each) and produces a one-paragraph "what this change does and what to watch for" summary. +2. **Per-chunk review pass (parallel, `standard` tier).** Each chunk receives: the establish-context summary as a system-prompt prefix, then its own files. Reviewer findings are accumulated. Parallelism is bounded by `max_agents_by_phase.review`. +3. **Synthesis pass (single dispatch, `standard` tier).** All chunk findings are merged, deduplicated, and prioritised. The synthesis agent decides whether the review should pass, request changes, or block (security/correctness issue). + +The synthesis verdict is what the harness acts on — chunked passes alone never decide. + +### 13.4 Unit archive + +When a slice or milestone reaches `PhaseComplete`, the harness MUST move its artifact directory from `.sf/active/` to `.sf/archive/{YYYY-MM-DD}-{unit-id}/` atomically (rename, not copy+delete). + +`.sf/active/` holds only in-progress work. `.sf/archive/` is queried by `/sf history`. + +### 13.5 Reserved + +(`specs.check`, godoc enforcement on the harness package, is a sf CI requirement — see § 1 — not a runtime gate against user projects.) + +--- + +## 14. Configuration + +### 14.1 File locations and precedence + +1. `~/.sf/config.toml` — global defaults +2. `.sf/config.toml` — project overrides (takes precedence) + +Both files are TOML. Project overrides global on a per-key basis. + +### 14.2 Canonical schema + +```toml +[harness] +context_compact_at = 0.80 +context_hard_limit = 0.95 +unit_timeout = "10m" # default per-attempt cap; can override per phase +turn_timeout = "5m" # bounds one model turn +stall_timeout = "2m" # AttemptStalled when no agent event for this long +tool_abort_grace = "5s" # cooperative cancel window before SIGTERM +tool_abort_kill = "3s" # SIGTERM-to-SIGKILL window +max_turns_per_attempt = 50 +max_attempts = 6 # exponential backoff before giving up +hot_cache_turns = 10 # in-memory recent-turn buffer +supervisor_interval = "10s" +max_retry_backoff = "5m" +doc_sync = true +turn_input_required = "soft" # or "hard" +worktree_mode = "branch-per-slice" + +[harness.unit_timeout_by_phase] +research = "30m" # AST analysis / spec reading can take real time +plan = "20m" +execute = "15m" +tdd = "10m" +verify = "10m" +review = "15m" +merge = "5m" +reassess = "20m" +uat = "0" # 0 = no timeout (UAT can take days; advance via /sf uat-approve) + +[harness.concurrency.max_agents_by_phase] +execute = 4 +tdd = 4 +verify = 10 # mostly reads — cheap +review = 4 # parallel chunked review (§ 13.3) +merge = 1 # serial per project (§ 12.3) + +[harness.concurrency] +max_agents = 10 # global cap; per-phase caps under [harness.concurrency.max_agents_by_phase] above + +[harness.auto_approve] +tools = ["bash:read", "fs:read", "git:status", "git:diff"] + +[harness.hooks] +pre_dispatch = ["./hooks/pre-dispatch.sh"] +post_unit = ["./hooks/post-unit.sh"] +after_create = "./hooks/after-create.sh" +before_run = "./hooks/before-run.sh" +after_run = "./hooks/after-run.sh" +before_remove = "./hooks/before-remove.sh" + +[harness.hooks.timeouts] # per-hook overrides; defaults in § 10.3 +before_run = "120s" +post_unit = "60s" +doc_sync = "5m" + +[providers] +# pi-ai provider settings live here. pi-ai is the multi-provider client; sf inherits all 20+ providers it supports. +# API keys MUST use vault:// (§ 24); plaintext is rejected at startup. +anthropic.api_key = "vault://secret/sf#anthropic_api_key" +openai.api_key = "vault://secret/sf#openai_api_key" + +[harness.gates] +post_slice = ["./gates/run-tests.sh"] +post_milestone = ["./gates/integration-tests.sh"] + +[harness.log] +path = "~/.sf/log/sf.log" +max_size = 10485760 # 10MB +max_files = 5 +stderr = false + +[server] +port = 7842 # 0 = ephemeral (tests) + +[memory] +mode = "embedded" # "embedded" (default) | "remote" +url = "http://memory.tailnet.local:7843" # required when mode = "remote" +api_key = "vault://secret/sf#sm_api_key" # required when mode = "remote" +# Embedded mode runs the singularity_memory_server engine in-process. +# Remote mode shares the server across the fleet (Hermes, OpenClaw, sf, etc.). + +[worker] +ssh_hosts = [] +max_concurrent_agents_per_host = 3 +ssh_auth_method = "agent" # "agent" | "key" | "key+agent" +ssh_identity_file = "~/.ssh/id_ed25519" # used for "key" or "key+agent" +ssh_known_hosts = "~/.ssh/known_hosts" # MUST verify; no auto-trust +ssh_disconnect_timeout = "30s" +host_quarantine = "5m" + +[routing] +research = "reasoning" +plan = "reasoning" +execute = "standard" +tdd = "standard" +verify = "fast" +review = "standard" +merge = "fast" +complete = "fast" +reassess = "reasoning" + +[tiers.fast] +models = ["claude-haiku-4-5", "gemini-flash-2.0"] + +[tiers.standard] +models = ["claude-sonnet-4-6", "gemini-2.0-pro"] + +[tiers.reasoning] +models = ["claude-opus-4-7", "o3"] +``` + +### 14.3 Dynamic reload + +The harness MUST poll `.sf/config.toml` on every orchestrator tick using a `{mtime, size, content_hash}` stamp. `content_hash` is SHA-256 of the file bytes. + +When the stamp changes: +- Re-parse and re-validate. +- On success: apply changes immediately to future dispatch, concurrency limits, and hook lists. In-flight runs are NOT interrupted. +- On failure (parse error, validation error): log error at WARNING level, keep last known good config. MUST NOT crash. + +The following fields are session-immutable even with dynamic reload enabled: +- `worktree_mode` +- `context_compact_at` +- `context_hard_limit` + +Changing session-immutable fields requires restart. **If a dynamic reload detects a changed session-immutable field, the harness MUST**: + +1. Log a warning naming the field, old value, new value. +2. Continue using the in-process value for the current session. +3. Display the change in `/sf status` as "config drift detected — restart to apply: ". +4. NOT crash and NOT auto-restart. + +### 14.4 Startup validation + +The harness MUST validate config at startup and MUST fail fast with a descriptive error on invalid config. It MUST NOT silently ignore unknown keys or bad values. `/sf doctor` MUST run `HarnessConfig.Validate()` as one of its checks. + +### 14.5 Plan.md format + +Every active unit has a `.sf/active/{unit-id}/plan.md` written by `PhasePlan` and consumed by all subsequent phases. The format is: + +```markdown +--- +unit_id: task/m1/s2/t3 +created_at: 2026-04-29T14:22:00Z +written_by: claude-sonnet-4-6 +plan_version: 1 +--- + +# Goal + + + +# Approach + +<2-3 paragraphs: how the agent intends to do it> + +# Deliverables + +- [ ] +- [ ] <…> + +# Verification + +- +- <…> + +# Notes + + +``` + +The frontmatter `plan_version` increments on each PhaseReassess→Re-plan. Subsequent phases parse the frontmatter to detect plan version changes (informational; not load-bearing). + +The harness MUST validate that `plan.md` parses as Markdown with the required frontmatter fields before allowing a transition out of `PhasePlan`. Missing `# Goal` or `# Deliverables` sections fail the phase. + +### 14.6 Project directory layout + +Every project has a `.sf/` directory with this canonical layout: + +``` +/ +├── .sf/ +│ ├── config.toml # project config (§ 14.1) +│ ├── workflows/ # workflow templates (§ 4.5) +│ │ ├── feature.toml +│ │ └── spike.toml +│ ├── hooks/ # hook scripts referenced by config +│ ├── gates/ # gate scripts referenced by config +│ ├── sf.db # SQLite orchestration DB +│ ├── run.lock # process lock (§ 4.7) +│ ├── auto.lock # signals auto-mode active (§ 4.7) +│ ├── active/ # in-progress unit artifacts +│ │ └── {unit-id}/ # one directory per active unit +│ │ ├── plan.md # unit's plan/notes +│ │ ├── workspace -> /path # symlink to actual workspace +│ │ └── run-{run-id}.log # per-run log +│ ├── archive/ # completed work + age-rolled artifacts +│ │ ├── {YYYY-MM-DD}-{unit-id}/ # one per completed unit +│ │ ├── agents/ # rolled agent_inbox/messages +│ │ └── lost-learnings.jsonl # pending_retain ages out here (§ 16.1) +│ ├── log/ +│ │ └── sf.log # rolling structured log (§ 19.2) +│ ├── runtime/ +│ │ ├── paused-session.json # written when SessionPaused +│ │ ├── gate-state.json # last gate result per unit +│ │ └── server.port # actual HTTP API port (§ 14.2) +│ └── trace/ +│ ├── trace-{YYYY-MM-DD}.jsonl # daily-rotated spans +│ └── _meta.json # trace schema version, file index +``` + +Layout is stable: `/sf revert`, `/sf history`, archive sweeps, and the HTTP API all assume these exact paths. + +--- + +## 15. Model Routing + +### 15.1 Three tiers + +The tier names are fixed: `fast`, `standard`, `reasoning`. Custom tier names are NOT supported — adding a tier would force changes in routing config, complexity-upgrade logic, and the rate-feedback fingerprint, with little benefit. Each tier holds multiple candidate models in `[tiers.]`. The router picks within the tier; it does not change the tier assignment. + +### 15.2 Phase → tier mapping + +Static, config-driven (see § 14.2 `[routing]` table). The harness MUST apply the phase-to-tier mapping before each dispatch. The agent MUST NOT influence this mapping. + +The harness MUST set `Think: true` on the model config for phases mapped to `reasoning` tier. + +### 15.3 Complexity upgrade + +A classifier at dispatch time — file count, scope breadth, cross-cutting changes → complexity score. If the score crosses a configurable threshold, the tier bumps one level (fast→standard, standard→reasoning). The fingerprint and upgrade decision MUST be stored in SQLite for future routing decisions. + +### 15.4 Within-tier selection + +Within a tier, the router picks the model with the highest benchmark score: + +``` +score = quality * 0.6 + (1 - normalised_latency) * 0.2 + (1 - normalised_cost) * 0.2 +``` + +Weights are configurable. If no benchmark data exists for the current fingerprint, use the tier's first model. + +Models with a tripped circuit breaker (§ 9.3) MUST be skipped. + +### 15.5 `/sf rate` feedback loop + +Two signal sources: + +- **Auto-mode** — the agent self-evaluates at unit close: `over` / `ok` / `under` relative to phase objective. No human in the loop. +- **Interactive mode** — human signals `over` / `ok` / `under` after reviewing unit output. + +Both write to `benchmark_results`. Human ratings carry higher weight than LLM self-ratings (configurable multiplier, default 3×). + +Score mappings: `over=0.3` (over-resourced), `ok=0.8`, `under=0.0` (blocks model for this fingerprint). + +--- + +## 16. Knowledge Layer + +### 16.1 Architecture + +The knowledge layer is **Singularity Memory** (`sm`) — an HTTP + MCP server we own at [`singularity-ng/singularity-memory`](https://github.com/singularity-ng/singularity-memory). The engine was derived from [`vectorize-io/hindsight`](https://github.com/vectorize-io/hindsight) (MIT) and assimilated into `singularity_memory_server/` under our namespace; from sf's perspective there is no upstream service. The same `sm` server is shared across our agent fleet (Hermes, OpenClaw, Claude Code, Cursor, sf), so memories accumulate across tools. + +sf uses [`github.com/singularity-ng/singularity-memory-client-go`](https://github.com/singularity-ng/singularity-memory-client-go), auto-generated from the OpenAPI document published by the running sm server (`/openapi.json`). There is no local vector store, no sqlite-vec table, no FTS5 fallback — all retrieval and persistence go through `sm`. + +**Embedded vs remote deployment.** sm supports both modes: + +| Mode | When | Config | +|---|---|---| +| **Embedded** (default for single-user sf) | sm engine runs in-process; no extra service to operate | `[memory] mode = "embedded"` | +| **Remote** | sm runs as a tailnet service shared across multiple tools/users | `[memory] mode = "remote"`, `[memory] url = "http://memory.tailnet.local:7843"` | + +Embedded mode eliminates the network hop for the common case. Switching to remote shares context across the fleet at the cost of a network round-trip per recall. + +SQLite in sf holds **orchestration state only** (sessions, units, blockers, gates, benchmarks, circuit breakers, agents). Memories, learnings, anti-patterns, and codebase context live in Singularity Memory. + +When `sm` is unreachable, the harness MUST log a warning and dispatch with no recall context (plus the local `local_anti_patterns` mirror, § 3.1). The agent still runs; it just lacks historical memory for that session. The harness MUST NOT block dispatch on memory availability. + +**Retain failures queue locally.** PostUnit retain calls that fail (transport error, 5xx) MUST be enqueued in `pending_retain` and retried with exponential backoff on every poll tick until success. This means a unit's learnings are never silently lost to an `sm` outage: + +```sql +CREATE TABLE pending_retain ( + id TEXT PRIMARY KEY, -- ULID + bank TEXT NOT NULL, + payload TEXT NOT NULL, -- serialised retain request + attempts INTEGER NOT NULL DEFAULT 0, + next_retry_at INTEGER NOT NULL, + last_error TEXT, + created_at INTEGER NOT NULL +); +``` + +`pending_retain` rows older than 7 days are flushed to `.sf/archive/lost-learnings.jsonl` and removed; at that point the operator is expected to investigate. + +### 16.1.1 Memory client interface + +The harness uses `github.com/singularity-ng/singularity-memory-client-go` (auto-generated from the sm server's `/openapi.json`) through a thin wrapper that the rest of the codebase depends on. This wrapper is the seam between sf and Singularity Memory; tests substitute a fake. + +```go +type Memory interface { + // Recall fetches top-k entries from a bank for a query. opts.Filter + // may include {"collection": "anti_patterns"} or other tags. + Recall(ctx context.Context, bank string, query string, opts RecallOpts) ([]Entry, error) + + // Retain stores a new entry in a bank. document_id is required for + // upsert-by-content-hash semantics (§ 16.3). + Retain(ctx context.Context, bank string, entry Entry) error + + // Feedback signals helpfulness of an entry recalled in this dispatch. + // signal ∈ {-1, 0, +1}; +1 resets decay timer. + Feedback(ctx context.Context, entryID string, signal int) error + + // Validate marks the entry as still-relevant (resets decay timer). + // Called by PostUnit when a recalled entry directly contributed to success. + Validate(ctx context.Context, entryID string) error + + // Health probe. Used by /sf doctor and the retain queue. + Health(ctx context.Context) error +} + +type RecallOpts struct { + TopK int + Filter map[string]string + RerankQuality string // "fast" | "accurate" +} + +type Entry struct { + DocumentID string // content hash; upsert key + Content string + Tags []string + Metadata map[string]string // includes maturity, decay_factor, etc. + Score float64 // populated on Recall, ignored on Retain +} +``` + +The wrapper is responsible for: +1. Translating sf's `last_error` and gate output into `Entry.Content`. +2. Adding `is_negative` and `collection` tags appropriately. +3. Routing transport errors through `pending_retain` (§ 16.1). +4. Exposing the local `local_anti_patterns` mirror to `Recall` when `sm` is unreachable. + +### 16.2 Memory tiers + +Two tiers prevent token bloat during long-running sessions: + +**Hot cache** — current dispatch's recent turns held in memory (never persisted to SQLite). Configurable size: `[harness] hot_cache_turns = 10`. Cleared on compaction. + +**Singularity Memory store** — durable. PostUnit writes summaries, learnings, and anti-patterns. Pre-dispatch reads top-N most relevant entries. On compaction, the hot cache is summarised and written to Singularity Memory as a `session_summary` entry. + +The harness MUST NOT mix the two tiers. + +### 16.3 Two-bank pattern + +Each session uses two Singularity Memory banks, queried separately and merged before each dispatch: + +```go +projectRecall := sm.Recall("project/"+projectHash, query) +globalRecall := sm.Recall("global/coding", query) +// merge, deduplicate, inject top-N into unit context +``` + +`projectHash` is derived deterministically (so the same project hits the same bank from any machine): + +1. If the project root is a git repository, `projectHash = sha256(canonical_remote_url)[:16]` where canonical_remote_url is the `origin` URL normalised (strip auth, lowercase host, drop trailing `.git`). +2. If no git remote, `projectHash = sha256(absolute_path_with_real_user_home)[:16]`. +3. The resolved hash is cached in `.sf/runtime/project-hash.json` to ensure stability if the remote changes (a cleared cache forces re-derivation; a project move under a different remote is a deliberate re-bank). + +This means a developer cloning the repo on a second machine hits the same Singularity Memory bank as their first machine. Different forks of the same project have different remotes and thus different banks — desired, because their context diverges. + +Concurrent `retain` calls from parallel slice workers use `document_id` derived from content hash. Duplicate memories silently overwrite rather than accumulate. + +### 16.4 Anti-pattern library + +Anti-patterns are memories tagged `collection: anti_patterns`, `is_negative: true`. They: +- Are written explicitly when the agent makes a mistake (gate failure or user feedback). +- MUST NOT be subject to normal maturation decay — they persist at full weight until explicitly removed. +- Are retrieved at dispatch time and presented in a dedicated block: `avoid these mistakes...`. +- MUST also be mirrored to the local `local_anti_patterns` SQLite table (§ 3.1) on `retain`. When Singularity Memory is unreachable, the harness still injects local anti-patterns into prompt context. Anti-patterns are small, high-value, and never decay — making them the one knowledge category worth duplicating locally. + +```go +type AntiPattern struct { + ID string + Description string // what went wrong + Context string // when/where this applies + CorrectPath string // what to do instead + SourceUnit string + CreatedAt time.Time +} +``` + +### 16.5 Pattern maturation + +| State | Condition | Retrieval weight | +|---|---|---| +| `candidate` | < 3 observations | 0.5× | +| `established` | ≥ 3 obs, harmful ratio < 30% | 1.0× | +| `proven` | decayed helpful score ≥ 5, harmful ratio < 15% | 1.5× | +| `deprecated` | harmful ratio > 30% | 0× (excluded) | + +After 3 failed uses, content is prefixed `AVOID:` and flagged `is_negative: true`. + +### 16.6 Confidence decay + +``` +halfLife = 90 * (0.5 + confidence) // days; confidence ∈ [0.0, 1.0] +decayFactor = 0.5 ^ (ageInDays / halfLife) +finalScore = similarityScore * decayFactor +``` + +Memory access tiers: **hot** (accessed within 7 days), **warm** (within 30 days), **cold/stale** (older). + +Entries with 10+ accesses gain a 7-day buffer against decay. Calling `validate()` when a memory directly aids task completion resets the decay timer. + +### 16.7 Retrieval pipeline + +Retrieval is delegated to Singularity Memory via `sm.Recall(bank, query, opts)`. Singularity Memory runs its own internal pipeline — fused semantic + lexical retrieval, optional reranking, and decay weighting — and returns ranked entries. The harness does not implement a retrieval pipeline of its own. + +Recall options the harness uses: + +| Option | Use | +|---|---| +| `top_k` | Number of entries to inject into prompt (default 5) | +| `bank` | `project/{hash}` or `global/coding` (§ 16.3) | +| `filter` | Tag filters (e.g. `collection=anti_patterns`) | +| `rerank_quality` | `fast` (routine) or `accurate` (pre-dispatch context injection) | + +The harness applies its own maturity and anti-pattern weighting (§ 16.4, § 16.5) by tagging entries on retain and filtering / re-ordering on recall — Singularity Memory stores the metadata but does not interpret it. + +### 16.8 `sf init` + +Deep analysis is default, not opt-in: + +1. AST-level codebase scan (languages, structure, entry points, dependencies). +2. Git history analysis (active areas, recent changes, contributors). +3. Retain findings into the `project/{hash}` Singularity Memory bank. +4. Establish `.sf/config.toml` with detected stack, workflow templates, model routing hints. + +`--quick` flag skips Singularity Memory indexing for throwaway sessions. + +--- + +## 17. Persistent Agents + +### 17.1 Agent vs unit + +A **unit** is ephemeral work created by `/sf plan` (or `.sf/plan.md`) and driven through the phase state machine (§ 4). It is archived on completion. + +A **persistent agent** is a named, long-lived identity: it has its own memory blocks, system prompt, and message history. It sleeps at zero cost when idle and wakes when its inbox receives a message or an explicit `/sf agent run ` is issued. + +**A persistent agent run is NOT a unit.** Specifically: + +| Aspect | Unit | Persistent agent run | +|---|---|---| +| Source of work | User goal via `/sf plan` (§ 3.3) | Inbox message or explicit `/sf agent run` | +| Phase state machine | YES | NO | +| Verification gates | YES | NO | +| Workflow templates | YES | NO | +| PostUnit hooks | YES | NO (replaced by `PostAgentRun`) | +| `before_run` / `after_run` workspace hooks | YES | YES (shared lifecycle) | +| Supervisor checks (StuckLoop, AbandonDetect, BudgetWarning) | YES | YES | +| Crash recovery | re-dispatch from last phase | re-deliver undelivered inbox | +| Budget instance | fresh per attempt | persistent across runs (until reset) | + +What they share: the worker attempt lifecycle (§ 6) — workspace creation, `before_run` hook, agent session, turn loop, `after_run` hook — is identical. The supervisor goroutine monitors agent runs and unit attempts with the same checks. The trace records both as runs with distinct `run_kind` attributes. + +### 17.2 Memory block injection + +At dispatch time, the harness MUST render the agent's memory blocks into the system prompt: + +```xml + + {{value}} + {{value}} + {{value}} + +``` + +### 17.3 Built-in memory tools + +| Tool | Signature | Effect | +|---|---|---| +| `core_memory_append` | `(label string, content string)` | Appends content to block, respects `char_limit` | +| `core_memory_replace` | `(label string, old string, new string)` | Replaces substring in block | + +Both tools MUST write to `agent_memory_blocks` in SQLite before the next turn is dispatched. A crash mid-session MUST preserve the updated block state. + +### 17.4 Agent lifecycle + +```go +type AgentState int + +const ( + AgentIdle AgentState = iota // no pending messages, not running + AgentRunning // dispatched, consuming tokens + AgentWaiting // sent a message to another agent, awaiting reply + AgentStopped // explicitly stopped; will not wake automatically +) +``` + +The harness owns all state transitions. The agent loop MUST NOT write `AgentState` directly. + +### 17.5 Agent run termination + +A persistent agent run terminates when ANY of: + +1. **Inbox drained.** The agent's inbox has no `delivered = 0` rows AND the agent's last turn produced no outgoing `send_message` requiring `wait_for_reply`. +2. **Explicit stop.** The agent calls a built-in `stop()` tool, signalling it has no further work. +3. **Budget exhausted.** Per-agent `Budget.AtHardLimit()` fires (§ 8). Compaction does NOT terminate the run; only hard-limit does. +4. **Turn cap.** `max_turns_per_run = 100` (configurable per-agent via `agents.max_turns_per_run` column or `[harness] agent_max_turns_per_run`). Higher than unit cap because agents are long-running. +5. **Supervisor signal.** `SignalAbort` for any reason (StuckLoop, AbandonDetect, ReconciliationCancel does not apply to agents). +6. **Timeout.** A configurable `agent_run_timeout = "30m"` from run start. + +On termination the agent transitions to `AgentIdle` (or `AgentStopped` for case 2). On wake (next inbox message), a NEW run begins — the **agent's hot cache is NOT preserved across runs**; only the durable memory blocks (`agent_memory_blocks`) and message history (`agent_messages`) survive. + +### 17.6 Agent fleet supervision + +Each persistent agent has its own `Budget` instance (§ 8) that persists across runs and is reset only on explicit `/sf agent reset `. Compaction fires per-agent — when one agent's budget hits the compact threshold, only its hot cache is summarised; other agents are unaffected. + +Crash recovery for agents differs from unit recovery (§ 4.7): on restart, each agent's `agent_inbox` is rescanned for `delivered = 0` rows. Any such rows trigger an immediate `AgentWake` — the agent resumes processing the queue. There is no phase to resume; the inbox IS the resumption state. + +The trace records each agent run as a separate root span with `run_kind = "agent"` and `agent_id = `. `/sf session-report` breaks down spend by agent. + +--- + +## 18. Inter-Agent Messaging + +### 18.1 `send_message` tool + +```go +// Tool the agent calls: +// send_message(to: string, message: string) -> void +// +// to: agent name or agent ID +// message: plain text; the receiving agent sees it as a "user" role message +``` + +When called, the harness MUST: +1. Insert a row into `agent_inbox` for the target agent. +2. Emit an `AgentWake` pubsub event for the target agent. +3. Record the message in `agent_messages` for both sender and receiver. + +### 18.2 Wake rules + +- An `AgentIdle` agent that receives `AgentWake` MUST start a new dispatch cycle immediately. +- An `AgentRunning` agent queues the message for its next dispatch cycle. +- Undelivered inbox messages MUST be prepended to the context as `user` role messages in arrival order at the start of each dispatch, then marked `delivered = 1`. + +### 18.3 `wait_for_reply` + +An agent calling `wait_for_reply(ticket_id)` transitions to `AgentWaiting`. The harness suspends its dispatch loop until the target agent sends a reply or a configurable timeout elapses. + +`wait_for_reply` has a mandatory timeout. The harness MUST NOT block indefinitely. + +### 18.4 Agent handoff + +`handoff(to, context)` transfers the active task to a specialist agent. `to` is either an agent name (exact match) or a capability tag string (e.g. `"capability:go"` or `"capability:sql,perf"`): + +1. **Resolution.** If `to` starts with `capability:`, the harness queries `agents` for an active agent (`archived_at IS NULL`, `state != 'stopped'`) whose `capabilities` JSON array includes ALL listed tags. If multiple match, the one with the lowest `last_active` wins (round-robin). If none match, `handoff` returns `ErrNoCapableAgent`. +2. **Suspension.** The calling agent's current run is suspended (not completed). +3. **Context delivery.** The target agent receives the full task context (system prompt, memory blocks at handoff time, last N messages) pre-loaded as a snapshot in its inbox. +4. **Wait.** The calling agent transitions to `AgentWaiting` until the specialist replies (subject to `wait_for_reply` timeout). +5. **Fallback.** If the target agent is not found or is `AgentStopped`, `handoff` returns an error and the calling agent continues. + +```go +// Tool the agent calls: +// handoff(to: string, context: string) -> HandoffTicket +// Agent calls wait_for_reply(ticket.id) to block until the specialist responds. +// +// to formats: +// "go-specialist" — exact agent name +// "capability:go" — first eligible agent with capability tag "go" +// "capability:sql,perf" — agent with both "sql" AND "perf" tags +``` + +Capability matching is the recommended form — it lets the agent fleet evolve without changing handoff call sites. + +### 18.5 Append-only inbox log + +`agent_inbox` MUST be append-only. Rows MUST NOT be deleted after insert. `delivered` is the only mutable column. This gives a complete audit trail of all inter-agent communication. + +Inbox and message tables are subject to a periodic GC sweep: rows with `delivered = 1` and `created_at < now() - retain_window` are moved to `.sf/archive/agents/{agent_id}/inbox-{YYYY-MM}.jsonl` and deleted from the live tables. Default `retain_window = 30d`, configurable via `[harness] agent_inbox_retain = "30d"`. The archive is human-readable and queryable by `/sf agent history`. + +### 18.6 Memory block concurrency + +An agent's memory blocks are owned by that agent — they are NEVER shared with other agents (§ 18.7). Within a single agent, a turn's tool calls execute serially (one tool at a time), so two `core_memory_*` writes within a turn cannot race. Across turns, the harness commits the prior turn's writes before dispatching the next turn (§ 17.3). + +`handoff` does NOT share blocks — the receiving agent gets its own blocks. The `context` argument of `handoff` is a snapshot, not a reference. + +### 18.7 What not to build + +- **Shared memory** — agents MUST NOT share memory blocks. If two agents need a common fact, one sends it as a message. +- **Broadcast** — there is no `send_message_all`. Routing MUST be explicit. +- **Synchronous RPC** — `send_message` is fire-and-forget. `wait_for_reply()` is explicit and has a timeout. + +--- + +## 19. Observability + +### 19.1 Structured log format + +All harness log lines MUST use stable `key=value` pairs. Required context fields: + +| Scope | Required fields | +|---|---| +| Any unit-related log | `unit_id=`, `unit_type=` | +| Agent session lifecycle | `session_id=`, `turn_count=` | +| Phase transitions | `from=`, `to=`, `reason=` | +| Gate execution | `gate=`, `attempt=`, `passed=` | + +Include action outcome in the message: `completed`, `failed`, `retrying`, `canceled`. MUST NOT log large raw payloads — truncate hook output at 2 KB and append `(truncated)`. + +### 19.2 Log rotation + +- Max file size: 10 MB. +- Max rotating files: 5. +- Single-line format — no multi-line log entries. +- When file logging is configured, the default stderr handler MUST be removed (logs to file only). +- Default path: `~/.sf/log/sf.log`. + +### 19.3 Spans and trace + +```go +type Span struct { + TraceID string + SpanID string + Operation string // "tool_call" | "phase_transition" | "model_request" | "hook" + StartedAt time.Time + Duration time.Duration + Attrs map[string]any + Error error +} +``` + +- Every tool call, phase transition, model request, and hook execution MUST emit a span. +- Spans MUST be written to `/.sf/trace/trace-{YYYY-MM-DD}.jsonl` (rolls at local-midnight on first span emission after midnight). +- Span emission MUST be non-blocking — use a buffered channel with a background writer goroutine. +- MUST NOT drop spans. If the buffer is full, block briefly rather than discard. +- The first line of each daily file MUST be a `_meta` record: + ```json + {"_meta":true,"trace_schema_version":1,"sf_version":"","created_at":""} + ``` + Readers branch on `trace_schema_version`. Future schema changes bump the version; no in-place migration of historical files. + +### 19.3.1 Trace index for forensics + +JSONL is the source of truth for spans, but `/sf forensics` queries demand fast access to specific runs/units/sessions. The harness MUST maintain a small SQL index alongside the JSONL: + +```sql +CREATE TABLE trace_index ( + run_id TEXT NOT NULL, + span_id TEXT NOT NULL, + parent_span_id TEXT, + trace_id TEXT NOT NULL, + operation TEXT NOT NULL, -- "tool_call" | "phase_transition" | "model_request" | "hook" + started_at INTEGER NOT NULL, + duration_ms INTEGER, + file_path TEXT NOT NULL, -- which JSONL file holds the full record + file_offset INTEGER NOT NULL, -- byte offset within the file + PRIMARY KEY (run_id, span_id) +); +CREATE INDEX trace_index_started_at ON trace_index(started_at); +CREATE INDEX trace_index_trace_id ON trace_index(trace_id); +``` + +The index is populated by the trace writer goroutine after a successful flush. `/sf forensics ` queries the index, then seeks into the JSONL files for full payloads. + +JSONL files older than 30 days MAY be moved to `/.sf/archive/trace/` by `/sf clean`. The move MUST be a single transaction: +1. Move the JSONL file to `archive/trace/`. +2. UPDATE `trace_index SET file_path = REPLACE(file_path, '.sf/trace/', '.sf/archive/trace/') WHERE file_path = ?`. + +Both steps under a process-level lock so a concurrent forensics query never observes a half-renamed state. If `/sf clean` is interrupted mid-move, on next run it detects the file in archive but index pointing to original path and repairs by re-running the UPDATE. + +### 19.4 Intent chapters + +Spans are grouped into named chapters by intent (not just by phase). + +```go +type Chapter struct { + ID string + UnitID string + Name string // inferred or agent-declared + Intent string // one-sentence summary written at close + OpenedAt time.Time + ClosedAt *time.Time + Outcome string // "success" | "failure" | "pivot" + SpanIDs []string +} +``` + +Chapters serve two purposes: +1. **Context recovery** — on resume after a crash, the harness reconstructs "what the agent was doing and why" from the chapter log. The chapter summary is injected at the top of the restored context. +2. **Singularity Memory recall** — completed chapters are stored as discrete entries. Recall queries match against chapter intent. + +The agent MAY open a chapter explicitly via `chapter_open(name)`. + +### 19.5 HTTP observability API + +The harness MUST expose a lightweight HTTP server on `localhost` when `server.port` is configured. The API is observability-only — orchestrator correctness MUST NOT depend on it. + +**Auth.** The server binds to `127.0.0.1` only. Every request MUST include header `Authorization: Bearer ` where token is read from `/.sf/runtime/api.token` (generated as 32 random bytes hex on first start, mode 0600). Multi-user machines need this — `localhost` alone is insufficient. The actual port and token are written to `/.sf/runtime/server.port` and `api.token` for tools to discover. + +**Session filter.** All endpoints accept `?session=` to scope the response to one session. With no parameter, responses include all active sessions in the project DB; the response body has a top-level `sessions: [...]` array with the snapshot per session. + +**`GET /api/v1/state`** — runtime snapshot: + +```json +{ + "generated_at": "2026-04-29T14:22:00Z", + "counts": { "running": 3, "retrying": 1, "queued": 5 }, + "running": [ + { + "unit_id": "execute-task/m1/s2/t3", + "phase": "execute", + "session_id": "sess-abc-turn-4", + "turn_count": 7, + "last_event": "tool_call", + "started_at": "2026-04-29T14:10:00Z", + "tokens": { "input": 18200, "output": 2100, "total": 20300 } + } + ], + "retrying": [ + { + "unit_id": "execute-task/m1/s2/t4", + "attempt": 2, + "due_at": "2026-04-29T14:24:00Z", + "error": "gate: tests failed" + } + ], + "totals": { + "input_tokens": 84000, + "output_tokens": 12000, + "cost_usd": 1.24, + "seconds_running": 4820 + } +} +``` + +**`GET /api/v1/units/`** — per-unit debug detail: recent events, workspace path, retry count, last error, log file path. + +**`POST /api/v1/refresh`** — queue an immediate poll + reconciliation cycle (202 Accepted; best-effort coalescing of rapid requests). + +### 19.6 Rate-limit tracking + +The harness MUST track the latest rate-limit payload from any provider event and surface it in the TUI and HTTP API. Rate-limit data is observability-only — no retry logic is driven by it. + +**Why not actively throttle on rate limits?** Three reasons: (a) rate limit headers vary in format and meaning across providers (Anthropic's `anthropic-ratelimit-tokens-remaining` vs OpenAI's `x-ratelimit-remaining-tokens` differ in semantics — input-only vs total), (b) the model router (§ 15) already moves between providers, so a single provider's pressure does not need to feed back into dispatch, (c) the circuit breaker (§ 9.3) handles repeated provider failures including 429. Rate-limit data is for the operator to see what's happening, not for the orchestrator to react to. + +--- + +## 20. Failure Taxonomy + +Every harness failure has a class. The class determines recovery behavior. + +| Class | Examples | Recovery | +|---|---|---| +| `config` | Missing or invalid `.sf/workflows/*.toml`, invalid `.sf/config.toml`, missing API key | Block new dispatches. Keep service alive. Emit operator-visible error. | +| `workspace` | Directory creation failure, hook timeout, invalid path | Fail the current attempt. Orchestrator retries with backoff. | +| `agent_session` | Startup handshake failed, turn timeout, turn cancelled, subprocess exit, stalled session, `turn_input_required` (hard mode) | Fail the current attempt. Orchestrator retries with backoff. | +| `observability` | Snapshot timeout, dashboard render error, log sink failure | Log and ignore. MUST NOT crash the orchestrator over an observability failure. | + +### 20.1 Typed error codes + +```go +const ( + ErrMissingWorkflowFile = "missing_workflow_file" // .sf/workflows/.toml not found + ErrWorkflowParseError = "workflow_parse_error" + ErrWorkspaceCreation = "workspace_creation_failed" + ErrWorkspaceSymlinkEscape = "workspace_symlink_escape" + ErrHookTimeout = "hook_timeout" + ErrHookFailed = "hook_failed" + ErrAgentStartup = "agent_session_startup" + ErrTurnTimeout = "turn_timeout" + ErrTurnFailed = "turn_failed" + ErrTurnInputRequired = "turn_input_required" + ErrPromptRender = "prompt_render_failed" + ErrBudgetExhausted = "budget_exhausted" + ErrStalled = "stalled" + ErrCanceledByOperator = "canceled_by_operator" // user ran /sf abandon + ErrModelUnavailable = "model_unavailable" + ErrCircuitOpen = "circuit_open" + ErrNoCapableAgent = "no_capable_agent" + ErrSshDisconnected = "ssh_disconnected" + ErrCanceledBySupervisor = "canceled_by_supervisor" +) +``` + +Implementations MUST match on typed error codes. Matching on error message strings is PROHIBITED. + +### 20.2 Scheduler state + +Scheduler state is intentionally in-memory. Restart recovery MUST NOT attempt to restore retry timers, live sessions, or in-flight agent state. After restart: startup terminal cleanup → fresh poll → re-dispatch eligible work. This is a design choice, not a limitation. Durable retry state is a future extension. + +--- + +## 21. Trust Boundary + +Every deployment MUST document its trust posture explicitly. There is no universal safe default. + +### 21.1 Default posture (single-user developer machine) + +- Auto-approve tool execution and file changes within the workspace. +- `turn_input_required = "soft"`. +- Workspace isolation enforced (symlink-aware path containment, sanitized names). +- Secrets from Vault only — MUST NOT store secrets in config files in plaintext. + +### 21.2 Hardening measures for less-trusted environments + +- Filter which issues/tasks are eligible for dispatch — untrusted or out-of-scope tasks MUST NOT automatically reach the agent. +- Restrict the `plan_unit` client-side tool to read-only or scope-limited mutations only. +- Run the agent subprocess under a dedicated OS user with no write access outside the workspace root. +- Add container or VM isolation around each workspace (Docker, nsjail, etc.). +- Restrict network access from the workspace. +- Narrow available tools to the minimum needed for the workflow. + +### 21.3 Auto-approval contract + +In auto-mode the harness calls pi-coding-agent's existing permission API ONLY for operations listed in `[harness.auto_approve]`. Sensitive operations (`fs:write-outside-project`, `shell:exec`) MUST always prompt regardless of auto-mode setting. + +**Precedence between PreToolUse hooks and auto-approve.** pi-coding-agent's PreToolUse hook system already runs before any tool call. If a PreToolUse hook returns `deny` or `halt`, the tool call is rejected even if `auto_approve` lists the tool. The order is: + +1. PreToolUse hooks run first; their decision is final for `deny`/`halt`. +2. If hooks return `allow` or no decision, the auto-approve list is consulted. +3. If neither approves, the user is prompted (interactive mode) or the call fails (auto-mode for non-allowlisted tools). + +This means: PreToolUse hooks MAY revoke an auto-approval; the auto-approve list MAY NOT override a hook denial. This precedence is critical for security policies that need to override per-session approvals. + +SF-specific permission gates: +- `git:write` — any git operation that mutates state. Requires explicit grant in auto-mode. +- `worktree:create` and `worktree:delete` — worktree lifecycle. +- `fs:write-outside-project` — ALWAYS prompt, NEVER auto-approve. +- `shell:exec` — allowlist specific commands; no blanket approval. + +--- + +## 22. Distributed Execution + +### 22.1 Topology + +The orchestrator ALWAYS runs centrally. Workers MAY execute on remote hosts over SSH. + +```toml +[worker] +ssh_hosts = ["mikki-bunker", "forge-gpu-1"] +max_concurrent_agents_per_host = 3 +``` + +### 22.2 Rules + +- `workspace.root` is resolved on the **remote host**, not the orchestrator. +- The agent subprocess is launched over SSH stdio. The orchestrator owns the session lifecycle. +- Continuation turns within one worker lifetime MUST stay on the same host and workspace. +- If a host is at capacity, dispatch MUST wait rather than silently fall back to local or another host. +- Once a run has produced side effects, moving to another host on retry is treated as a new attempt (not invisible failover). +- The run record MUST include `worker_host` so operators can see where each run executed. +- SSH workspace creation MUST use the same symlink-aware validation as local workspaces, implemented via shell script. + +### 22.3 Disconnect and zombie handling + +When the SSH connection drops mid-turn: + +1. The orchestrator marks the attempt `failed` with `error_code = "ssh_disconnected"` after `[worker] ssh_disconnect_timeout = "30s"` of no stdio activity. +2. **Before** scheduling a retry, the orchestrator MUST emit a remote-cleanup script over a fresh SSH session: `pgrep -f "" | xargs -r kill -TERM`, wait 10s, then `kill -KILL`. The marker is a unique string injected into the agent process's command line (e.g. `--sf-run-id=`). +3. If the cleanup script fails (host unreachable), the host is marked `unhealthy` for `[worker] host_quarantine = "5m"`. New dispatches skip it; the host re-eligibility check runs each tick. +4. The retry MUST land on a different host if `host_quarantine` is in effect for the original host; otherwise same host with a fresh workspace re-creation (the previous workspace is moved to `~/.sf/orphaned-workspaces/{timestamp}-{run-id}/` for forensics, not deleted). + +Zombies are the dominant failure mode for distributed execution; ignoring them produces double-write corruption. + +--- + +## 23. Plugin Extension Points + +Plugin interfaces are TypeScript classes implementing the listed contracts. sf loads them via dynamic import at boot from `.sf/plugins/`. Each plugin is a Node module exporting a default class with a marker method (e.g. `static readonly kind = "shipper"`). + +### 23.1 Interfaces + +**`SupervisorCheck`** — custom supervisor checks without forking: + +```go +type SupervisorCheck interface { + Name() string + Check(ctx context.Context, state SupervisorState) SupervisorSignal +} +``` + +**`Shipper`** — PR/MR creation. GitHub default; GitLab, Gitea, Forgejo alternatives: + +```go +type Shipper interface { + Ship(ctx context.Context, opts ShipOptions) (ShipResult, error) +} +``` + +**`VCS`** — version control backend. `git` default; `jj` (Jujutsu) first alternative: + +```go +type VCS interface { + Commit(ctx context.Context, msg string, files []string) error + Branch(ctx context.Context, name string) error + Push(ctx context.Context, remote, branch string) error +} +``` + +**`Store`** — storage backend. SQLite for personal use; PostgreSQL for team sessions: + +```go +type Store interface { + SaveSession(ctx context.Context, s Session) error + LoadSession(ctx context.Context, id string) (Session, error) + SaveMemory(ctx context.Context, m Memory) error + SearchMemory(ctx context.Context, q MemoryQuery) ([]Memory, error) +} +``` + +**`Notifier`** — notification provider. Slack, Discord, webhook: + +```go +type Notifier interface { + Notify(ctx context.Context, event Event) error +} +``` + +### 23.2 What stays out of plugins + +- Workflow templates — enforced TOML/YAML data +- Skills — `SKILL.md` prompt guidance +- Model routing — config + SQLite + thin Go scorer +- Phase transitions — harness-owned, not extensible + +--- + +## 24. Secret Management + +### 24.1 `vault://` URI scheme + +Secrets MUST NOT be stored in config files in plaintext. The canonical secret reference format is: + +``` +vault://secret/sf#anthropic_api_key +``` + +In config: + +```json +{ + "providers": { + "anthropic": { "api_key": "vault://secret/sf#anthropic_api_key" } + } +} +``` + +### 24.2 VaultResolver + +```go +type VaultResolver struct { + client *vault.Client +} + +func (r *VaultResolver) Resolve(uri string) (string, error) { + // parse vault://path#field + // client.KVv2(mount).Get(ctx, path) → secret.Data["field"] +} +``` + +Auth chain (first that succeeds): +1. `VAULT_TOKEN` env var (CI / ephemeral) +2. `~/.vault-token` file (local dev) +3. AppRole via `VAULT_ROLE_ID` + `VAULT_SECRET_ID` (production) + +Secrets MUST be fetched once at startup and held in memory for the session lifetime. MUST NOT be written to disk or logged. + +### 24.3 Stopgap + +Until the native resolver is built, sf supports the same `$(command)` substitution that pi-mono inherits — embed a shell command: + +```json +{ "api_key": "$(vault kv get -field=anthropic_api_key secret/sf)" } +``` + +--- + +## 25. CLI Commands + +### `/sf plan "" [--workflow=feature] [--link-issue=]` + +Add a milestone to the project's plan. sf decomposes into slices and tasks at runtime (Plan phase) but the milestone row is created immediately so it shows up in `/sf status`. `--link-issue=` writes `metadata.gh_issue` for use by visibility hooks (§ 10.5.1). `--workflow=` overrides the default workflow template. + +### `/sf plan reload` + +Re-read `.sf/plan.md` and reconcile against current `units`. Adds new milestones, surfaces removed ones as `archived`, leaves in-flight units alone. + +### `/sf abandon "reason"` + +Operator override to mark a unit terminal mid-flight. Sets `phase_status = 'canceled'`, records the reason in `runs.error_code = "canceled_by_operator"`. Mid-turn workers detect the change at the next inter-turn check (§ 6) and exit cleanly. + +### `/sf auto` + +Start the autonomous loop. The harness polls `units` for eligible work and dispatches workers until no more eligible units exist or until stopped by `/sf pause`. + +### `/sf next` + +Manual step mode. Dispatch one unit, wait for completion, surface result. Repeat on each invocation. + +### `/sf dispatch ` + +Force-dispatch a specific unit regardless of priority or blocker state. Surfaces a warning if blockers exist. + +### `/sf pause` + +Cleanly pause auto-mode. Writes `SessionPaused` to SQLite. All in-flight units complete their current turn before stopping. + +### `/sf status` + +Structured project health snapshot: + +``` +Project: singularity-foundry +Phase: Execute [m2/s3/t1 — add trace export] +Next: TDD [m2/s3/t1] +Blocker: none + +Milestones: 2 / 5 (40%) +Slices: 7 / 18 (39%) +Tasks: 14 / 42 (33%) + +Session: 4h 12m | $0.83 | claude-sonnet-4-6 +``` + +Blockers surface from the `session_blockers` table. `/sf status` MUST NOT poll pubsub — it reads SQLite directly. + +### `/sf revert ` + +Four-phase git-aware revert protocol: + +1. **Target selection** — accept explicit unit ID, or present the top 3 in-progress + 3 most recent completed units as a numbered menu. +2. **Git reconciliation** — find all commits belonging to the target unit. Handle ghost commits (SHA missing after rebase/squash) by searching by commit message prefix. +3. **Confirmation** — display exact SHA list with descriptions and dates. Warn on merge commits. +4. **Execution** — `git revert --no-edit ` in reverse order (newest first). On conflict: `SignalPause`. + +After all reverts: restore `.sf/active/{unit-id}/` artifacts from archive; mark unit as `[ ]` in the plan. + +### `/sf rate over|ok|under [unit-id]` + +Signal model quality. Without `unit-id`, targets the most recently completed run in the current session — specifically the latest row in `runs` where `outcome IN ('success', 'failure')` and `ended_at IS NOT NULL`, scoped to `session_id`. With `unit-id`, targets the latest run for that unit. + +Writes to `benchmark_results` with the human-rating weight multiplier (default 3×). Cannot be issued against an in-flight run. + +### `/sf benchmark` + +Run on-demand model benchmarks for all tiers against real task samples. Updates `benchmark_results`. + +### `/sf doctor` + +Run health checks: +- `HarnessConfig.Validate()` +- Vault connectivity +- Singularity Memory connectivity +- SQLite schema version +- Lock file state +- Workflow template syntax +- HTTP API token presence + permissions + +Exit code: `0` if all checks pass, `1` if any FAIL or WARN. Useful in CI: `sf doctor || exit 1`. The TUI rendering shows pass/warn/fail per check; the JSON form (`/sf doctor --json`) returns a structured report for automation. + +### `/sf history` + +Query archived units in `.sf/archive/`. Supports filtering by date, phase, model, verdict. + +### `/sf forensics` + +Inspect the trace for a specific unit or session. Shows all spans, tool calls, phase transitions, and gate results in chronological order. + +### `/sf reset-circuits` + +Clear all tripped circuit breakers. Next dispatch uses benchmark scores to select within each tier normally. + +### `/sf reassess-resolve "operator response"` + +Resume a unit that entered `PhaseReassess` with the **Escalate** outcome (§ 4.6). The operator's response is appended as the next attempt's `last_error` so the agent can incorporate it. The unit re-enters `PhasePlan`. + +### `/sf force-clear ` + +Operator override: mark a `session_blockers` row resolved with `resolved_by = "/sf force-clear"`. Used to dismiss stuck `GateBlocked` events that can't auto-resolve (e.g. flaky external test infrastructure). + +### `/sf merge-resolve ` + +Resume a unit halted on `MergeConflict`. Assumes the operator has resolved the conflict in the worktree. Triggers re-emission of `MergeReady`. + +### `/sf uat-approve ` and `/sf uat-reject "reason"` + +Advance a unit out of `PhaseUAT` (§ 4.6). Approve transitions to `PhaseMerge`; reject transitions to `PhaseReassess` with the reason as `last_error`. + +### `/sf agent ` + +Persistent agent management: + +- `/sf agent list` — show all agents with state, last_active, capabilities. +- `/sf agent run "message"` — wake an agent with an ad-hoc message (bypasses inbox routing). +- `/sf agent reset ` — clear hot cache and reset Budget; memory blocks and message history preserved. +- `/sf agent delete ` — soft-delete (sets `archived_at`); runs and messages preserved via snap_ columns. +- `/sf agent inspect ` — show memory blocks, recent messages, current state. +- `/sf agent history ` — query archived inbox in `.sf/archive/agents/{id}/`. + +### `/sf history [filters]` + +Query archived units in `.sf/archive/`. Filter syntax: + +``` +/sf history --since 2026-04-01 --phase merge --verdict success +/sf history --workflow spike +/sf history --model claude-sonnet-4-6 --limit 50 +/sf history --json # machine-readable output for automation +``` + +Filters are AND-combined. Without filters, returns the most recent 20 archived units. The query reads from `runs` table joined with archive metadata; full unit artifacts are accessible at `.sf/archive/{date}-{unit-id}/`. + +### `/sf clean [--dry-run]` + +Garbage-collect: rotate trace JSONL older than 30 days to `.sf/archive/trace/`, evict `pending_retain` rows older than 7 days to `lost-learnings.jsonl`, vacuum SQLite. `--dry-run` shows what would be removed. + +--- + +## 26. Conformance Checklist + +Use this checklist as the definition-of-done for each build phase. An implementation is **core-conformant** when all core items pass. **Extension-conformant** when all extension items also pass. + +Each item is tagged: + +- **[REQUIRED]** — MUST be present for conformance at its tier. Absence = non-conformant. +- **[STRONG]** — SHOULD be present; departure requires a written rationale. +- **[OPTIONAL]** — MAY be present; absence is acceptable. + +Default tag is **[REQUIRED]** unless explicitly noted. + +### 26.1 Core (must ship) + +- [ ] **C-01** Workflow template TOML loader with `phases`, `require_tdd`, `require_review`, `max_retries`, `max_reassess` fields; unknown fields rejected. +- [ ] **C-02** Phase state machine with all 10 phases; invalid transitions rejected with typed error at harness boundary. +- [ ] **C-03** `Harness.Transition(ctx, from, to, reason)` persists to SQLite before new phase begins; emits pubsub `PhaseChange` after write. +- [ ] **C-04** AttemptState enum (11 states); `AttemptCanceled` distinct from `AttemptFailed`. +- [ ] **C-05** TurnKind enum; continuation turns receive guidance-only prompt, not full task prompt. +- [ ] **C-06** Strict prompt rendering: unknown `{{variable}}` in template → startup panic. +- [ ] **C-07** `attempt` variable: `null` on first dispatch; integer ≥ 1 on retry; `last_error` auto-injected on retry. +- [ ] **C-08** `turn_input_required` configurable `soft` (inject non-interactive message) or `hard` (fail immediately); MUST NOT stall indefinitely. +- [ ] **C-09** Context budget: `ShouldCompact()` triggers compaction before next turn; `AtHardLimit()` halts unit; budget state persisted to SQLite after every turn. +- [ ] **C-10** Budget token accounting prefers absolute totals; prevents double-counting. +- [ ] **C-11** Compaction: write session summary to Singularity Memory, clear hot cache, start next turn with fresh recall. +- [ ] **C-12** Supervisor goroutine: all 9 built-in checks; communicates only via pubsub; MUST NOT call `os.Exit`. +- [ ] **C-13** Circuit breaker: 3 consecutive non-transient failures trips model; state persisted to SQLite; resets after 24h or `/sf reset-circuits`. +- [ ] **C-14** `ModelUnavailable` → `SignalAbort` immediately (not after timeout). +- [ ] **C-15** Hook events: `PreDispatch`, `PostUnit`, `PhaseChange`, `AutoLoop`, `WorktreeCreate`, `WorktreeDelete`, `MergeReady`, `MergeConflict`. +- [ ] **C-16** `UnitResult` struct passed to PostUnit hooks as JSON via stdin. +- [ ] **C-17** PostUnit hooks run sequentially; non-zero exit → `SignalAbort`; timeout → kill, log, continue. +- [ ] **C-18** Tool response contract: `{success, output, contentItems}` shape for all tool responses. +- [ ] **C-19** Unknown tool call → structured failure response; session continues. +- [ ] **C-20** Doc sync hook runs after every `PhaseMerge`; MAY be disabled with `doc_sync = false`. +- [ ] **C-21** Workspace name sanitization: `[^a-zA-Z0-9._-]` → `_`. +- [ ] **C-22** Symlink-aware workspace path containment via segment-by-segment `lstat` canonicalization; naive `EvalSymlinks` is insufficient. +- [ ] **C-23** Workspace lifecycle hooks: `after_create`, `before_run`, `after_run`, `before_remove`; `before_run` fatal, `after_run` best-effort. +- [ ] **C-24** Startup cleanup: stale active artifacts moved to archive; running units marked interrupted. +- [ ] **C-25** Dynamic config reload: `{mtime, size, SHA-256}` stamp polled every tick; invalid reload keeps last known good; session-immutable fields unchanged without restart. +- [ ] **C-26** Per-phase concurrency caps (`max_agents_by_phase`). +- [ ] **C-27** Blocker-aware dispatch: non-terminal upstream → skip, re-evaluate next tick; no backoff increment. +- [ ] **C-28** Priority sort: priority asc → blocker-free first → phase order → created_at asc → id lexicographic. +- [ ] **C-29** Continuation retry (1s) after normal worker exit. +- [ ] **C-30** Exponential backoff after abnormal exit; cap configurable (default 5m). +- [ ] **C-31** Structured log format: `key=value` pairs; required context fields per scope; truncate at 2KB. +- [ ] **C-32** Log rotation: 10MB max, 5 files, single-line format, stderr handler removed when file logging active. +- [ ] **C-33** Span-based trace to `~/.sf/trace.jsonl`; non-blocking buffered writer; MUST NOT drop spans. +- [ ] **C-34** Intent chapters: open/close with intent summary; used for crash recovery context and Singularity Memory recall. +- [ ] **C-35** Typed error codes; matching on error strings PROHIBITED. +- [ ] **C-36** Scheduler state intentionally in-memory; restart re-dispatches from fresh poll. +- [ ] **C-37** Project CI runs `specs.check`: AST-based godoc enforcement on all exported identifiers in sf's own harness packages. (Not a user-project runtime gate.) +- [ ] **C-38** Vault secret resolution: `vault://path#field` URI scheme; auth chain: `VAULT_TOKEN` → `~/.vault-token` → AppRole; secrets MUST NOT be written to disk or logged. +- [ ] **C-39** PhaseReview chunked at ≤ 300 lines per chunk. +- [ ] **C-40** Unit archive: `.sf/active/` → `.sf/archive/{date}-{unit-id}/` on `PhaseComplete` via atomic rename. +- [ ] **C-41** No external tracker integration. The orchestrator polls only `units` in local SQLite. External visibility (GH Issues, Slack, etc.) is achieved via PostUnit hook scripts, not built-in adapters. +- [ ] **C-42** Unit creation sources: `/sf plan ""` CLI, `.sf/plan.md` reload, `/sf dispatch `. No background poll of any external API. +- [ ] **C-43** Crash recovery: `running` units → `interrupted` on startup; re-dispatch fresh from last persisted phase boundary with `last_error = "resumed_after_crash"`; tool calls NOT replayed; agent sessions NOT resumed. +- [ ] **C-44** Process lock at `~/.sf/run.lock`; stale-lock cleanup via `/proc` PID check. +- [ ] **C-45** Doc-sync runs as a sub-step of `PhaseMerge` (not a separate phase, not a post-merge dispatch); empty diff is a no-op; user approval required unless `doc_sync_auto_approve = true`. +- [ ] **C-46** SQLite is orchestration-only — no `memories` table, no vector index. Knowledge MUST live in Singularity Memory. +- [ ] **C-47** Atomic claim acquisition: single conditional UPDATE pattern; rows_affected = 1 gates dispatch. +- [ ] **C-48** `runs` table: CHECK constraint enforces XOR between unit_attempt and agent_run; aggregate token/cost are end-of-run rollup. +- [ ] **C-49** `units.attempt` is current counter; historical attempts in `runs`; both updated in same transaction. +- [ ] **C-50** Mid-run cancellation only via `/sf abandon ` (operator) or supervisor signal; no automated cancellation from external state changes (since there is no external state). +- [ ] **C-51** Singularity Memory retain failures queue in `pending_retain`; flush to `lost-learnings.jsonl` after 7d. +- [ ] **C-52** Workflow selection priority: `metadata.workflow` set at plan time → `default_workflow` config → built-in fallback. Pinned to unit at first dispatch; never re-evaluated. +- [ ] **C-53** PhaseUAT trigger: workflow `require_uat = true`; halts auto-loop with `SignalPause`; resumes via `/sf uat-approve` or `/sf uat-reject`. +- [ ] **C-54** Agent run termination conditions defined (inbox drain, stop tool, hard budget, turn cap, supervisor abort, timeout); hot cache NOT preserved across runs; durable blocks and message history ARE. +- [ ] **C-55** `last_error` injected only on `TurnFirst` of `attempt >= 2`. +- [ ] **C-56** Per-project lock at `/.sf/run.lock`; multiple projects can run auto concurrently. +- [ ] **C-57** Project DB at `/.sf/sf.db`; canonical directory layout (§ 14.5) MUST be honoured for `/sf revert`, `/sf history`, archive sweeps. +- [ ] **C-58** All runtime ULID PKs; soft-delete via `archived_at` for units and agents (no cascade delete of runs). +- [ ] **C-59** `runs` snap_ columns survive entity deletion; FK uses `ON DELETE SET NULL`. +- [ ] **C-60** Per-hook-type timeouts (table in § 10.3); not a single global value. +- [ ] **C-61** PhaseReassess outcomes: Re-plan / Abandon / Escalate; `max_reassess` decrements only on Re-plan; reasoning tier with Think. +- [ ] **C-62** PhaseChange is non-vetoable; veto semantics live on PreDispatch. +- [ ] **C-63** PhaseReview three-pass: establish-context → parallel chunked review → synthesis. +- [ ] **C-64** SSH disconnect: `error_code = "ssh_disconnected"`; remote zombie cleanup via marker pgrep; host quarantine on cleanup failure; orphaned workspace preserved for forensics. +- [ ] **C-65** Agent compaction preserves wake message + recent 3 inbox arrivals + full memory blocks. +- [ ] **C-66** PreToolUse hook decisions outrank auto_approve list (deny wins; allow falls through to auto-approve). +- [ ] **C-67** Slice merge ordering: `code_depends_on` honoured; merges serialised per project. +- [ ] **C-68** Doc-sync sub-step runs at end of last code-mutating phase (Merge if present, else Execute). +- [ ] **C-69** Cost stored as `cost_micro_usd` INTEGER (1e-6 USD); float drift avoided. +- [ ] **C-70** `session_blockers.resolved_at` set per resolution-rules table; `resolved_by` records source. +- [ ] **C-71** Workflow content pinning via `workflow_pins(hash, name, content)`; in-flight units use pinned content even if template file changes. +- [ ] **C-72** `projectHash` derivation: git-remote SHA-256 → fallback path SHA-256; cached in `.sf/runtime/project-hash.json`. +- [ ] **C-73** Dynamic reload of session-immutable fields: warn, keep in-process value, surface in `/sf status` as drift; do NOT crash. +- [ ] **C-74** `last_error` capped at 4 KB head-and-tail; full payload at `.sf/active/{unit-id}/last-error-full.txt`. +- [ ] **C-75** SSH auth via agent / explicit key; `ssh_known_hosts` MUST verify; no auto-trust. +- [ ] **C-76** UAT phase has timeout = 0 (infinite); advanced via `/sf uat-approve` or `/sf uat-reject`. +- [ ] **C-77** HTTP API requires `Authorization: Bearer ` from `.sf/runtime/api.token` (mode 0600); `?session=` filter supported. +- [ ] **C-78** `/sf doctor` exit code 0 = all pass, 1 = any FAIL or WARN; `--json` returns structured report. +- [ ] **C-79** Trace JSONL has `_meta` first-line record with `trace_schema_version`; readers branch on version. +- [ ] **C-80** Trace SQL index (`trace_index`) populated by trace writer; `/sf forensics` queries it for fast span lookup. +- [ ] **C-81** Turn outcome marker parsed from last 200 chars: `complete|blocked|giving_up`; blocked → SignalPause, giving_up → PhaseReassess. +- [ ] **C-82** Agent handoff supports `capability:tag1,tag2` form; round-robin by `last_active` among matching agents; `ErrNoCapableAgent` if none. +- [ ] **C-83** Provider API keys MUST use `vault://`; plaintext rejected at startup validation. +- [ ] **C-84** Gate script protocol: env vars (SF_PROJECT_ROOT, SF_UNIT_ID, SF_RUN_ID, SF_PHASE, SF_ATTEMPT, SF_GATE_NAME, SF_GATE_RETRY, SF_WORKSPACE, SF_TRACE_FILE), stdin = UnitResult JSON, exit codes 0/1/2/3, output truncated at 8 KB. +- [ ] **C-85** Gate retry counter is separate from `units.attempt`; resets on phase transition. +- [ ] **C-86** `plan.md` frontmatter (unit_id, created_at, written_by, plan_version) + sections (Goal, Approach, Deliverables, Verification, Notes) validated before transition out of PhasePlan. +- [ ] **C-87** `Memory` interface (Recall, Retain, Feedback, Validate, Health) generated from sm's `/openapi.json`; `pending_retain` queue routes failed Retains; `local_anti_patterns` mirror exposed when sm unreachable. +- [ ] **C-88** sf tools registered through pi-coding-agent's tool registry; PreToolUse hooks apply uniformly; auto_approve keys documented per tool. +- [ ] **C-89** All operator commands referenced elsewhere in spec are present in § 25: reassess-resolve, force-clear, merge-resolve, uat-approve, uat-reject, agent {list,run,reset,delete,inspect,history}, history, clean. +- [ ] **C-90** `agent_capabilities` index maintained in sync with `agents.capabilities`; capability lookup is index scan, not full table scan. +- [ ] **C-91** Trace JSONL archive move is transactional with `trace_index.file_path` UPDATE; recoverable if interrupted. +- [ ] **C-92** Versioning policy: SemVer; v1.0 freezes §§3, 4, 6, 10, 14, 26. +- [ ] **C-93** [STRONG] Rate-limit data is observability-only; no orchestrator retry/dispatch logic reads it. +- [ ] **C-94** Singularity Memory is the sole knowledge backend; engine assimilated into `singularity_memory_server/` (MIT-attributed, no upstream runtime dep). +- [ ] **C-95** `[memory] mode = "embedded"` is the default for single-user sf; `mode = "remote"` MUST require `url` and `api_key` (vault://). +- [ ] **C-96** Go client `github.com/singularity-ng/singularity-memory-client-go` is generated from sm's `/openapi.json`; sf imports it as a normal Go module dependency. + +### 26.2 Knowledge layer (ship after core) + +- [ ] **K-01** Memory tiers: hot cache (in-memory, last 10 turns); Singularity Memory store (durable, PostUnit writes). +- [ ] **K-02** Two-bank pattern in Singularity Memory: `project/{hash}` + `global/coding`; merged before dispatch. +- [ ] **K-03** Anti-pattern library: `collection: anti_patterns`; never decay; surfaced in dedicated `` block. +- [ ] **K-04** Pattern maturation: 4 states (candidate → established → proven → deprecated); weights as specified. +- [ ] **K-05** Confidence decay: `halfLife = 90 * (0.5 + confidence)` days. +- [ ] **K-06** Singularity Memory is the sole knowledge backend; on sm outage, dispatch proceeds with empty recall (plus local_anti_patterns mirror) and a logged warning. +- [ ] **K-07** `sf init` deep analysis default; `--quick` skips Singularity Memory indexing. + +### 26.3 Model routing (ship after core) + +- [ ] **R-01** Three tiers; phase → tier static mapping from config. +- [ ] **R-02** `Think: true` set for `reasoning` tier phases; agent cannot override. +- [ ] **R-03** Within-tier selection by benchmark score formula. +- [ ] **R-04** Complexity upgrade: classifier at dispatch time; fingerprint stored in SQLite. +- [ ] **R-05** `/sf rate` writes `benchmark_results`; human ratings carry 3× weight. + +### 26.4 Persistent agents (ship after core) + +- [ ] **A-01** `agents`, `agent_memory_blocks`, `agent_messages`, `agent_inbox` SQLite tables. +- [ ] **A-02** Memory block injection as XML into system prompt at dispatch. +- [ ] **A-03** `core_memory_append` and `core_memory_replace` tools write to SQLite before next turn. +- [ ] **A-04** `AgentState` enum (4 states); harness owns all transitions. +- [ ] **A-05** `agent_inbox` append-only; `delivered` is the only mutable column. +- [ ] **A-06** `send_message` tool: inserts to inbox, emits `AgentWake`. +- [ ] **A-07** `wait_for_reply` with mandatory timeout; MUST NOT block indefinitely. +- [ ] **A-08** `handoff(to, context)`: suspends calling agent → target receives full context → calling agent transitions to `AgentWaiting`. +- [ ] **A-09** Per-agent budget tracking, supervision, and crash recovery. +- [ ] **A-10** Cost recorded per agent in trace. + +### 26.5 Extensions (ship after core) + +- [ ] **E-01** HTTP observability API: `GET /api/v1/state`, `GET /api/v1/units/`, `POST /api/v1/refresh`. +- [ ] **E-02** SSH worker extension: `worker.ssh_hosts`; remote workspace creation via shell script with symlink-aware validation; per-host concurrency cap. +- [ ] **E-03** Durable retry queue across restarts (SQLite-backed). +- [ ] **E-04** `plan_unit` client-side tool: agent can refine its own plan mid-run (add/split/reorder units). Uses orchestrator auth; subject to PreToolUse hooks. +- [ ] **E-05** Plugin interfaces: `SupervisorCheck`, `Shipper`, `VCS`, `Store`, `Notifier`. (`Tracker` deliberately not in this list — see § 3.3.) + +--- + +*End of SPEC.md v0.1.0-draft* diff --git a/docs/hooks/FUTURE.md b/docs/hooks/FUTURE.md index 0ddbfe2905..334f12579a 100644 --- a/docs/hooks/FUTURE.md +++ b/docs/hooks/FUTURE.md @@ -374,3 +374,65 @@ Cross-platform matrix: - No Crush-as-script-interpreter mode (users can't write `#!/usr/bin/env crush` and have it mean something). If we want that later, it's an additive feature, not a dependency of this work. + +## SF harness hook events + +**Status:** planned, tied to singularity-crush harness build (see `harness.md`). + +Crush currently exposes only `PreToolUse`. The SF harness needs additional +lifecycle events. These extend the existing hook aggregation system +(`aggregate()` in `hooks.go`) — the same allow/deny/halt semantics apply. + +### New event types + +| Event | When it fires | Payload fields | +|-------|--------------|----------------| +| `PreDispatch` | Before a unit is dispatched to the agent | `unit_id`, `unit_type`, `phase`, `model`, `session_id` | +| `PostUnit` | After a unit completes (success or failure) | Full `UnitResult` JSON — see `harness.md` § 10 | +| `PhaseChange` | On every phase state machine transition | `from`, `to`, `unit_id`, `reason`, `timestamp` | +| `AutoLoop` | Each iteration of auto-mode | `iteration`, `phase`, `budget_used_pct` | +| `AgentWake` | When a persistent agent is woken by an inbox message | `agent_id`, `agent_name`, `from_agent`, `message_count` | +| `AgentIdle` | When a persistent agent finishes its turn with no pending messages | `agent_id`, `agent_name`, `turns_run`, `tokens_used` | +| `AgentMessage` | When `send_message(to, message)` is called between agents | `from_agent`, `to_agent`, `message_preview` | + +### PostUnit payload (abridged) + +```json +{ + "unit_id": "execute-task/m1/s2/t3", + "unit_type": "task", + "phase": "execute", + "verdict": "success", + "duration_ms": 42300, + "input_tokens": 18200, + "output_tokens": 2100, + "cost_usd": 0.043, + "model": "claude-sonnet-4-6", + "learnings": ["use --no-verify when running pre-commit in CI context"] +} +``` + +`PostUnit` hooks that exit non-zero signal `SignalAbort` — the harness stops +the session. Hooks that time out (default 30s) are killed and logged but do +not block the next dispatch. This is the primary hook for: git commit/push, +hermes-memory feedback, test gate execution, custom notifications. + +### AgentWake / AgentIdle hooks + +These fire per persistent agent, not per session. A hook on `AgentWake` can +gate which agents are allowed to start (e.g. enforce a fleet size limit). A +hook on `AgentIdle` is the natural place for post-turn git operations scoped +to that agent's workspace. + +`AgentMessage` hooks fire before the message is delivered to the inbox. A +`deny` decision drops the message and returns an error to the calling agent's +`send_message` tool. Use this to enforce routing policy (e.g. an agent cannot +message outside its designated group). + +### Aggregation behaviour for new events + +`PreDispatch` and `AgentWake` follow `PreToolUse` semantics: any `deny` or +`halt` blocks the dispatch/wake. `PostUnit`, `AgentIdle`, and `AutoLoop` are +notification-only — hooks cannot block these events, only observe them. +`PhaseChange` and `AgentMessage` support `deny` to block the transition or +message delivery respectively. diff --git a/harness.md b/harness.md new file mode 100644 index 0000000000..eaebe9d974 --- /dev/null +++ b/harness.md @@ -0,0 +1,890 @@ +# Harness Engineering Practices + +> **Status: superseded research notes.** This document was the working draft that SPEC.md was synthesised from. **For implementation, follow [SPEC.md](./SPEC.md), which is the authoritative specification.** +> +> Specifically out of date in this document: +> - Knowledge layer: this doc still says "hermes-memory" with `hermes_memory_*` tools and a local memory pipeline. **SPEC.md uses Hindsight as the sole knowledge backend** — no local sqlite-vec, no embedding/reranker calls from the harness, no hermes-memory plugin. Hindsight handles retrieval and reranking server-side. +> - Anti-pattern detection details, in-process embedding, and reranker tier discussion in this file no longer apply. +> +> The rest (phase state machine, hooks, supervisor checks, worktree isolation, distributed execution, trust boundary) is broadly aligned with SPEC.md but with less precision. Use SPEC.md for the canonical wording. + +--- + +Engineering practices for the singularity-crush agent harness. The goal is to define clean boundaries from day one so practices that were bolted onto SF over time (dispatch-guard, exec-sandbox, auto-recovery, context-budget, auto-supervisor, abandon-detect — all separate files, all added later) are instead structural from the start. + +## What Crush already provides + +Before adding anything, understand what exists: + +- `internal/hooks/` — PreToolUse hook system with allow/deny/halt decisions, input rewriting, multi-hook aggregation. **Use this as-is.** It is well-designed. +- `internal/permission/` — Permission service with pubsub, persistent session grants, hook pre-approval via context key. **Extend, don't replace.** +- `internal/pubsub/` — Generic event broker used across hooks, permissions, notifications. **Use for all SF events.** +- `internal/session/` — Session lifecycle. +- `internal/db/` — SQLite via sqlc. **Extend schema here for SF planning tables.** +- `internal/agent/` — Agent loop, tool execution, fantasy integration. +- `internal/event/` — Event types. + +## Harness boundary + +The harness is the layer between the agent loop (fantasy) and SF's orchestration logic (milestones, phases, git, worktrees). It owns: + +1. Context budget +2. Phase transitions +3. Unit lifecycle hooks +4. Session contract (crash recovery, resume) +5. Observability +6. Supervision (stuck loop, timeout, abandon) + +Nothing in the planning or git layers should reach past the harness boundary into fantasy directly. + +## 1. Context budget + +Context budget is a first-class harness concern, not an application concern. + +Define a `Budget` type owned by the harness: + +```go +type Budget struct { + MaxTokens int + UsedTokens int + CompactAt float64 // fraction, e.g. 0.80 + HardLimitAt float64 // fraction, e.g. 0.95 +} + +func (b *Budget) ShouldCompact() bool { return float64(b.UsedTokens)/float64(b.MaxTokens) >= b.CompactAt } +func (b *Budget) AtHardLimit() bool { return float64(b.UsedTokens)/float64(b.MaxTokens) >= b.HardLimitAt } +``` + +Rules: +- The harness updates `UsedTokens` after every model response — the agent loop never manages this itself. +- When `ShouldCompact()` is true the harness triggers compaction before the next dispatch, not mid-turn. +- When `AtHardLimit()` the harness halts the current unit, snapshots state, and surfaces an error — it does not let the agent proceed and hit a provider error. +- Budget state is persisted to SQLite after every turn so crash recovery can restore it. + +## 2. Phase transitions + +Phases (research → plan → execute → complete → reassess) are harness-owned state machine transitions, not ad-hoc function calls. + +```go +type Phase int + +const ( + PhaseResearch Phase = iota // map the problem, gather context + PhasePlan // decompose into slices and tasks, get sign-off + PhaseExecute // write the code + PhaseTDD // write tests for what was just built; red → green + PhaseVerify // run full test suite + lint + type check; gates pass + PhaseReview // structured self-review: correctness, style, security + PhaseMerge // commit, push, open PR — git service handles this via PostUnit hook + PhaseComplete // unit done; benchmark result recorded + PhaseReassess // something failed; re-enter planning with failure context + PhaseUAT // human acceptance; only used when uat_dispatch = true +) + +type PhaseTransition struct { + From Phase + To Phase + UnitID string + Timestamp time.Time + Reason string +} +``` + +Standard flow: `Research → Plan → Execute → TDD → Verify → Review → Merge → Complete` + +On gate failure in Verify: `Verify → Execute` (retry, up to `max_retries`). On max retries exceeded: `Verify → Reassess`. On review finding a real problem: `Review → Execute`. On merge conflict: `Merge → Reassess`. + +**Run attempt lifecycle** — within each phase, individual dispatch attempts move through finer-grained states: + +```go +type AttemptState int + +const ( + AttemptPreparingWorkspace AttemptState = iota + AttemptBuildingPrompt + AttemptLaunchingAgent + AttemptInitializingSession + AttemptStreamingTurn + AttemptFinishing + AttemptSucceeded + AttemptFailed + AttemptTimedOut + AttemptStalled // stall_timeout exceeded since last agent event + AttemptCanceled // issue became non-active mid-run (reconciliation) +) +``` + +`AttemptCanceled` is distinct from `AttemptFailed` — it means the work was valid but the task was externally invalidated (deleted, moved to a terminal state, superseded). The harness does not retry a canceled attempt; it releases the slot and moves on. + +**Continuation turns** — after a turn completes, the harness re-checks the unit's eligibility before dispatching the next turn. If still eligible, the continuation turn receives only a short guidance prompt injected into the existing thread context — not the full original task prompt that is already in thread history. This avoids prompt inflation on long multi-turn tasks. + +```go +type TurnKind int + +const ( + TurnFirst TurnKind = iota // full rendered task prompt + TurnContinuation // continuation guidance only, same thread +) +``` + +Workflow templates control which phases are active. A `spike` template might be `Research → Plan → Execute → Complete` with no TDD, Verify, or Merge. A `bugfix` template runs the full sequence. The harness enforces whatever sequence the template defines — it does not infer phases. + +```toml +# .sf/workflows/feature.toml +name = "feature" +phases = ["research", "plan", "execute", "tdd", "verify", "review", "merge", "complete"] +require_tdd = true # PhaseTDD is enforced; skipping it is a gate violation +require_review = true +max_retries = 3 # applies per gate in PhaseVerify +``` + +**Prompt template variable enforcement:** +Prompt templates use `{{variable}}` placeholders. The renderer operates in strict mode: if any `{{variable}}` in the template has no corresponding key in the vars map, `loadPrompt` panics at startup rather than silently rendering an empty string. This catches template/code drift immediately rather than at dispatch time when the missing variable would silently produce a malformed prompt. + +The canonical template variables for execute-task are: + +| Variable | Type | Notes | +|---|---|---| +| `unit_id` | string | Stable identifier for this unit | +| `unit_type` | string | "milestone" \| "slice" \| "task" | +| `phase` | string | Current phase name | +| `attempt` | int \| null | null on first dispatch; integer (≥ 1) on retry | +| `issue` | object | Full issue/task struct as a flat map | + +When adding a new `{{variable}}` to any prompt template: (1) pass it in every `loadPrompt` call site, (2) add a placeholder value in every test that renders that template, (3) recompile. Skipping step (1) or (2) will cause a startup panic in the next run. + +Rules: +- All phase transitions go through a single `Harness.Transition(ctx, from, to, reason)` method. +- Transition emits a pubsub event (extend `internal/pubsub`). The TUI subscribes; it never polls state. +- Transition is persisted to SQLite before the new phase begins. A crash mid-phase means on resume the harness sees the last committed transition and re-enters that phase cleanly. +- Invalid transitions (e.g. `Execute → Research` without a `Reassess`) are rejected at the harness boundary with a typed error — not silently allowed and corrected downstream. +- The harness sets `Think: true` on the model config for `Research` and `Plan` phases; off for all others. The agent does not control this. + +## 3. Unit lifecycle hooks + +Extend Crush's existing `internal/hooks/` with SF-specific hook events alongside `PreToolUse`: + +```go +const ( + EventPreToolUse = "PreToolUse" // already in Crush + EventPreDispatch = "PreDispatch" // before a unit is dispatched + EventPostUnit = "PostUnit" // after a unit completes + EventPhaseChange = "PhaseChange" // on phase transition + EventAutoLoop = "AutoLoop" // each auto-mode iteration +) +``` + +The existing hook aggregation logic (`aggregate()` in `hooks.go`) handles allow/deny/halt — reuse it for all new events. + +**Unsupported tool calls — continue, don't stall:** +If the agent calls a tool that is not registered or not supported in the current dispatch context, the harness returns a structured tool failure result to the agent and continues the session. It never stalls or panics on an unknown tool name. The agent sees the failure in its context and can adapt. This applies to both built-in tools and dynamic tools registered via hooks. + +**Client-side tool response contract:** +Every tool call — successful or not — receives a response in this shape: + +```go +type ToolResponse struct { + Success bool `json:"success"` + Output string `json:"output"` + ContentItems []ContentItem `json:"contentItems"` +} + +type ContentItem struct { + Type string `json:"type"` // always "inputText" for text results + Text string `json:"text"` +} +``` + +For successful calls: `success = true`, `output` = result summary, `contentItems` contains the full result text. For unsupported or failed calls: `success = false`, `output` = human-readable error, `contentItems` lists which tools *are* supported in the current context. This shape must be consistent — the agent relies on the `success` field to distinguish real failures from tool-not-found errors. Never return a bare string or a different shape for error cases. + +PostUnit hooks receive a `UnitResult` payload: phase, duration, token cost, model used, verdict. This is what enables `/sf rate` (model tier rating) and post-unit git operations without the harness knowing about git. + +**Doc sync hook (built-in, runs after PhaseMerge):** +After a slice or milestone merges, the harness dispatches a lightweight doc-sync unit. The agent checks whether the completed work changed the tech stack, introduced a new pattern, or requires a guideline update, and proposes edits to project-level docs (`ARCHITECTURE.md`, `CONVENTIONS.md`, `STACK.md`). The agent writes proposed changes as a diff to stdout; the harness surfaces them to the TUI for user approval before committing. This keeps project context current without manual maintenance. + +Rules: +- Doc sync runs as a `fast`-tier dispatch (short context, light reasoning needed). +- It only runs after `PhaseMerge`, not after every phase. +- If the agent finds nothing to update it emits an empty diff and the hook is a no-op. +- Doc sync can be disabled per-project: `[harness] doc_sync = false`. + +## 4. Session contract + +A session has a defined contract the harness enforces: + +```go +type SessionState int + +const ( + SessionIdle SessionState = iota + SessionRunning + SessionPaused + SessionInterrupted // abnormal stop, recoverable + SessionComplete + SessionFailed // unrecoverable +) +``` + +Rules: +- Only the harness writes session state to SQLite — never the planning or execution layers. +- On startup, the harness checks for `SessionInterrupted` state. If found it offers resume before accepting new work. +- `SessionPaused` is a clean pause (`/sf pause`) — state is fully committed, resume is safe. +- `SessionInterrupted` means the process died mid-turn. The harness reconstructs from the last committed phase transition and unit state. It does not replay tool calls. +- A lock file (`~/.sf/run.lock` containing PID) prevents two SF processes writing to the same session. The harness acquires this on start and releases on clean exit. On startup, a stale lock (PID not running) is cleaned up automatically. + +## 5. Observability + +Every harness boundary crossing is traced. Use structured logging via Crush's existing `internal/log` package and extend with span-style events: + +```go +type Span struct { + TraceID string + SpanID string + Operation string // "tool_call", "phase_transition", "model_request" + StartedAt time.Time + Duration time.Duration + Attrs map[string]any + Error error +} +``` + +**Structured log format:** +All harness log lines use stable `key=value` phrasing. Required context fields: + +| Scope | Required fields | +|-------|----------------| +| Any unit-related log | `unit_id=`, `unit_type=` | +| Agent session lifecycle | `session_id=`, `turn_count=` | +| Phase transitions | `from=`, `to=`, `reason=` | +| Gate execution | `gate=`, `attempt=`, `passed=` | + +Include action outcome in the message: `completed`, `failed`, `retrying`, `canceled`. Never log large raw payloads — truncate at 512 bytes and note `[truncated]`. If a log sink fails, continue running and emit a warning through any remaining sink. + +Rules: +- Every tool call, phase transition, model request, and hook execution emits a span. +- Spans are written to a JSONL file (`~/.sf/trace.jsonl`) that rolls daily. This is what `/sf forensics` and `/sf logs debug` read. +- Span emission is synchronous but non-blocking — use a buffered channel with a background writer goroutine. Never drop a span on the floor; if the buffer is full, block briefly rather than discard. +- Cost tracking (tokens × model price) is computed at the harness layer from model response metadata, not estimated by the agent. `/sf session-report` reads directly from the trace. + +**Log rotation:** +Disk log is bounded: 10MB max file size, 5 rotating files. Lines are single-line (no multi-line JSON blobs). When file logging is active, the default stderr/console handler is removed — logs go to file only. Default path: `~/.sf/log/sf.log`. Configurable: + +```toml +[harness.log] +path = "~/.sf/log/sf.log" +max_size = 10_485_760 # 10MB in bytes +max_files = 5 +stderr = false # true = write to both file and stderr (dev only) +``` + +Never write raw multi-line hook output to the structured log — truncate at 2 KB and append `(truncated)` if longer. + +**Token accounting precision:** +Provider responses can arrive as absolute thread totals or as per-turn deltas. The harness always prefers absolute totals when available (`thread/tokenUsage/updated`-style events) and tracks the last-reported total to compute deltas, preventing double-counting if both payload types appear in the same session. Never treat a generic `usage` map as a cumulative total unless the event type explicitly defines it that way. Aggregate totals (input, output, cache-read, cache-write, cost-usd) accumulate in the orchestrator state and are included in every runtime snapshot. + +**Rate-limit tracking:** +The harness tracks the latest rate-limit payload from any provider event and surfaces it in the TUI and HTTP API. No retry logic is driven by this — the circuit breaker and backoff handle that separately. Rate-limit data is observability-only. + +**`turn_input_required` in auto-mode:** +When the agent raises `turn_input_required` during auto-mode, there are two valid responses: + +- **Soft response (preferred):** inject a fixed message — `"This is a non-interactive session. Operator input is unavailable."` — as a `user` role turn and let the session continue. The agent sees the message and can adapt (e.g. make a reasonable default choice and continue). This keeps the session alive and avoids unnecessary retry noise for agents that ask a clarifying question but don't strictly require an answer. +- **Hard failure:** end the attempt immediately, record `ErrTurnInputRequired`, schedule a normal failure retry. Appropriate when the project requires zero ambiguous agent choices in auto-mode. + +The default behaviour is configurable per-project: + +```toml +[harness] +turn_input_required = "soft" # or "hard" +``` + +In interactive/step mode, the harness always surfaces the request to the user via the TUI and waits up to `unit_timeout` before failing with `ErrTurnInputRequired`. The harness MUST NOT leave a run stalled indefinitely. + +**Intent chapters:** +Spans are grouped into named chapters — not just by phase, but by intent. A chapter opens when the agent starts pursuing a distinct goal (e.g. "investigate auth bug", "write migration") and closes when that intent resolves (success, failure, or pivot). Chapter boundaries are inferred from phase transitions and tool call patterns; the agent can also explicitly open a chapter via `chapter_open(name)`. + +Chapters serve two purposes: +1. **Context recovery** — on resume after a crash, the harness reconstructs "what the agent was doing and why" from the chapter log rather than replaying raw tool calls. The chapter summary is injected at the top of the restored context. +2. **Hindsight recall** — completed chapters are stored as discrete memory units in Hindsight. Recall queries match against chapter intent, not just content, giving more relevant results for "how did we handle X before?" + +```go +type Chapter struct { + ID string + UnitID string + Name string // inferred or agent-declared + Intent string // one-sentence summary written at close + OpenedAt time.Time + ClosedAt *time.Time + Outcome string // "success" | "failure" | "pivot" + SpanIDs []string // spans that belong to this chapter +} +``` + +## 6. Supervision + +The harness runs a supervisor goroutine alongside the agent loop. It does not touch agent state directly — it communicates via the pubsub broker. + +Supervisor checks (run every N seconds during auto-mode): + +```go +type SupervisorCheck interface { + Name() string + Check(ctx context.Context, state SupervisorState) SupervisorSignal +} + +type SupervisorSignal int +const ( + SignalOK SupervisorSignal = iota + SignalWarn // log, surface in TUI + SignalPause // pause auto-loop, wait for user + SignalAbort // stop unit, mark interrupted +) +``` + +Built-in checks: +- **StuckLoop** — same phase for > N turns with no tool calls completing successfully +- **BudgetWarning** — context approaching compaction threshold +- **TimeoutCheck** — unit running longer than configured max +- **AbandonDetect** — agent producing output with no tool calls (talking without acting) +- **GitDivergence** — working branch has diverged from base in unexpected ways +- **ReconciliationCancel** — the unit's underlying issue/task transitioned to a non-active state (deleted, cancelled, superseded) while the agent was running. The supervisor emits `SignalAbort` with reason `canceled_by_reconciliation`; the harness transitions to `AttemptCanceled` and releases the slot without retry. +- **BlockerCheck** — before each dispatch, the harness checks whether the unit has unresolved blockers (upstream tasks not yet `PhaseComplete`). If blockers exist, dispatch is skipped and the unit stays queued. Checked again on the next poll tick, not retried with backoff. +- **ModelUnavailable** — provider returns a "model not supported / not found" error class (distinct from a transient 429 or 5xx). On detection the supervisor emits `SignalAbort` immediately rather than retrying to the timeout; the harness surfaces a typed `ErrModelUnavailable` so the router can fall back to the next candidate in the tier rather than spinning. +- **CircuitBreaker** — when the same model fails with non-transient errors 3 consecutive times within a session, the supervisor trips a circuit breaker for that model for the remainder of the session. Subsequent dispatches in that tier skip the tripped model. The circuit state is written to SQLite so it survives a restart. Resets after 24h or on explicit `/sf reset-circuits`. + +The supervisor signals map to pubsub events the TUI and auto-loop both subscribe to. The auto-loop acts on `SignalPause` and `SignalAbort`; the TUI shows warnings on `SignalWarn`. The supervisor never calls `os.Exit` or panics — it signals and the harness decides. + +## 7. Tool sandboxing + +Extend Crush's existing permission service (`internal/permission/`) rather than building a new sandbox. + +SF-specific permission rules: +- `git:write` — any git operation that mutates state (commit, branch, push). Requires explicit grant in auto-mode. +- `worktree:create` and `worktree:delete` — worktree lifecycle operations. +- `fs:write-outside-project` — writes outside the project root. Always prompt, never auto-approve. +- `shell:exec` — arbitrary shell execution. Allowlist specific commands rather than blanket approval. + +In auto-mode the harness calls `permission.AutoApproveSession()` only for the allowlisted operations defined in the project's `.sf/config.toml`. Sensitive operations always prompt regardless of auto-mode. + +## 8. Configuration contract + +Harness configuration lives in `.sf/config.toml` (project-level) and `~/.sf/config.toml` (global). Project overrides global. + +```toml +[harness] +context_compact_at = 0.80 +context_hard_limit = 0.95 +unit_timeout = "10m" +supervisor_interval = "10s" + +[harness.auto_approve] +tools = ["bash:read", "fs:read", "git:status", "git:diff"] + +[harness.hooks] +pre_dispatch = ["./hooks/pre-dispatch.sh"] +post_unit = ["./hooks/post-unit.sh"] +``` + +Rules: +- The harness validates config on startup and fails fast with a descriptive error — it never silently ignores unknown keys or bad values. +- `/sf doctor` runs `HarnessConfig.Validate()` as one of its checks. +- **Dynamic reload** — the harness polls `.sf/config.toml` every second using a `{mtime, size, content_hash}` stamp rather than fsnotify (simpler, more portable across Linux/macOS/WSL). When the stamp changes, it re-parses and re-validates. Valid changes apply immediately to future dispatch, concurrency limits, and hook lists. Invalid reloads log an error and keep the last known good config — never crash. In-flight agent runs are not interrupted by a config reload. +- The following fields are always session-immutable even with dynamic reload enabled: `worktree_mode`, `context_compact_at`, `context_hard_limit`. Changing these requires restart. + +```toml +[harness] +context_compact_at = 0.80 +context_hard_limit = 0.95 +unit_timeout = "10m" +supervisor_interval = "10s" +doc_sync = true # post-merge doc synchronisation hook + +[harness.concurrency] +max_agents = 10 +max_agents_by_phase.execute = 4 # cap concurrent execute-phase units +max_agents_by_phase.tdd = 4 +max_agents_by_phase.verify = 10 # verify is cheap; allow more + +[harness.auto_approve] +tools = ["bash:read", "fs:read", "git:status", "git:diff"] + +[harness.hooks] +pre_dispatch = ["./hooks/pre-dispatch.sh"] +post_unit = ["./hooks/post-unit.sh"] +``` + +`max_agents_by_phase` limits how many units can be in a given phase simultaneously across all running agents. This prevents resource contention — e.g. capping concurrent `execute` phases stops 10 agents from all hammering the same codebase at once while `verify` phases (which only read) can run freely. + +## 9. Knowledge layer + +> **Superseded.** This section's earlier content described "hermes-memory" with `hermes_memory_*` tools, a local in-process embedding/reranker pipeline, and per-event Hermes plugin hooks. **That model is obsolete.** SPEC.md § 16 defines the actual knowledge layer: +> +> - **Hindsight is the sole knowledge backend.** The harness calls `hindsight.Recall(...)` / `hindsight.Retain(...)` / `hindsight.Feedback(...)` through a thin client wrapper. +> - **No local sqlite-vec, no embedding endpoint, no reranker tier discussion.** Hindsight handles all of that server-side. +> - **No hermes_memory_* tools.** Anti-patterns, learnings, and recall are managed through the Hindsight client, not exposed as agent-callable tools (except for explicit retain via PostUnit and explicit recall on dispatch). +> - **`local_anti_patterns` SQLite mirror** is the only durable on-disk knowledge state; it preserves anti-patterns when Hindsight is unreachable. +> +> See SPEC.md §§ 16.1, 16.1.1 (Hindsight client interface), 16.4 (anti-patterns), 16.7 (retrieval delegation). + +## 10. Post-unit hook pipeline + +After each unit completes, the harness serializes a `UnitResult` payload and dispatches it through all registered `PostUnit` hooks in order. Hooks are synchronous from the harness's perspective — they run sequentially, not concurrently, and the next dispatch does not begin until all PostUnit hooks have returned. + +```go +type UnitResult struct { + UnitID string + UnitType string // "milestone", "slice", "task" + Phase Phase + Verdict string // "success", "failure", "abandoned" + Duration time.Duration + InputTokens int + OutputTokens int + CacheHits int + CostUSD float64 + Model string + Error error // non-nil on failure + Learnings []string // explicit learnings extracted by the agent +} +``` + +Hook execution rules: +- Each hook runs in a child process (same pattern as `PreDispatch` hooks in config). The `UnitResult` is serialized to JSON and passed via stdin. +- A hook that exits non-zero signals `SignalAbort` — the harness stops the session and marks it `SessionFailed`. A hook that times out (default 30s) is killed and logged but does not block the next dispatch. +- The git service subscribes to PostUnit via a hook and handles commits, branch creation, and push. The harness knows nothing about git. +- Hindsight feedback (retain learnings, mark anti-patterns) is emitted from a built-in PostUnit hook, not a subprocess — it calls the Hindsight client directly. See SPEC.md § 16.1.1. +- PostUnit hook results are written to the trace as child spans of the unit span. + +## 11. Worktree isolation + +SF operates in one of two worktree modes, configured per-project in `.sf/config.toml`: + +```toml +[harness] +worktree_mode = "branch-per-slice" # or "milestone-per-worktree" +``` + +**branch-per-slice** (default): +- Each slice gets its own git branch (`sf/m{n}-s{n}-{slug}`) created from the current base. +- The harness emits `worktree:create` permission events before branch creation; a git service hook handles the actual `git worktree add`. +- After a slice's PostUnit hooks run, the git service merges the branch back to the integration branch. The harness waits for the merge hook to return before marking the slice complete. +- Merge conflicts trigger `SignalPause` — the supervisor surfaces the conflict in the TUI and halts auto-mode until the user resolves it. + +**milestone-per-worktree**: +- A single worktree is created for the entire milestone at milestone start. +- All slices in the milestone share that worktree. The git service commits incrementally within it. +- The worktree is merged (or PR'd) at milestone PostUnit time. +- Suitable for milestones where slices are tightly coupled and sequential; branch-per-slice is preferred for parallel-safe work. + +Worktree lifecycle events (extend `internal/hooks/`): + +```go +const ( + EventWorktreeCreate = "WorktreeCreate" + EventWorktreeDelete = "WorktreeDelete" + EventMergeReady = "MergeReady" + EventMergeConflict = "MergeConflict" +) +``` + +The harness emits these events; the git service subscribes. The harness never calls `git` directly. + +## 12. Model routing and budget signals + +Model routing is not a harness concern — the harness does not select models. It does, however, emit signals that the router reads: + +- **BudgetWarning** (pubsub) — emitted when `ShouldCompact()` is true. The router may downgrade model tier to extend the budget window. +- **AtHardLimit** (pubsub) — emitted before the harness halts the unit. The router records this as a signal that the current model tier consumed the context. +- Each span in the trace records `model` in its attributes. `/sf session-report` reads the trace to report per-model token costs. + +The router subscribes to these events independently. The harness does not know what decision the router makes — it only acts on the budget state it observes. + +## 13. Verification gates + +Between PostUnit and the next dispatch, the harness runs a verification gate. The gate is a list of checks defined in `.sf/config.toml`: + +```toml +[harness.gates] +post_slice = ["./gates/run-tests.sh", "./gates/lint.sh"] +post_milestone = ["./gates/integration-tests.sh"] +``` + +Gate execution rules: +- Gates run as subprocesses. The gate script receives the `UnitResult` JSON on stdin (same payload as PostUnit hooks). +- A gate exits 0 = pass. Non-zero = fail. Fail increments the retry counter for that unit. +- Default max retries: 3. Configurable per gate type (`max_retries_slice`, `max_retries_milestone`). +- On retry, the harness re-dispatches the same unit with the gate failure appended to the context. The agent sees what failed and why. +- After max retries, the harness transitions to `SessionFailed` and surfaces a `GateBlocked` event on pubsub. The TUI shows which gate failed and what output it produced. +- Gate results are stored in SQLite and written as span events on the unit span so they appear in `/sf forensics`. + +```go +type GateResult struct { + GateName string + UnitID string + Passed bool + Attempt int + MaxRetries int + Output string // combined stdout+stderr, truncated at 8KB + Duration time.Duration +} +``` + +**PhaseReview — chunked review:** +Large diffs must not be reviewed in a single pass. The harness splits the changed file list into chunks of ≤ 300 lines each before dispatching the review agent. Each chunk is reviewed independently; findings are accumulated. The final review pass synthesises all findings across chunks. This keeps each review dispatch within a predictable context budget and prevents the agent from glossing over the end of a large diff. + +```go +const ReviewChunkLines = 300 + +func chunkDiff(files []ChangedFile) [][]ChangedFile { + // bin-pack files into chunks, respecting ReviewChunkLines + // files larger than ReviewChunkLines get their own chunk +} +``` + +**Unit archive after completion:** +When a slice or milestone reaches `PhaseComplete`, the harness moves its artifact directory from `.sf/active/` to `.sf/archive/{date}-{unit-id}/`. Active work stays in `.sf/active/` (small, fast to scan). Archived units are queryable via `/sf history` and readable by the hindsight recall pipeline. The archive move is atomic (rename, not copy+delete). + +**`specs.check` gate (built-in, run at PhaseVerify):** +Every exported Go function, type, and constant in the harness must have a godoc comment. The `specs.check` gate runs an AST-based pass over the harness package and fails if any public identifier is missing documentation. This is a compile-time quality gate, not a style suggestion. + +```go +// Gate: exit 1 if any exported decl lacks a doc comment +// Implemented as a go/ast walk — no external linter dependency. +// Only applies to the harness package itself, not to extension or hook code. +``` + +This keeps the public API self-documenting without relying on a linter that the agent might not have available in the workspace. + +## 14. SF environment variables and lock files + +### Environment variables + +| Variable | Purpose | +|----------|---------| +| `SF_PROJECT_ROOT` | Absolute path to the project root. Set by the harness on startup; never read from the environment — the harness always sets it. | +| `SF_HOME` | SF data directory. Defaults to `~/.sf`. Overridable for multi-user systems. | +| `SF_TRACE_ENABLED` | Set to `1` to enable structured trace collection to `~/.sf/trace.jsonl`. Off by default. | +| `SF_MILESTONE_LOCK` | Set to the active milestone ID when a milestone is in progress. Hook scripts can read this to scope their work. | +| `SF_UNIT_ID` | Set to the active unit ID during dispatch and hook execution. | +| `SF_SESSION_ID` | Set to the harness session UUID. Stable across resume. | + +### Lock files + +**`~/.sf/run.lock`** — global process lock. Contains the PID of the running SF process. The harness acquires this on startup (fails if another PID is alive and running). On clean exit, the file is removed. On startup, a stale lock (PID not in `/proc`) is cleaned up automatically and logged. + +**`.sf/auto.lock`** — signals that auto-mode is active for this project. Contains the session ID. Created when auto-mode starts, removed when it stops (cleanly or via interrupt). Hook scripts and external tools can check for this file to gate behavior. + +**`.sf/runtime/paused-session.json`** — written by the harness when `SessionPaused` is committed. Contains: session ID, phase, unit ID, budget state, timestamp. On resume, the harness reads this file, restores state, and deletes the file. If the file exists but the session is not `SessionPaused` (e.g., process was killed), the harness treats it as `SessionInterrupted` and offers recovery. + +**`.sf/runtime/gate-state.json`** — written by the harness after each gate run. Contains the last gate result per unit, retry counts, and blocked state. Persisted so that a crash mid-gate retry does not reset the retry counter. + +## 15. Persistent agents + +An agent is a named, persistent identity with its own memory blocks, system prompt, and message history. Unlike a unit (which is ephemeral work within a session), an agent persists indefinitely and can be woken at any time by an incoming message or an explicit dispatch. + +### Schema + +```sql +CREATE TABLE agents ( + id TEXT PRIMARY KEY, -- stable UUID + name TEXT NOT NULL UNIQUE, + system TEXT NOT NULL, -- system prompt template + model TEXT NOT NULL, + created_at INTEGER NOT NULL, + last_active INTEGER +); + +CREATE TABLE agent_memory_blocks ( + agent_id TEXT NOT NULL REFERENCES agents(id), + label TEXT NOT NULL, -- e.g. "persona", "human", "task" + value TEXT NOT NULL DEFAULT '', + char_limit INTEGER NOT NULL DEFAULT 2000, + read_only INTEGER NOT NULL DEFAULT 0, + PRIMARY KEY (agent_id, label) +); + +CREATE TABLE agent_messages ( + id TEXT PRIMARY KEY, + agent_id TEXT NOT NULL REFERENCES agents(id), + seq INTEGER NOT NULL, -- monotonically increasing per agent + role TEXT NOT NULL, -- "user" | "assistant" | "tool_call" | "tool_return" | "system" + content TEXT NOT NULL, + tool_name TEXT, -- set when role = tool_call or tool_return + created_at INTEGER NOT NULL +); +``` + +### Memory block injection + +At dispatch time the harness renders the agent's memory blocks into the system prompt: + +``` + +{{value}} +{{value}} + +``` + +The agent can edit blocks mid-conversation using two built-in tools: + +| Tool | Signature | Effect | +|------|-----------|--------| +| `core_memory_append` | `(label string, content string)` | Appends content to the named block (respects char_limit) | +| `core_memory_replace` | `(label string, old string, new string)` | Replaces a substring within the named block | + +Both tools write directly to `agent_memory_blocks` before the next turn is dispatched, so a crash mid-session preserves the updated block state. + +### Agent lifecycle + +```go +type AgentState int + +const ( + AgentIdle AgentState = iota // no pending messages, not running + AgentRunning // dispatched, consuming tokens + AgentWaiting // sent a message to another agent, awaiting reply + AgentStopped // explicitly stopped; will not wake automatically +) +``` + +The harness owns all state transitions. The agent loop never writes `AgentState` directly. + +## 16. Inter-agent messaging + +Agents communicate by calling a `send_message` tool. The harness routes the message to the target agent's inbox, wakes it if it is `AgentIdle`, and records both the outbound and inbound messages in `agent_messages`. + +### Wire format + +```go +// Tool the agent calls: +// send_message(to: string, message: string) -> void +// +// to: agent name or agent ID +// message: plain text; the receiving agent sees it as a "user" role message + +type AgentMessage struct { + ID string + FromAgent string + ToAgent string + Content string + SentAt time.Time +} +``` + +### Inbox and wake + +Each agent has a persistent inbox table: + +```sql +CREATE TABLE agent_inbox ( + id TEXT PRIMARY KEY, + agent_id TEXT NOT NULL REFERENCES agents(id), + from_agent TEXT NOT NULL, + content TEXT NOT NULL, + delivered INTEGER NOT NULL DEFAULT 0, + created_at INTEGER NOT NULL +); +``` + +Wake rules: +- When `send_message` is called, the harness inserts into `agent_inbox` and emits an `AgentWake` pubsub event for the target agent. +- The target agent's run loop checks its inbox at the start of each dispatch. Undelivered messages are prepended to the context as `user` role messages in arrival order, then marked `delivered = 1`. +- An `AgentIdle` agent that receives an `AgentWake` event starts a new dispatch cycle immediately. An `AgentRunning` agent queues the message for its next cycle. +- An agent that sends a message and calls `wait_for_reply()` transitions to `AgentWaiting`. The harness suspends its dispatch loop until the target agent sends a reply (via another `send_message`) or a configurable timeout elapses. + +### Pubsub events + +```go +const ( + EventAgentWake = "AgentWake" // target agent should start/resume + EventAgentMessage = "AgentMessage" // message routed (for TUI and tracing) + EventAgentIdle = "AgentIdle" // agent completed its turn, no pending messages +) +``` + +### Harness guarantees for agent fleets + +Every agent run is wrapped by the same harness that wraps single-agent units: +- Budget tracking and compaction fire per-agent, not globally. +- The supervisor goroutine monitors all running agents; `StuckLoop` and `AbandonDetect` checks apply to each independently. +- Crash recovery restores each agent to its last committed message sequence independently — a crash in agent B does not affect agent A's recovery. +- Cost is recorded per agent in the trace. `/sf session-report` breaks down spend by agent ID. + +### Agent handoff + +An agent that determines a task is outside its competence can call `handoff(to, context)` to transfer the active task to a named specialist agent. Handoff is different from `send_message`: + +- The calling agent's current unit is suspended (not completed). +- The target agent receives the full task context (system prompt, memory blocks, last N messages) pre-loaded into its inbox. +- The calling agent transitions to `AgentWaiting` until the specialist replies with a result or escalates back. +- If the target agent is not found or is `AgentStopped`, `handoff` returns an error and the calling agent continues. + +Handoff is the runtime complement to static phase routing — the agent self-nominates when the task complexity exceeds its configured role. A coordinator agent that receives a handoff reply can merge the result and continue its own unit. + +```go +// Tool the agent calls: +// handoff(to: string, context: string) -> HandoffTicket +// +// Returns a ticket ID. The agent calls wait_for_reply(ticket_id) to block. +``` + +### Inbox event log (append-only) + +`agent_inbox` is append-only — rows are never deleted or updated, only inserted and marked `delivered`. This gives the harness a complete audit trail of all inter-agent communication. Conflict resolution for shared state happens at read time (last-writer-wins on `agent_memory_blocks` by `updated_at`), not via locking. The event log is the source of truth for what happened; the current block state is a materialised view of it. + +### What not to build for inter-agent messaging + +- **Shared memory** — agents do not share memory blocks. If two agents need a common fact, one agent sends it as a message. Shared mutable state creates the same race conditions here as anywhere else. +- **Broadcast** — there is no `send_message_all`. A coordinator agent sends individual messages. This keeps routing explicit and traceable. +- **Synchronous RPC** — `send_message` is fire-and-forget from the caller's perspective. `wait_for_reply()` is a separate explicit call, and it has a timeout. Never block the harness loop indefinitely waiting for another agent. + +## 17. Failure taxonomy + +Every harness failure has a class. The class determines recovery behavior — not the error message. + +| Class | Examples | Recovery | +|---|---|---| +| `config` | Missing WORKFLOW.md, invalid TOML, unknown tracker kind, missing API key | Block new dispatches. Keep service alive. Continue reconciliation. Emit operator-visible error. | +| `workspace` | Directory creation failure, hook timeout, invalid path | Fail the current attempt. Orchestrator retries with backoff. | +| `agent_session` | Startup handshake failed, turn timeout, turn cancelled, subprocess exit, stalled session, `turn_input_required` (auto-mode) | Fail the current attempt. Orchestrator retries with backoff. | +| `tracker` | API transport error, non-200, GraphQL errors, malformed payload | **Candidate fetch failure**: skip this tick, try next tick. **Reconciliation failure**: keep workers running, retry next tick. **Startup cleanup failure**: log and continue. | +| `observability` | Snapshot timeout, dashboard render error, log sink failure | Log and ignore. Never crash the orchestrator over an observability failure. | + +**Scheduler state is intentionally in-memory.** Restart recovery does not restore retry timers, live sessions, or in-flight agent state. After restart: startup terminal cleanup → fresh poll → re-dispatch eligible work. This is not a limitation — it is a design choice. Durable retry state is on the TODO list for a future conformance extension. + +```go +// Typed error codes — never match on error strings +const ( + ErrMissingWorkflowFile = "missing_workflow_file" + ErrWorkflowParseError = "workflow_parse_error" + ErrUnsupportedTrackerKind = "unsupported_tracker_kind" + ErrMissingTrackerKey = "missing_tracker_api_key" + ErrWorkspaceCreation = "workspace_creation_failed" + ErrHookTimeout = "hook_timeout" + ErrAgentStartup = "agent_session_startup" + ErrTurnTimeout = "turn_timeout" + ErrTurnFailed = "turn_failed" + ErrTurnInputRequired = "turn_input_required" + ErrStalled = "stalled" + ErrCanceledByReconciliation = "canceled_by_reconciliation" + ErrModelUnavailable = "model_unavailable" + ErrCircuitOpen = "circuit_open" +) +``` + +## 18. Worker attempt inner loop + +The exact sequence inside a single worker run — how the harness wires workspace, hooks, agent session, and multi-turn continuation: + +``` +run_worker_attempt(unit, attempt): + workspace = create_or_reuse_workspace(unit.id) + if workspace failed → fail_attempt(ErrWorkspaceCreation) + + run_hook("before_run", workspace) → fatal if fails + + session = agent.start_session(cwd=workspace) + if session failed: + run_hook_best_effort("after_run", workspace) + fail_attempt(ErrAgentStartup) + + turn = 1 + loop: + prompt = build_prompt(unit, attempt, turn) + if prompt failed: + agent.stop_session(session) + run_hook_best_effort("after_run", workspace) + fail_attempt(ErrPromptRender) + + result = agent.run_turn(session, prompt) + if result failed: + agent.stop_session(session) + run_hook_best_effort("after_run", workspace) + fail_attempt(result.error) + + // re-check unit state between turns + current_state = tracker.fetch_unit_state(unit.id) + if current_state is non-active → break // AttemptCanceled + if turn >= max_turns → break + + turn++ + // continuation turns get guidance-only prompt, not the full original + + agent.stop_session(session) + run_hook_best_effort("after_run", workspace) + exit_normal() +``` + +**After a normal exit**, the orchestrator schedules a short continuation retry (1 second) to re-poll and decide whether the unit is still eligible. This is not a failure retry — it is deliberate re-evaluation. If the unit is still active, a new session starts; if terminal, the claim is released. + +**After an abnormal exit**, exponential backoff: `delay = min(10s × 2^(attempt-1), max_retry_backoff)`. Default cap: 5 minutes. + +## 19. Distributed execution (SSH worker extension) + +The orchestrator always runs centrally. Workers can execute on remote hosts over SSH — relevant for executing agents on tailnet nodes with more RAM/GPU. + +```toml +[worker] +ssh_hosts = ["mikki-bunker", "forge-gpu-1"] +max_concurrent_agents_per_host = 3 +``` + +Rules: +- `workspace.root` is resolved on the **remote host**, not the orchestrator. +- The agent subprocess is launched over SSH stdio; the orchestrator owns the session lifecycle even though execution is remote. +- Continuation turns within one worker lifetime stay on the same host and workspace. +- If a host is at capacity, dispatch waits rather than silently falling back to local or another host. +- Once a run has produced side effects, moving to another host on retry is treated as a new attempt (not invisible failover). +- Host health: a dead or overloaded host reduces available capacity; it does not cause duplicate execution. +- The run record includes `worker_host` so operators can see where each run executed and where to find its logs. + +This is the model for running singularity-crush against the ace-coder or inference-fabric repos on bunker while keeping the orchestrator on the local machine. + +## 20. Trust boundary and harness hardening + +Every deployment MUST document its trust posture explicitly. There is no universal safe default. + +**Singularity-crush defaults (trusted single-user developer machine):** +- Auto-approve tool execution and file changes within the workspace. +- `turn_input_required = "soft"` (inject non-interactive message, let agent adapt). +- Workspace isolation enforced (path containment, sanitized names). +- Secrets from Vault only — never plaintext in config. + +**Workspace path containment — symlink-aware canonicalization:** +A naive `filepath.EvalSymlinks` or `path.Clean` check can be defeated by a symlink that resolves outside the workspace root. The harness uses segment-by-segment canonicalization instead: + +```go +// resolveCanonical walks each path segment via lstat, +// following symlinks one hop at a time and re-resolving from the new target. +// Only fails on permission errors — ENOENT for not-yet-created paths is OK +// (returns the unconsumed tail appended to the resolved prefix). +func resolveCanonical(root string, segments []string) (string, error) { + // for each segment, lstat the candidate path: + // - symlink → read link target, expand relative to current resolved prefix, + // prepend remaining segments, recurse + // - regular → append to resolved prefix, continue + // - ENOENT → append segment + rest as-is (path not yet created) + // - other err → return error +} +``` + +After canonicalization, assert `canonical_workspace` has `canonical_root + "/"` as a prefix. This catches symlinks that escape the root even if `path.Clean` would not. Apply this check in both directions: local filesystem (via `lstat`) and, for remote workers, via a shell script that resolves each segment before `mkdir`. + +The workspace directory name is also sanitized: `[^a-zA-Z0-9._-]` → `_`. This prevents path injection via issue identifiers that contain slashes, `..`, or null bytes. + +**Hardening measures for less-trusted environments:** +- Filter which issues/tasks are eligible for dispatch — untrusted or out-of-scope tasks must not automatically reach the agent. +- Restrict the `linear_graphql` (or equivalent) client-side tool to read-only or project-scoped mutations only. +- Run the agent subprocess under a dedicated OS user with no write access outside the workspace root. +- Add container or VM isolation around each workspace (Docker, nsjail, etc.). +- Restrict network access from the workspace — the agent should not be able to call arbitrary external APIs. +- Narrow available tools to the minimum needed for the workflow. + +Treat harness hardening as part of the core safety model, not an afterthought. Tracker data, repository contents, and tool arguments are not unconditionally trustworthy even when they originate inside a normal workflow. + +## What not to build in the harness + +- **Git operations** — the harness emits `PostUnit` hooks; a git service subscribes and acts. Git logic is not in the harness. +- **Planning logic** — milestone/slice/task management is above the harness. The harness knows about phases and units, not their content. +- **TUI rendering** — the harness emits pubsub events. The TUI subscribes. The harness never formats output. +- **Model selection** — the harness tracks cost and surfaces budget signals. Model routing is a separate concern that reads those signals. +- **Provider errors** — the harness catches them at the boundary and translates to typed errors. Provider-specific retry logic lives in catwalk, not here. diff --git a/migrate.md b/migrate.md new file mode 100644 index 0000000000..0edef0576c --- /dev/null +++ b/migrate.md @@ -0,0 +1,633 @@ +# singularity-crush — Migration Notes + +> **Status: superseded research notes.** This document was the working draft that SPEC.md was synthesised from. **For implementation, follow [SPEC.md](./SPEC.md), which is the authoritative specification.** +> +> Specifically out of date in this document: +> - **Knowledge layer storage** (the `## Memory and knowledge` section): the long sqlite-vec + FTS5 + RRF + reranker pipeline does not apply. SPEC.md uses Hindsight as the sole knowledge backend; sf does not call any embedding or reranker endpoint directly. The `memories` SQLite schema, `F32_BLOB(2560)`, `vector_top_k()`, `Qwen3-Reranker-0.6B/4B` tier discussion, and the `KNOWLEDGE.md` replacement table are all stale. +> - References to `hermes-memory` and `hermes_memory_*` tools are stale. Use the Hindsight client interface defined in SPEC.md § 16.1.1. +> - The "Implementation conformance checklist" near the bottom predates SPEC.md § 26's 90+ numbered items with [REQUIRED] / [STRONG] / [OPTIONAL] tags. +> +> The rest (Crush feature inventory, what to build vs. inherit, plugin extension points, Vault secret management, charmbracelet/x package picks, skills, /sf revert protocol, dispatch scheduling) is broadly aligned with SPEC.md but with less precision. Use SPEC.md for canonical wording. + +--- + +## What this is + +**singularity-crush is Crush on autopilot.** + +Crush is an interactive coding agent — you drive it turn by turn. singularity-crush adds the autopilot layer on top: when pointed at an existing codebase it maps it, builds the harness, and populates the knowledge store — then drives itself through research → plan → execute → verify → complete without human intervention per unit. + +Concretely, `sf init` in an existing project: +1. Indexes the codebase into Hindsight under the project bank (§ 16.3 of SPEC.md) +2. Extracts initial patterns and conventions into the memory store +3. Sets up `.sf/config.toml` with project-specific harness config +4. Establishes the session contract and runs doctor checks + +Then `sf auto` takes over — the harness drives phase transitions, the supervisor watches for problems, the knowledge layer injects relevant context before each unit, and the model router picks the right model for each phase. The human watches or steers; the agent executes. + +This is a fork of [charmbracelet/crush](https://github.com/charmbracelet/crush) as the foundation for porting [singularity-foundry](../singularity-foundry) (SF) from TypeScript to Go. SF is a mature AI coding agent orchestration CLI (v2.75+, ~254k lines of TypeScript). Rather than porting line-by-line, the plan is to build SF's orchestration layer on top of what Crush already provides — Crush is the interactive agent, singularity-crush is the autopilot. + +## Why + +- Node.js startup latency makes SF feel slow — Go binary is near-instant +- Single binary, no `node_modules`, trivially Nix-packageable +- Crush already implements the agent loop, tool execution, multi-provider LLM, MCP — that's the hard part +- Works natively in Termux, any platform Go runs on + +## What Crush already provides (don't rebuild) + +- Agent loop via `charm.land/fantasy` +- Multi-provider LLM: Anthropic, OpenAI, Gemini, Groq, Bedrock, Azure, Ollama, etc. via `charm.land/catwalk` +- MCP client support (`modelcontextprotocol/go-sdk`) +- LSP integration for code intelligence +- SQLite state via `ncruces/go-sqlite3` +- Bubbletea TUI, Lipgloss styling, full charmbracelet ecosystem +- Tool execution (bash, file read/write, grep, web search, sourcegraph) + +## What SF adds that needs to be built + +### Core (build first) + +1. **Planning system** — milestones → slices → tasks hierarchy, stored in SQLite +2. **Phase dispatch** — research → plan → execute → complete → reassess loop (`/sf auto`, `/sf dispatch`) +3. **Git/worktree management** — branch-per-slice, clean PR branches, worktree isolation +4. **Session state** — auto-loop persistence, crash recovery, interrupted session resume +5. **Step mode** — `/sf next` for manual stepping through the execution loop + +### Important (build second) + +6. **Workflow templates** — bugfix, feature, hotfix, spike, refactor, security-audit, dep-upgrade +7. **Parallel orchestration** — parallel slice execution, merge, conflict detection +8. **Knowledge base** — replaced by Hivemind memory layer (see below) +9. ~~**Skill system**~~ — **already done, free from Crush** (see below) +10. **Ship** — PR creation from milestone artifacts + +### Persistent agents + inter-agent messaging (build third) + +11. **Persistent agent identity** — named agents with stable IDs, system prompts, memory blocks, and message history in SQLite. An agent wakes when its inbox has undelivered messages, runs until idle, then sleeps at zero cost. See `harness.md` § 15. +12. **Memory blocks** — labeled, character-limited blocks (`persona`, `human`, `task`, custom) rendered as XML into the system prompt at dispatch. Two built-in tools: `core_memory_append(label, content)` and `core_memory_replace(label, old, new)`. Block writes commit to SQLite mid-conversation — crash-safe. See `harness.md` § 15. +13. **Inter-agent messaging** — `send_message(to, message)` tool routes a message to a named agent's inbox. Target agent wakes, processes it as a `user` role message, and can reply. `wait_for_reply()` lets the caller block until the reply arrives (with a timeout). Supervisor monitors all running agents independently. See `harness.md` § 16. + +### Nice-to-have (build later or drop) + +- `/sf visualize` — rebuild lighter with Bubbletea (10-tab TUI is overkill) +- `/sf cmux` — drop entirely (tmux integration, not needed) +- `/sf remote` — Slack/Discord control, handled by Notifier plugin interface +- `/sf migrate` — one-time v1 migration tool, low priority + +## Prompt template contract + +Every dispatch renders the unit's prompt template with a strict variable checker — unknown variables fail rendering immediately (not silently). Template input variables: + +| Variable | Type | Value | +|---|---|---| +| `unit` | object | Full unit record: id, type, phase, title, description, labels, blockers | +| `attempt` | integer or null | `null` on first dispatch; `1+` on retry or continuation | +| `phase` | string | Current phase name (`execute`, `tdd`, etc.) | +| `session_id` | string | Stable session UUID | + +The `attempt` variable is the key one: the prompt template can give different instructions to a retrying agent vs. a fresh start. A retry prompt might say "your previous attempt failed with: {{last_error}} — focus on that specifically." The harness injects `last_error` automatically on `attempt >= 1`. + +Continuation turns (same thread, subsequent dispatch after a successful turn) receive a short continuation-guidance prompt, not the full task prompt. The full prompt is already in the thread history — resending it inflates context and confuses the model. + +## HTTP observability API + +The harness exposes a lightweight HTTP server on `localhost` when `server.port` is set in `.sf/config.toml`. It is observability-only — orchestrator correctness never depends on it. + +```toml +[server] +port = 7842 # 0 = ephemeral port for tests +``` + +**`GET /api/v1/state`** — runtime snapshot: + +```json +{ + "generated_at": "2026-04-29T14:22:00Z", + "counts": { "running": 3, "retrying": 1, "queued": 5 }, + "running": [ + { + "unit_id": "execute-task/m1/s2/t3", + "phase": "execute", + "session_id": "sess-abc-turn-4", + "turn_count": 7, + "last_event": "tool_call", + "started_at": "2026-04-29T14:10:00Z", + "tokens": { "input": 18200, "output": 2100, "total": 20300 } + } + ], + "retrying": [ + { + "unit_id": "execute-task/m1/s2/t4", + "attempt": 2, + "due_at": "2026-04-29T14:24:00Z", + "error": "gate: tests failed" + } + ], + "totals": { + "input_tokens": 84000, "output_tokens": 12000, + "cost_usd": 1.24, "seconds_running": 4820 + } +} +``` + +**`GET /api/v1/units/`** — per-unit debug detail including recent events, workspace path, retry count, last error, and log file path. + +**`POST /api/v1/refresh`** — queue an immediate poll + reconciliation cycle (202 Accepted, best-effort coalescing of rapid requests). + +Hot-rebind on port change is not required — restart is acceptable for server config changes. + +## `/sf revert` — git-aware revert + +Four-phase protocol replacing the naive current undo: + +**Phase 1 — target selection:** +- If the user provides a target (`/sf revert m1/s2/t3`), go directly to confirmation. +- Otherwise present the top 3 in-progress units + 3 most recently completed units as a numbered menu. User picks one. + +**Phase 2 — git reconciliation:** +- Find all commits belonging to the target unit from the activity log and git history. +- Handle ghost commits: if a commit SHA is missing (rebase/squash rewrote history), search by commit message prefix rather than SHA. Log which SHAs were resolved vs. not found. +- Find plan-update commits (commits that modified `.sf/` artifact files for this unit). +- Compile final ordered SHA list (most recent first). + +**Phase 3 — confirmation:** +- Display the exact SHA list with descriptions and dates. +- Warn if any SHAs are merge commits (requires `--mainline`). +- Require explicit user confirmation before touching git state. + +**Phase 4 — execution:** +- `git revert --no-edit ` in reverse order (newest first). +- On conflict: surface via `SignalPause`, halt, wait for user to resolve. +- After all reverts: restore `.sf/active/{unit-id}/` artifacts from archive, mark unit as `[ ]` in the plan, remove from `completed-units.json`. + +```go +type RevertTarget struct { + UnitID string + UnitType string + SHAs []string // resolved from activity log + git log + Ghosts []string // SHAs that couldn't be found — log but continue + PlanSHAs []string // plan-update commits to also revert +} +``` + +## Dispatch scheduling + +### Priority ordering + +When multiple units are eligible for dispatch, the harness sorts them: + +1. **Explicit priority** — tasks with `priority: 1` (urgent) before `priority: 4` (low); unset sorts last +2. **Blocker-free first** — tasks with no unresolved upstream blockers before blocked tasks +3. **Phase order** — earlier phases first (Research before Execute) within the same priority +4. **Created-at** — oldest tasks first as tie-breaker + +This order is evaluated fresh on each poll tick — a task that was blocked can become unblocked if its upstream completes between ticks. + +### Blocker-aware dispatch + +A task is not dispatched if any of its upstream dependencies are **non-terminal**. Terminal means `PhaseComplete`, `PhaseReassess` (resolved), or explicitly cancelled. A dependency stuck in `PhaseVerify` is non-terminal and blocks downstream dispatch. A dependency that failed and was marked abandoned is terminal and does NOT block downstream. + +The harness checks blockers from the plan's `blocked_by` list before each dispatch. Blocked tasks stay queued and are re-evaluated on the next tick — no exponential backoff, no retry counter increment. This means blocker resolution is near-instant when the upstream unit completes. + +```sql +CREATE TABLE task_blockers ( + task_id TEXT NOT NULL, + blocked_by TEXT NOT NULL, -- task_id of the upstream dependency + PRIMARY KEY (task_id, blocked_by) +); +``` + +### Startup cleanup + +On startup, the harness scans `.sf/active/` for unit artifacts whose tasks are now in terminal states (complete, cancelled, superseded). Stale active artifacts are moved to `.sf/archive/` automatically. This prevents workspace accumulation across restarts. + +## `/sf status` — project health snapshot + +Structured output format (TUI panel + `/sf status` CLI output): + +``` +Project: singularity-foundry +Phase: Execute [m2/s3/t1 — add trace export] +Next: TDD [m2/s3/t1] +Blocker: none + +Milestones: 2 / 5 (40%) +Slices: 7 / 18 (39%) +Tasks: 14 / 42 (33%) + +Session: 4h 12m | $0.83 | claude-sonnet-4-6 +``` + +Blockers surface automatically: a `GateBlocked` event, a `MergeConflict` event, or a supervisor `SignalPause` all write to a `session_blockers` SQLite table. `/sf status` reads this table; the TUI subscribes to the pubsub event directly. + +## Skills vs workflow templates — critical distinction + +**Skills are inspirational.** A `SKILL.md` file is prompt guidance injected into the agent's context. The agent reads it, uses it as reference, and may follow it loosely. Good for: coding conventions, debugging approaches, domain knowledge, tool usage patterns. The agent can deviate if it judges differently. + +**Workflow templates are enforced.** A workflow template (bugfix, feature, hotfix, spike, refactor, security-audit, dep-upgrade) defines the exact phase sequence the harness executes. The harness follows it programmatically — it is not a suggestion to the AI. The AI operates within the template's phase structure; it cannot skip phases or reorder them. + +This means workflow templates are **not skills**. They are structured data (TOML/YAML) the harness loads and enforces. The harness reads the template, constructs the phase state machine from it, and drives execution. The agent has no say in whether the template is followed. + +```toml +# .sf/workflows/bugfix.toml +name = "bugfix" +phases = ["research", "plan", "execute", "tdd", "verify", "review", "merge", "complete"] +max_reassess = 2 +require_tdd = true +require_review = true +max_retries = 3 + +# .sf/workflows/spike.toml +name = "spike" +phases = ["research", "plan", "execute", "complete"] +require_tdd = false +require_review = false +max_retries = 0 +``` + +## Go plugin extension points + +When Crush's plugin system stabilises (issue #2038 — Caddy-style compile-time Go modules), singularity-crush should expose these interfaces as plugin boundaries. Each is a clean interface with at least two real implementations that different users genuinely need. + +### Plugin interfaces + +**`SupervisorCheck`** — already defined in `harness.md`. Custom checks without forking: compliance gates, security scanners, cost limits, organisation-specific policies. + +```go +type SupervisorCheck interface { + Name() string + Check(ctx context.Context, state SupervisorState) SupervisorSignal +} +``` + +**`Shipper`** — PR/MR creation. GitHub is the default; GitLab, Bitbucket, Forgejo, Gitea are real needs. + +```go +type Shipper interface { + Ship(ctx context.Context, opts ShipOptions) (ShipResult, error) +} +``` + +**`VCS`** — version control backend. `git` default, `jj` (Jujutsu) as first alternative (ace already prefers jj). + +```go +type VCS interface { + Commit(ctx context.Context, msg string, files []string) error + Branch(ctx context.Context, name string) error + Push(ctx context.Context, remote, branch string) error + // ... +} +``` + +**`Store`** — storage backend. SQLite for personal use, PostgreSQL for team/shared sessions — same binary, different plugin. + +```go +type Store interface { + SaveSession(ctx context.Context, s Session) error + LoadSession(ctx context.Context, id string) (Session, error) + SaveMemory(ctx context.Context, m Memory) error + SearchMemory(ctx context.Context, q MemoryQuery) ([]Memory, error) + // ... +} +``` + +**`Notifier`** — notification provider. Replaces SF's `/sf remote`. Slack, Discord, webhook — each a plugin. + +```go +type Notifier interface { + Notify(ctx context.Context, event Event) error +} +``` + +### What stays out of plugins + +- Workflow templates — enforced TOML/YAML data, not plugin code +- Skills — SKILL.md files, prompt guidance, not Go code +- Model routing — config + thin Go function + SQLite history (see below) +- Phase transitions — harness-owned, not extensible by design + +## Model routing + +Not a plugin — a pure function `(phase, complexity, history) → tier` with a feedback loop. No external dependencies, no algorithm variation needed. + +### Extended thinking — already in Crush + +Crush upstream has full extended thinking support — no cherry-picking needed: +- `Think` / `reasoning_effort` per model in config +- Anthropic extended thinking with budget allocation and beta header +- Gemini thinking levels +- Expandable thinking UI with sidebar +- Streaming thinking deltas + +The `reasoning` tier (research, plan phases) should have `Think: true` in config. The harness sets this via the model routing decision — the agent doesn't control whether thinking is enabled. + +### Three tiers, multiple models each + +Each tier holds multiple candidate models — the router picks within the tier using benchmark scores, not static config: + +```toml +[tiers.fast] +models = ["claude-haiku-4-5", "gemini-flash-2.0", "gpt-4o-mini"] + +[tiers.standard] +models = ["claude-sonnet-4-6", "gpt-4o", "gemini-2.0-pro"] + +[tiers.reasoning] +models = ["claude-opus-4-7", "o3", "gemini-ultra-2.0"] +``` + +### Phase → tier (static, config-driven) + +```toml +[routing] +research = "reasoning" # Think: true +plan = "reasoning" # Think: true +execute = "standard" +tdd = "standard" # writing tests: same tier as coding +verify = "fast" # running gates, reading results: fast is fine +review = "standard" # structured self-review: needs judgment +merge = "fast" # commit message + PR description: fast is fine +complete = "fast" +reassess = "reasoning" # failure diagnosis: needs depth +memory = "fast" +``` + +### Benchmarking — model selection within tier + +Within a tier, the router picks the best model based on benchmark scores stored in SQLite. Benchmarks run against real task samples (not synthetic) and record: + +- **Quality score** — outcome rating (pass/fail/partial, user `/sf rate` signal) +- **Latency** — time to first token + completion time +- **Cost** — tokens × model price +- **Task fingerprint** — phase + complexity bucket + project type + +The benchmark system (inspired by SF's `benchmark-selector.ts` and ace's benchmark daemon) runs periodically and on demand (`/sf benchmark`). Results decay over time — model capabilities change with provider updates. + +```go +type BenchmarkResult struct { + Model string + Tier string + Fingerprint string // phase+complexity+project hash + Quality float64 // 0.0-1.0 + LatencyP50 time.Duration + CostPer1k float64 + SampleCount int + RecordedAt time.Time +} +``` + +Router selection within tier: +``` +score = quality * 0.6 + (1 - normalised_latency) * 0.2 + (1 - normalised_cost) * 0.2 +``` +Weights are configurable. Fallback to tier's first model if no benchmark data exists. + +### Complexity upgrade (dynamic) + +Classifier at dispatch time — file count, scope breadth, cross-cutting changes → complexity score. Crosses threshold → tier bumps one level. Fingerprint + upgrade decision stored in SQLite for future routing. + +### `/sf rate` feedback loop + +Two sources depending on mode: + +- **Auto-mode** — LLM self-evaluates at unit close: did the output meet the phase objective? Signals `over/ok/under` based on its own assessment of quality vs cost. No human in the loop. +- **Step/interactive mode** — human signals `over/ok/under` after reviewing the unit output. + +Both write the same benchmark result format with quality score (`over=0.3`, `ok=0.8`, `under=0.0` effectively blocks model for this fingerprint). Human ratings carry higher weight than LLM self-ratings — configurable multiplier. + +### Why not a plugin + +The routing decision is `(phase, complexity, benchmark_history) → model`. Pure function, no external dependencies, no algorithm variation needed. Config + SQLite + a thin Go scorer. If a user needs different weights or different tiers, they change TOML — they do not recompile. + +## Memory and knowledge — Hindsight (superseded section) + +> **Superseded.** Earlier drafts of this section described a layered local pipeline (sqlite-vec + FTS5 + RRF + cross-encoder reranker via `llm-gateway.centralcloud.com/v1/rerank`, with `Qwen3-Reranker-0.6B` for routine queries and `Qwen3-Reranker-4B` for pre-dispatch context). **That model was dropped during SPEC.md v0.2.** The current architecture: +> +> - **Hindsight is the sole knowledge backend.** sf does not call any embedding endpoint, any reranker endpoint, or run any local vector index. Hindsight handles all retrieval and reranking server-side. +> - **No `memories` SQLite table, no `F32_BLOB`, no `libsql_vector_idx`, no `vector_top_k()`.** SQLite in sf is orchestration-only. +> - The `local_anti_patterns` mirror is the one exception: a small SQLite table that survives Hindsight outage so anti-patterns still inject into context. +> - On Hindsight unreachable: log warning, dispatch with empty recall (plus local anti-patterns). Do NOT fall back to FTS5 — there is no FTS5. +> +> See SPEC.md §§ 16.1, 16.1.1 (Hindsight client interface — Recall / Retain / Feedback / Validate / Health), 16.4 (anti-patterns), 16.7 (retrieval delegation to Hindsight). +> +> What's still useful from this section's research: anti-pattern semantics, confidence decay formula, two-bank pattern (project/global). Those concepts survived; the local-pipeline implementation didn't. + +## SSH / Wishlist integration (bonus) + +Since this is Go and we're on Headscale: + +- Wrap with `tailscale.com/tsnet` to expose singularity-crush as a tailnet node +- Use `github.com/charmbracelet/wish` to serve it as an SSH app +- Use `github.com/charmbracelet/wishlist` for cross-node session directory + +This gives browser + Termux + ET access to the agent from anywhere on the tailnet without any extra ingress. + +## Secret management — Vault instead of env vars + +Crush stores API keys as plaintext in `~/.local/share/crush/crush.json` or resolves `$ENV_VAR`. Both are unacceptable. singularity-crush will use Vault (already running at `vault.hugo.dk` in k3s) as the sole source of secrets. + +### Config syntax + +Replace `api_key: "$ENV_VAR"` with a `vault://` URI scheme: + +```json +{ + "providers": { + "anthropic": { "api_key": "vault://secret/singularity-crush#anthropic_api_key" }, + "openai": { "api_key": "vault://secret/singularity-crush#openai_api_key" } + } +} +``` + +### Implementation + +Extend `internal/config/resolve.go` — add a `VaultResolver` alongside the existing `ShellVariableResolver`: + +```go +type VaultResolver struct { + client *vault.Client // github.com/hashicorp/vault/api +} + +func (r *VaultResolver) Resolve(uri string) (string, error) { + // parse vault://path#field + // client.KVv2(mount).Get(ctx, path) → secret.Data["field"] +} +``` + +Auth chain (first that succeeds): +1. `VAULT_TOKEN` env var (CI/ephemeral) +2. `~/.vault-token` file (local dev, written by `vault login`) +3. AppRole via `VAULT_ROLE_ID` + `VAULT_SECRET_ID` (production/automated) + +Secrets are fetched **once at startup** and held in memory for the session lifetime — never written to disk, never logged (Crush's existing redaction in `crush_logs.go` already covers `api_key` fields). + +### Stopgap (works today without code changes) + +Crush already supports `$(command)` substitution: +```json +{ "api_key": "$(vault kv get -field=anthropic_api_key secret/singularity-crush)" } +``` +Use this until the native resolver is built. + +### What not to do + +- Do not use External Secrets → k8s Secret → env var. That's three hops back to env vars. +- Do not use Vault Agent sidecar for a CLI tool — that's k8s infrastructure, not a local binary. +- Do not store any secret in `crush.json` in plaintext, even temporarily. + +## Module rename + +Change `github.com/charmbracelet/crush` → `github.com/singularity-ng/singularity-crush` in go.mod and all imports. + +## Skill system — already done + +Crush implements the [Agent Skills open standard](https://agentskills.io) in `internal/skills/`. Nothing to build. + +**Built-in skills Crush ships:** +- `crush-config` — configuration help (crush.json, providers, LSP, MCP, hooks) +- `crush-hooks` — hook authoring (PreToolUse, matchers, allow/deny/halt, input rewriting) +- `jq` — built-in JSON processor via gojq + +**Auto-discovered skill paths** (no config needed): +- `.agents/skills/` — **ace-coder's skills already live here, fully compatible** +- `.crush/skills/` +- `.claude/skills/` +- `.cursor/skills/` + +Each skill is a directory with a `SKILL.md` file containing YAML frontmatter (`name`, `description`, `license`, `compatibility`) and markdown instructions. SF's TypeScript skill catalog is replaced entirely by dropping `SKILL.md` files into `.agents/skills/`. + +**SF skills to port as SKILL.md files** (drop into `.agents/skills/singularity-crush/`): +- `sf-auto` — autonomous mode instructions +- `sf-plan` — planning and phase dispatch +- `sf-git` — worktree and branch conventions +- `sf-knowledge` — when and how to store memories +- `sf-ship` — PR creation conventions + +These replace the TypeScript skill catalog, marketplace discovery, and skill manifest system entirely. + +## charmbracelet/x packages to use + +Experimental packages from [charmbracelet/x](https://github.com/charmbracelet/x) — no backward compat guarantees, but worth pulling for singularity-crush. + +**Directly needed:** + +| Package | Use | +|---|---| +| `x/exp/teatest` | Testing Bubbletea TUI components without a real terminal | +| `x/vcr` | HTTP recording/playback — test LLM API calls without hitting real providers | +| `x/xpty` | Cross-platform PTY — worktree shell execution, running commands in isolation | +| `x/gitignore` | Gitignore pattern matching — codebase indexer respects `.gitignore` | +| `x/sshkey` | SSH key parsing — Wish/tsnet SSH integration | + +**Worth evaluating before committing to raw Bubbletea:** + +| Package | Use | +|---|---| +| `x/pony` | Declarative terminal UI markup language — could replace a lot of Bubbletea boilerplate for `/sf visualize` and the status dashboard. Check maturity before deciding. | +| `x/vt` | Virtual terminal emulator — relevant if session replay or terminal recording is wanted | + +`teatest` + `vcr` together form the core test harness: reproducible TUI tests and reproducible LLM interaction tests with no real API or terminal needed. + +## What to keep from Crush unchanged + +- `internal/lsp/` — LSP client, keep as-is +- `internal/db/` — SQLite schema, extend for SF planning tables +- `internal/agent/tools/` — all tools, add SF-specific ones +- `charm.land/fantasy` and `charm.land/catwalk` — do not replace + +## Implementation conformance checklist + +Derived from Symphony SPEC.md § 18.1. Use as definition-of-done for each build phase. + +**Core (must ship):** +- [ ] Workflow path: explicit arg + cwd default (`WORKFLOW.md`) +- [ ] TOML/YAML config loader with typed defaults and `$VAR` resolution +- [ ] Dynamic config watch/reload/re-apply without restart (fsnotify) +- [ ] Polling orchestrator with single-authority in-memory state +- [ ] Issue/task tracker client: candidate fetch + state refresh + terminal fetch +- [ ] Workspace manager: sanitized per-unit paths, creation + reuse +- [ ] Workspace lifecycle hooks: `after_create`, `before_run`, `after_run`, `before_remove` +- [ ] Hook timeout config (default 60s), fatal vs. best-effort semantics per hook type +- [ ] Agent subprocess client: JSON line protocol, session start, turn loop +- [ ] Strict prompt rendering: `unit`, `attempt`, `phase` variables; unknown vars = error +- [ ] Continuation turns: guidance-only prompt on turn > 1, same thread +- [ ] Issue state re-check between turns: break if non-active +- [ ] Exponential retry queue with continuation retry (1s) after normal exit +- [ ] Configurable retry backoff cap (default 5m) +- [ ] Reconciliation: stop runs on terminal/non-active state transitions +- [ ] Workspace cleanup on terminal state (startup sweep + active transition) +- [ ] Blocker-aware dispatch: non-terminal upstream blocks dispatch +- [ ] Priority sort: priority asc → created_at asc → id lexicographic +- [ ] Per-phase concurrency caps (`max_agents_by_phase`) +- [ ] `turn_input_required` → hard failure in auto-mode +- [ ] Unsupported tool call → structured failure result, continue session +- [ ] Structured logs with `unit_id`, `session_id`, `turn_count` context fields +- [ ] Typed error codes (never match on error strings) +- [ ] Startup terminal artifact cleanup + +**Extensions (ship after core):** +- [ ] HTTP observability API (`/api/v1/state`, `/api/v1/units/`, `POST /api/v1/refresh`) +- [ ] SSH worker extension (`worker.ssh_hosts`, remote workspace, per-host concurrency) +- [ ] Durable retry queue across restarts (SQLite-backed) +- [ ] `tracker_query` client-side tool (agent reads/updates task state via orchestrator auth) +- [ ] Pluggable tracker adapters beyond Linear (GitHub Issues, Jira, plain SQLite) + +## Effort estimate + +- Foundation (module rename, SF commands scaffolded): 1-2 days +- Core planning + dispatch + git + auto-loop: 3-5 weeks +- Workflow templates + parallel + ship command: 2-3 weeks +- Hindsight knowledge layer (sqlite-vec + FTS5/BM25 + RRF fusion + maturation): 2-3 weeks +- SF skills as SKILL.md files: 3-5 days +- Persistent agents + memory blocks: 1-2 weeks +- Inter-agent messaging (inbox, wake/sleep, wait_for_reply): 1-2 weeks +- Polish, /sf visualize rebuild: 1-2 weeks + +**Total for a working SF-equivalent with persistent agents: ~11-14 weeks** — skill system and agent loop are free from Crush; knowledge layer and agent fleet are net-new capability SF never had. + +## Open design decisions + +**Hindsight deployment** — must be running and accessible on the tailnet before the memory layer works. Deploy in k3s alongside other services before starting singularity-crush development. + +**Two-bank pattern per session** — Hindsight supports arbitrary bank IDs. Two banks, queried separately and merged before each dispatch: + +```go +projectRecall := hindsight.Recall("project/"+projectHash, query) +globalRecall := hindsight.Recall("global/coding", query) +// merge, deduplicate, inject into unit context +``` + +Project bank holds codebase-specific knowledge. Global bank holds cross-project patterns. No native cross-bank federated search in Hindsight — two calls merged in Go is trivial. + +**Compaction = retain + fresh context** — when context hits 80%, dump the session summary into Hindsight (`retain`) and start a fresh context window with a `recall` of the most relevant memories. Better than lossy summarisation — session history becomes searchable memory rather than a truncated digest. + +**`sf init` — deep analysis is default, not opt-in.** Shallow init defeats the autopilot. Default `sf init` runs full deep analysis: +1. AST-level codebase scan (languages, structure, entry points, dependencies) +2. Git history analysis (active areas, recent changes, contributors) +3. Retain findings into `project/{hash}` Hindsight bank +4. Establish `.sf/config.toml` with detected stack, workflow templates, model routing hints + +`--quick` flag skips the Hindsight indexing for throwaway sessions. Deep is always the default. + +**Parallel workers + Hindsight** — concurrent `retain` calls from parallel slice workers use `document_id` (Hindsight's upsert key) derived from content hash. Duplicate memories from parallel workers silently overwrite rather than accumulate. + +**Nix flake from day one** — `flake.nix` in the repo root from the first commit. Single Go binary + Nix = installable everywhere on the infrastructure in one line, including k3s nodes via the tailnet. + +## Product management via skills (phase 2) + +singularity-crush is a coding autopilot first. PM capabilities come later via the skills ecosystem — no code needed, just the right skill packs dropped into `.agents/skills/`. + +**Ready-made PM skill packs:** +- [anthropics/knowledge-work-plugins](https://github.com/anthropics/knowledge-work-plugins) — official Anthropic PM skills: `competitive-brief`, `metrics-review`, `product-brainstorming`, `roadmap-update`, `sprint-planning`, `stakeholder-update`, `synthesize-research`, `write-spec` +- [deanpeters/Product-Manager-Skills](https://github.com/deanpeters/Product-Manager-Skills) — 47 PM skills, full workflow from discovery to delivery +- [product-on-purpose/pm-skills](https://github.com/product-on-purpose/pm-skills) — 38 skills covering Triple Diamond (discover → define → develop → deliver) + +**Large curated collections:** +- [VoltAgent/awesome-agent-skills](https://github.com/VoltAgent/awesome-agent-skills) — 1000+ skills +- [SkillsMP](https://skillsmp.com) — 66,500+ skills +- [LobeHub](https://lobehub.com/skills) — 15,000+ skills + +**How it works:** PM skills are inspirational guidance — the agent reads them and uses PM frameworks (roadmaps, sprint planning, specs) on demand. For enforced PM workflows (vision → feature tree → HTDAG) that's ace's domain via MCP. singularity-crush sits between: richer than plain Crush with PM skills loaded, lighter than ace for day-to-day coding work. + +**Connection to ace:** When ace is connected via MCP, it replaces the PM skill layer entirely — ace's PM agent owns vision decomposition and task routing, singularity-crush executes. PM skills are for standalone use without ace. diff --git a/mise.toml b/mise.toml new file mode 100644 index 0000000000..97894c9d99 --- /dev/null +++ b/mise.toml @@ -0,0 +1,2 @@ +[env] +"aqua:openai/codex" = "latest"