diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..6d75a5a --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +.sea/ +.worktrees/ +.DS_Store +**/.DS_Store diff --git a/CHANGELOG.md b/CHANGELOG.md index 7212bbc..71a085f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,47 @@ All notable changes to `software-engineer-agent` are documented here. This project follows [Keep a Changelog](https://keepachangelog.com/) and [Semantic Versioning](https://semver.org/). +## [Unreleased] — v2.1.0 + +### Added +- `_common.md` Rule 7 (Evidence-Bearing Exit Reports): every agent's exit report + must include actual command output, not a paraphrase. +- Step 0 (Demonstrate Comprehension) in `researcher.md`, `planner.md`, `executor.md`: + agents state task understanding in structured `UNDERSTOOD:` format before any tool call. +- `evals/suites/agents/prompt-quality.sh`: structural regression protection for both + additions (Rule 7 presence, Step 0 presence, verifier exclusion). +- Per-task `Allowed paths` / `Forbidden paths` fields in `planner.md` Mode B plan schema. +- Pre-commit scope check (Step 5.5) in `executor.md`: detects out-of-scope files before + committing; emits `STATUS: blocked` with scope-violation reason. +- `evals/fixtures/plans/sample-plan-with-scope.md`: fixture plan demonstrating scope bounds. +- `evals/suites/agents/scope-creep-detection.sh`: structural simulation of scope-violation + detection logic. +- `evals/suites/agents/prompt-quality.sh` extended with scope-bound assertions. +- Per-plan `risk_gates` section in `planner.md` Mode B plan schema with + gate-kind taxonomy (`destructive-git`, `filesystem-destruction`, + `dependency-removal`, `schema-migration`, `unsafe-shell`, + `network-state-mutation`). +- Gate-pause protocol in `executor.md`: new `STATUS: gate` exit, writes + `.sea/phases/phase-N/gate-pending.json`, marks task status `gated` in + `progress.json`, and resumes via "gate resumed" context on re-launch. +- Step 4.5 "Risk gate inspection" and "Resume after gate" branch in + `skills/sea-go/SKILL.md`: surfaces gates for explicit user confirmation + before executor launch and on each `STATUS: gate` return. +- `docs/STATE.md` documents the new `.sea/phases/phase-N/gate-pending.json` + marker (writer, readers, format, invariants). +- `evals/fixtures/plans/sample-plan-with-gates.md`: fixture plan with one + task per gate kind. +- `evals/suites/agents/risk-gate-flow.sh`: structural simulation of the + gate-pending marker round-trip; does not run a real executor. +- `evals/suites/agents/prompt-quality.sh` extended with risk-gate + assertions (planner, executor, sea-go). + +### Pending (Iter 3) +- **Live end-to-end validation required before merge.** Iteration 3 may + not ship without a successful `claude --plugin-dir` run against a + throwaway repo containing one risk gate, confirming executor pauses, + sea-go surfaces the prompt, and the resume path works end-to-end. + ## [2.0.0] — 2026-04-15 v2.0.0 is a disciplined scope cut and state-model consolidation driven diff --git a/agents/_common.md b/agents/_common.md index 6ed3816..80de732 100644 --- a/agents/_common.md +++ b/agents/_common.md @@ -98,3 +98,30 @@ should enforce it at the task boundary too. re-stage, create a new commit. - **Never commit secrets.** If a diff contains an API key, token, credential, or `.env` value, stop and report. + +## 7. Evidence-Bearing Exit Reports + +When you report `STATUS: done`, `STATUS: blocked`, or any claim of +the form "I verified X" / "X works" / "X passes", include the actual +command(s) run and their output, not a paraphrase. + +**Bad:** "Tests pass." +**Good:** `pytest tests/ -v → 47 passed in 2.1s` + +**Bad:** "Build succeeded." +**Good:** `npm run build → Compiled in 3.2s, bundle 142 KiB` + +**Bad:** "Reviewed for security." +**Good:** `grep -rn 'eval\|exec\|innerHTML' src/ → no matches` + +**Bad:** "The migration worked." +**Good:** `cat .sea/state.json | jq .schema_version → 2` + +A claim without the command and its output is an **assertion**; a +claim with them is **verifiable**. The verifier agent treats +unverifiable claims as failures and returns `{ok: false, reason: +"exit report contained claims without evidence: "}`. + +This rule does not replace the Prove-It pattern (`executor.md:73-98`) +for bug fixes. Prove-It is the stricter rule for its specific +trigger; Rule 7 is the baseline rule for every other claim. diff --git a/agents/executor.md b/agents/executor.md index f7f6bcb..5140ec4 100644 --- a/agents/executor.md +++ b/agents/executor.md @@ -24,6 +24,27 @@ color: green You are an execution agent. You receive a plan file and implement it task by task. You are the only agent in this plugin allowed to write code. +## Step 0: Demonstrate Comprehension + +Before your first tool call on this invocation, state what you +understand the task to require. Use this exact format: + +``` +UNDERSTOOD: + - Task: + - Inputs: + - Outputs: + - Boundary: +ASSUMPTIONS: + - + - +``` + +If any element is unclear after re-reading the plan, **STOP** and +surface the specific ambiguity (Rule 2 in `_common.md`). Do not +guess and proceed. This step comes **before** any memory check, file +read, or tool call. + ## Start Here: Check Memory Every invocation, review your own `MEMORY.md` first. Which conventions does this project use? What naming style? Which helper modules exist so you don't duplicate them? Where have you stumbled before? Load that context before touching any file. @@ -34,7 +55,9 @@ Every invocation, review your own `MEMORY.md` first. Which conventions does this 2. **Check progress** — read `.sea/phases/phase-N/progress.json` if it exists. Skip tasks already in `completed_tasks[]` and resume at `current_task`. If absent, start at task 1. 3. **Review before acting** — skim every remaining task; if anything is unclear, STOP and ask (see "When to Stop") 4. **Work one task at a time** — never start task N+1 before task N is committed +4.5. **Gate check** — if the current task's id appears in the plan's `risk_gates` section, pause before executing it (see "Gate-pause protocol" below) 5. **Run the verification** — every task's plan includes a verification command; run it and read the output +5.5. **Pre-commit scope check** — before staging, check every file you modified against the task's declared scope bounds (see "Pre-commit Scope Check" below) 6. **Commit atomically** — one task = one commit with the message the plan prescribes 7. **Persist progress** — after each successful commit, update `.sea/phases/phase-N/progress.json` (see "Progress File") 8. **Update memory** — at the end, record anything that will help future you @@ -65,6 +88,78 @@ jq -n --argjson p "$N" --argjson next "$NEXT" --argjson done "$DONE_JSON_ARRAY" When the phase is fully done, delete the progress.json — the summary.md takes over as the historical record. +## Pre-commit Scope Check + +After completing a task's changes but **before staging and committing**, check your +diff against the task's scope bounds from the plan: + +```bash +CHANGED=$(git diff --name-only HEAD) +``` + +For each file in `CHANGED`: +- It must match at least one glob in the task's `Allowed paths`. +- It must NOT match any glob in the task's `Forbidden paths`. + +If any file fails either check, **STOP** (Rule 5 "Stop-the-Line"). Do not commit. +Emit: + +``` +STATUS: blocked +TASK: +REASON: scope violation — is not in allowed_paths / is in forbidden_paths +TRIED: +NEEDED: either (a) user confirms scope expansion, or (b) revert the out-of-scope + change and continue with only in-scope work +``` + +Do not silently adjust the scope by editing the plan. Scope expansions require +user acknowledgment. + +**Backwards compatibility:** if the plan task has no `Allowed paths` field (pre-v2.1.0 +plan or user-authored plan), emit a one-line warning and skip the check: +`WARNING: plan task N has no allowed_paths — scope check skipped` + +## Gate-pause protocol + +Before starting any task whose id appears in the plan's `risk_gates` section, +**pause** before executing it: + +1. Write `.sea/phases/phase-N/gate-pending.json`: + + ```json + { + "phase": , + "task": , + "kind": "", + "confirmation_prompt": "", + "created": "" + } + ``` + +2. Update `progress.json` to mark task status `gated` (not `completed`, + not `in-progress`). +3. Exit with: + + ``` + STATUS: gate + TASK: + KIND: + PROMPT: + ``` + +4. Do NOT proceed to the next task. Do NOT emit a commit for the gate + task. + +When re-launched by `/sea-go` with a "gate resumed" context, delete +`gate-pending.json`, read `progress.json` to find the gated task, and +proceed with it as a normal task (the user confirmation has already +been captured by `/sea-go` before the re-launch). + +**Backwards compatibility:** if the plan has no `risk_gates` section +(pre-v2.1.0 plan), emit a one-line warning and skip gate checks: +`WARNING: plan has no risk_gates section — gate checks skipped` + ## Commit Format ``` diff --git a/agents/planner.md b/agents/planner.md index 1bfff01..69cec12 100644 --- a/agents/planner.md +++ b/agents/planner.md @@ -23,6 +23,27 @@ color: blue You are a planning agent. Your job is to produce clear, atomic, verifiable plans. You do not write code — you define *what* gets done, *in what order*, and *how it will be verified*. +## Step 0: Demonstrate Comprehension + +Before your first tool call on this invocation, state what you +understand the task to require. Use this exact format: + +``` +UNDERSTOOD: + - Task: + - Inputs: + - Outputs: + - Boundary: +ASSUMPTIONS: + - + - +``` + +If any element is unclear after re-reading the brief, **STOP** and +surface the specific ambiguity (Rule 2 in `_common.md`). Do not +guess and proceed. This step comes **before** any memory check, file +read, or tool call. + ## Start Here: Check Memory Every invocation, read your own `MEMORY.md` first. What phase sizes worked on this project? Where did executor get stuck last time? Which plan patterns the user accepted, which they pushed back on? Past experience shapes the current plan. @@ -87,10 +108,68 @@ trivial | medium | complex 3. ... - **Verification:** - **Commit:** `type(scope): message` +- **Allowed paths:** glob1, glob2 *(files executor may create/edit/delete)* +- **Forbidden paths:** glob3, glob4 *(files executor must NOT touch in this task)* ### Task 2: ... ``` +### Per-task scope bounds + +Every task must declare its filesystem scope explicitly. + +**Allowed paths** are a positive scope: globs the executor may create, edit, or +delete files within. If scope is truly the whole repo (e.g., a lint sweep), write +`**` and document why in the Verification field. + +**Forbidden paths** are explicit guards: globs the executor must NOT touch even +if a task "naturally leads" there. They catch the most common scope-creep +direction for this specific task. + +- Empty `Forbidden paths` is allowed and means "no explicit guards"; prefer listing + at least one high-risk neighbor. +- If a task has no `Allowed paths` entry (pre-v2.1.0 plan), the executor treats + it as unrestricted with a one-line warning. + +### Per-plan risk gates + +Every plan.md must include a `risk_gates` section at the top of the file, +even if empty. A task is a **risk gate** if it contains any of: + +- **Destructive git ops:** `reset --hard`, `branch -D`, `push --force`, + `clean -fd`, tag deletion. +- **Filesystem destruction:** `rm -rf`, `truncate`, or file deletion from + a directory with > 10 commits of history. +- **Dependency removal or major-version downgrade.** +- **Schema migration** (state, database, config file format). +- **Shell commands** that run untrusted input through `eval`, `exec`, or a + subshell. +- **Network operations that modify external state:** API POST/DELETE, + `npm publish`, `gh release create`, `docker push`. + +Emit as: + +```yaml +risk_gates: + - task: 5 + kind: "dependency-removal" + reason: "Removes @legacy/auth; may break any import we haven't caught" + confirmation: "Confirm removal of @legacy/auth. Last used in commit abc123; grep found 3 import sites, all migrated in task 4. Proceed?" + - task: 7 + kind: "schema-migration" + reason: "Runs .sea/state.json migration from v1 to v2" + confirmation: "Confirm state migration. Back up .sea/ first? Migration is one-way." +``` + +Empty gates → write `risk_gates: []`. Empty is an **assertion** that no +gate-triggering task exists in this phase, not an omission. The planner +must read every task's verification and rationale before deciding the +list is empty. + +**Gate kinds (taxonomy):** `destructive-git`, `filesystem-destruction`, +`dependency-removal`, `schema-migration`, `unsafe-shell`, +`network-state-mutation`. + ## Rules - **Atomicity:** each task = **one** commit. If a task won't fit in a single commit, split it. diff --git a/agents/researcher.md b/agents/researcher.md index e09bd97..937e421 100644 --- a/agents/researcher.md +++ b/agents/researcher.md @@ -25,6 +25,28 @@ color: cyan You are a research agent. Your job is to analyze a codebase (or a topic) deeply and report the findings in a concise, actionable form. **You never modify files** — you read, search, and report. +## Step 0: Demonstrate Comprehension + +Before your first tool call on this invocation, state what you +understand the task to require. Use this exact format: + +``` +UNDERSTOOD: + - Task: + - Inputs: + - Outputs: +ASSUMPTIONS: + - + - +``` + +(Researcher is read-only — no Boundary field needed.) + +If any element is unclear after re-reading the brief, **STOP** and +surface the specific ambiguity (Rule 2 in `_common.md`). Do not +guess and proceed. This step comes **before** any memory check, file +read, or tool call. + ## Start Here: Check Memory Every invocation, start by reviewing your own `MEMORY.md`. Read the patterns, tech stack notes, and known gaps you've already recorded for this project. Avoid re-discovering what you already know — focus your report on what's new or changed. diff --git a/docs/STATE.md b/docs/STATE.md index 4d4aebb..5826e3d 100644 --- a/docs/STATE.md +++ b/docs/STATE.md @@ -137,6 +137,25 @@ The inventory table above is the index. The per-file sections below answer four - **Missing:** normal for a fresh phase (nothing started yet) or a completed phase (deleted at phase end). `/sea-go` interprets "missing" as "start at task 1". - **Corrupted:** `/sea-go` and the executor refuse to parse non-JSON and fall back to "start at task 1". Silent data loss risk: if `completed_tasks[]` is lost, the executor re-runs tasks — benign because each task is an atomic, idempotent commit-or-skip, but worth flagging. +### `phases/phase-N/gate-pending.json` (new in v2.1.0) + +- **Writer(s):** `agents/executor.md` when a task whose id appears in the plan's `risk_gates` section is reached. Executor writes this file, marks the task `gated` in `progress.json`, and exits with `STATUS: gate`. +- **Reader(s):** `skills/sea-go/SKILL.md` (Step 5 "Resume after gate" branch) reads the marker to surface the confirmation prompt to the user. Deleted by the executor on the next invocation once "gate resumed" context is passed in. +- **Format:** + ```json + { + "phase": , + "task": , + "kind": "", + "confirmation_prompt": "", + "created": "" + } + ``` +- **Required fields:** all of the above. Missing `kind` or `confirmation_prompt` is treated as a corrupt marker by `/sea-go`, which falls back to re-reading the plan's `risk_gates` section. +- **Invariants:** exists **iff** the executor exited with `STATUS: gate` and has not yet been re-launched with a resume context. Clearing on resume is the executor's responsibility; manual deletion unblocks the phase at the user's risk. +- **Missing:** normal in every phase where no gate has been hit. A missing marker after a `STATUS: gate` exit is an anomaly — `/sea-go` re-reads the plan and re-surfaces the gate from `risk_gates` directly. +- **Corrupted:** `/sea-go` refuses to auto-confirm; surfaces the plan's `risk_gates` entry and asks the user to confirm from the plan text instead of the marker. + ### `phases/phase-N/summary.md` - **Writer(s):** `skills/sea-go/SKILL.md:110` writes the summary when the phase completes successfully. diff --git a/docs/specs/2026-04-15-prompt-quality-patterns.md b/docs/specs/2026-04-15-prompt-quality-patterns.md new file mode 100644 index 0000000..7c78882 --- /dev/null +++ b/docs/specs/2026-04-15-prompt-quality-patterns.md @@ -0,0 +1,802 @@ + + +# Prompt quality patterns — v2.1.0 + +**Status:** Planned (gated on v2.0.0 completion) +**Date:** 2026-04-15 +**Author:** demwick (plan produced in collaboration with Claude Opus 4.6, 1M context) +**Target version:** v2.1.0 (minor bump: feature-additive, backwards compatible) +**Estimated duration:** 2–4 working sessions +**Companion spec:** `docs/specs/2026-04-15-scope-and-state-refactor.md` (v2.0.0 refactor — must complete first) + +--- + +## How to read this spec + +This spec practices the patterns it advocates for. If you are the implementing Claude session, before you touch code: + +1. Read this file end-to-end (not just the iteration you plan to start). +2. State back in 3–4 sentences what each of the four iterations adds to SEA and which files each touches. This is the comprehension check. +3. Do not start any iteration until the user confirms your summary is correct. + +This is not ceremony — it is the same **Demonstrate Comprehension** step that Iteration 1 installs into SEA's own agents. Applying the rule to the spec that introduces the rule is the point. + +--- + +## Context + +This spec is the **second** of two planned refactors for `software-engineer-agent`. The first (`2026-04-15-scope-and-state-refactor.md`) cuts the command surface from 11 to 6 and consolidates the state model. This spec (v2.1.0) is feature-additive: it hardens the prompt-quality layer that governs how SEA's subagents behave. + +### Why this is a separate spec + +The v2.0.0 refactor is **subtractive** (deletes commands, consolidates state). This spec is **additive** (adds rules to `_common.md`, adds schema fields to plan output, adds gate inspection to `/sea-go`). Bundling them would triple the review surface and obscure which changes introduced which behavior. Ship v2.0.0 first. Live in it for a few days. Then decide whether v2.1.0 is still the right priority. + +### Motivation: the gap between taste and mechanism + +A structural review of SEA (conducted 2026-04-15) identified that SEA already has most of the prompt-quality patterns a senior engineer would want: + +- **`agents/_common.md`** is a numbered operating constitution — referenceable rules that override task-specific instructions. +- **`executor.md:73-98`** Prove-It pattern names a specific failure mode (`"obvious bug, skip the repro"`) and mechanically prevents it. +- **`hooks/auto-qa`** is structural enforcement — the test runs whether the model wants it to or not. +- **`disable-model-invocation: true`** on side-effect commands is a consent gate by construction. + +The same review also identified four specific gaps where the plugin's own outputs **could be held to the same discipline it imposes on users**: + +1. **Comprehension is not demonstrated before execution.** When `/sea-go` launches the `executor` agent with a plan, the executor starts editing. There is no step where the executor proves it read the plan correctly before touching any file. Plan misreads are caught a commit later, when the damage is already done. + +2. **Exit reports are claim-format, not evidence-format.** `executor.md:143-149` specifies `STATUS: done / VERIFIED: yes` but does not require the actual command and output. `"tests pass"` is unverifiable. The Prove-It pattern enforces this for bug fixes only; every other task type gets claim-only reports. + +3. **Risk is not asymmetric to scope.** `sea-go/SKILL.md:46-54` picks a pipeline based on task complexity (trivial/medium/complex) but does not pick a gate density based on **risk**. A trivial phase that runs `rm -rf build/` gets the same treatment as a trivial phase that appends a line to a README. Destructive operations need extra confirmation even when they are "small". + +4. **Negative scope bounds are absent from plans.** Plans list what to do; they do not list what **not** to touch. Scope creep is silent because no assertion catches it. An executor that edits a file outside the intended area succeeds as long as its commit passes tests, even though it violated the plan's implicit boundary. + +All four gaps have the same root cause: **prompt-quality discipline lives in SEA's advice to users, not in SEA's internal protocols.** This spec closes that asymmetry. After v2.1.0, SEA eats its own dogfood. + +--- + +## Goals + +1. **Comprehension-before-action.** Every write-critical subagent (`planner`, `researcher`, `executor`) must state its understanding of the task in a structured format before taking its first tool call. Misunderstandings are caught at the cheapest possible point. + +2. **Evidence-bearing exit reports.** Every agent's `STATUS: done` / `STATUS: blocked` report must include the actual command(s) run and their output, not a paraphrase. Unverifiable claims are rejected by the verifier. + +3. **Risk-asymmetric gating.** The planner must tag tasks that trigger destructive operations, schema migrations, or dependency changes as **risk gates**. `/sea-go` inspects gates before execution and surfaces them for user confirmation. Executor pauses at gate tasks and waits for explicit acknowledgment. + +4. **Negative scope bounds.** Every plan task declares `allowed_paths` and `forbidden_paths`. Executor checks its staged diff against those globs before committing. Scope creep becomes a detectable, rejectable event. + +5. **Self-verifying structure.** Every addition in this spec is covered by an eval suite (`evals/suites/agents/prompt-quality.sh`) that structurally asserts its presence. Regression protection is not optional. + +## Non-goals + +- **No change to `_common.md` Rules 1–6.** The existing operating constitution is load-bearing. This spec **adds** Rule 7; it does not rewrite any existing rule. +- **No change to the Prove-It pattern in `executor.md`.** It is correct as-is and covers the bug-fix case better than the generic evidence rule could. Prove-It remains the stricter rule for its specific trigger. +- **No change to `hooks/auto-qa` or any other hook.** Hook layer is stable. This spec operates on the agent prompt layer only. +- **No change to the state schema.** v2.0.0 handled state; v2.1.0 does not touch `state.json`, `roadmap.md`, `progress.json`, or any other state artifact. +- **No change to the subagent → tool allowlist split.** `researcher` and `verifier` stay read-only; `executor` stays the only Write-permission agent. +- **No change to the side-effect command disable.** `init`, `go`, `quick` remain user-invocable-only. +- **No new commands, no new agents.** Scope discipline from v2.0.0 is honored. v2.1.0 adds **rules and schema**, not surfaces. +- **No change to the license or dependency set.** Still AGPL-3.0-or-later, still only bash + jq + git. +- **Iteration 4 (auto-inject `_common.md`) is gated on Claude Code platform API support.** If the API does not support pre-subagent-launch injection, Iteration 4 is dropped from v2.1.0 and re-evaluated for a later version. + +## Constraints + +- Every iteration ends with `bash evals/run.sh` green **and** `bash tests/run-tests.sh` green. +- Every iteration is one PR targeting `main`. Do not chain iterations. +- Every commit is atomic and follows conventional-commit format (`CLAUDE.md:85`). +- AGPL-3.0 header on every new `.md` or `.sh` file (`CLAUDE.md:36`). +- No `--no-verify`, no `rm -rf`, no `git push --force`, no `git reset --hard` with uncommitted work (`_common.md:93-94`). +- v2.0.0 must be merged and tagged before Iteration 1 begins. If v2.0.0 has not shipped yet, **stop and finish it first** — this spec builds on the post-refactor filesystem state (4 agents, 6 skills). +- Each iteration's PR description must include the spec section it implements (link to file + line anchor). + +--- + +## Iteration breakdown + +Four iterations, each a standalone PR. Iterations are **ordered by cost/value ratio**, not dependency — you can technically ship Iteration 1 alone and stop. Later iterations build on earlier ones (Iteration 3 assumes Iteration 2's plan schema fields exist) so out-of-order shipping requires schema coordination. + +### Iteration 1 — Comprehension + Evidence (cheapest, highest leverage) + +**Branch:** `feat/prompt-quality-comprehension-evidence` + +**Goal:** Install **Demonstrate Comprehension** as a pre-action step in every write-critical agent, and install **Evidence-Bearing Exit Reports** as a universal rule in `_common.md`. + +**Files touched:** +- `agents/_common.md` (add Rule 7) +- `agents/researcher.md` (add Step 0) +- `agents/planner.md` (add Step 0) +- `agents/executor.md` (add Step 0 before existing Step 1) +- `evals/suites/agents/prompt-quality.sh` (new file) +- `CHANGELOG.md` (unreleased section) + +**Files NOT touched:** +- `agents/verifier.md` — already narrow enough; adding ceremony here is overkill. +- `agents/_common.md` Rules 1–6 — unchanged. +- Any skill file. +- Any hook. + +**Steps:** + +1. **Add Rule 7 to `_common.md`.** Insert after Rule 6 (Commit Discipline): + + ```markdown + ## 7. Evidence-Bearing Exit Reports + + When you report `STATUS: done`, `STATUS: blocked`, or any claim of + the form "I verified X" / "X works" / "X passes", include the actual + command(s) run and their output, not a paraphrase. + + **Bad:** "Tests pass." + **Good:** `pytest tests/ -v → 47 passed in 2.1s` + + **Bad:** "Build succeeded." + **Good:** `npm run build → Compiled in 3.2s, bundle 142 KiB` + + **Bad:** "Reviewed for security." + **Good:** `grep -rn 'eval\|exec\|innerHTML' src/ → no matches` + + **Bad:** "The migration worked." + **Good:** `cat .sea/state.json | jq .schema_version → 2` + + A claim without the command and its output is an **assertion**; a + claim with them is **verifiable**. The verifier agent treats + unverifiable claims as failures and returns `{ok: false, reason: + "exit report contained claims without evidence: "}`. + + This rule does not replace the Prove-It pattern (`executor.md:73-98`) + for bug fixes. Prove-It is the stricter rule for its specific + trigger; Rule 7 is the baseline rule for every other claim. + ``` + +2. **Add Step 0 to `researcher.md`, `planner.md`, `executor.md`.** Insert as the first step of each agent's workflow section: + + ```markdown + ## Step 0: Demonstrate Comprehension + + Before your first tool call on this invocation, state what you + understand the task to require. Use this exact format so the + calling skill (and the verifier) can parse it: + + ``` + UNDERSTOOD: + - Task: + - Inputs: + - Outputs: + - Boundary: + ASSUMPTIONS: + - + - + ``` + + If any element is unclear after re-reading the plan / brief, + **STOP** and surface the specific ambiguity (see Rule 2 "Manage + Confusion Actively" in `_common.md`). Do not guess and proceed. + + This step comes **before** any memory check, file read, or tool + call. A 30-second comprehension statement catches 80% of plan + misreads at the cheapest point in the flow. + ``` + + The exact wording may vary per agent (researcher has no `Boundary` + field because it is read-only; adapt the block for each agent). + +3. **Adjust existing Step 1 labels** in each agent to `Step 1` if they + are currently "Start here" or equivalent. Maintain numbered order. + +4. **Create `evals/suites/agents/prompt-quality.sh`** — a new eval + suite with structural assertions: + + ```bash + #!/usr/bin/env bash + # Asserts prompt-quality patterns are installed in agent files. + # SPDX-License-Identifier: AGPL-3.0-or-later + set -euo pipefail + + REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)" + cd "$REPO_ROOT" + + fail() { printf 'FAIL: %s\n' "$1" >&2; exit 1; } + + # Rule 7 present in _common.md + grep -q 'Evidence-Bearing Exit Reports' agents/_common.md \ + || fail "_common.md missing Rule 7 (Evidence-Bearing Exit Reports)" + + # Each write-critical agent has Step 0 comprehension check + for agent in researcher planner executor; do + grep -q 'Demonstrate Comprehension' "agents/${agent}.md" \ + || fail "${agent}.md missing Step 0 (Demonstrate Comprehension)" + grep -q 'UNDERSTOOD:' "agents/${agent}.md" \ + || fail "${agent}.md missing UNDERSTOOD: output format" + done + + # verifier.md intentionally skipped — document the exclusion + if grep -q 'Demonstrate Comprehension' agents/verifier.md; then + fail "verifier.md should NOT have Step 0 — document the exclusion" + fi + + echo "prompt-quality.sh: all checks passed" + ``` + +5. **Update `CHANGELOG.md`** (`Unreleased` section): + + ```markdown + ## Unreleased + + ### Added + - `_common.md` Rule 7 (Evidence-Bearing Exit Reports): every agent's + exit report must include actual command output, not paraphrase. + - Step 0 (Demonstrate Comprehension) in `researcher.md`, `planner.md`, + `executor.md`: agents state task understanding before any tool call. + - `evals/suites/agents/prompt-quality.sh`: structural regression + protection for both additions. + ``` + +**Commit plan:** +- `feat(agents): add Rule 7 (Evidence-Bearing Exit Reports) to _common.md` +- `feat(agents): add Step 0 (Demonstrate Comprehension) to researcher` +- `feat(agents): add Step 0 (Demonstrate Comprehension) to planner` +- `feat(agents): add Step 0 (Demonstrate Comprehension) to executor` +- `test(evals): add prompt-quality.sh structural assertions` +- `docs(changelog): log v2.1.0 Iteration 1 additions` + +**Exit criteria (each must be demonstrated with a command and its output):** +- `grep -c 'Evidence-Bearing Exit Reports' agents/_common.md` → `1` +- `grep -c 'Demonstrate Comprehension' agents/{researcher,planner,executor}.md` → `3` +- `bash evals/run.sh` → all suites pass (new suite included) +- `bash tests/run-tests.sh` → all pass +- A dry-run `/sea-quick` against a throwaway repo produces an executor run whose first output contains the `UNDERSTOOD:` block (live validation — skip if you can't easily set up a throwaway repo, but log the skip in the PR description) + +**PR title:** `feat(agents): comprehension checks and evidence-bearing exits (v2.1.0 Iteration 1)` + +**Rollback:** `git revert` the merge commit. No schema changes, no state changes — rollback is trivial. + +--- + +### Iteration 2 — Negative scope bounds + +**Branch:** `feat/prompt-quality-scope-bounds` + +**Goal:** Every plan task declares `allowed_paths` and `forbidden_paths`. Executor checks its staged diff against these globs before each commit. Scope creep becomes detectable. + +**Preconditions:** Iteration 1 merged. + +**Files touched:** +- `agents/planner.md` (plan output schema extended) +- `agents/executor.md` (pre-commit scope check added to workflow) +- `docs/specs/2026-04-14-evaluation-layer-design.md` (only if the plan schema is referenced there — check first) +- `evals/fixtures/plans/` (add a fixture with explicit scope bounds) +- `evals/suites/agents/prompt-quality.sh` (extend with scope-bound assertions) +- `evals/suites/agents/scope-creep-detection.sh` (new suite, simulates scope creep) +- `CHANGELOG.md` (extend Unreleased) + +**Steps:** + +1. **Extend `planner.md` plan output schema.** Add a new subsection under the existing task output format: + + ```markdown + ### Per-task scope bounds + + Each task in the plan must declare its filesystem scope explicitly. + + ```yaml + tasks: + - id: 1 + title: "Add input validation to /api/users POST handler" + verification: "curl -X POST /api/users -d '{}' → 400 with error" + allowed_paths: + - "src/api/users/**" + - "tests/api/users/**" + forbidden_paths: + - "src/api/auth/**" # unrelated area, guard against drift + - "src/database/migrations/**" # no schema changes in this task + ``` + + **Allowed paths** are a positive scope: globs the executor may + create, edit, or delete files within. + + **Forbidden paths** are explicit guards: globs the executor must + NOT touch even if a task "naturally leads" there. They catch the + most common scope-creep direction for this specific task. + + - If a task's scope is truly the whole repo (rare — usually a + lint sweep or a rename), write `allowed_paths: ["**"]` and + document why in the task's verification field. + - Empty `forbidden_paths` is allowed and means "no explicit guards"; + prefer listing at least one high-risk neighbor even so. + ``` + +2. **Extend `executor.md` workflow** with a pre-commit scope check + between "Implement" and "Verify": + + ```markdown + ## Pre-commit scope check (Step 5.5) + + After making changes for a task but **before running the verification + command**, check your staged diff against the task's scope bounds: + + ```bash + STAGED=$(git diff --cached --name-only) + ``` + + For each file in `STAGED`: + - It must match at least one glob in `allowed_paths`. + - It must NOT match any glob in `forbidden_paths`. + + If any file fails either check, **STOP** (`_common.md` Rule 5 + "Stop-the-Line"). Do not commit. Emit: + + ``` + STATUS: blocked + TASK: + REASON: scope violation — is not in allowed_paths / is in forbidden_paths + TRIED: + NEEDED: either (a) user confirms scope expansion, or (b) revert the + out-of-scope change and continue with only in-scope work + ``` + + Do not silently adjust the scope by editing the plan. Scope + expansions require user acknowledgment. This is how SEA prevents + the classic "while I'm already in there" failure mode named in + `_common.md` Rule 4 and in `executor.md` Rules (line 103). + ``` + +3. **Add a fixture plan** under + `evals/fixtures/plans/sample-plan-with-scope.md` demonstrating the + new schema with at least 3 tasks and a realistic mix of allowed / + forbidden paths. + +4. **Add a new eval suite** + `evals/suites/agents/scope-creep-detection.sh` that: + - Creates a temp project with a fake plan (using the fixture). + - Simulates an executor run that edits a file outside `allowed_paths`. + - Asserts that the scope check detects the violation. + - Asserts that the output matches the expected `STATUS: blocked` format. + + This is a **structural simulation**, not a full executor run — it + tests the check logic, not the LLM. Use a bash helper to compute + the diff vs. globs. + +5. **Extend `evals/suites/agents/prompt-quality.sh`:** + + ```bash + # Planner schema includes scope bounds + grep -q 'allowed_paths' agents/planner.md \ + || fail "planner.md missing allowed_paths in plan schema" + grep -q 'forbidden_paths' agents/planner.md \ + || fail "planner.md missing forbidden_paths in plan schema" + + # Executor has pre-commit scope check + grep -q 'Pre-commit scope check' agents/executor.md \ + || fail "executor.md missing pre-commit scope check" + grep -q 'scope violation' agents/executor.md \ + || fail "executor.md missing scope-violation STATUS format" + ``` + +6. **Update `CHANGELOG.md`** Unreleased section. + +**Commit plan:** +- `feat(planner): add allowed_paths / forbidden_paths to plan schema` +- `feat(executor): add pre-commit scope check with scope-violation status` +- `test(evals): add scope-creep-detection suite and sample-plan fixture` +- `test(evals): extend prompt-quality.sh with scope-bound assertions` +- `docs(changelog): log Iteration 2 additions` + +**Exit criteria (with command + output evidence):** +- `grep -c 'allowed_paths' agents/planner.md` → `≥ 2` +- `grep -c 'Pre-commit scope check' agents/executor.md` → `1` +- `bash evals/suites/agents/scope-creep-detection.sh` → passes +- `bash evals/run.sh` → all green +- `bash tests/run-tests.sh` → all green + +**PR title:** `feat(agents): negative scope bounds with pre-commit scope check (v2.1.0 Iteration 2)` + +**Rollback:** `git revert`. No schema migration because plans are per-project and per-phase artifacts; pre-existing plans without `allowed_paths` continue to work (the executor treats missing `allowed_paths` as "all paths allowed" with a warning). + +**Backwards compatibility note:** the executor must gracefully handle plans that do not have `allowed_paths` (pre-v2.1.0 plans and user-authored plans). Emit a one-line warning in that case: `WARNING: plan task N has no allowed_paths — scope check skipped`. Do not fail. + +--- + +### Iteration 3 — Risk gates + +**Branch:** `feat/prompt-quality-risk-gates` + +**Goal:** The planner tags tasks that trigger destructive operations, schema migrations, or dependency changes as **risk gates**. `/sea-go` inspects gates before launching the executor and surfaces them for user confirmation. Executor pauses at gate tasks. + +**This is the most invasive iteration in this spec.** It touches three files across three layers (planner output schema, sea-go orchestration, executor pause protocol) and introduces a new agent exit status (`STATUS: gate`). Before starting: + +- **Read this entire iteration.** +- **Confirm the implementation approach** with the user (the marker-file approach is the default; see "Implementation note" below). +- **Sign off required** before Iteration 3 starts. This is the Iteration 3 version of the Phase 6 sign-off gate in the v2.0.0 spec. + +**Preconditions:** Iterations 1 and 2 merged. + +**Files touched:** +- `agents/planner.md` (plan schema extended with `risk_gates`) +- `agents/executor.md` (new `STATUS: gate` exit + resume-from-gate logic) +- `skills/sea-go/SKILL.md` (new Step 4.5 "Risk gate inspection") +- `evals/fixtures/plans/sample-plan-with-gates.md` (new fixture) +- `evals/suites/agents/risk-gate-flow.sh` (new suite) +- `evals/suites/agents/prompt-quality.sh` (extend) +- `docs/STATE.md` (add `.sea/phases/phase-N/gate-pending.json` to the state inventory — Phase 2 of v2.0.0 created this file; now it has a new schema addition) +- `CHANGELOG.md` + +**Implementation note — how executor "pauses"** + +Claude Code subagents do not support mid-run user interaction. An executor that needs to pause must exit and be re-launched by the calling skill. Two candidate approaches: + +**Approach A: split-phase launch.** `/sea-go` launches executor for tasks before the gate, waits for completion, surfaces the gate, waits for user confirmation, launches a second executor for tasks after the gate. + +**Approach B: gate-pending marker.** Executor runs normally. When it reaches a gate task, it writes `.sea/phases/phase-N/gate-pending.json` with the task ID and the confirmation prompt, then exits with `STATUS: gate`. `/sea-go` reads the marker, surfaces the prompt, waits for user confirmation, and re-launches executor with a "resume from gate" context that the executor reads via the existing `progress.json` logic. + +**Default recommendation: Approach B.** It matches the existing resume-from-interruption pattern in `executor.md` (progress.json), so the executor's state machine already knows how to resume from a stopped point. Approach A duplicates state across two executor invocations. + +**User sign-off required on approach choice** before implementation begins. Log the choice in the journal. + +**Steps (assuming Approach B):** + +1. **Extend `planner.md`** with the `risk_gates` schema: + + ```markdown + ### Per-plan risk gates + + Every plan.md must include a `risk_gates` section at the top of the + file, even if empty. + + A task is a risk gate if it contains any of: + - Destructive git ops: `reset --hard`, `branch -D`, `push --force`, + `clean -fd`, tag deletion. + - Filesystem destruction: `rm -rf`, `truncate`, file deletion from + a directory with > 10 commits of history. + - Dependency removal or major-version downgrade. + - Schema migration (state, database, config file format). + - Shell commands that run untrusted input through `eval`, `exec`, + or a subshell. + - Network operations that modify external state: API POST/DELETE, + `npm publish`, `gh release create`, `docker push`. + + Emit as: + + ```yaml + risk_gates: + - task: 5 + kind: "dependency-removal" + reason: "Removes @legacy/auth; may break any import we haven't caught" + confirmation: "Confirm removal of @legacy/auth. Last used in commit abc123; grep found 3 import sites, all migrated in task 4. Proceed?" + - task: 7 + kind: "schema-migration" + reason: "Runs .sea/state.json migration from v1 to v2" + confirmation: "Confirm state migration. Back up .sea/ first? Migration is one-way." + ``` + + Empty gates → write `risk_gates: []`. Empty is an **assertion** that + no gate-triggering task exists in this phase, not an omission. The + planner must read every task's verification and rationale before + deciding the list is empty. + ``` + +2. **Extend `executor.md`** with new status and pause protocol: + + ```markdown + ## Gate-pause protocol + + Before starting any task whose id appears in the plan's `risk_gates` + section, pause before executing it: + + 1. Write `.sea/phases/phase-N/gate-pending.json`: + ```json + { + "phase": , + "task": , + "kind": "", + "confirmation_prompt": "", + "created": "" + } + ``` + 2. Update `progress.json` to mark task status `gated` (not + `completed`, not `in-progress`). + 3. Exit with: + ``` + STATUS: gate + TASK: + KIND: + PROMPT: + ``` + 4. Do NOT proceed to the next task. Do NOT emit a commit for the + gate task. + + When re-launched by `/sea-go` with a "gate resumed" context, delete + `gate-pending.json`, read `progress.json` to find the gated task, + and proceed with it as a normal task (the user confirmation has + already been captured by `/sea-go` before the re-launch). + ``` + +3. **Extend `sea-go/SKILL.md`** with Step 4.5: + + ```markdown + ## Step 4.5: Risk gate inspection + + After reading the plan (Step 4 complexity pipeline selection) and + BEFORE launching the executor (Step 5): + + 1. Read the plan's `risk_gates` section. + 2. If `risk_gates` is missing, the plan is pre-v2.1.0. Emit a + warning: "WARNING: plan has no risk_gates section (pre-v2.1.0); + running without gate checks." Proceed to Step 5. + 3. If `risk_gates: []`, the planner asserted no gates. Proceed. + 4. If `risk_gates` contains entries, surface each to the user: + + ``` + Phase N contains risk gate(s): + Gate 1 — task (): + Confirmation prompt: + Gate 2 — task (): + Confirmation prompt: + + Review the gates. Confirm to proceed, or cancel to revise the + plan. The executor will pause at each gate task and request + individual confirmation before running it. + ``` + + 5. Wait for explicit user confirmation. Do not accept ambiguous + responses ("sounds good", "ok"). Require "confirm" or + equivalent specific acknowledgment. On any non-confirmation, + surface the plan path and stop. + + ## Step 6.5: Resume after gate + + If the executor returns `STATUS: gate`: + + 1. Read `.sea/phases/phase-N/gate-pending.json`. + 2. Surface the `confirmation_prompt` to the user, with the gate + kind and reason. + 3. Wait for specific confirmation ("confirm" or explicit task + acknowledgment — no ambiguous responses). + 4. On confirmation: re-launch the executor with resume context + ("Resume from gate at task . User confirmed."). The executor + deletes the marker and proceeds. + 5. On non-confirmation: mark the phase `blocked` in state, leave + the marker in place, and tell the user: "Phase N paused at + task . Delete .sea/phases/phase-N/gate-pending.json to + unblock, or run `/sea-quick 'modify task '` to revise." + + Do not auto-confirm gates. Do not skip the re-launch. The human + loop is the point. + ``` + +4. **Update `docs/STATE.md`** to add `gate-pending.json`: + - Path, writer (executor), readers (sea-go, verifier), format, + required fields, invariants (exists iff executor exited with + `STATUS: gate`; cleared on resume or manual deletion). + +5. **Add fixture plan** `evals/fixtures/plans/sample-plan-with-gates.md` + with at least one task of each gate kind. + +6. **Add eval suite** `evals/suites/agents/risk-gate-flow.sh`: + - Parses the fixture plan's `risk_gates` section. + - Asserts each gate task has a non-empty `confirmation_prompt`. + - Simulates a gate-pending marker and asserts the resume path + reads it correctly. + - Does not run a full executor — tests the state machine only. + +7. **Extend `prompt-quality.sh`** with `risk_gates` presence checks in + `planner.md` and `Step 4.5` presence in `sea-go/SKILL.md`. + +8. **Update `CHANGELOG.md`.** + +**Commit plan:** +- `feat(planner): add risk_gates schema with gate-kind taxonomy` +- `feat(executor): add gate-pause protocol with STATUS: gate and gate-pending marker` +- `feat(sea-go): add Step 4.5 risk gate inspection and Step 6.5 resume-after-gate` +- `docs(state): document .sea/phases/phase-N/gate-pending.json` +- `test(evals): add risk-gate-flow suite and sample-plan-with-gates fixture` +- `test(evals): extend prompt-quality.sh with risk-gate assertions` +- `docs(changelog): log Iteration 3 additions` + +**Exit criteria (command + output evidence):** +- `grep -c 'risk_gates' agents/planner.md` → `≥ 3` +- `grep -c 'STATUS: gate' agents/executor.md` → `≥ 2` +- `grep -c 'Risk gate inspection' skills/sea-go/SKILL.md` → `1` +- `bash evals/suites/agents/risk-gate-flow.sh` → passes +- `bash evals/run.sh` → all green +- `bash tests/run-tests.sh` → all green +- **Live validation (required for Iteration 3):** on a throwaway repo, + run `/sea-go` against a plan containing one risk gate. Observe that + the executor pauses, sea-go surfaces the prompt, and the resume + path works end-to-end. Log the run in the PR description with the + actual output. + +**PR title:** `feat(agents,skills): risk gates with pause-and-confirm protocol (v2.1.0 Iteration 3)` + +**Rollback:** `git revert`. Gate-pending markers in user projects become orphaned; add a note to `docs/migration/v1-to-v2.1.md` about manual cleanup if rollback happens. State schema is unchanged — this iteration adds a new file under `.sea/phases/phase-N/` but does not modify existing files. + +**Live-test gate:** Iteration 3 may NOT ship without a successful live end-to-end run. Structural evals catch 80% of regressions; the remaining 20% (planner emits correct gates, executor actually pauses, sea-go actually resumes) can only be validated against a real Claude Code session. If live validation fails, do not merge — debug and re-run. + +--- + +### Iteration 4 — Auto-inject `_common.md` (exploratory) + +**Branch:** `feat/prompt-quality-auto-inject-common` (only if feasible) + +**Goal:** Remove the manual `"Read _common.md first"` instruction at the top of every agent file and replace it with **runtime injection** — the plugin automatically prepends `_common.md` content to every subagent's prompt at launch time. This upgrades `_common.md` from a referenced document to a structural guarantee. + +**This iteration is exploratory and conditional.** If Claude Code's plugin API does not support pre-subagent-launch content injection, this iteration is **dropped** from v2.1.0 and filed as an upstream feature request. Do not fake it with a hook that rewrites agent files — that creates drift between what the file says and what the runtime loads. + +**Preconditions:** Iterations 1–3 merged. Research on Claude Code plugin API completed (see Step 1). + +**Steps:** + +1. **Research phase — BEFORE any code changes:** + - Read Claude Code's plugin API documentation: what hooks exist + for subagent launch? `PreSubagentLaunch` or similar? Is there + a per-agent prelude mechanism in frontmatter? + - Read `hooks/run-hook.cmd` and `hooks.json` — does the existing + hook infrastructure support a new event type? + - If a native mechanism exists: proceed to Step 2. + - If no native mechanism exists: **stop**, write a short research + note at `docs/specs/2026-04-15-prompt-quality-patterns-research.md` + documenting the API gap, and mark Iteration 4 as "deferred to + upstream API support". File a Claude Code issue or plugin-API + feature request if appropriate. + +2. **If feasible — implementation:** + - Use the supported mechanism to make `_common.md` content load + automatically before each subagent's prompt. + - Remove the `"Read agents/_common.md first"` instruction from + each agent file (researcher, planner, executor, verifier). + - Add a one-line note at the top of each agent file: ``. + - Update `CLAUDE.md:32-34` gotchas if the mechanism has quirks. + +3. **Eval:** + - Extend `prompt-quality.sh` with a check that the manual `"Read + _common.md first"` line is REMOVED from agent files (the new + mechanism replaces it). + - Add a hook smoke test that verifies `_common.md` is loaded on + agent launch (depends on the mechanism used). + +4. **Update `CHANGELOG.md`.** + +**Commit plan (if feasible):** +- `feat(runtime): auto-inject _common.md at subagent launch` +- `refactor(agents): remove manual "Read _common.md first" instruction` +- `test(evals): verify _common.md auto-injection` +- `docs(changelog): log Iteration 4` + +**If infeasible:** +- `docs(specs): add prompt-quality-patterns-research note on plugin API gap` +- No other commits. Iteration 4 becomes a future version's problem. + +**Exit criteria (if feasible):** +- `_common.md` content is observed in a subagent's context without + the agent file manually loading it. +- `grep -rn 'Read agents/_common.md' agents/` → no results. +- `bash evals/run.sh` → all green. + +**PR title (if feasible):** `feat(runtime): auto-inject _common.md at subagent launch (v2.1.0 Iteration 4)` +**PR title (if infeasible):** `docs(specs): research note on prompt-quality Iteration 4 API gap` + +--- + +## Decision points requiring user input + +### 1. Ship v2.1.0 at all, or defer indefinitely? + +**Recommendation: wait 7 days after v2.0.0 tags, then decide.** + +After v2.0.0 ships, use SEA on 2–3 real tasks with just the v2.0.0 feature set. If the comprehension and evidence gaps visibly hurt (e.g. an executor edits the wrong file silently; an exit report claims "tests pass" without output and the user discovers they didn't), v2.1.0 is high priority. If v2.0.0 alone feels sufficient, v2.1.0 can wait a month or a quarter without cost. Feature debt is only debt if it hurts. + +### 2. Which iterations to ship? + +**Recommendation: ship Iterations 1 and 2 as v2.1.0. Defer Iteration 3 to v2.2.0. Treat Iteration 4 as exploratory.** + +Iteration 1 is the cheapest, highest-leverage change — it installs the discipline with minimal mechanism. Iteration 2 is also relatively cheap and catches a common failure (scope creep). Iteration 3 is more invasive and introduces a new agent exit status plus a new state file; shipping it in a separate release lets the Iteration 1+2 patterns stabilize first. Iteration 4 depends on platform API research. + +Alternative: all four iterations in v2.1.0 if the author has a long weekend. Acceptable but riskier. + +### 3. Approach A or Approach B for executor pause (Iteration 3)? + +**Recommendation: Approach B (gate-pending marker).** + +Matches the existing resume-from-`progress.json` pattern; executor state machine already supports resumption. Approach A duplicates state across two launches and creates a cross-invocation consistency problem. + +Override only if the research in Iteration 3's planning surfaces a specific reason Approach B cannot work (e.g., marker files don't survive the Stop hook). + +### 4. When live validation is required + +**Required:** Iteration 3 live end-to-end run on a throwaway repo. + +**Optional but recommended:** Iteration 1 live run showing the `UNDERSTOOD:` block appears in executor output; Iteration 2 live run showing scope violation detection. + +**Not required:** Iteration 4 (it's exploratory; if implemented, live validation is whatever the chosen mechanism provides). + +### 5. Eval vs. live test boundary + +**Rule of thumb:** structural evals (bash grep / state machine simulation) catch regression; live tests catch LLM behavior. Both matter. For v2.1.0: + +- Iterations 1, 2 can rely primarily on structural evals — the LLM behavior change is "follow one more step in the prompt", which is high-confidence. +- Iteration 3 requires live validation because the pause-and-resume flow involves three components (planner, sea-go, executor) and multiple invocations. +- Iteration 4 validation depends on the chosen mechanism. + +--- + +## Rollback strategy + +- Every iteration is a separate PR. Reverting any iteration is `git revert `. +- No state schema changes in Iterations 1 or 2 — rollback is trivial. +- Iteration 3 adds a new file (`gate-pending.json`) but does not modify existing state. Rollback leaves orphan markers in any project that ran an iteration-3 phase; document the manual cleanup in `docs/migration/v2-to-v2.1.md` if it exists, or a short section in the iteration 3 PR description. +- Iteration 4 is exploratory and gated on research. Its rollback is "revert and re-file the research note as the deliverable". + +--- + +## Success criteria (entire v2.1.0 release) + +- [ ] `_common.md` contains Rule 7 (Evidence-Bearing Exit Reports). +- [ ] `researcher.md`, `planner.md`, `executor.md` contain Step 0 (Demonstrate Comprehension) with the `UNDERSTOOD:` output format. +- [ ] `verifier.md` intentionally does NOT contain Step 0 (documented exclusion). +- [ ] `planner.md` plan schema includes `allowed_paths` and `forbidden_paths` per task. +- [ ] `executor.md` workflow includes Pre-commit scope check with scope-violation `STATUS: blocked` format. +- [ ] `evals/suites/agents/prompt-quality.sh` exists and is green. +- [ ] `evals/suites/agents/scope-creep-detection.sh` exists and is green. +- [ ] If Iteration 3 shipped: `planner.md` has `risk_gates` schema; `executor.md` has gate-pause protocol; `sea-go/SKILL.md` has Step 4.5 + Step 6.5; `evals/suites/agents/risk-gate-flow.sh` exists and is green; live end-to-end test logged in the Iteration 3 PR. +- [ ] If Iteration 4 shipped: `_common.md` is auto-injected at subagent launch; manual `"Read _common.md first"` lines removed. +- [ ] If Iteration 4 deferred: research note at `docs/specs/2026-04-15-prompt-quality-patterns-research.md` exists. +- [ ] `.claude-plugin/plugin.json` version bumped to `2.1.0`. +- [ ] `CHANGELOG.md` has v2.1.0 section listing every iteration that shipped. +- [ ] `bash evals/run.sh` green. +- [ ] `bash tests/run-tests.sh` green. +- [ ] `git tag v2.1.0` pushed to origin. +- [ ] GitHub release published with notes. + +--- + +## Starting instructions for the implementing Claude session + +Open a fresh Claude Code session in this repository root (or reuse the session that completed v2.0.0 — your call, but re-read this spec either way): + +```bash +cd /Users/demirel/Projects/software-engineer-agent +claude +``` + +First message: + +> *Türkçe yanıt ver. `docs/specs/2026-04-15-prompt-quality-patterns.md` dosyasını baştan sona oku. Bu repo için v2.1.0 implementation planım — v2.0.0 refactor'ının üzerine kurulur. Başlamadan önce şunu ispatla: dört iterasyonun her birinin ne eklediğini ve hangi dosyalara dokunduğunu 4-6 cümlede özetle. Ayrıca: v2.0.0 henüz merge edilmemişse bunu söyle, bu spec ona gated.* + +Confirm the summary matches the spec. Specifically check that the implementing session correctly identifies: +1. Iteration 1 adds Rule 7 + Step 0 (touching `_common.md`, `researcher.md`, `planner.md`, `executor.md`). +2. Iteration 2 adds scope bounds (touching `planner.md`, `executor.md`, eval fixtures). +3. Iteration 3 adds risk gates (touching `planner.md`, `executor.md`, `sea-go/SKILL.md`, a new state file, evals). +4. Iteration 4 is exploratory and gated on API research. + +If the summary is correct, decide on Decision Point 2 (which iterations to ship in v2.1.0) before starting. Then proceed iteration by iteration with the same discipline as the v2.0.0 refactor: one iteration = one PR = one review. + +**Apply Rule 7 to your own implementation.** When you report completion of any iteration, include the actual command output (`grep -c ...`, `bash evals/...`, `git log --oneline`) not a paraphrase. Eat your own dogfood starting with this spec's execution. + +--- + +## Journal update (for the existing v2.0.0 journal) + +Append a new section to `docs/specs/2026-04-15-scope-and-state-refactor-journal.md` after v2.0.0 ships: + +```markdown +## v2.1.0 prompt-quality patterns (YYYY-MM-DD) + +- Spec: docs/specs/2026-04-15-prompt-quality-patterns.md +- Iterations shipped: +- Iterations deferred: +- Decision Point 2 resolution: +- Decision Point 3 resolution: +- Iteration 4 API research outcome: +- v2.1.0 release URL: +- Live-test observations: +``` + +This journal remains the permanent record across both refactors. + +--- + +*End of spec.* diff --git a/docs/superpowers/plans/2026-04-15-prompt-quality-v2.1.0-iter1-2.md b/docs/superpowers/plans/2026-04-15-prompt-quality-v2.1.0-iter1-2.md new file mode 100644 index 0000000..382b71f --- /dev/null +++ b/docs/superpowers/plans/2026-04-15-prompt-quality-v2.1.0-iter1-2.md @@ -0,0 +1,769 @@ +# Prompt Quality Patterns v2.1.0 — Iterations 1 & 2 + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Install Demonstrate Comprehension (Step 0) into write-critical agents, add Evidence-Bearing Exit Reports (Rule 7) to `_common.md`, and add negative scope bounds (`allowed_paths`/`forbidden_paths`) to the planner + executor. + +**Architecture:** Pure agent-prompt edits — no hooks, no state schema, no new commands. Iteration 1 adds Rule 7 to `_common.md` and Step 0 to three agents. Iteration 2 extends the planner's plan output schema and executor's pre-commit workflow. Both iterations are covered by a new `evals/suites/agents/` eval suite. + +**Tech Stack:** Bash eval suites, Markdown agent files, no build step. Validation: `bash evals/run.sh`. Every file touched has an AGPL-3.0-or-later header. + +--- + +## File Map + +**Iteration 1 — creates/modifies:** +- `agents/_common.md` — add Rule 7 after existing Rule 6 +- `agents/researcher.md` — add Step 0 before "## Start Here: Check Memory" +- `agents/planner.md` — add Step 0 before "## Start Here: Check Memory" +- `agents/executor.md` — add Step 0 before "## Start Here: Check Memory" (before existing Step 1 in Workflow) +- `evals/suites/agents/prompt-quality.sh` — new file (create directory) +- `CHANGELOG.md` — add Unreleased section + +**Iteration 2 — creates/modifies (on top of Iteration 1):** +- `agents/planner.md` — extend Mode B plan output schema with per-task scope bounds +- `agents/executor.md` — add Step 5.5 (Pre-commit scope check) to Workflow +- `evals/fixtures/plans/sample-plan-with-scope.md` — new fixture +- `evals/suites/agents/scope-creep-detection.sh` — new eval suite +- `evals/suites/agents/prompt-quality.sh` — extend with scope-bound assertions +- `CHANGELOG.md` — extend Unreleased section + +--- + +## ITERATION 1 + +### Task 1: Add Rule 7 to `_common.md` + +**Files:** +- Modify: `agents/_common.md` (after line 101 — end of Rule 6) + +- [ ] **Step 1: Read current end of `_common.md` to confirm insertion point** + + Run: `tail -20 agents/_common.md` + Expected: Rule 6 (Commit Discipline) is the last rule, file ends after the `--no-verify` lines. + +- [ ] **Step 2: Append Rule 7 after Rule 6** + + Add this block at the end of `agents/_common.md` (after the last line): + + ```markdown + + ## 7. Evidence-Bearing Exit Reports + + When you report `STATUS: done`, `STATUS: blocked`, or any claim of + the form "I verified X" / "X works" / "X passes", include the actual + command(s) run and their output, not a paraphrase. + + **Bad:** "Tests pass." + **Good:** `pytest tests/ -v → 47 passed in 2.1s` + + **Bad:** "Build succeeded." + **Good:** `npm run build → Compiled in 3.2s, bundle 142 KiB` + + **Bad:** "Reviewed for security." + **Good:** `grep -rn 'eval\|exec\|innerHTML' src/ → no matches` + + **Bad:** "The migration worked." + **Good:** `cat .sea/state.json | jq .schema_version → 2` + + A claim without the command and its output is an **assertion**; a + claim with them is **verifiable**. The verifier agent treats + unverifiable claims as failures and returns `{ok: false, reason: + "exit report contained claims without evidence: "}`. + + This rule does not replace the Prove-It pattern (`executor.md:73-98`) + for bug fixes. Prove-It is the stricter rule for its specific + trigger; Rule 7 is the baseline rule for every other claim. + ``` + +- [ ] **Step 3: Verify Rule 7 is present** + + Run: `grep -c 'Evidence-Bearing Exit Reports' agents/_common.md` + Expected: `1` + +- [ ] **Step 4: Commit** + + ```bash + git add agents/_common.md + git commit -m "feat(agents): add Rule 7 (Evidence-Bearing Exit Reports) to _common.md" + ``` + +--- + +### Task 2: Add Step 0 to `researcher.md` + +**Files:** +- Modify: `agents/researcher.md` (insert before "## Start Here: Check Memory") + +- [ ] **Step 1: Confirm insertion point** + + Run: `grep -n 'Start Here' agents/researcher.md` + Expected: line ~29 — "## Start Here: Check Memory" + +- [ ] **Step 2: Insert Step 0 block** + + Insert this block BEFORE the line `## Start Here: Check Memory` in `agents/researcher.md`: + + ```markdown + ## Step 0: Demonstrate Comprehension + + Before your first tool call on this invocation, state what you + understand the task to require. Use this exact format: + + ``` + UNDERSTOOD: + - Task: + - Inputs: + - Outputs: + ASSUMPTIONS: + - + - + ``` + + (Researcher is read-only — no Boundary field needed.) + + If any element is unclear after re-reading the brief, **STOP** and + surface the specific ambiguity (Rule 2 in `_common.md`). Do not + guess and proceed. This step comes **before** any memory check, file + read, or tool call. + + ``` + +- [ ] **Step 3: Verify** + + Run: `grep -c 'Demonstrate Comprehension' agents/researcher.md && grep -c 'UNDERSTOOD:' agents/researcher.md` + Expected: `1` and `1` + +- [ ] **Step 4: Commit** + + ```bash + git add agents/researcher.md + git commit -m "feat(agents): add Step 0 (Demonstrate Comprehension) to researcher" + ``` + +--- + +### Task 3: Add Step 0 to `planner.md` + +**Files:** +- Modify: `agents/planner.md` (insert before "## Start Here: Check Memory") + +- [ ] **Step 1: Confirm insertion point** + + Run: `grep -n 'Start Here' agents/planner.md` + Expected: line ~29 — "## Start Here: Check Memory" + +- [ ] **Step 2: Insert Step 0 block** + + Insert this block BEFORE the line `## Start Here: Check Memory` in `agents/planner.md`: + + ```markdown + ## Step 0: Demonstrate Comprehension + + Before your first tool call on this invocation, state what you + understand the task to require. Use this exact format: + + ``` + UNDERSTOOD: + - Task: + - Inputs: + - Outputs: + - Boundary: + ASSUMPTIONS: + - + - + ``` + + If any element is unclear after re-reading the brief, **STOP** and + surface the specific ambiguity (Rule 2 in `_common.md`). Do not + guess and proceed. This step comes **before** any memory check, file + read, or tool call. + + ``` + +- [ ] **Step 3: Verify** + + Run: `grep -c 'Demonstrate Comprehension' agents/planner.md && grep -c 'UNDERSTOOD:' agents/planner.md` + Expected: `1` and `1` + +- [ ] **Step 4: Commit** + + ```bash + git add agents/planner.md + git commit -m "feat(agents): add Step 0 (Demonstrate Comprehension) to planner" + ``` + +--- + +### Task 4: Add Step 0 to `executor.md` + +**Files:** +- Modify: `agents/executor.md` (insert before "## Start Here: Check Memory") + +- [ ] **Step 1: Confirm insertion point** + + Run: `grep -n 'Start Here\|## Workflow' agents/executor.md` + Expected: "## Start Here: Check Memory" around line 29, "## Workflow" shortly after + +- [ ] **Step 2: Insert Step 0 block** + + Insert this block BEFORE the line `## Start Here: Check Memory` in `agents/executor.md`: + + ```markdown + ## Step 0: Demonstrate Comprehension + + Before your first tool call on this invocation, state what you + understand the task to require. Use this exact format: + + ``` + UNDERSTOOD: + - Task: + - Inputs: + - Outputs: + - Boundary: + ASSUMPTIONS: + - + - + ``` + + If any element is unclear after re-reading the plan, **STOP** and + surface the specific ambiguity (Rule 2 in `_common.md`). Do not + guess and proceed. This step comes **before** any memory check, file + read, or tool call. + + ``` + +- [ ] **Step 3: Verify** + + Run: `grep -c 'Demonstrate Comprehension' agents/executor.md && grep -c 'UNDERSTOOD:' agents/executor.md` + Expected: `1` and `1` + +- [ ] **Step 4: Commit** + + ```bash + git add agents/executor.md + git commit -m "feat(agents): add Step 0 (Demonstrate Comprehension) to executor" + ``` + +--- + +### Task 5: Create `evals/suites/agents/prompt-quality.sh` + +**Files:** +- Create: `evals/suites/agents/prompt-quality.sh` (new directory + file) + +- [ ] **Step 1: Create directory** + + ```bash + mkdir -p evals/suites/agents + ``` + +- [ ] **Step 2: Write the eval script** + + Create `evals/suites/agents/prompt-quality.sh` with this exact content: + + ```bash + #!/usr/bin/env bash + # Asserts prompt-quality patterns are installed in agent files. + # SPDX-License-Identifier: AGPL-3.0-or-later + set -euo pipefail + + SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" + cd "$REPO_ROOT" + + fail() { printf 'FAIL: %s\n' "$1" >&2; exit 1; } + + # Rule 7 present in _common.md + grep -q 'Evidence-Bearing Exit Reports' agents/_common.md \ + || fail "_common.md missing Rule 7 (Evidence-Bearing Exit Reports)" + + # Each write-critical agent has Step 0 comprehension check + for agent in researcher planner executor; do + grep -q 'Demonstrate Comprehension' "agents/${agent}.md" \ + || fail "${agent}.md missing Step 0 (Demonstrate Comprehension)" + grep -q 'UNDERSTOOD:' "agents/${agent}.md" \ + || fail "${agent}.md missing UNDERSTOOD: output format" + done + + # verifier.md intentionally skipped — no Step 0 by design + if grep -q 'Demonstrate Comprehension' agents/verifier.md 2>/dev/null; then + fail "verifier.md should NOT have Step 0 — it is intentionally excluded" + fi + + echo "prompt-quality.sh: all checks passed" + ``` + +- [ ] **Step 3: Make executable** + + ```bash + chmod +x evals/suites/agents/prompt-quality.sh + ``` + +- [ ] **Step 4: Run the eval to confirm green** + + Run: `bash evals/suites/agents/prompt-quality.sh` + Expected: `prompt-quality.sh: all checks passed` + +- [ ] **Step 5: Run full eval suite** + + Run: `bash evals/run.sh` + Expected: all suites pass (new suite included), `0 failed` + +- [ ] **Step 6: Commit** + + ```bash + git add evals/suites/agents/prompt-quality.sh + git commit -m "test(evals): add prompt-quality.sh structural assertions" + ``` + +--- + +### Task 6: Update `CHANGELOG.md` for Iteration 1 + +**Files:** +- Modify: `CHANGELOG.md` (add Unreleased section before [2.0.0]) + +- [ ] **Step 1: Insert Unreleased block** + + Add after the `# Changelog` header block (before `## [2.0.0]`): + + ```markdown + ## [Unreleased] — v2.1.0 + + ### Added + - `_common.md` Rule 7 (Evidence-Bearing Exit Reports): every agent's exit report + must include actual command output, not a paraphrase. + - Step 0 (Demonstrate Comprehension) in `researcher.md`, `planner.md`, `executor.md`: + agents state task understanding in structured `UNDERSTOOD:` format before any tool call. + - `evals/suites/agents/prompt-quality.sh`: structural regression protection for both + additions (Rule 7 presence, Step 0 presence, verifier exclusion). + + ``` + +- [ ] **Step 2: Verify** + + Run: `grep -c 'Unreleased' CHANGELOG.md` + Expected: `1` + +- [ ] **Step 3: Commit** + + ```bash + git add CHANGELOG.md + git commit -m "docs(changelog): log v2.1.0 Iteration 1 additions" + ``` + +--- + +## ITERATION 2 + +### Task 7: Extend `planner.md` with per-task scope bounds schema + +**Files:** +- Modify: `agents/planner.md` (extend Mode B plan output schema) + +- [ ] **Step 1: Confirm insertion point in Mode B output** + + Run: `grep -n 'Verification\|Commit:' agents/planner.md | head -5` + Expected: lines inside the Mode B plan template showing the task schema ending with `**Commit:**` + +- [ ] **Step 2: Add scope bounds to the plan task template** + + In the Mode B plan output format in `agents/planner.md`, the existing task template ends with: + ``` + - **Commit:** `type(scope): message` + ``` + + Add after `**Commit:**` line in the task template (still inside the fenced code block): + ``` + - **Allowed paths:** glob1, glob2 *(files executor may create/edit/delete)* + - **Forbidden paths:** glob3, glob4 *(files executor must NOT touch in this task)* + ``` + + Then add this explanatory section AFTER the Mode B code block (before the `## Rules` section): + + ```markdown + ### Per-task scope bounds + + Every task must declare its filesystem scope explicitly. + + **Allowed paths** are a positive scope: globs the executor may create, edit, or + delete files within. If scope is truly the whole repo (e.g., a lint sweep), write + `**` and document why in the Verification field. + + **Forbidden paths** are explicit guards: globs the executor must NOT touch even + if a task "naturally leads" there. They catch the most common scope-creep + direction for this specific task. + + - Empty `Forbidden paths` is allowed and means "no explicit guards"; prefer listing + at least one high-risk neighbor. + - If a task has no `Allowed paths` entry (pre-v2.1.0 plan), the executor treats + it as unrestricted with a one-line warning. + ``` + +- [ ] **Step 3: Verify** + + Run: `grep -c 'allowed_paths\|Allowed paths' agents/planner.md` + Expected: `≥ 2` + + Run: `grep -c 'forbidden_paths\|Forbidden paths' agents/planner.md` + Expected: `≥ 2` + +- [ ] **Step 4: Commit** + + ```bash + git add agents/planner.md + git commit -m "feat(planner): add allowed_paths / forbidden_paths to plan schema" + ``` + +--- + +### Task 8: Add pre-commit scope check (Step 5.5) to `executor.md` + +**Files:** +- Modify: `agents/executor.md` (add Step 5.5 between Step 5 "Run the verification" and Step 6 "Commit atomically") + +- [ ] **Step 1: Confirm current workflow numbering** + + Run: `grep -n '^\d\.' agents/executor.md | head -10` + If the workflow steps aren't numbered in that format, run: + Run: `grep -n 'Run the verification\|Commit atomically' agents/executor.md` + Expected: find the two steps the new step should be inserted between. + +- [ ] **Step 2: Insert Step 5.5 in Workflow** + + In the `## Workflow` section, between step 5 ("Run the verification") and step 6 ("Commit atomically"), add: + + After the line `5. **Run the verification** — every task's plan includes a verification command; run it and read the output`, insert: + + ```markdown + 5.5. **Pre-commit scope check** — before staging, check every file you modified against the task's declared scope bounds (see "Pre-commit Scope Check" below) + ``` + + Then add this new section AFTER the `## Progress File` section and BEFORE the `## Commit Format` section: + + ```markdown + ## Pre-commit Scope Check + + After completing a task's changes but **before staging and committing**, check your + diff against the task's scope bounds from the plan: + + ```bash + CHANGED=$(git diff --name-only HEAD) + ``` + + For each file in `CHANGED`: + - It must match at least one glob in the task's `Allowed paths`. + - It must NOT match any glob in the task's `Forbidden paths`. + + If any file fails either check, **STOP** (Rule 5 "Stop-the-Line"). Do not commit. + Emit: + + ``` + STATUS: blocked + TASK: + REASON: scope violation — is not in allowed_paths / is in forbidden_paths + TRIED: + NEEDED: either (a) user confirms scope expansion, or (b) revert the out-of-scope + change and continue with only in-scope work + ``` + + Do not silently adjust the scope by editing the plan. Scope expansions require + user acknowledgment. + + **Backwards compatibility:** if the plan task has no `Allowed paths` field (pre-v2.1.0 + plan or user-authored plan), emit a one-line warning and skip the check: + `WARNING: plan task N has no allowed_paths — scope check skipped` + ``` + +- [ ] **Step 3: Verify** + + Run: `grep -c 'Pre-commit scope check\|Pre-commit Scope Check' agents/executor.md` + Expected: `≥ 2` + + Run: `grep -c 'scope violation' agents/executor.md` + Expected: `≥ 1` + +- [ ] **Step 4: Commit** + + ```bash + git add agents/executor.md + git commit -m "feat(executor): add pre-commit scope check with scope-violation status" + ``` + +--- + +### Task 9: Add fixture plan with scope bounds + +**Files:** +- Create: `evals/fixtures/plans/sample-plan-with-scope.md` + +- [ ] **Step 1: Write the fixture** + + Create `evals/fixtures/plans/sample-plan-with-scope.md`: + + ```markdown + # Phase 2 Plan: input validation + + ## Context + Sample plan demonstrating per-task scope bounds for eval fixtures. + Extends the trivial greet fixture with an API validation layer. + + ## Complexity + medium + + ## Pipeline + - medium → executor + verifier + + ## Tasks + + ### Task 1: add validation to greet endpoint + - **What:** validate non-empty name parameter in greet function + - **Files:** src/greet.py (modified) + - **Steps:** + 1. add guard clause for empty string + 2. raise ValueError with message + - **Verification:** `python -c "from greet import greet; greet('')"` → raises ValueError + - **Commit:** `feat(greet): add input validation` + - **Allowed paths:** src/greet.py, tests/test_greet.py + - **Forbidden paths:** src/auth/**, src/database/** + + ### Task 2: add test for validation + - **What:** write test covering the empty-name error path + - **Files:** tests/test_greet.py (new) + - **Steps:** + 1. write pytest test asserting ValueError on empty string + - **Verification:** `pytest tests/test_greet.py -v` → 1 passed + - **Commit:** `test(greet): add validation test` + - **Allowed paths:** tests/test_greet.py + - **Forbidden paths:** src/**, config/** + + ### Task 3: update README + - **What:** document the new validation behaviour + - **Files:** README.md (modified) + - **Steps:** + 1. add one paragraph under Usage + - **Verification:** `grep -c 'ValueError' README.md` → 1 + - **Commit:** `docs(readme): document greet validation` + - **Allowed paths:** README.md + - **Forbidden paths:** src/**, tests/** + ``` + +- [ ] **Step 2: Verify file exists and has required fields** + + Run: `grep -c 'Allowed paths\|Forbidden paths' evals/fixtures/plans/sample-plan-with-scope.md` + Expected: `6` (2 per task × 3 tasks) + +- [ ] **Step 3: Commit** + + ```bash + git add evals/fixtures/plans/sample-plan-with-scope.md + git commit -m "test(evals): add sample-plan-with-scope fixture" + ``` + +--- + +### Task 10: Create `evals/suites/agents/scope-creep-detection.sh` + +**Files:** +- Create: `evals/suites/agents/scope-creep-detection.sh` + +- [ ] **Step 1: Write the eval script** + + Create `evals/suites/agents/scope-creep-detection.sh`: + + ```bash + #!/usr/bin/env bash + # Structural simulation: verifies scope-violation detection logic. + # Tests the check logic, not an LLM run. + # SPDX-License-Identifier: AGPL-3.0-or-later + set -euo pipefail + + SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" + cd "$REPO_ROOT" + + fail() { printf 'FAIL: %s\n' "$1" >&2; exit 1; } + + # Helper: glob match (bash extglob-free, uses case) + glob_match() { + local file="$1" pattern="$2" + case "$file" in + $pattern) return 0;; + *) return 1;; + esac + } + + # Helper: check a file against allowed_paths list + check_allowed() { + local file="$1" + shift + local patterns=("$@") + for p in "${patterns[@]}"; do + glob_match "$file" "$p" && return 0 + done + return 1 + } + + # Helper: check a file against forbidden_paths list + check_forbidden() { + local file="$1" + shift + local patterns=("$@") + for p in "${patterns[@]}"; do + glob_match "$file" "$p" && return 1 + done + return 0 + } + + # --- Test 1: in-scope file passes --- + allowed=("src/greet.py" "tests/test_greet.py") + forbidden=("src/auth/*" "src/database/*") + + check_allowed "src/greet.py" "${allowed[@]}" \ + || fail "Test 1: src/greet.py should be in allowed_paths" + check_forbidden "src/greet.py" "${forbidden[@]}" \ + || fail "Test 1: src/greet.py should not match forbidden_paths" + + # --- Test 2: forbidden file is detected --- + check_forbidden "src/auth/session.py" "${forbidden[@]}" \ + && fail "Test 2: src/auth/session.py should be detected as forbidden" + + # --- Test 3: out-of-scope file (not in allowed) is detected --- + check_allowed "src/config/settings.py" "${allowed[@]}" \ + && fail "Test 3: src/config/settings.py should NOT be in allowed_paths" + + # --- Test 4: fixture plan has scope fields --- + fixture="$REPO_ROOT/evals/fixtures/plans/sample-plan-with-scope.md" + [[ -f "$fixture" ]] || fail "Test 4: fixture plan missing at $fixture" + grep -q 'Allowed paths' "$fixture" \ + || fail "Test 4: fixture plan missing Allowed paths" + grep -q 'Forbidden paths' "$fixture" \ + || fail "Test 4: fixture plan missing Forbidden paths" + + echo "scope-creep-detection.sh: all checks passed" + ``` + +- [ ] **Step 2: Make executable** + + ```bash + chmod +x evals/suites/agents/scope-creep-detection.sh + ``` + +- [ ] **Step 3: Run the eval** + + Run: `bash evals/suites/agents/scope-creep-detection.sh` + Expected: `scope-creep-detection.sh: all checks passed` + +- [ ] **Step 4: Commit** + + ```bash + git add evals/suites/agents/scope-creep-detection.sh + git commit -m "test(evals): add scope-creep-detection suite" + ``` + +--- + +### Task 11: Extend `prompt-quality.sh` with scope-bound assertions + +**Files:** +- Modify: `evals/suites/agents/prompt-quality.sh` (append before final echo) + +- [ ] **Step 1: Add scope-bound checks** + + In `evals/suites/agents/prompt-quality.sh`, replace the final `echo` line with: + + ```bash + # Planner schema includes scope bounds + grep -q 'Allowed paths\|allowed_paths' agents/planner.md \ + || fail "planner.md missing allowed_paths / Allowed paths in plan schema" + grep -q 'Forbidden paths\|forbidden_paths' agents/planner.md \ + || fail "planner.md missing forbidden_paths / Forbidden paths in plan schema" + + # Executor has pre-commit scope check + grep -q 'Pre-commit Scope Check\|Pre-commit scope check' agents/executor.md \ + || fail "executor.md missing pre-commit scope check" + grep -q 'scope violation' agents/executor.md \ + || fail "executor.md missing scope-violation STATUS format" + + echo "prompt-quality.sh: all checks passed" + ``` + +- [ ] **Step 2: Run extended eval** + + Run: `bash evals/suites/agents/prompt-quality.sh` + Expected: `prompt-quality.sh: all checks passed` + +- [ ] **Step 3: Run full suite** + + Run: `bash evals/run.sh` + Expected: all pass, `0 failed` + +- [ ] **Step 4: Commit** + + ```bash + git add evals/suites/agents/prompt-quality.sh + git commit -m "test(evals): extend prompt-quality.sh with scope-bound assertions" + ``` + +--- + +### Task 12: Update `CHANGELOG.md` for Iteration 2 + +**Files:** +- Modify: `CHANGELOG.md` (extend Unreleased section) + +- [ ] **Step 1: Extend Unreleased section** + + In `CHANGELOG.md`, append to the `## [Unreleased] — v2.1.0` Added block: + + ```markdown + - Per-task `Allowed paths` / `Forbidden paths` fields in `planner.md` Mode B plan schema. + - Pre-commit scope check (Step 5.5) in `executor.md`: detects out-of-scope files before + committing; emits `STATUS: blocked` with scope-violation reason. + - `evals/fixtures/plans/sample-plan-with-scope.md`: fixture plan demonstrating scope bounds. + - `evals/suites/agents/scope-creep-detection.sh`: structural simulation of scope-violation + detection logic. + - `evals/suites/agents/prompt-quality.sh` extended with scope-bound assertions. + ``` + +- [ ] **Step 2: Verify** + + Run: `grep -c 'scope' CHANGELOG.md` + Expected: `≥ 3` + +- [ ] **Step 3: Commit** + + ```bash + git add CHANGELOG.md + git commit -m "docs(changelog): log v2.1.0 Iteration 2 additions" + ``` + +--- + +## Exit Criteria (full Iteration 1 + 2) + +Run each command and verify output: + +```bash +grep -c 'Evidence-Bearing Exit Reports' agents/_common.md +# → 1 + +grep -c 'Demonstrate Comprehension' agents/researcher.md agents/planner.md agents/executor.md +# → 1 per file (3 total) + +grep -c 'allowed_paths\|Allowed paths' agents/planner.md +# → ≥ 2 + +grep -c 'Pre-commit Scope Check\|Pre-commit scope check' agents/executor.md +# → ≥ 2 + +bash evals/suites/agents/prompt-quality.sh +# → prompt-quality.sh: all checks passed + +bash evals/suites/agents/scope-creep-detection.sh +# → scope-creep-detection.sh: all checks passed + +bash evals/run.sh +# → N passed, 0 failed +``` diff --git a/evals/fixtures/plans/sample-plan-with-gates.md b/evals/fixtures/plans/sample-plan-with-gates.md new file mode 100644 index 0000000..0d13170 --- /dev/null +++ b/evals/fixtures/plans/sample-plan-with-gates.md @@ -0,0 +1,104 @@ +# Phase 3 Plan: risky migration + dep cleanup + +## Context +Sample plan exercising the v2.1.0 risk_gates taxonomy. Covers one task +per gate kind for eval fixtures. + +## Complexity +complex + +## Pipeline +- complex → researcher + executor + verifier + +## Requirements + +### R1: migration +- **R1.1** state schema migrates v1 → v2 +- **R1.2** legacy auth package removed + +## risk_gates + +```yaml +risk_gates: + - task: 2 + kind: "schema-migration" + reason: "Runs .sea/state.json migration from v1 to v2" + confirmation: "Confirm state migration. Back up .sea/ first? Migration is one-way." + - task: 3 + kind: "dependency-removal" + reason: "Removes @legacy/auth; may break any import we haven't caught" + confirmation: "Confirm removal of @legacy/auth. grep found 0 remaining imports. Proceed?" + - task: 4 + kind: "destructive-git" + reason: "Force-pushes the rebased branch to origin" + confirmation: "Confirm force-push to origin/feat-migration. Ensure no collaborator has pulled." + - task: 5 + kind: "network-state-mutation" + reason: "Publishes the package to the internal npm registry" + confirmation: "Confirm npm publish. Version bump is final and visible to all consumers." +``` + +## Tasks + +### Task 1: add migration script +- **What:** write scripts/migrate-v1-to-v2.sh +- **Covers:** R1.1 +- **Files:** scripts/migrate-v1-to-v2.sh (new) +- **Steps:** + 1. write migration logic +- **Verification:** `bash -n scripts/migrate-v1-to-v2.sh` +- **Commit:** `feat(scripts): add v1→v2 migration` +- **Allowed paths:** scripts/migrate-v1-to-v2.sh +- **Forbidden paths:** .sea/** + +### Task 2: run migration on sample state +- **What:** execute migration against .sea/state.json +- **Covers:** R1.1 +- **Files:** .sea/state.json (modified) +- **Steps:** + 1. back up state.json + 2. run migration +- **Verification:** `jq .schema_version .sea/state.json` → `2` +- **Commit:** `chore(state): migrate schema v1→v2` +- **Allowed paths:** .sea/state.json +- **Forbidden paths:** src/** + +### Task 3: remove @legacy/auth +- **What:** uninstall legacy auth package +- **Covers:** R1.2 +- **Files:** package.json, package-lock.json (modified) +- **Steps:** + 1. npm uninstall @legacy/auth +- **Verification:** `jq '.dependencies["@legacy/auth"]' package.json` → `null` +- **Commit:** `chore(deps): remove @legacy/auth` +- **Allowed paths:** package.json, package-lock.json +- **Forbidden paths:** src/auth/** + +### Task 4: force-push rebased branch +- **What:** force-push the cleaned branch +- **Covers:** R1.2 +- **Files:** (none — git op only) +- **Steps:** + 1. git push --force-with-lease origin feat-migration +- **Verification:** `git ls-remote origin feat-migration` +- **Commit:** (no commit — push only) +- **Allowed paths:** ** +- **Forbidden paths:** .sea/** + +### Task 5: publish package +- **What:** publish to internal npm registry +- **Covers:** R1.2 +- **Files:** (none — network op) +- **Steps:** + 1. npm publish +- **Verification:** `npm view @sea/plugin version` +- **Commit:** (no commit — publish only) +- **Allowed paths:** ** +- **Forbidden paths:** .sea/** + +## Coverage Matrix + +| Criterion | Task(s) | Check | +|---|---|---| +| R1.1 | T1, T2 | task-coverage | +| R1.2 | T3, T4, T5 | task-coverage | diff --git a/evals/fixtures/plans/sample-plan-with-scope.md b/evals/fixtures/plans/sample-plan-with-scope.md new file mode 100644 index 0000000..0f7c608 --- /dev/null +++ b/evals/fixtures/plans/sample-plan-with-scope.md @@ -0,0 +1,44 @@ +# Phase 2 Plan: input validation + +## Context +Sample plan demonstrating per-task scope bounds for eval fixtures. +Extends the trivial greet fixture with an API validation layer. + +## Complexity +medium + +## Pipeline +- medium → executor + verifier + +## Tasks + +### Task 1: add validation to greet endpoint +- **What:** validate non-empty name parameter in greet function +- **Files:** src/greet.py (modified) +- **Steps:** + 1. add guard clause for empty string + 2. raise ValueError with message +- **Verification:** `python -c "from greet import greet; greet('')"` → raises ValueError +- **Commit:** `feat(greet): add input validation` +- **Allowed paths:** src/greet.py, tests/test_greet.py +- **Forbidden paths:** src/auth/**, src/database/** + +### Task 2: add test for validation +- **What:** write test covering the empty-name error path +- **Files:** tests/test_greet.py (new) +- **Steps:** + 1. write pytest test asserting ValueError on empty string +- **Verification:** `pytest tests/test_greet.py -v` → 1 passed +- **Commit:** `test(greet): add validation test` +- **Allowed paths:** tests/test_greet.py +- **Forbidden paths:** src/**, config/** + +### Task 3: update README +- **What:** document the new validation behaviour +- **Files:** README.md (modified) +- **Steps:** + 1. add one paragraph under Usage +- **Verification:** `grep -c 'ValueError' README.md` → 1 +- **Commit:** `docs(readme): document greet validation` +- **Allowed paths:** README.md +- **Forbidden paths:** src/**, tests/** diff --git a/evals/fixtures/plans/trivial.md b/evals/fixtures/plans/trivial.md new file mode 100644 index 0000000..dadff87 --- /dev/null +++ b/evals/fixtures/plans/trivial.md @@ -0,0 +1,36 @@ +# Phase 1 Plan: trivial example + +## Context +Sample trivial plan for eval fixtures. + +## Complexity +trivial + +## Pipeline +- trivial → executor + +## Requirements + +### R1: greet +Function returns a greeting string. + +- **R1.1** function exists named `greet` +- **R1.2** returns "hello world" + +## Tasks + +### Task 1: implement greet +- **What:** add greet function +- **Covers:** R1.1, R1.2 +- **Files:** src/greet.py (new) +- **Steps:** + 1. write function +- **Verification:** `python -c "from greet import greet; assert greet()=='hello world'"` +- **Commit:** `feat(greet): add greet function` + +## Coverage Matrix + +| Criterion | Task(s) | Check | +|---|---|---| +| R1.1 | T1 | task-coverage | +| R1.2 | T1 | task-coverage | diff --git a/evals/suites/agents/prompt-quality.sh b/evals/suites/agents/prompt-quality.sh new file mode 100755 index 0000000..b98d466 --- /dev/null +++ b/evals/suites/agents/prompt-quality.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# Asserts prompt-quality patterns are installed in agent files. +# SPDX-License-Identifier: AGPL-3.0-or-later +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" +cd "$REPO_ROOT" + +fail() { printf 'FAIL: %s\n' "$1" >&2; exit 1; } + +# Rule 7 present in _common.md +grep -q 'Evidence-Bearing Exit Reports' agents/_common.md \ + || fail "_common.md missing Rule 7 (Evidence-Bearing Exit Reports)" + +# Each write-critical agent has Step 0 comprehension check +for agent in researcher planner executor; do + grep -q 'Demonstrate Comprehension' "agents/${agent}.md" \ + || fail "${agent}.md missing Step 0 (Demonstrate Comprehension)" + grep -q 'UNDERSTOOD:' "agents/${agent}.md" \ + || fail "${agent}.md missing UNDERSTOOD: output format" +done + +# verifier.md intentionally skipped — no Step 0 by design +if grep -q 'Demonstrate Comprehension' agents/verifier.md 2>/dev/null; then + fail "verifier.md should NOT have Step 0 — it is intentionally excluded" +fi + +# Planner schema includes scope bounds +grep -q 'Allowed paths\|allowed_paths' agents/planner.md \ + || fail "planner.md missing allowed_paths / Allowed paths in plan schema" +grep -q 'Forbidden paths\|forbidden_paths' agents/planner.md \ + || fail "planner.md missing forbidden_paths / Forbidden paths in plan schema" + +# Executor has pre-commit scope check +grep -q 'Pre-commit Scope Check\|Pre-commit scope check' agents/executor.md \ + || fail "executor.md missing pre-commit scope check" +grep -q 'scope violation' agents/executor.md \ + || fail "executor.md missing scope-violation STATUS format" + +# Planner schema includes risk_gates section (v2.1.0 Iter 3) +grep -q 'risk_gates' agents/planner.md \ + || fail "planner.md missing risk_gates section in plan schema" + +# Executor has gate-pause protocol and STATUS: gate exit (v2.1.0 Iter 3) +grep -q 'Gate-pause protocol' agents/executor.md \ + || fail "executor.md missing Gate-pause protocol section" +grep -q 'STATUS: gate' agents/executor.md \ + || fail "executor.md missing STATUS: gate exit format" + +# sea-go has Step 4.5 risk gate inspection (v2.1.0 Iter 3) +grep -q 'Risk gate inspection' skills/sea-go/SKILL.md \ + || fail "sea-go/SKILL.md missing Step 4.5 'Risk gate inspection'" +grep -q 'Resume after gate' skills/sea-go/SKILL.md \ + || fail "sea-go/SKILL.md missing 'Resume after gate' section" + +echo "prompt-quality.sh: all checks passed" diff --git a/evals/suites/agents/risk-gate-flow.sh b/evals/suites/agents/risk-gate-flow.sh new file mode 100755 index 0000000..f7a73e2 --- /dev/null +++ b/evals/suites/agents/risk-gate-flow.sh @@ -0,0 +1,78 @@ +#!/usr/bin/env bash +# +# software-engineer-agent +# Copyright (C) 2026 demwick +# Licensed under the GNU Affero General Public License v3.0 or later. +# See LICENSE in the repository root for the full license text. +# +# risk-gate-flow.sh — structural simulation of the risk-gate state machine. +# Parses the fixture plan's risk_gates section, asserts each gate carries a +# non-empty confirmation, and simulates the gate-pending marker resume path. +# Does not run a real executor. +# +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" +cd "$REPO_ROOT" + +fail() { printf 'FAIL: %s\n' "$1" >&2; exit 1; } + +FIXTURE="evals/fixtures/plans/sample-plan-with-gates.md" +[[ -f "$FIXTURE" ]] || fail "fixture missing: $FIXTURE" + +# --- Test 1: fixture has a risk_gates section --- +grep -q '^## risk_gates' "$FIXTURE" \ + || fail "fixture missing '## risk_gates' section" + +# --- Test 2: extract the yaml block and assert shape --- +YAML_BLOCK="$(awk '/^```yaml$/{flag=1;next}/^```$/{flag=0}flag' "$FIXTURE")" +[[ -n "$YAML_BLOCK" ]] || fail "fixture risk_gates yaml block is empty" + +# --- Test 3: at least one gate of each expected kind is present --- +for kind in schema-migration dependency-removal destructive-git network-state-mutation; do + grep -q "kind: \"${kind}\"" <<<"$YAML_BLOCK" \ + || fail "fixture missing gate kind: ${kind}" +done + +# --- Test 4: every gate has a non-empty confirmation line --- +while IFS= read -r line; do + # match lines like: confirmation: "..." + value="${line#*confirmation: \"}" + value="${value%\"}" + [[ -n "$value" ]] || fail "gate has empty confirmation string" +done < <(grep -E '^[[:space:]]+confirmation:' <<<"$YAML_BLOCK") + +# --- Test 5: simulate gate-pending marker round-trip --- +TMPDIR="$(mktemp -d)" +trap 'rm -rf "$TMPDIR"' EXIT + +MARKER="$TMPDIR/gate-pending.json" +cat > "$MARKER" <<'JSON' +{ + "phase": 3, + "task": 2, + "kind": "schema-migration", + "confirmation_prompt": "Confirm state migration. Back up .sea/ first?", + "created": "2026-04-15T00:00:00Z" +} +JSON + +# sea-go's resume branch would parse these four fields +PHASE="$(jq -r .phase "$MARKER")" +TASK="$(jq -r .task "$MARKER")" +KIND="$(jq -r .kind "$MARKER")" +PROMPT="$(jq -r .confirmation_prompt "$MARKER")" + +[[ "$PHASE" == "3" ]] || fail "marker phase mismatch" +[[ "$TASK" == "2" ]] || fail "marker task mismatch" +[[ "$KIND" == "schema-migration" ]] || fail "marker kind mismatch" +[[ -n "$PROMPT" ]] || fail "marker confirmation_prompt empty" + +# --- Test 6: corrupt marker (missing confirmation_prompt) is detectable --- +CORRUPT="$TMPDIR/corrupt.json" +echo '{"phase":3,"task":2,"kind":"schema-migration"}' > "$CORRUPT" +BAD_PROMPT="$(jq -r '.confirmation_prompt // empty' "$CORRUPT")" +[[ -z "$BAD_PROMPT" ]] || fail "corrupt marker should surface empty confirmation_prompt" + +echo "risk-gate-flow.sh: all checks passed" diff --git a/evals/suites/agents/scope-creep-detection.sh b/evals/suites/agents/scope-creep-detection.sh new file mode 100755 index 0000000..4f6cabb --- /dev/null +++ b/evals/suites/agents/scope-creep-detection.sh @@ -0,0 +1,69 @@ +#!/usr/bin/env bash +# Structural simulation: verifies scope-violation detection logic. +# Tests the check logic, not an LLM run. +# SPDX-License-Identifier: AGPL-3.0-or-later +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" +cd "$REPO_ROOT" + +fail() { printf 'FAIL: %s\n' "$1" >&2; exit 1; } + +# Helper: glob match using case +glob_match() { + local file="$1" pattern="$2" + case "$file" in + $pattern) return 0;; + *) return 1;; + esac +} + +# Helper: check a file against allowed_paths list +check_allowed() { + local file="$1" + shift + local patterns=("$@") + for p in "${patterns[@]}"; do + glob_match "$file" "$p" && return 0 + done + return 1 +} + +# Helper: check a file against forbidden_paths list (returns 0 if safe, 1 if forbidden) +check_forbidden() { + local file="$1" + shift + local patterns=("$@") + for p in "${patterns[@]}"; do + glob_match "$file" "$p" && return 1 + done + return 0 +} + +# --- Test 1: in-scope file passes --- +allowed=("src/greet.py" "tests/test_greet.py") +forbidden=("src/auth/*" "src/database/*") + +check_allowed "src/greet.py" "${allowed[@]}" \ + || fail "Test 1: src/greet.py should be in allowed_paths" +check_forbidden "src/greet.py" "${forbidden[@]}" \ + || fail "Test 1: src/greet.py should not match forbidden_paths" + +# --- Test 2: forbidden file is detected --- +check_forbidden "src/auth/session.py" "${forbidden[@]}" \ + && fail "Test 2: src/auth/session.py should be detected as forbidden" + +# --- Test 3: out-of-scope file (not in allowed) is detected --- +check_allowed "src/config/settings.py" "${allowed[@]}" \ + && fail "Test 3: src/config/settings.py should NOT be in allowed_paths" + +# --- Test 4: fixture plan has scope fields --- +fixture="$REPO_ROOT/evals/fixtures/plans/sample-plan-with-scope.md" +[[ -f "$fixture" ]] || fail "Test 4: fixture plan missing at $fixture" +grep -q 'Allowed paths' "$fixture" \ + || fail "Test 4: fixture plan missing Allowed paths" +grep -q 'Forbidden paths' "$fixture" \ + || fail "Test 4: fixture plan missing Forbidden paths" + +echo "scope-creep-detection.sh: all checks passed" diff --git a/evals/suites/plan/coverage-matrix-shape.sh b/evals/suites/plan/coverage-matrix-shape.sh new file mode 100644 index 0000000..8b1d9a1 --- /dev/null +++ b/evals/suites/plan/coverage-matrix-shape.sh @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +# Verify check-coverage.sh parses a well-formed plan + progress and emits JSON. +# SPDX-License-Identifier: AGPL-3.0-or-later +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" +source "$REPO_ROOT/evals/lib/assert.sh" + +PLAN="$REPO_ROOT/evals/fixtures/plans/trivial.md" +PROGRESS=$(mktemp) +trap 'rm -f "$PROGRESS"' EXIT + +cat > "$PROGRESS" <<'JSON' +{ + "phase": 1, + "current_task": 2, + "completed_tasks": [ + {"id": "T1", "commit": "abc1234", "covered": ["R1.1", "R1.2"]} + ], + "last_commit": "abc1234", + "updated": "2026-04-15T00:00:00Z" +} +JSON + +OUT="$(bash "$REPO_ROOT/scripts/check-coverage.sh" "$PLAN" "$PROGRESS")" + +assert_jq "$OUT" '.covered | length' '== 2' "two criteria covered" +assert_jq "$OUT" '.uncovered | length' '== 0' "no uncovered criteria" +assert_jq "$OUT" '.errors | length' '== 0' "no errors" diff --git a/scripts/check-coverage.sh b/scripts/check-coverage.sh new file mode 100755 index 0000000..702970a --- /dev/null +++ b/scripts/check-coverage.sh @@ -0,0 +1,46 @@ +#!/usr/bin/env bash +# +# software-engineer-agent +# Copyright (C) 2026 demwick +# Licensed under the GNU Affero General Public License v3.0 or later. +# See LICENSE in the repository root for the full license text. +# +# check-coverage.sh — compare plan requirements against progress coverage. +# +# Usage: check-coverage.sh +# +# Parses criterion ids (Rx.y) from the plan, reads .completed_tasks[].covered[] +# from progress.json, and emits JSON: +# +# {"covered": [...], "uncovered": [...], "errors": [...]} +# +# - covered: criteria present in both plan and progress +# - uncovered: plan criteria not yet covered by any completed task +# - errors: progress references unknown criteria (not defined in plan) +# +set -euo pipefail + +PLAN="${1:?plan file required}" +PROGRESS="${2:?progress file required}" + +if [[ ! -f "$PLAN" ]]; then + jq -n '{covered: [], uncovered: [], errors: ["plan not found"]}' + exit 0 +fi + +if [[ ! -f "$PROGRESS" ]]; then + jq -n '{covered: [], uncovered: [], errors: ["progress not found"]}' + exit 0 +fi + +ALL_JSON=$(grep -oE 'R[0-9]+\.[0-9]+' "$PLAN" | sort -u | jq -Rn '[inputs]') +COVERED_JSON=$(jq '[.completed_tasks[]?.covered[]?] | unique' "$PROGRESS") + +jq -n \ + --argjson all "$ALL_JSON" \ + --argjson covered "$COVERED_JSON" \ + '{ + covered: ($covered - ($covered - $all)), + uncovered: ($all - $covered), + errors: (($covered - $all) | map("unknown criterion: " + .)) + }' diff --git a/skills/sea-go/SKILL.md b/skills/sea-go/SKILL.md index 201b60c..3109a3f 100644 --- a/skills/sea-go/SKILL.md +++ b/skills/sea-go/SKILL.md @@ -54,6 +54,34 @@ The plan's "Complexity" line tells you which pipeline to run: For **complex** only: before calling executor, launch the `researcher` agent with the plan's context section to gather any deep knowledge the executor will need. Read the report yourself — do not dump it to the user. +## Step 4.5: Risk gate inspection + +After choosing the pipeline and BEFORE launching the executor: + +1. Read the plan's `risk_gates` section. +2. If `risk_gates` is **missing**, the plan is pre-v2.1.0. Emit a warning: + *"WARNING: plan has no risk_gates section (pre-v2.1.0); running without + gate checks."* Proceed to Step 5. +3. If `risk_gates: []`, the planner asserted no gates. Proceed. +4. If `risk_gates` contains entries, surface each to the user: + + ``` + Phase N contains risk gate(s): + Gate 1 — task (): + Confirmation prompt: + Gate 2 — task (): + Confirmation prompt: + + Review the gates. Confirm to proceed, or cancel to revise the plan. + The executor will pause at each gate task and request individual + confirmation before running it. + ``` + +5. Wait for **explicit** user confirmation. Do not accept ambiguous + responses ("sounds good", "ok"). Require "confirm" or an equivalent + specific acknowledgment. On any non-confirmation, surface the plan + path and stop. + ## Step 5: Execute Before launching, check `.sea/phases/phase-N/progress.json`: @@ -67,11 +95,32 @@ Launch the `executor` agent. Pass it: - Any researcher findings (for complex phases) - **Resume context** if `progress.json` existed: *"Skip tasks ; resume at task ."* -Wait for executor to finish. It returns `STATUS: done` or `STATUS: blocked`. +Wait for executor to finish. It returns `STATUS: done`, `STATUS: blocked`, or `STATUS: gate`. - **blocked** → surface executor's report to the user verbatim, mark the phase `in-progress` in state.json, and stop. If `obra/superpowers:debugging` or `addyosmani/agent-skills:debugging` is installed, recommend invoking it for structured triage. Otherwise surface the executor's blocked report as-is and let the user decide. SEA does not own debug methodology in v2.0.0 — compose with specialized plugins. +- **gate** → run "Resume after gate" (below), then re-launch executor. - **done** → arm auto-QA (next step). +### Resume after gate + +If the executor returns `STATUS: gate`: + +1. Read `.sea/phases/phase-N/gate-pending.json`. +2. Surface the `confirmation_prompt` to the user with the gate `kind` + and `reason`. +3. Wait for **specific** confirmation ("confirm" or explicit task + acknowledgment — no ambiguous responses). +4. **On confirmation:** re-launch the executor with resume context + *"Resume from gate at task . User confirmed."* The executor + deletes the marker and proceeds. +5. **On non-confirmation:** mark the phase `blocked` in state, leave + the marker in place, and tell the user: + *"Phase N paused at task . Delete `.sea/phases/phase-N/gate-pending.json` + to unblock, or run `/sea-quick 'modify task '` to revise."* + +Do not auto-confirm gates. Do not skip the re-launch. The human loop +is the point. + ## Step 6: Arm the Auto-QA Hook Before your response ends, arm the auto-QA marker. In v2.0.0 the marker is **existence-only** — content is ignored. The retry counter lives in a sibling file `.sea/.verify-attempts` that `hooks/auto-qa` writes atomically.