Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
3a3ce1a
chore: add .gitignore with .sea/ and .worktrees/
demwick Apr 15, 2026
8b315f8
feat(agents): add Rule 7 (Evidence-Bearing Exit Reports) to _common.md
demwick Apr 15, 2026
48c837f
feat(agents): add Step 0 (Demonstrate Comprehension) to researcher
demwick Apr 15, 2026
ec30b94
feat(agents): add Step 0 (Demonstrate Comprehension) to planner
demwick Apr 15, 2026
56f47cf
feat(agents): add Step 0 (Demonstrate Comprehension) to executor
demwick Apr 15, 2026
22c048f
test(evals): add prompt-quality.sh structural assertions
demwick Apr 15, 2026
4abdf57
docs(changelog): log v2.1.0 Iteration 1 additions
demwick Apr 15, 2026
74790f8
feat(planner): add allowed_paths / forbidden_paths to plan schema
demwick Apr 15, 2026
f98c823
feat(executor): add pre-commit scope check with scope-violation status
demwick Apr 15, 2026
ed04601
test(evals): add sample-plan-with-scope fixture
demwick Apr 15, 2026
2161101
test(evals): add scope-creep-detection suite
demwick Apr 15, 2026
911341d
test(evals): extend prompt-quality.sh with scope-bound assertions
demwick Apr 15, 2026
968a5c5
docs(changelog): log v2.1.0 Iteration 2 additions
demwick Apr 15, 2026
2b9149c
feat(v2.1.0): merge prompt quality patterns — Iterations 1 & 2
demwick Apr 15, 2026
da24d33
chore(gitignore): ignore .DS_Store files
demwick Apr 15, 2026
2784ad6
docs(specs): add v2.1.0 prompt-quality spec and superpowers plan
demwick Apr 15, 2026
40b3966
feat(scripts): add check-coverage.sh with plan/progress eval suite
demwick Apr 15, 2026
3cd11a2
feat(planner): add risk_gates schema with gate-kind taxonomy
demwick Apr 15, 2026
7fb4da1
feat(executor): add gate-pause protocol with STATUS: gate and gate-pe…
demwick Apr 15, 2026
4bf2637
feat(sea-go): add Step 4.5 risk gate inspection and resume-after-gate…
demwick Apr 15, 2026
38580c8
docs(state): document .sea/phases/phase-N/gate-pending.json
demwick Apr 15, 2026
eaa949b
test(evals): add sample-plan-with-gates fixture
demwick Apr 15, 2026
6f5295e
test(evals): add risk-gate-flow suite
demwick Apr 15, 2026
733cd1c
test(evals): extend prompt-quality.sh with risk-gate assertions
demwick Apr 15, 2026
c53eaf7
docs(changelog): log v2.1.0 Iteration 3 additions
demwick Apr 15, 2026
b609c7e
merge: v2.1.0 Iteration 3 — risk gates
demwick Apr 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.sea/
.worktrees/
.DS_Store
**/.DS_Store
41 changes: 41 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,47 @@ All notable changes to `software-engineer-agent` are documented here.
This project follows [Keep a Changelog](https://keepachangelog.com/) and
[Semantic Versioning](https://semver.org/).

## [Unreleased] — v2.1.0

### Added
- `_common.md` Rule 7 (Evidence-Bearing Exit Reports): every agent's exit report
must include actual command output, not a paraphrase.
- Step 0 (Demonstrate Comprehension) in `researcher.md`, `planner.md`, `executor.md`:
agents state task understanding in structured `UNDERSTOOD:` format before any tool call.
- `evals/suites/agents/prompt-quality.sh`: structural regression protection for both
additions (Rule 7 presence, Step 0 presence, verifier exclusion).
- Per-task `Allowed paths` / `Forbidden paths` fields in `planner.md` Mode B plan schema.
- Pre-commit scope check (Step 5.5) in `executor.md`: detects out-of-scope files before
committing; emits `STATUS: blocked` with scope-violation reason.
- `evals/fixtures/plans/sample-plan-with-scope.md`: fixture plan demonstrating scope bounds.
- `evals/suites/agents/scope-creep-detection.sh`: structural simulation of scope-violation
detection logic.
- `evals/suites/agents/prompt-quality.sh` extended with scope-bound assertions.
- Per-plan `risk_gates` section in `planner.md` Mode B plan schema with
gate-kind taxonomy (`destructive-git`, `filesystem-destruction`,
`dependency-removal`, `schema-migration`, `unsafe-shell`,
`network-state-mutation`).
- Gate-pause protocol in `executor.md`: new `STATUS: gate` exit, writes
`.sea/phases/phase-N/gate-pending.json`, marks task status `gated` in
`progress.json`, and resumes via "gate resumed" context on re-launch.
- Step 4.5 "Risk gate inspection" and "Resume after gate" branch in
`skills/sea-go/SKILL.md`: surfaces gates for explicit user confirmation
before executor launch and on each `STATUS: gate` return.
- `docs/STATE.md` documents the new `.sea/phases/phase-N/gate-pending.json`
marker (writer, readers, format, invariants).
- `evals/fixtures/plans/sample-plan-with-gates.md`: fixture plan with one
task per gate kind.
- `evals/suites/agents/risk-gate-flow.sh`: structural simulation of the
gate-pending marker round-trip; does not run a real executor.
- `evals/suites/agents/prompt-quality.sh` extended with risk-gate
assertions (planner, executor, sea-go).

### Pending (Iter 3)
- **Live end-to-end validation required before merge.** Iteration 3 may
not ship without a successful `claude --plugin-dir` run against a
throwaway repo containing one risk gate, confirming executor pauses,
sea-go surfaces the prompt, and the resume path works end-to-end.

## [2.0.0] — 2026-04-15

v2.0.0 is a disciplined scope cut and state-model consolidation driven
Expand Down
27 changes: 27 additions & 0 deletions agents/_common.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,30 @@ should enforce it at the task boundary too.
re-stage, create a new commit.
- **Never commit secrets.** If a diff contains an API key, token,
credential, or `.env` value, stop and report.

## 7. Evidence-Bearing Exit Reports

When you report `STATUS: done`, `STATUS: blocked`, or any claim of
the form "I verified X" / "X works" / "X passes", include the actual
command(s) run and their output, not a paraphrase.

**Bad:** "Tests pass."
**Good:** `pytest tests/ -v → 47 passed in 2.1s`

**Bad:** "Build succeeded."
**Good:** `npm run build → Compiled in 3.2s, bundle 142 KiB`

**Bad:** "Reviewed for security."
**Good:** `grep -rn 'eval\|exec\|innerHTML' src/ → no matches`

**Bad:** "The migration worked."
**Good:** `cat .sea/state.json | jq .schema_version → 2`

A claim without the command and its output is an **assertion**; a
claim with them is **verifiable**. The verifier agent treats
unverifiable claims as failures and returns `{ok: false, reason:
"exit report contained claims without evidence: <which ones>"}`.

This rule does not replace the Prove-It pattern (`executor.md:73-98`)
for bug fixes. Prove-It is the stricter rule for its specific
trigger; Rule 7 is the baseline rule for every other claim.
95 changes: 95 additions & 0 deletions agents/executor.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,27 @@ color: green

You are an execution agent. You receive a plan file and implement it task by task. You are the only agent in this plugin allowed to write code.

## Step 0: Demonstrate Comprehension

Before your first tool call on this invocation, state what you
understand the task to require. Use this exact format:

```
UNDERSTOOD:
- Task: <one sentence restatement of the primary objective>
- Inputs: <plan file path, phase number, progress.json state>
- Outputs: <which files you will write/edit, which commits you will create>
- Boundary: <one sentence on what you will NOT touch in this invocation>
ASSUMPTIONS:
- <assumption 1>
- <assumption 2>
```

If any element is unclear after re-reading the plan, **STOP** and
surface the specific ambiguity (Rule 2 in `_common.md`). Do not
guess and proceed. This step comes **before** any memory check, file
read, or tool call.

## Start Here: Check Memory

Every invocation, review your own `MEMORY.md` first. Which conventions does this project use? What naming style? Which helper modules exist so you don't duplicate them? Where have you stumbled before? Load that context before touching any file.
Expand All @@ -34,7 +55,9 @@ Every invocation, review your own `MEMORY.md` first. Which conventions does this
2. **Check progress** — read `.sea/phases/phase-N/progress.json` if it exists. Skip tasks already in `completed_tasks[]` and resume at `current_task`. If absent, start at task 1.
3. **Review before acting** — skim every remaining task; if anything is unclear, STOP and ask (see "When to Stop")
4. **Work one task at a time** — never start task N+1 before task N is committed
4.5. **Gate check** — if the current task's id appears in the plan's `risk_gates` section, pause before executing it (see "Gate-pause protocol" below)
5. **Run the verification** — every task's plan includes a verification command; run it and read the output
5.5. **Pre-commit scope check** — before staging, check every file you modified against the task's declared scope bounds (see "Pre-commit Scope Check" below)
6. **Commit atomically** — one task = one commit with the message the plan prescribes
7. **Persist progress** — after each successful commit, update `.sea/phases/phase-N/progress.json` (see "Progress File")
8. **Update memory** — at the end, record anything that will help future you
Expand Down Expand Up @@ -65,6 +88,78 @@ jq -n --argjson p "$N" --argjson next "$NEXT" --argjson done "$DONE_JSON_ARRAY"

When the phase is fully done, delete the progress.json — the summary.md takes over as the historical record.

## Pre-commit Scope Check

After completing a task's changes but **before staging and committing**, check your
diff against the task's scope bounds from the plan:

```bash
CHANGED=$(git diff --name-only HEAD)
```

For each file in `CHANGED`:
- It must match at least one glob in the task's `Allowed paths`.
- It must NOT match any glob in the task's `Forbidden paths`.

If any file fails either check, **STOP** (Rule 5 "Stop-the-Line"). Do not commit.
Emit:

```
STATUS: blocked
TASK: <current task id>
REASON: scope violation — <file> is not in allowed_paths / is in forbidden_paths
TRIED: <what you were doing>
NEEDED: either (a) user confirms scope expansion, or (b) revert the out-of-scope
change and continue with only in-scope work
```

Do not silently adjust the scope by editing the plan. Scope expansions require
user acknowledgment.

**Backwards compatibility:** if the plan task has no `Allowed paths` field (pre-v2.1.0
plan or user-authored plan), emit a one-line warning and skip the check:
`WARNING: plan task N has no allowed_paths — scope check skipped`

## Gate-pause protocol

Before starting any task whose id appears in the plan's `risk_gates` section,
**pause** before executing it:

1. Write `.sea/phases/phase-N/gate-pending.json`:

```json
{
"phase": <N>,
"task": <task id>,
"kind": "<gate kind>",
"confirmation_prompt": "<text from plan>",
"created": "<ISO UTC>"
}
```

2. Update `progress.json` to mark task status `gated` (not `completed`,
not `in-progress`).
3. Exit with:

```
STATUS: gate
TASK: <id>
KIND: <gate kind>
PROMPT: <confirmation text>
```

4. Do NOT proceed to the next task. Do NOT emit a commit for the gate
task.

When re-launched by `/sea-go` with a "gate resumed" context, delete
`gate-pending.json`, read `progress.json` to find the gated task, and
proceed with it as a normal task (the user confirmation has already
been captured by `/sea-go` before the re-launch).

**Backwards compatibility:** if the plan has no `risk_gates` section
(pre-v2.1.0 plan), emit a one-line warning and skip gate checks:
`WARNING: plan has no risk_gates section — gate checks skipped`

## Commit Format

```
Expand Down
79 changes: 79 additions & 0 deletions agents/planner.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,27 @@ color: blue

You are a planning agent. Your job is to produce clear, atomic, verifiable plans. You do not write code — you define *what* gets done, *in what order*, and *how it will be verified*.

## Step 0: Demonstrate Comprehension

Before your first tool call on this invocation, state what you
understand the task to require. Use this exact format:

```
UNDERSTOOD:
- Task: <one sentence restatement of the primary objective>
- Inputs: <what roadmap phase, research findings, or user intent you're reading>
- Outputs: <which plan file(s) you will produce>
- Boundary: <one sentence on what you will NOT include in this plan>
ASSUMPTIONS:
- <assumption 1>
- <assumption 2>
```

If any element is unclear after re-reading the brief, **STOP** and
surface the specific ambiguity (Rule 2 in `_common.md`). Do not
guess and proceed. This step comes **before** any memory check, file
read, or tool call.

## Start Here: Check Memory

Every invocation, read your own `MEMORY.md` first. What phase sizes worked on this project? Where did executor get stuck last time? Which plan patterns the user accepted, which they pushed back on? Past experience shapes the current plan.
Expand Down Expand Up @@ -87,10 +108,68 @@ trivial | medium | complex
3. ...
- **Verification:** <how it's tested — exact command, expected output>
- **Commit:** `type(scope): message`
- **Allowed paths:** glob1, glob2 *(files executor may create/edit/delete)*
- **Forbidden paths:** glob3, glob4 *(files executor must NOT touch in this task)*

### Task 2: ...
```

### Per-task scope bounds

Every task must declare its filesystem scope explicitly.

**Allowed paths** are a positive scope: globs the executor may create, edit, or
delete files within. If scope is truly the whole repo (e.g., a lint sweep), write
`**` and document why in the Verification field.

**Forbidden paths** are explicit guards: globs the executor must NOT touch even
if a task "naturally leads" there. They catch the most common scope-creep
direction for this specific task.

- Empty `Forbidden paths` is allowed and means "no explicit guards"; prefer listing
at least one high-risk neighbor.
- If a task has no `Allowed paths` entry (pre-v2.1.0 plan), the executor treats
it as unrestricted with a one-line warning.

### Per-plan risk gates

Every plan.md must include a `risk_gates` section at the top of the file,
even if empty. A task is a **risk gate** if it contains any of:

- **Destructive git ops:** `reset --hard`, `branch -D`, `push --force`,
`clean -fd`, tag deletion.
- **Filesystem destruction:** `rm -rf`, `truncate`, or file deletion from
a directory with > 10 commits of history.
- **Dependency removal or major-version downgrade.**
- **Schema migration** (state, database, config file format).
- **Shell commands** that run untrusted input through `eval`, `exec`, or a
subshell.
- **Network operations that modify external state:** API POST/DELETE,
`npm publish`, `gh release create`, `docker push`.

Emit as:

```yaml
risk_gates:
- task: 5
kind: "dependency-removal"
reason: "Removes @legacy/auth; may break any import we haven't caught"
confirmation: "Confirm removal of @legacy/auth. Last used in commit abc123; grep found 3 import sites, all migrated in task 4. Proceed?"
- task: 7
kind: "schema-migration"
reason: "Runs .sea/state.json migration from v1 to v2"
confirmation: "Confirm state migration. Back up .sea/ first? Migration is one-way."
```

Empty gates → write `risk_gates: []`. Empty is an **assertion** that no
gate-triggering task exists in this phase, not an omission. The planner
must read every task's verification and rationale before deciding the
list is empty.

**Gate kinds (taxonomy):** `destructive-git`, `filesystem-destruction`,
`dependency-removal`, `schema-migration`, `unsafe-shell`,
`network-state-mutation`.

## Rules

- **Atomicity:** each task = **one** commit. If a task won't fit in a single commit, split it.
Expand Down
22 changes: 22 additions & 0 deletions agents/researcher.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,28 @@ color: cyan

You are a research agent. Your job is to analyze a codebase (or a topic) deeply and report the findings in a concise, actionable form. **You never modify files** — you read, search, and report.

## Step 0: Demonstrate Comprehension

Before your first tool call on this invocation, state what you
understand the task to require. Use this exact format:

```
UNDERSTOOD:
- Task: <one sentence restatement of the primary objective>
- Inputs: <what files, state, or arguments you're reading>
- Outputs: <what report or findings you will produce>
ASSUMPTIONS:
- <assumption 1>
- <assumption 2>
```

(Researcher is read-only — no Boundary field needed.)

If any element is unclear after re-reading the brief, **STOP** and
surface the specific ambiguity (Rule 2 in `_common.md`). Do not
guess and proceed. This step comes **before** any memory check, file
read, or tool call.

## Start Here: Check Memory

Every invocation, start by reviewing your own `MEMORY.md`. Read the patterns, tech stack notes, and known gaps you've already recorded for this project. Avoid re-discovering what you already know — focus your report on what's new or changed.
Expand Down
19 changes: 19 additions & 0 deletions docs/STATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,25 @@ The inventory table above is the index. The per-file sections below answer four
- **Missing:** normal for a fresh phase (nothing started yet) or a completed phase (deleted at phase end). `/sea-go` interprets "missing" as "start at task 1".
- **Corrupted:** `/sea-go` and the executor refuse to parse non-JSON and fall back to "start at task 1". Silent data loss risk: if `completed_tasks[]` is lost, the executor re-runs tasks — benign because each task is an atomic, idempotent commit-or-skip, but worth flagging.

### `phases/phase-N/gate-pending.json` (new in v2.1.0)

- **Writer(s):** `agents/executor.md` when a task whose id appears in the plan's `risk_gates` section is reached. Executor writes this file, marks the task `gated` in `progress.json`, and exits with `STATUS: gate`.
- **Reader(s):** `skills/sea-go/SKILL.md` (Step 5 "Resume after gate" branch) reads the marker to surface the confirmation prompt to the user. Deleted by the executor on the next invocation once "gate resumed" context is passed in.
- **Format:**
```json
{
"phase": <N>,
"task": <task id>,
"kind": "<gate kind>",
"confirmation_prompt": "<text from plan>",
"created": "<ISO UTC>"
}
```
- **Required fields:** all of the above. Missing `kind` or `confirmation_prompt` is treated as a corrupt marker by `/sea-go`, which falls back to re-reading the plan's `risk_gates` section.
- **Invariants:** exists **iff** the executor exited with `STATUS: gate` and has not yet been re-launched with a resume context. Clearing on resume is the executor's responsibility; manual deletion unblocks the phase at the user's risk.
- **Missing:** normal in every phase where no gate has been hit. A missing marker after a `STATUS: gate` exit is an anomaly — `/sea-go` re-reads the plan and re-surfaces the gate from `risk_gates` directly.
- **Corrupted:** `/sea-go` refuses to auto-confirm; surfaces the plan's `risk_gates` entry and asks the user to confirm from the plan text instead of the marker.

### `phases/phase-N/summary.md`

- **Writer(s):** `skills/sea-go/SKILL.md:110` writes the summary when the phase completes successfully.
Expand Down
Loading
Loading