Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
277 changes: 277 additions & 0 deletions docs/plans/2026-04-24-kilroy-fixes-from-feedback.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
---
date: 2026-04-24
status: open
audience: kilroy engine + workflow contributors
---

# Kilroy fixes — from `quick-launch` and `pr-review` agent feedback

Two agent sessions hit real bugs running `quick-launch` and `pr-review` against live work
(the pr-review session reviewed PRs across `danshapiro/freshell` and other repos; the
quick-launch session ran 10+ parallel research agents against `bitbucket/svg-compose`).
This doc captures every actionable fix from both, ranked by impact, with the workflow-level
mitigations that have already shipped called out so the engine work can be scoped cleanly.

The `pr-review` and `quick-launch` skills + workflow packages now live in
`gf-software-factory/{skills,workflows}` (the kilroy-side copies were removed in the same
pass that produced this doc). Engine-level fixes still belong in this repo.

---

## Blockers — must fix before broader rollout

These caused user-visible damage or silent success-on-failure in real runs.

### 1. Agents leak out of the worktree when prompts contain absolute paths

**What happened.** A user's prompt opened with "You are working on `/Users/matt/bitbucket/svg-compose`."
Ten parallel agents followed the literal path: they `cd`'d to the user's main source tree
and ran `git -C /Users/matt/bitbucket/svg-compose ...` and `Edit` against absolute paths,
clobbering the shared tree while their isolated worktrees sat untouched. Their final commits
on `attractor/run/...` branches contained only `.gitignore` updates. End result: 10 agents
racing on the user's actual repo.

**Why the docs didn't catch it.** The skill says "the engine auto-detects and creates an
isolated run branch + worktree, so the user's source tree is never touched." That's true
of `cwd`, not of the paths the LLM writes in `Bash`/`Edit` calls.

**Fix path, in order of impact.**

1. **Agent system-prompt prepend** (highest leverage, low effort): "You are running inside
an isolated Kilroy worktree at `{path}`. All work must happen relative to `cwd`; do not
`cd` elsewhere or pass `-C` to git with paths outside `cwd`." Owners: tmux agent
harness in `internal/agents/...` (or wherever the system-prompt assembly lives).
2. **Bash-tool path guard** (heaviest, optional): wrap `Bash` so writes outside the worktree
require `--confirm-out-of-tree-writes` or fail loudly. Heavy-handed but appropriate for
fire-and-forget agents.
3. **SKILL.md sharp-edge note** (already shipped in `gf-software-factory/skills/quick-launch/SKILL.md`
— see Step 0 preflight; consider also adding to `pr-review`).

### 2. Detached runs don't register in the run DB until terminal

**What happened.** `runs wait --latest --label`, `runs wait <id>`, `status --latest`, and
`runs list` all fail to find an active detached run — they only see completed ones. The
first version of the `pr-review` skill modeled `runs wait --label` on `quick-launch`'s docs;
it broke on the first live run.

**Workaround in shipped skills.** Both `pr-review` and (partially) `quick-launch` now
capture `logs_root=...` from the launch stdout and poll `final.json`. That's mechanical
and works, but it's a regression from the documented `runs wait` UX.

**Fix.** Write the run row to the DB at launch with `status=running` so `wait`/`show`/`list`
operate on active runs, not just terminated ones. Owners: `cmd/kilroy/attractor_run.go`
and the run-DB writer in `internal/runs/...`.

### 3. Reaching any terminal node = `status=success`

**What happened.** PR #303 v1 hit 4× transient `git fetch` failures in setup. The original
`pr-review` graph followed the unconditional `setup -> done` fallback edge, so the engine
reached the `done` node and reported `status=success` at the run level — with no review
produced. Every graph author hits this trap.

**Workflow-level mitigation (already shipped).** `gf-software-factory/workflows/pr-review/graph.dot`
now routes `setup -> fail_report` and `collect_diff -> fail_report` instead of `-> done`,
matching `build_test -> fail_report`. `write-fail-report.sh` infers the failed phase from
artifact presence and writes a phase-specific `review-report.md`.

**Engine fix (still wanted).** The trap is graph-author-shaped, not workflow-shaped.
Either:

- **(a)** track "any stage failed?" in the engine and surface a distinct terminal state
(e.g. `status=completed_with_failures`) on `final.json`; or
- **(b)** require an explicit `condition=` on every inbound edge to a terminal node, and
fail validation otherwise.

(b) catches the bug at validate time, before runs happen. (a) is a runtime safety net.
Both are useful; (b) is cheaper.

### 4. Agents don't inherit MCP servers from the parent Claude Code session

**What happened.** The user told their parent Claude session to verify work in Chrome
via `mcp__claude-in-chrome__*`. When that conversation launched 10 detached `quick-launch`
agents, those tools didn't exist in the children. The agents reached for `osascript`,
`screencapture`, `chrome --headless`, and `curl` via Bash to compensate — macOS-level
automation the user did not consent to. That's the "wild things happening with Chrome"
the user reported.

**Fix path.**

1. **Document prominently** (cheapest): add a sentence to `quick-launch` SKILL.md and
`pr-review` SKILL.md: "Kilroy agents do not inherit the parent Claude Code session's
MCP servers. Only Bash, Edit, Read are available unless you forward MCP via run config."
2. **Optional opt-in forwarding**: a `--forward-mcp <name>` flag (or `mcp_forward:` in
`run.yaml`) that copies a parent server's config into the child agent. Not every parent
server makes sense to forward; opt-in is the right default.

### 5. `install-skills.sh` doesn't rebuild the binary

**What happened.** The user re-ran `install-skills.sh` to relink the workflow symlink at
their checkout. The `kilroy` binary on PATH was 17 days stale (pre-`--tmux`). The first
sweep loop produced no launches because a `grep '^logs_root='` matched nothing on error
output that no longer included that line.

**Fix.** Either (a) the install script should run `go build ./cmd/kilroy` before linking,
or (b) it should `stat` the binary against `find . -name '*.go' -newer ...` and bail
loudly when stale. (a) is one extra line and handles every case. The kilroy-side
`scripts/install-skills.sh` does not currently build; the new factory-side
`gf-software-factory/scripts/install-kilroy-host.sh` does build.

---

## Important — fix before more graph authors arrive

### 6. `--help` is missing half the CLI surface

`kilroy attractor run --help` doesn't document `--tmux`, `--prompt-file`, `--label`,
`--workspace`, `--input`, `--package`, `--detach`. `kilroy attractor runs --help` lists
only `list` and `prune` — no `show`, no `wait`. Every flag the shipped skills use is
load-bearing but undiscoverable.

**Fix.** Regenerate help text from the actual flag parser in `cmd/kilroy/main.go`. If the
parser is hand-rolled (likely), align the printed usage with the parsed flags as a unit
test — drift here is the recurring class.

### 7. Progress stream has no terminal event

`progress.ndjson` emits `stage_attempt_*`, `edge_selected`, `input_materialization_*`,
`tmux_session_*`. Nothing for run completion. The canonical terminal signal today is the
appearance of `final.json`. That invariant isn't documented anywhere I could find.

**Fix.** Emit a `run_completed` / `run_failed` event as the final line of `progress.ndjson`,
plus document `final.json` as the public completion contract.

### 8. Worktree cleanup isn't tied to run completion

After a long working session, `git worktree list` showed 40+ orphaned worktrees from
attractor runs going back weeks — all physically gone from disk, stale git admin refs.
The user pruned manually with `git worktree prune`.

**Fix.** `git worktree remove` the run's worktree on successful terminal state. Keep it
on failure for forensics. Wire to whatever runs the existing per-run cleanup (or add it).

### 9. `result.md` contract is brittle under interruption

When the user `stop --force`'d 10 agents mid-flight, only 1 of 10 had written `result.md`.
Several had done substantive committed work but hadn't reached the "write the summary"
step. The `outputs=` contract surfaces this as missing `result.md` — a generic error that
obscures what actually happened.

**Fix options.**

- Auto-flush a partial `result.md` on graceful termination, with a `__stopped_early__: true`
marker and the last assistant text block.
- Or: make `result.md` optional and let `runs show --print result.md` fall back to a
generated one-line summary from the transcript.

### 10. `pr-review` workflow has no GitHub-review publish step

The workflow writes `review-report.md` to disk but never posts a GitHub review. Branch
protection requires an approved review, so the workflow's verdict is invisible to the
merge gate. We hit this on the first merge batch — all 5 PRs blocked by `REVIEW_REQUIRED`.

**Fix.** Add a node `post_review` after `holistic_review` that runs
`gh pr review <N> --approve|--request-changes --body "$(cat review-report.md)"`, gated by
the holistic decision token: `MERGE` → approve; `MERGE-FIX`/`FIX-MERGE` → request-changes
with checklist; `REJECT` → request-changes with rejection comment. Lives in the workflow,
not the engine. Owner: `gf-software-factory/workflows/pr-review/`.

### 11. `pr-review` setup uses HTTPS with no auth fallback

PR #303 v1 failed 4× with `remote: Internal Server Error` (HTTP 500) on `git fetch upstream
pull/N/head`. When the workspace is an existing clone, `setup-pr.sh` inherits whatever URL
is on the upstream remote — plain HTTPS, no credential helper. Flaky under GitHub pressure.

**Fix.** Use `gh pr checkout <N> --repo <owner/repo>` (handles auth via `gh`) or add an
explicit retry-with-SSH-URL fallback. Workflow-level fix; lives in
`gf-software-factory/workflows/pr-review/scripts/setup-pr.sh`.

### 12. Cross-repo PRs aren't handled

PR #297 was a fork PR from `Glowforge/freshell`. Nothing in the graph detects "is this
same-repo?" — a follow-up fix agent tried to push and hit HTTP 403.

**Fix.** `setup-pr.sh` should record `headRepositoryOwner` in `pr-meta.json`, and
`fail_report` should surface a distinct "cross-repo, push blocked" case so downstream
agents/humans know not to try. Workflow-level fix.

---

## Quality of life

### 13. `KILROY_PREDECESSOR_NODE` env for failure handlers

Today `write-fail-report.sh` infers the failed phase from filesystem state (`pr-meta.json`
exists? `build-report.json` exists?). That's a workaround for `fail_report` not knowing
its predecessor node. Adding `KILROY_PREDECESSOR_NODE` (and maybe `KILROY_PREDECESSOR_OUTCOME`)
to the env when a node fires off a non-primary edge would let failure handlers be
deterministic instead of probing.

### 14. `runs show --worktree-path` accessor

Today you `runs show <id>` and grep for `worktree_dir`. A scripting-friendly
`runs show --print-worktree-path` would clean up the bash patterns in skills.

### 15. Concurrency cap

10 parallel `attractor run`s worked but generated a lot of simultaneous LLM usage.
A `--max-concurrent N` flag (or a config setting) would help users self-pace without
writing their own throttle.

### 16. Long worktree paths

`/Users/matt/.local/state/kilroy/attractor/runs/01K.../worktree/` is a mouthful.
A `runs cd <id>` helper or a documented shell-function template ("here, source this
into your zshrc") would smooth post-mortem inspection.

### 17. `progress.ndjson` vs `agent_output.jsonl` documentation gap

Progress shows lifecycle events (start, input-materialization, tmux-session-start,
terminal); the actual tool-call log is in `agent/agent_output.jsonl`. Several minutes
were wasted reading the wrong file when debugging. A short paragraph in
`docs/runs-layout.md` (or wherever) would prevent that.

### 18. `Write` tool for agents

Agents have `Edit` + `Bash` heredoc only — no `Write` tool. Functionally fine, but
unconventional vs Claude Code's standard tool set. Worth either (a) adding `Write`,
or (b) noting in skill docs so prompt authors don't expect it.

---

## Things that worked — protect these

The pr-review session reported the following, with no kilroy-side failures across 8
parallel runs and 3 sequential batches:

- **`--prompt-file`** — writing prose to a file beat JSON-escaping multi-paragraph
prompts. Keep it.
- **Labeling + `--latest --label task=X` lookup** — made multi-run orchestration
tractable across 10+ concurrent runs.
- **`runs wait --timeout`** behavior — streams status to stderr, sane exit codes.
(Note: the active-run lookup gap from item #2 is orthogonal.)
- **`stop --force`** across many concurrent runs — reliable, no orphaned tmux sessions.
- **CLI backend via Claude subscription** — no API keys, no rate-limit surprises.
- **`agent_output.jsonl`** — complete transcript saved a post-mortem investigation.
- **Git-worktree-per-run isolation** — the right design when it's actually used (see
item #1 for when it isn't).
- **Auto-detect provider CLIs + auto-build default config when no `run.yaml` is supplied** —
removed a lot of setup friction for the skill author.

---

## Suggested rollout order

If a single contributor picks this up:

1. **Item #1** (worktree isolation prompt prepend) and **#5** (install-skills builds binary).
Both are one-line patches with high leverage. Do them first.
2. **Item #2** (DB-write at launch) and **#6** (help-text alignment). These unlock the
documented UX.
3. **Item #7** (`run_completed` event) and **#8** (worktree cleanup). Polish that
compounds across every future skill.
4. **Item #3** (engine-side terminal-state guard). Pick (b) — validator-time. Less
runtime surface area to verify.
5. **Items #10–#12** (workflow-level pr-review fixes). These live in
`gf-software-factory/workflows/pr-review/` now.
6. **Item #4** (MCP forwarding). Nice but optional.
7. **Items #13–#18**. Quality of life as time allows.
7 changes: 7 additions & 0 deletions internal/attractor/agents/tmux_handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@ func (h *TmuxAgentHandler) Execute(ctx context.Context, exec *engine.Execution,
if prompt == "" {
prompt = node.Label()
}
if wtPreamble := strings.TrimSpace(engine.BuildWorktreeContextPreamble(exec.WorktreeDir)); wtPreamble != "" {
if strings.TrimSpace(prompt) == "" {
prompt = wtPreamble
} else {
prompt = wtPreamble + "\n\n" + strings.TrimSpace(prompt)
}
}

// Session name: kilroy-{runID}-{nodeID} (unique per node execution).
runID := ""
Expand Down
9 changes: 8 additions & 1 deletion internal/attractor/agents/tmux_handler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,8 @@ func TestTmuxAgentHandler_FakeAgent_SuccessfulExecution(t *testing.T) {
t.Fatalf("response = %q, want to contain FAKE_AGENT_OUTPUT_OK", string(resp))
}

// Verify prompt.md was written.
// Verify prompt.md was written — the user prompt survives, and the
// worktree-context preamble is prepended so the agent stays in cwd.
promptPath := filepath.Join(logsRoot, "test_node", "prompt.md")
prompt, err := os.ReadFile(promptPath)
if err != nil {
Expand All @@ -131,6 +132,12 @@ func TestTmuxAgentHandler_FakeAgent_SuccessfulExecution(t *testing.T) {
if !strings.Contains(string(prompt), "do something") {
t.Fatalf("prompt = %q, want 'do something'", string(prompt))
}
if !strings.Contains(string(prompt), "WORKTREE CONTEXT") {
t.Fatalf("prompt = %q, want worktree-context preamble", string(prompt))
}
if !strings.Contains(string(prompt), workDir) {
t.Fatalf("prompt = %q, want worktree path %q", string(prompt), workDir)
}
}

func TestTmuxAgentHandler_FakeAgent_FailedExecution(t *testing.T) {
Expand Down
27 changes: 27 additions & 0 deletions internal/attractor/engine/handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,24 @@ func BuildManualBoxFanInPromptPreamble(exec *Execution, node *model.Node) string
return strings.TrimSpace(b.String())
}

// BuildWorktreeContextPreamble returns a short preamble that pins the agent to
// its isolated worktree. Without it, an agent following a prompt that mentions
// absolute paths outside the worktree will happily `cd` out and clobber the
// user's source tree. Returns "" if worktreeDir is empty.
func BuildWorktreeContextPreamble(worktreeDir string) string {
worktreeDir = strings.TrimSpace(worktreeDir)
if worktreeDir == "" {
return ""
}
var b strings.Builder
b.WriteString("WORKTREE CONTEXT\n")
b.WriteString(fmt.Sprintf("- You are running inside an isolated Kilroy worktree at %s.\n", worktreeDir))
b.WriteString("- All work must happen relative to cwd. Do not `cd` elsewhere.\n")
b.WriteString("- Do not pass `-C <path>` to git with a path outside cwd.\n")
b.WriteString("- If the task description mentions absolute paths outside this worktree, treat them as informational only — do not read, write, or run commands against them.\n")
return b.String()
}

func (h *CodergenHandler) Execute(ctx context.Context, exec *Execution, node *model.Node) (runtime.Outcome, error) {
stageDir := filepath.Join(exec.LogsRoot, node.ID)
stageStatusPath := filepath.Join(stageDir, "status.json")
Expand Down Expand Up @@ -538,6 +556,15 @@ func (h *CodergenHandler) Execute(ctx context.Context, exec *Execution, node *mo
promptText = preamble + "\n\n" + strings.TrimSpace(promptText)
}
}
if exec != nil {
if wtPreamble := strings.TrimSpace(BuildWorktreeContextPreamble(exec.WorktreeDir)); wtPreamble != "" {
if strings.TrimSpace(promptText) == "" {
promptText = wtPreamble
} else {
promptText = wtPreamble + "\n\n" + strings.TrimSpace(promptText)
}
}
}
if env := BuildStageRuntimeEnv(exec, node.ID); len(env) > 0 {
if manifestPath := strings.TrimSpace(env[inputsManifestEnvKey]); manifestPath != "" {
preamble := strings.TrimSpace(MustRenderInputMaterializationPromptPreamble(manifestPath))
Expand Down
36 changes: 36 additions & 0 deletions internal/attractor/engine/worktree_preamble_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
// Tests for BuildWorktreeContextPreamble — the per-run isolation notice that
// pins agents to their worktree so prompts with stray absolute paths don't
// pull them into the user's source tree.
package engine

import (
"strings"
"testing"
)

func TestBuildWorktreeContextPreamble_EmptyWhenNoWorktree(t *testing.T) {
if got := BuildWorktreeContextPreamble(""); got != "" {
t.Fatalf("empty worktreeDir: got %q, want empty", got)
}
if got := BuildWorktreeContextPreamble(" "); got != "" {
t.Fatalf("whitespace worktreeDir: got %q, want empty", got)
}
}

func TestBuildWorktreeContextPreamble_IncludesPathAndGuidance(t *testing.T) {
worktree := "/tmp/kilroy-run-abc/worktree"
got := BuildWorktreeContextPreamble(worktree)
if !strings.Contains(got, worktree) {
t.Fatalf("preamble missing worktree path %q: %s", worktree, got)
}
for _, want := range []string{
"WORKTREE CONTEXT",
"isolated Kilroy worktree",
"Do not `cd` elsewhere",
"informational only",
} {
if !strings.Contains(got, want) {
t.Fatalf("preamble missing %q: %s", want, got)
}
}
}
Loading
Loading