Skip to content

fix(polecat): pre-spawn duplicate-branch check in mol-polecat-work#1

Open
vbtcl wants to merge 148 commits intomainfrom
fix/polecat-dup-branch-check
Open

fix(polecat): pre-spawn duplicate-branch check in mol-polecat-work#1
vbtcl wants to merge 148 commits intomainfrom
fix/polecat-dup-branch-check

Conversation

@vbtcl
Copy link
Copy Markdown

@vbtcl vbtcl commented Apr 29, 2026

Summary

Polecats were creating duplicate PRs against the same bead because re-spawn didn't check whether another polecat alias already had an open branch/PR for that bead. Triage of bead pa-9zyx4 found four such double-PR cases (pa-v4oi6.7, .19, .23, .24) where polecat/capable/* and polecat/rictus/* simultaneously targeted the same issue ID.

This adds a check in mol-polecat-work.workspace-setup that runs before fresh branch creation. When metadata.branch is empty:

  • Scan origin for any polecat/* branch carrying this bead ID, matched precisely (anchored to / before and end/@/- after — so pa-wgb8g does not match pa-wgb8g2 or pa-wgb8g.2).
  • Same-alias match → adopt as ours by setting metadata.branch; the existing rejection-aware-resume path below checks it out.
  • Different-alias match → don't open a duplicate. Mail the witness with both branches, drop the claim back to the pool, drain-ack and exit.
  • Both → adopt our own; nudge witness about the other-alias branches.

Bumps formula version 8 → 9.

Source bead

Filed (in partcl rig's beads database, since that's where the symptom was observed) as pa-wgb8g. The fix has to land here in gascity because the formula and dispatcher live in this repo, not in partcl.

Test plan

  • TOML parses (python3 -c 'import tomllib; tomllib.load(open(...))')
  • Rendered shell snippet passes bash -n syntax check
  • Regex /<bead>([@-]|$) matches polecat/<alias>/<bead>-... and polecat/<alias>/<bead>@..., does NOT match polecat/<alias>/<bead>2-... or polecat/<alias>/<bead>.2-...
  • Decision logic verified manually for SAME-only, OTHER-only, BOTH, NONE cases
  • End-to-end: contrived re-spawn against a bead with an existing same-alias branch — should resume (not implemented as automated test; would need the dispatcher in the loop)
  • End-to-end: contrived re-spawn against a bead with an existing different-alias branch — should release claim, mail witness, exit

The last two require driving a real polecat through the formula and aren't easily automated. They are the right follow-up if a test harness for formula execution exists.

Notes for reviewers

  • I left the dispatcher-side guard (Go code) for a separate change. The bead description says both layers should ideally guard; this PR is the formula half.
  • The check uses git ls-remote (the equivalent path the bead description offered) rather than gh pr list. git ls-remote doesn't require GitHub auth and runs in any rig regardless of forge.
  • metadata.duplicate_detected is set in the OTHER-only path so the next polecat that picks the bead can see what happened on the previous attempt.

🤖 Generated with Claude Code

Jim Wordelman and others added 30 commits April 22, 2026 14:05
…all#799)

Pool-agent sessions resumed through the control-dispatcher path dropped
--dangerously-skip-permissions, --settings, and other schema-default
launch flags. The previous shouldPreserveStoredRuntimeCommand treated a
stored command that exactly equalled the bare provider binary ("claude")
as complete and preserved it, skipping BuildProviderLaunchCommand. The
resulting claude --resume <uuid> ran without the unrestricted
permission mode flag and wedged on interactive tool-call prompts.

Rebuild when the stored command is only the bare binary while still
preserving longer stored commands that already carry flags (so existing
paths that persist fully-resolved command strings continue to win). The
API path (internal/api/session_runtime.go) and the CLI worker-boundary
path (cmd/gc/worker_handle.go) carry parallel copies of the helper —
update both and add regression tests covering each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pool-managed and ephemeral cursor-agent sessions were timing out during
create because the cursor profile had no ReadyPromptPrefix or
ReadyDelayMs, so the reconciler could not detect when the TUI was ready
to accept piped-in prompts. Named sessions worked because their prompt
is delivered as an exec arg, bypassing readiness detection.

cursor-agent 2026.04+ renders its composer input with a U+2192 (right
arrow) prefix after roughly 5-10s of startup (workspace-trust gate +
statsig + model load). Match the pattern used by the claude/codex/
gemini/copilot profiles:

  ReadyPromptPrefix: "\u2192 "
  ReadyDelayMs:      10000

Verified on macOS arm64 with cursor-agent 2026.04.17: ephemeral cursor
workers now claim beads, execute mol-do-work, write the requested files,
and run gc runtime drain-ack cleanly instead of cycling
stopped -> creating -> stopped against a deadline_exceeded outcome.

Made-with: Cursor
…nhall#1196)

## Summary

- Change `dolt-gc-nudge` interval from **6h → 1h**
- Change default `GC_DOLT_GC_THRESHOLD_BYTES` from **2 GiB → 0** (run
unconditionally; knob preserved as optional escape hatch)
- Update runbook + script comments with empirical evidence

## Why

A beads perf harness run (2026-04) shows Dolt's auto-GC **does not fire
under bd's fork-per-op CLI workload** even with
`auto_gc_behavior.enable: true` and the managed `config.yaml` we ship.
Unbounded commit-graph growth causes two prod symptoms:

- Disk bloat — ~120 GB after a few days of agent work
- Tail latency — at 143 MB of accumulated history: **p99 = 16.9s, max =
18.6s**; cost scales roughly linearly with history size, extrapolating
to multi-minute ops at GB scale

Manual `CALL DOLT_GC()` on a 148 MB store reclaims 43% in 4.6s, so a
periodic nudge is both necessary and cheap.

The pre-tune defaults (2 GiB threshold + 6h cooldown) allow bloat to
climb well into the tail-latency danger zone between nudges. At 1h
unconditional GC, the commit graph stays bounded and p99 stays low.

## Test plan

- [x] `go test ./cmd/gc/ -run 'DoltGCNudge|Order'` passes locally
(existing tests already set `GC_DOLT_GC_THRESHOLD_BYTES=0` explicitly,
so no test changes needed)
- [ ] CI green
- [ ] Prod observation post-merge: `gc doctor` reports clean
`dolt-noms-size` on cities that previously bloated

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atomic-materialize

fix: atomically materialize system pack files
…nv-vars

fix: persist supervisor provider env vars
The zombie scan flagged every dolt sql-server PID except the main city
server as a zombie. Rig-local Dolt servers configured via dolt.port in
config.yaml are legitimate — the scan now reads rig configs from the
metadata cache and excludes PIDs listening on known rig ports.

Closes gastownhall#1217

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When dolt-provider-state.json has a stale PID (dolt was restarted but
state not yet refreshed) or is absent entirely (e.g. after a crash),
publishManagedDoltRuntimeState now falls back to inspecting the actual
running process via repairedManagedDoltRuntimeState, which probes the
port holder and verifies process ownership. On success it repairs the
provider state file atomically before writing dolt-state.json, so all
subsequent readers see a consistent view.

Previously the function returned an immediate error in both cases,
causing gc doctor and gc beads health to fail permanently until
gc start was run — even though dolt was running and healthy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r-disable

managed dolt: disable upstream load-avg auto-GC scheduler (workaround dolt#10944)
…me-under-2m-complete

Reduce cmd/gc test runtime under two minutes
Add a regression test for the builtin cursor provider so future cleanup cannot silently drop its readiness prefix or readiness delay and recreate the startup deadline_exceeded loop.
…astownhall#1222)

## Summary

- Fix `gc-nudge` DOLT_GC() failure: replace `--database` (invalid in
dolt 1.86.3) with `--use-db` flag
- Fix darwin build: cast `stat.Dev` from `int32` to `uint64` in
`fsys.readRegularFileSnapshot`

## Changes

- `examples/dolt/commands/gc-nudge/run.sh` — use `--use-db` instead of
`--database` when calling `dolt sql`
- `internal/fsys/read_regular_unix.go` — cast `stat.Dev` to `uint64` for
darwin compatibility
- `cmd/gc/dolt_gc_nudge_script_test.go` — update test assertions to
match new flag

## Test plan

- [x] All gc-nudge script tests updated and pass
- [x] `stat.Dev` cast verified on darwin

Closes ga-66a

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atcher-pool-resume-flags

fix: rebuild bare resume command to include schema defaults (gastownhall#799)
…readiness-detection

fix(profiles): add readiness tuning to cursor provider
…de-rig-locals

fix(dolt-health): exclude rig-local Dolt servers from zombie scan
…all#1208)

## Summary

Fixes a build break on darwin introduced in c3e6f17. `unix.Stat_t.Dev`
is `int32` on darwin / BSDs and `uint64` on Linux; the new
`readRegularFileSnapshot` (`internal/fsys/read_regular_unix.go`)
assigned it directly to `fileIdentity.dev` (`uint64`), so the build
fails on darwin with:

```
cannot use stat.Dev (variable of type int32) as uint64 value in struct literal
```

This keeps the direct cast in the unix snapshot path, aligns the
reflective identity path with the same signed-device bit preservation,
adds in-package coverage for both synthetic signed-device handling and
the real snapshot-to-Lstat identity flow, and wires a Darwin compile
guard into `make test` / `make test-cover`.

Closes gastownhall#1207

## Testing

- [x] `go test ./internal/fsys`
- [x] `make test-fsys-darwin-compile`
- [x] `make check` (maintainer follow-up; local pre-commit hook)
- [ ] `make check-docs` — N/A, no docs touched
- [ ] `make test-integration` — N/A, no runtime/controller/workflow
change

## Checklist

- [x] Linked an issue (gastownhall#1207)
- [x] Added targeted `internal/fsys` regression coverage
- [x] Added a default-path Darwin compile guard
- [x] No user-facing surface — internal helper
- [x] No breaking change

Co-authored-by: Julian Knutsen <julianknutsen@users.noreply.github.com>
Providers like OpenCode need different commands for ACP vs tmux
transport (e.g. 'opencode acp' for ACP, bare 'opencode' for tmux).
The existing Args field applies to both transports with no override.

Add ACPCommand and ACPArgs fields to ProviderSpec and ResolvedProvider.
When transport is 'acp', BuildProviderLaunchCommand uses
ACPCommandString() which falls back field-by-field to Command/Args,
allowing partial overrides.
julianknutsen and others added 28 commits April 28, 2026 08:18
…udge-complete

fix(mail): nudge reply recipients from human
## Summary
- run the triage-label workflow for reopened issues/PRs and PRs marked
ready for review
- skip draft PRs so they do not enter triage until ready
- fail loudly if the workflow cannot resolve an issue/PR number

## Tests
- git diff --check
…session-id

fix(mail): use session bead id for senders
Three interacting bugs orphaned in_progress beads when pool sessions
churned, leaving them invisible to every work_query tier (no assignee
match, status not "ready"):

1. reapStaleSessionBeads closed any session whose tmux probe failed,
   including ones past startup that held active claims. Restrict to
   sessions stuck in the "creating" state (or with pending_create_claim
   set) — by design those cannot have claimed work yet, since claim is
   the worker's first post-startup action. Sessions past creating with a
   dead tmux are left for the lifecycle reconciler to restart so the
   original assignee resumes the work.

2. unclaimWorkAssignedToRetiredSessionBead and the default
   EffectiveOnDeath/EffectiveOnBoot shell hooks all cleared the assignee
   on in_progress beads but never reset status. Reset to "open" so a
   fresh worker can re-claim via Tier 3 of the work_query
   (gc.routed_to + --unassigned).

3. Belt-and-suspenders against any future close path that bypasses (1):
   closeBead now refuses to close a session bead while non-session work
   is still assigned to it. The reconciler relies on the assignee link
   to wake the session and resume claims; closing under live claims
   would strand the work. Callers that legitimately need to retire an
   active session must drain or unclaim first.

Evidence on a live city: 17 codex-max session beads cycled in ~1 hour
(9 "stale-session", plus drained/orphaned/duplicate). Four PR-review
finalize beads ended up status=in_progress + assignee="" + routed_to=
gascity/workflows.codex-max, invisible to all four work_query tiers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up for closed PR gastownhall#1340.\n\nPreserves Julian Knutsen's original order-check cache work and includes the reviewed cold-cache fallback fix plus generated API/dashboard artifacts.
…er-orphan-claims-complete-20260428

fix(session): preserve in_progress claims across worker churn
…ync-session-starts-complete

perf(sessions): enqueue reconciler starts asynchronously
…tions-permissions

chore: reduce workflow token permissions
…lease-security

chore: harden release security docs and publishing
…hell-trust-boundaries

Harden controller shell trust boundaries
…-codeql

Add fork-safe advanced CodeQL workflow
Polecats were creating duplicate PRs against the same bead because
re-spawn didn't check whether another polecat alias already had an
open branch/PR for that bead. Triage of pa-9zyx4 found four such
double-PR cases (pa-v4oi6.7, .19, .23, .24) where polecat/capable/*
and polecat/rictus/* simultaneously targeted the same issue ID.

Add a check in mol-polecat-work.workspace-setup that runs before fresh
branch creation. When metadata.branch is empty:

- Scan origin for any polecat/* branch carrying this bead ID, matched
  precisely (anchored to '/' before and end/'@'/'-' after, so pa-wgb8g
  does not match pa-wgb8g2 or pa-wgb8g.2).
- Same-alias match → adopt as ours by setting metadata.branch; the
  existing rejection-aware-resume path checks it out below.
- Different-alias match → don't open a duplicate. Mail the witness with
  both branches, drop the claim back to the pool, drain-ack and exit.
- Both → adopt our own; nudge witness about the other-alias branches.

Bumps formula version 8 → 9.
vbtcl pushed a commit that referenced this pull request May 6, 2026
…gastownhall#1203)

## Summary

The session-first migration (`dd90ac0a`, Mar 8 2026) deleted the Agent
Protocol primitive (`internal/agent/agent.go` and the `Agent`/`Handle`
interfaces) and moved its responsibilities into `internal/session/`
(lifecycle) and `internal/runtime/` (providers). Commit `be8debd8` then
renamed `agent.*` events to `session.*`. The primitive list in AGENTS.md
and the architecture docs in `engdocs/architecture/` continued to
describe the deleted abstraction — for ~50 days.

This PR aligns the docs with HEAD. **No code changes.**

## Changes

- **Renamed** `engdocs/architecture/agent-protocol.md` → `session.md`
(88% similarity preserved by `git mv`). Title and summary rewritten to
frame as primitive #1 (Session) with a History note pointing at
`dd90ac0a`.
- **AGENTS.md**:
- Renamed primitive #1 from "Agent Protocol" to "Session"; reference
`agent.SessionNameFor` (in the `agent` helper package, not `session`)
- Fixed Messaging derivation: `Session.Nudge()` (delegating to
`runtime.Provider.Nudge()`) — `SendPrompt()` doesn't exist
  - Fixed Health Patrol derivation language
  - Updated progressive capability table (Level 0-1: Agent → Session)
- Replaced `agent` with `worker` in the canonical-domain list; footnoted
`internal/agent/` as a small helper package
  - Added new **"Active migrations"** subsection documenting:
- Worker boundary migration (started `12a0a848`, in progress) — names
the CI test `TestGCNonTestFilesStayOnWorkerBoundary` that enforces
non-test `cmd/gc/` files route through `worker.Handle`
    - Session-first migration (completed `dd90ac0a`)
- **`engdocs/architecture/`**: updated `nine-concepts.md`, `index.md`,
`glossary.md`, `messaging.md`, `dispatch.md`, `controller.md`,
`prompt-templates.md` — Agent Protocol references replaced with Session,
broken links fixed.
- **`event-bus.md`**: rewrote the event-type constants table from
`Agent*`/`agent.*` to `Session*`/`session.*` (matching
`internal/events/events.go:20-38` exactly, including the new
`SessionUpdated` constant the old table was missing). Renamed
`AutomationFired`/`Completed`/`Failed` to
`OrderFired`/`Completed`/`Failed`. Updated the storage-format example.
- **`health-patrol.md`**: fixed six event names renamed by `be8debd8`
(`agent.*` → `session.*`); replaced obsolete description of the removed
`internal/agent.Agent` interface with current helper-package
responsibilities; disambiguated the Erlang/OTP table's "Worker" row from
gascity's `internal/worker/` package.
- **`test/integration/E2E-PROVIDER-GAPS.md`**: three references to
`agent.started` → `session.woke`.
-
**`examples/gastown/packs/gastown/formulas/mol-digest-generate.toml`**:
runnable formula's `gc events --type=...` filter updated from broken
`agent.*` event names to current `session.*` names.
- Refreshed "Last verified against code" stamps on all touched docs to
2026-04-25.

## Why

The spec maintenance rule (originally in `specs/architecture.md`, now
distributed across the per-doc "Last verified" stamps) requires updating
docs in the same commit as referenced symbol renames. That discipline
wasn't honored on `dd90ac0a` or `be8debd8`, leaving the doc set
describing a deleted primitive. A new contributor reading AGENTS.md was
being directed to look for an abstraction that doesn't exist.

## Test plan

- [x] `make build` clean
- [x] `make check` clean (full `./...` suite green)
- [x] `make check-docs` clean
- [x] Repo-wide grep for `agent-protocol.md` returns 0 hits (no broken
cross-doc links)
- [x] All `Agent Protocol` references that remain are in intentional
History/migration footnotes (4 files)
- [x] All `session.*` event constants in docs match
`internal/events/events.go:20-38` exactly
- [x] `agent.*` event references remain only in `engdocs/archive/` (3
files, deliberately preserved historical analysis)

## Pre-flight review

Multi-agent review pipeline (3 Anthropic reviewers + Gas City
contributor checker) iterated 2 rounds before convergence. Iteration 1
caught a blocker (the worker-boundary claim was factually wrong) plus 5
majors; iteration 2 verified all fixes and approved with zero remaining
findings. Codex and Copilot CLI were unresponsive in this session and
skipped per the review skill's documented fallback policy.

---------

Co-authored-by: sjarmak <sjarmak@users.noreply.github.com>
Co-authored-by: Julian Knutsen <julianknutsen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.