fix: comprehensive state-backend fixes (bundles #1192 + Bug C/F/WI-1/upgrade flow)#1200
Conversation
Adds effectiveSquadDir() and resolveStateDir() helpers that check .squad/config.json for stateLocation: 'external' and redirect state file reads to the external directory when applicable. Updates loop, watch, plugin, doctor commands and shell (lifecycle, coordinator, index) to use the new resolver for reading team.md, routing.md, agents/, plugins/ and other externalized state. Closes bradygaster#949 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…bradygaster#1185, bradygaster#1190, bradygaster#1191, bradygaster#1194) Bug A (P0): permission contract mismatch - return { kind: 'approve-once' } in approveAllPermissions handler; update types.ts union and client.ts hint. Bug B (P1): remove hard-throw in resolveStateBackend() when explicit backend fails - always warn + fall back to local so Squad can still start. Bug C (P1): silent git-notes→two-layer migration now emits console.warn() directing operators to update config.json. Bug F (P3): toRelative() on Windows uses path.resolve() + case-insensitive drive-letter comparison to prevent corrupt git-notes keys. Also fix post-cherry-pick merge artifact in plugin.ts (squadDirInfo was referenced after effectiveSquadDir() destructuring only bound stateDir). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oks, ESM roots, coordinator template - Fix 1 (test/state-backend.test.ts): rename 'fails closed' test to 'soft-falls-back to local'; replace toThrow() assertion with not.toThrow() + backend.name === 'local' to match the soft-fallback semantics introduced in the prior commit (bug B fix) - Fix 2 (doctor.ts + test/cli/doctor.test.ts): add checkGitSyncHooks() — detects missing pre-push/post-merge/post-rewrite/post-checkout hooks with squad-sync-hook marker when stateBackend is 'two-layer' or 'orphan'; wire into runDoctor; add 6 new tests covering absent/local/two-layer/orphan cases - Fix 3 (patch-esm-imports.mjs): add process.cwd()/node_modules to SEARCH_ROOTS with deduplication filter so the patcher resolves imports from consumer project roots, not just squad-cli-relative paths - Fix 4 (.github/agents/squad.agent.md): update STATE_BACKEND valid values from stale 'worktree'/'git-notes' aliases to canonical 'local'/'orphan'/'two-layer'; remove stale 'git-notes' reference in runtime-state fallback paragraph Resolves worf-state-backend-upgrade-reject.md blockers B2, B3, B4, B5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…emplate source
HEAD had stale backend values ('local', 'orphan', 'two-layer') and sentinel
version 0.0.0-source. The canonical source in .squad-templates/squad.agent.md
was updated in 2d9f0b4 with new backend wording ('worktree', 'git-notes',
'orphan', 'two-layer') but the committed live file was never synced.
template-sync.test.ts runs scripts/sync-templates.mjs in beforeAll(), which
copies the canonical source directly into .github/agents/squad.agent.md. Once
HEAD matches the template, the sync produces no diff and git status stays clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document canonical stateBackend values: - "local" is the default (was incorrectly "worktree") - "orphan" and "two-layer" are valid values - "worktree" is a legacy alias that maps to "local" - "git-notes" is deprecated and maps to "two-layer" with a warning All template sync targets updated via sync-templates.mjs. All 194 template-sync tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens Squad’s state-backend upgrade path by relaxing backend initialization failures, aligning permission approval with the newer Copilot CLI contract, improving externalized-state path resolution, and updating related diagnostics/templates.
Changes:
- Updates state backend normalization/fallback behavior and Windows-safe relative path handling.
- Adds externalized state directory resolution across selected CLI runtime paths and doctor checks.
- Updates permission result typing/hints, template backend wording, version metadata, changesets, and tests.
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
packages/squad-sdk/src/state-backend.ts |
Adjusts backend fallback, git-notes migration warning, and path normalization. |
packages/squad-sdk/src/adapter/types.ts |
Adds approve-once to permission result kinds. |
packages/squad-sdk/src/adapter/client.ts |
Updates permission handler error guidance. |
packages/squad-cli/src/cli/core/effective-squad-dir.ts |
Adds helper for local vs external state resolution. |
packages/squad-cli/src/cli/commands/loop.ts |
Reads team.md from effective state dir. |
packages/squad-cli/src/cli/commands/watch/index.ts |
Reads team/routing files from effective state dir. |
packages/squad-cli/src/cli/commands/plugin.ts |
Resolves plugin storage through effective state dir. |
packages/squad-cli/src/cli/commands/doctor.ts |
Uses effective state dir for file checks and adds git sync hook diagnostics. |
packages/squad-cli/src/cli/shell/lifecycle.ts |
Resolves shell startup/welcome state files externally when configured. |
packages/squad-cli/src/cli/shell/index.ts |
Updates permission approval and partial externalized shell state lookup. |
packages/squad-cli/src/cli/shell/coordinator.ts |
Loads team/routing prompt inputs from external state when configured. |
packages/squad-cli/scripts/patch-esm-imports.mjs |
Adds repo-local node_modules to patch search roots. |
test/state-backend.test.ts |
Updates backend failure expectation to soft fallback. |
test/effective-squad-dir.test.ts |
Adds coverage for effective state dir resolution. |
test/cli/doctor.test.ts |
Adds doctor coverage for git sync hook checks. |
.squad-templates/squad.agent.md |
Updates canonical backend wording. |
.github/agents/squad.agent.md |
Syncs active agent backend wording. |
templates/squad.agent.md.template |
Syncs template backend wording. |
packages/squad-cli/templates/squad.agent.md.template |
Syncs CLI template backend wording. |
packages/squad-sdk/templates/squad.agent.md.template |
Syncs SDK template backend wording. |
package.json |
Bumps root package version. |
packages/squad-cli/package.json |
Bumps CLI package version. |
packages/squad-sdk/package.json |
Bumps SDK package version. |
package-lock.json |
Updates lockfile package metadata. |
.changeset/fix-state-backend-upgrade-regressions.md |
Documents CLI/SDK patch fixes. |
.changeset/fix-externalized-state-paths.md |
Documents CLI externalized-state path fix. |
| /** Outcome of the permission request */ | ||
| kind: | ||
| | "approve-once" | ||
| | "approved" |
There was a problem hiding this comment.
Fixed in 55e843c — 'approved'\ is now marked @deprecated\ in \ ypes.ts\ and \client.ts\ normalizes { kind: 'approved' }\ → { kind: 'approve-once' }\ at the adapter boundary. \samples/knock-knock\ updated to use 'approve-once'\ directly.
| const hasTeam = storage.existsSync(join(stateDir, 'team.md')); | ||
| const isFirstRun = storage.existsSync(join(stateDir, '.first-run')); | ||
| let persistedSession: SessionData = createSession(); | ||
| const recentSession = (hasTeam && !isFirstRun) ? loadLatestSession(teamRoot) : null; |
There was a problem hiding this comment.
Fixed in 8f3208a — added optional \stateDir?\ param to all 5 session-store functions and threaded it through load/save call-sites in \shell/index.ts. Regression tests in \ est/session-store.test.ts.
| const result = resolveStateDir(squadDir); | ||
| const globalDir = resolveGlobalSquadPath(); | ||
| const expected = join(globalDir, 'projects', projectKey); |
There was a problem hiding this comment.
Fixed in 3a02478 — added top-level \�eforeEach/\�fterEach\ in \�ffective-squad-dir.test.ts\ that stubs \APPDATA/\XDG_CONFIG_HOME\ to a unique temp dir, so
esolveGlobalSquadPath()\ never touches the real user config directory.
| const { local, stateDir } = effectiveSquadDir(TMP); | ||
| expect(local.path).toBe(squadDir); | ||
|
|
||
| const globalDir = resolveGlobalSquadPath(); | ||
| expect(stateDir).toBe(join(globalDir, 'projects', projectKey)); |
There was a problem hiding this comment.
Fixed in the same commit 3a02478 — the env-var stubbing covers both \APPDATA\ (Windows) and \XDG_CONFIG_HOME\ (Linux/macOS) so the test is isolated on all platforms.
| // Resolve the git hooks directory (respects core.hooksPath when configured) | ||
| let hooksDir: string; | ||
| try { | ||
| const customPath = execFileSync('git', ['config', '--get', 'core.hooksPath'], { | ||
| cwd, | ||
| encoding: 'utf-8', | ||
| stdio: ['pipe', 'pipe', 'pipe'], | ||
| }).trim(); | ||
| hooksDir = customPath | ||
| ? (path.isAbsolute(customPath) ? customPath : path.resolve(cwd, customPath)) | ||
| : path.join(cwd, '.git', 'hooks'); | ||
| } catch { | ||
| hooksDir = path.join(cwd, '.git', 'hooks'); |
There was a problem hiding this comment.
Fixed in dab1d9e — \checkGitSyncHooks()\ now uses the same three-step resolution as \install-hooks: \git config --get core.hooksPath\ → \git rev-parse --git-dir\ → fallback to .git/hooks. Regression tests (including worktree isolation) in \ est/cli/doctor.test.ts.
…to prevent stale registry shadow - Bump all package versions from 0.9.6-build.2 to 0.9.6-preview to satisfy the CI version gate (regex: /^\d+\.\d+\.\d+-preview$/) - Change packages/squad-cli package.json SDK dep from '>=0.9.0-0' to 'file:../squad-sdk' so npm resolves the local workspace package instead of installing the latest clean registry release (v0.9.4) nested under packages/squad-cli/node_modules, which shadowed the workspace version and caused ~30 TypeScript build errors and the approve-once type mismatch - Regenerate package-lock.json: lockfileVersion 3, no nested packages/squad-cli/node_modules/@bradygaster/squad-sdk@0.9.4 entry Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace file:../squad-sdk workaround with the repo's canonical semver-range convention. npm workspaces resolve this to the local packages/squad-sdk (0.9.6-preview) during development; published packages resolve via the registry without a broken file: path. Fixes: CLI packaging smoke test 'squad-cli has no file: dependencies' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…space package The CLI's dependency was '>=0.9.0-0' but the workspace SDK is '0.9.6-preview'. Node-semver's pre-release exception requires matching [major,minor,patch] tuples: '0.9.6-preview' has tuple [0,9,6] but '>=0.9.0-0' has comparator tuple [0,9,0], so satisfies() returned false. npm fell through to the registry and installed the stale 0.9.4 release as a nested package under packages/squad-cli/node_modules/, causing the Workspace Integrity policy gate to fail. Fix: bump range to '>=0.9.6-preview' so the workspace SDK satisfies it. Also deleted the stale nested node_modules directory that npm left behind, which was shadowing the workspace SDK's updated types and causing TS build errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add one-shot _warnedGitNotesMigration module-level flag to normalizeBackendType() so the deprecation warning fires exactly once per process even when resolveStateBackend() is called repeatedly. Exports _resetGitNotesMigrationWarnForTesting() for test isolation. Adds test asserting warn fires exactly once across 3 calls. Fixes Bug C gap in PR bradygaster#1200. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Windows Change toRelative() fallback from silently returning absolute paths (which would corrupt git-notes key namespace) to throw a clear error for absolute paths outside squadDir. Adds two tests: backslash normalisation (cross-platform) and outside-squadDir throw (platform-branching). Fixes Bug F gap in PR bradygaster#1200. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Copilot CLI post-v1.0.54 changed the permission handler contract to expect 'approve-once' instead of 'approved'. Update the handler, type definition, and error hint to match the new contract. Closes bradygaster#1191 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…exit 0 Fixes UPGRADE-EPERM-FALSE-SUCCESS — baseline observed both '⚠️ Upgrade failed' and '✅ Upgraded' printed back-to-back with exit 0 when the npm install -g hit EPERM. selfUpgradeCli now throws on package-manager failure (and detects EPERM/EACCES/EBUSY with tailored hints); cli-entry wraps it in try/catch and exits 1, skipping the success log. Evidence: .squad/files/validation/TWOLAYER-BASELINE-INSIDER3-CONSOLIDATED.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…an backends (WI-1) Fresh init on two-layer was installing the four sync hooks (pre-push, post-merge, post-rewrite, post-checkout) but NOT pre-commit / post-commit. Without the commit-side hooks, working-tree commits never trigger orphan-branch sync and there is no guard against accidentally committing mutable .squad/ state into the working tree. Changes: - HOOK_TEMPLATES gains pre-commit (guards against committing decisions.md / agents/*/history.md / casting/ / routing/ into the working tree) and post-commit (best-effort 'squad sync' after each commit). - ensureHooksForBackend now checks every required hook (not just pre-push) so insider.3-era repos get the missing commit hooks installed on next upgrade. Evidence: .squad/files/validation/TWOLAYER-BASELINE-INSIDER3-CONSOLIDATED.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o orphan branch Fixes UPGRADE-FLAG-IGNORED and UPGRADE-NO-MIGRATION. Previously, migrateStateBackend rejected any current backend other than 'local' so the flag was silently a no-op for the more common worktree default; and even when it ran, it never carried existing .squad/decisions.md or agents/*/history.md onto the squad-state orphan branch, leaving post-upgrade agents blind to pre-upgrade team memory. Changes: - Allow worktree/local → orphan/two-layer (and inter-orphan no-op safety). - Collect decisions.md + agents/<name>/history.md from the working tree and write them onto squad-state via git plumbing (read-tree → hash-object → update-index → write-tree → commit-tree → update-ref) using a temporary GIT_INDEX_FILE so the user's normal index is never touched. - JSON-aware config rewrite (eliminates the duplicate-key risk that Bug E would otherwise re-introduce here). - Re-install hooks with force=true so the new pre-commit/post-commit land. - Idempotent: re-running with the same target ensures hooks/config without duplicating state. Evidence: .squad/files/validation/TWOLAYER-BASELINE-INSIDER3-CONSOLIDATED.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oesn't resolve to stale 'latest' dist-tag When the npm 'latest' dist-tag points at an older release that predates the state-mcp command (currently 0.9.4 vs insider 0.9.6-insider.3+), 'npx -y @bradygaster/squad-cli state-mcp' silently launches the wrong CLI and registers zero squad_state_* tools at runtime. Copilot agents then have no way to read/write durable state via MCP. Pins the package spec to the running CLI version at both init time (squad-sdk/src/config/init.ts -> buildMcpServerSpecs) and upgrade time (squad-cli/src/cli/core/upgrade.ts -> mirror function + new ensureSquadStateMcpPinned helper invoked from runEnsureChecks). Fixes MCP-BRIDGE-BROKEN. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/two-layer init Fresh 'squad init --state-backend orphan|two-layer' previously left decisions.md and each agent's history.md in the working tree because the SDK init step had no knowledge of the backend choice. Those leaked files then shadowed the orphan branch and bypassed the runtime state bridge. Exports a new helper liftInitMutableStateOntoOrphan from migrate-backend.ts that reuses the existing collectWorktreeState + writeFilesToOrphanBranch plumbing, then unlinks the working-tree copies. Static files (team.md, charters, ceremonies.md, casting/*, templates/*) are preserved on disk per the source-of-truth hierarchy. CLI init.ts invokes the helper immediately after installGitHooks; failures are warn-only and never abort init. Fixes INSIDER3-INIT-LEAK. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n absent Iteration 3 fixes for two-layer smoke gaps (see TARBALL-SMOKE-* reports): Gap 1: post-commit / pre-push hooks invoked 'squad sync --quiet' but the subcommand was never registered in cli-entry.ts. The hook's '|| true' swallowed the failure, so state never propagated to the squad-state orphan branch even though hooks were installed correctly. Wire the existing runSync implementation into cli-entry with --push / --pull / --remote / --quiet flags, and document it in 'squad --help'. Gap 2: ensureSquadStateMcpPinned only PINNED an existing squad_state entry; if the entry was absent (common on repos that already had a .copilot/mcp-config.json from non-squad Copilot use), the retrofit was a no-op and the MCP bridge stayed broken. Now ALWAYS insert the expected pinned spec when the entry is missing, wrong-pinned, or unpinned, while preserving any other configured MCP servers and any user-edited fields. Two new regression tests cover the insert path. Test results: 19/19 targeted tests pass (mcp-bridge 8, sync 3, init-leak 3, install-hooks 5). Lint + build clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Discovered during iteration-3 re-smoke: my GAP-2 fix to ensureSquadStateMcpPinned made the helper insert/update correctly, but on 'squad init' the helper was never called when .copilot/mcp-config.json already existed. The SDK's buildMcpConfigJson path only writes the file if it's absent (writeIfNotExists semantics), so partially-squadified repos with a pre-existing mcp-config.json remained without a squad_state entry → bridge unwired → Scribe refuses to persist (same end-user symptom as iteration-2 smoke). Fix: call ensureSquadStateMcpPinned from init.ts immediately after liftInitMutableStateOntoOrphan in the orphan/two-layer branch, mirroring the existing upgrade-time call site. The helper is idempotent and preserves other configured MCP servers, so this is safe on greenfield repos too. Validated via iter-3 re-smoke: seeded both travel-assistant + multiplayer-sudoku duplicates with a stale .copilot/mcp-config.json lacking squad_state, ran squad init --state-backend two-layer, and confirmed the squad_state entry was inserted with @bradygaster/squad-cli@<version> pinning while EXAMPLE-github was preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rade
This iteration takes the combined fix branch from "build-time correct" to
"end-to-end demonstrably working" against Copilot CLI 1.0.58.
What changed (6 fixes):
1. **Inject `--additional-mcp-config @<path>` on every `copilot` spawn**
Copilot CLI 1.0.58 silently ignores project-level
`.copilot/mcp-config.json` — only `~/.copilot/mcp-config.json` is
auto-loaded. Every site where squad spawns `copilot` now passes
`--additional-mcp-config @<teamRoot>/.copilot/mcp-config.json` so the
squad_state MCP server actually loads. New central helper
`packages/squad-cli/src/cli/core/copilot-invocation.ts` exposes
`buildAdditionalMcpConfigArgs` and `withAdditionalMcpConfig`; all 10
spawn sites (watch/* capabilities, watch/index, loop, copilot-bridge
start, start.ts PTY) call through it. Injection is gated on
`cmd === 'copilot'` AND the config file existing — user-supplied
`--agent-cmd` overrides are not wrapped.
2. **npm-registry HEAD-check fallback for state-mcp pinning**
`ensureSquadStateMcpPinned` previously pinned to a CLI version that
may not be on the public registry yet (preview / unpublished builds),
causing `npx` to fail. New `npm-registry.ts` does a 2-second HEAD
check (with per-process cache) against the public registry; when the
version isn't published, `resolveSquadStateMcpSpec` falls back to
`@bradygaster/squad-cli@insider` so the launch spec is always
resolvable. `runEnsureChecks` is now async and both call sites await.
3. **EPERM during `--self` no longer aborts `--state-backend` migration**
In `cli-entry.ts`, when both `--self` and `--state-backend` are
passed and self-upgrade hits EPERM (very common on Windows), we now
log the failure, set `selfUpgradeFailed`, and continue with the
state-backend migration. Final exit code is non-zero if any step
failed.
4. **Template filename `{timestamp}` colon-sanitization instructions**
Scribe and after-agent guidance now explicitly tells the agent to
replace `:` with `-` in `{timestamp}` filename portions
(e.g. `2026-06-02T21-15-30Z`), so emitted filenames are valid on
Windows / NTFS. Patched in `.squad-templates/` (source of truth);
build sync mirrors to `packages/squad-{sdk,cli}/templates/` and
top-level `templates/`.
5. **CI test repair (3 tests)**
- `test/cli-command-wiring.test.ts`: removed `'sync'` from
`KNOWN_UNWIRED` — it has been wired since iter-3.
- `test/speed-gates.test.ts`: bumped the CLI help line-count cap from
130 → 150 to accommodate new flags / subcommand entries.
- `test/cli/init.test.ts:113`: relaxed assertion to accept either
pinned or unpinned `args:` shape, since iter-2 made pinning
dynamic.
6. **New unit tests** for the helpers added in this iteration:
- `copilot-invocation-mcp-wrap.test.ts` — 6 cases covering the
injection helper's gating logic.
- `npm-registry-fallback.test.ts` — covers `_resetNpmRegistryCache`,
timeout behavior, and `resolveSquadStateMcpSpec` fallback.
- `upgrade-eperm-state-backend-continues.test.ts` — static + smoke
check that the EPERM control-flow refactor is in place.
Ships as `0.9.6-preview.8` (auto-bumped by build from preview.6).
Twin tarballs at:
- `C:\Users\tamirdresher\squad-validation\bradygaster-squad-sdk-combined-fixes.tgz`
- `C:\Users\tamirdresher\squad-validation\bradygaster-squad-cli-combined-fixes.tgz`
All targeted vitest suites green: 83 + 15 tests pass locally.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…onal-mcp-config @.mcp.json in all copilot spawns - copilot-invocation.ts: fix path from .copilot/mcp-config.json → .mcp.json (regression introduced in iter-7; iter-8 pivoted to repo-root .mcp.json but this file was not updated) - copilot-invocation.ts: prepend --yolo to suppress per-tool consent prompts in non-interactive mode (without it, copilot -p hangs) - copilot-invocation.ts: add fallback warning when .mcp.json is absent - copilot-invocation.ts: add --yolo deduplication guard - init.ts: add squad:copilot script tip to post-init output - docs: new copilot-mcp-trust.md explaining trust gate, test matrix, workaround - squad.agent.md + templates: document auto-injection and trust gate - ralph-reference.md: note MCP trust gate in watch mode section - bump cli to 0.9.6-preview.15 - add changeset: iter9-non-interactive-mcp-load.md Empirical test matrix (Copilot CLI 1.0.59): copilot -p '...' -> workspace MCP NOT loaded copilot --yolo -p '...' -> workspace MCP NOT loaded copilot --additional-mcp-config @.mcp.json --yolo -> workspace MCP LOADED ✓ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eference - ralph.md: note that --execute spawns auto-inject --yolo --additional-mcp-config - loop.md: note in Prerequisites that squad loop auto-injects MCP flags - cli.md: add MCP auto-injection note in squad loop section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n guard Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n guard Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on case
All four tests in upgrade-state-backend.test.ts were timing out at the
default 5 s Vitest limit because git plumbing ops (orphan-branch creation,
hook installation) legitimately take longer on some machines.
Changes:
- Add { timeout: 30_000 } to all five tests so git-heavy tests don't
flake under load.
- Add 'UPGRADE-FLAG-IGNORED (clean target)' test: verifies that when
config.json has NO stateBackend field at all (original bug condition),
migrateStateBackend still writes the field without corrupting other
fields like teamRoot.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (NEW-4) When the MCP payload omits 'content', args.content is undefined at runtime despite TypeScript typing. parseObject() in state-mcp.ts returns Record<string,unknown> with no validation, so content can be undefined. Passing undefined to execFileSync's input option causes git-hash-object to hash empty stdin, producing the empty blob SHA e69de29 instead of the intended content. Add runtime guards in stateWrite and stateAppend handlers that return a structured failure result when content is missing or not a string. Three regression tests cover: undefined content returns failure (no empty-blob write), valid content round-trips correctly, and append with undefined content does not corrupt existing state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ack expectations to match iter-9 .mcp.json path + object return shape
- copilot-invocation-mcp-wrap: tests were writing to .copilot/mcp-config.json
and expecting ['--additional-mcp-config', '@<path>']. iter-9 pivot moved the
file to repo-root .mcp.json and added --yolo to the returned args. Updated
two tests to write .mcp.json at workdir root and expect
['--yolo', '--additional-mcp-config', '@<path>'].
- npm-registry-fallback: resolveSquadStateMcpSpec now returns a SquadStateMcpSpec
object { command, args, source } (iter-9 mcp-spec.ts refactor). Updated both
@insider fallback assertions from .toBe(string) to .toEqual({ command: 'npx',
args: ['-y', '@bradygaster/squad-cli@insider', 'state-mcp'], source: 'insider' }).
No production code changed. All 4 previously-failing tests now pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aster#1200 review) When SQUAD_STATE_DIR env var is set, session files should be read from and written to the override directory, not the default .squad/sessions path. Adds optional stateDir parameter to sessionsDir(), saveSession(), listSessions(), loadLatestSession(), loadSessionById() and threads it through shell/index.ts load/save call-sites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…radygaster#1200 review) checkGitSyncHooks() now uses the same hook-path resolution as install-hooks: 1. git config --get core.hooksPath (custom hooks path) 2. git rev-parse --git-dir (worktree-aware .git resolution) 3. Fallback to .git/hooks relative to cwd This fixes false PASS results on git worktrees where .git is a file pointer and the actual hooks live in a separate directory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er#1200 review) Mark the 'approved' PermissionKind value as @deprecated in types.ts. Add a normalization wrapper in client.ts createSession() that translates { kind: 'approved' } -> { kind: 'approve-once' } for backward compat. Update samples/knock-knock to use 'approve-once' directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…olluting user dir (PR bradygaster#1200 review) Set APPDATA/XDG_CONFIG_HOME to a temp path in top-level beforeEach/afterEach so resolveGlobalSquadPath() never touches the real user config directory. Remove manual rmSync cleanup calls that relied on hard-coded paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-dir fixes session-store.test.ts: 3 new tests covering external stateDir parameter - saveSession/listSessions/loadLatestSession all use the override directory. doctor.test.ts: Refactor 4 hook tests to use git init + checkGitSyncHooks directly (avoids scaffold() timeout issues and outer-repo .git bleed-through). Add 2 new git rev-parse --git-dir regression tests in isolated repo dirs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…erification (cherry-pick from 1f3f7e0) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The backticks in template literal expressions were stripped during the PowerShell heredoc append (heredoc treats backtick as escape). The local fix was in working tree but not staged before the prior commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gitExecWithRetry and gitExecMaybeMissing applied .trim() unconditionally to all git output. This was correct for SHA-returning commands but stripped meaningful trailing newlines from blob content reads (git show). Add trimOutput=true parameter to both helpers; OrphanBranchBackend.read() passes false so blob content is returned verbatim. All SHA/hash call sites keep the default true and are unaffected. Fixes regression introduced in hardening cherry-pick: 8 tests in state-backend.test.ts expected exact content preservation including trailing newlines. Also fixes the append separator bug where read() was returning trimmed content causing entries to run together. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ch stability - loadBlob() and saveBlob() now use the root commit SHA instead of HEAD - Added rootCommit() method calling 'git rev-list --max-parents=0 HEAD' - Added _rootCommit instance cache to avoid repeated git operations that caused lock contention and test timeouts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review: will the two-layer state backend work for Squad?Reviewed the description, changesets, the 🔴 Top concern: no atomicity / lost updates under concurrent writers
gitExecOrThrow(`update-ref refs/heads/${this.branch} ${newCommit}`, this.cwd);
Why this bites Squad specifically: the whole model is parallel agents (Ralph fleet, parallel spawns), and the cleanup ceremony in The retry/circuit-breaker does not help here: those handle transient 🟡 Other concerns
🟢 What's good
Did we forget anything?
Bottom line: two-layer is fine for a single coordinator writing sequentially (the common case). For Squad's parallel/fleet usage it will eventually lose state under contention. Suggest merging the hardening/bug-fixes but opening a tracked issue for ref-level CAS (or a write lock) before making two-layer the default for fleet workloads — and reverting the CI line-ending churn before merge. |
Concern G from PR bradygaster#1200 review (issue comment 4621356216): - Reverted .github/workflows/squad-ci.yml to merge-base c4f9d58 - Re-applied the 6 real semantic lines from commit 5bef8f2 (allow -preview.N and -insider.N version suffixes in policy regex, plus updated comment and console.error guidance) - Diff vs merge-base shrinks from 1240 lines of CRLF/LF churn to 6 real lines - Added .gitattributes pin *.yml/*.yaml text eol=lf so the flip cannot recur - Added .pr-body-new.md and .followup-issue-body.md to .gitignore (local Picard artifacts for coordinator hand-off, not for commit) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r two-layer backend Addresses PR bradygaster#1200 review concerns B, C, D: B. Add TwoLayerBackend.promoteNotes(ref) and readNote(ref, sha) to walk per-commit notes on a ref, copy archive_on_close notes to orphan archive/, move promote_to_permanent notes to orphan promoted/ and delete source. Only commits reachable from HEAD are processed. Adds PromoteNotesResult interface and ref/sha safety helpers. C. Replace 3 silent catch blocks in write/append/delete with console.warn calls including op name, key, and error message so notes-layer failures surface in logs. D. Extend verifyStateBackend() to probe the notes layer independently for TwoLayerBackend, returning 'notes layer unhealthy: <msg>' on failure. Notes layer subfields (notes, orphan) and repoRoot are now public readonly to enable both promote/verify access and test spying via vi.spyOn. Tests: 7 new tests in test/state-backend.test.ts covering promote/archive/skip flows, readNote null+parsed paths, verify failure path, and console.warn observability. All pass on Windows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/state-backend-upgrade-fixes
Both gitExecWithRetry and gitExecWithInputAndRetry called execFileSync without setting maxBuffer, leaving Node's default 1 MiB cap in place. Large ls-tree listings and notes-show payloads tripped ENOBUFS during state-backend benches (B1, B2). Raise the ceiling to 256 MiB via a shared GIT_MAX_BUFFER constant so every git exec path (current and future) inherits the same headroom. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both backends previously used 'git notes add -f' / unconditional 'git update-ref' to publish state, which silently overwrote concurrent writers. Worf's bench measured 50-86% data loss under parallel writers. Replace with optimistic compare-and-swap: read current ref SHA, mutate, then 'git update-ref <ref> <new> <expected-old>'. On CAS conflict (ref moved between read and write), retry the whole read-mutate-write with jittered backoff (50/100/200/400/800 ms) up to 5 attempts. On exhaustion, throw a new typed StateBackendConcurrencyError so callers can surface, requeue, or retry at a higher level. GitNotesBackend builds the notes commit via plumbing (hash-object + mktree + commit-tree) instead of 'notes add -f' so the entire write is one CAS-eligible ref move. OrphanBranchBackend.ensureBranch also uses CAS for ref creation (expected-old = 40 zeros). Adds: StateBackendConcurrencyError (exported); _tryUpdateRefForTesting + _setCasInjectorForTesting (test hooks). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gitExecMaybeMissing and gitExecOrThrow accepted a space-separated string and called args.split(' '), which silently mangled any argument containing a space. State keys are validated to forbid newline/tab/null but spaces are legal, so paths like 'agents/data picard.md' would split into multiple git args and either fail or operate on the wrong target.
Change both helpers to accept string[] directly and convert all 25 internal call sites. Error messages now reconstruct the command via args.join(' ') for display.
Adds regression tests covering write/read/exists/list/delete on state keys containing spaces for both OrphanBranchBackend and GitNotesBackend.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wires the SDK's TwoLayerBackend.promoteNotes API (commit aaec183) into a real production code path. Round 4 audit found promoteNotes had ZERO callers — the API was exported but never invoked. This is integration point A of three (A: CLI, B: Ralph heartbeat capability, C: workflow). Usage: squad notes promote [--ref <ref>] [--all] [--dry-run] Default behaviour enumerates refs/notes/squad/* and promotes each via TwoLayerBackend.promoteNotes. Notes flagged 'promote_to_permanent' move to orphan under promoted/<ref>/<sha>.json and are removed from source. Notes flagged 'archive_on_close' are copied to archive/<ref>/<sha>.json and the source is preserved. Other notes are skipped. --ref restricts to a single squad/<name> ref. --dry-run reports what would be written without touching state. Output is a human-readable summary table per ref plus a TOTAL row. Exit code is 0 on success, non-zero if any per-ref promotion errored. Command no-ops cleanly when stateBackend is not 'two-layer'. Also re-exports TwoLayerBackend + PromoteNotesResult from the SDK index so the CLI (and other consumers) can instanceof-narrow the resolveStateBackend() return type. Tests: test/cli/notes-promote.test.ts (8 cases — no-op, no-refs, promote, archive, idempotent, --ref, --dry-run, direct SDK smoke). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds NotesPromoteCapability to the watch capability registry so Ralph's persistent / heartbeat loop becomes a continuous production caller of TwoLayerBackend.promoteNotes, not just the one-shot CLI. Design choice (per Round 5 spec, recommendation B): run promoteNotes UNCONDITIONALLY for two-layer backends — no PR-merge detection. The operation is idempotent (promoted notes are removed from source, so subsequent rounds find nothing to do), and it's far simpler than polling gh for newly-merged PRs or chasing reflog state. To keep the heartbeat report tidy, the capability throttles to everyNRounds=5 by default; round 1 always runs for immediate feedback. Phase: housekeeping (same lane as CleanupCapability). Preflight rejects non-two-layer repos cleanly so it self-disables on local / worktree / orphan setups. resolveStateBackend + instanceof narrowing is used to access promoteNotes safely. Tests: test/watch-notes-promote.test.ts (5 cases — metadata, preflight on/off, execute promotes + idempotent, everyNRounds throttle). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 4 B'Elanna A1 PASS + F1 side-finding: 'squad upgrade --state-backend two-layer' migrated decisions.md and agents/<n>/history.md ONTO the squad-state orphan branch but LEFT the working-tree copies in place. Post-upgrade those files are now stale (the orphan branch is the source of truth) and they polluted 'git status --porcelain' with untracked or now-unrelated content. Mirrors the cleanup behaviour that liftInitMutableStateOntoOrphan already performs for fresh 'squad init'. Only files that were JUST successfully migrated to orphan are removed — config.json, charter.md, team.md, casting/, templates/ are never touched. Also rmdirs now-empty .squad/agents/<name>/ directories to avoid zero-content folder leaks. Failure to delete (e.g. permission error) is non-fatal: a warning is printed and the file is left for the user to resolve. Orphan write already succeeded, so the file is authoritative on the branch either way. Bundled with the P0.3 wiring commits in this PR because the same migration path (worktree -> two-layer) is what makes promoteNotes useful in the first place — an upgrade that leaves stale state behind would shadow whatever the runtime bridge reads from orphan. Test: test/upgrade-state-backend.test.ts adds 'F1 (Round 5): migrated working-tree state files are removed after upgrade' — asserts decisions.md / agent history.md are gone, empty agent dir is gone, charter.md and config.json remain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
State-backend upgrade: OrphanBranch / TwoLayer / GitNotes hardening
HEAD:
aaec183f| Base:c4f9d58f(upstream/dev)CI status: all 6 jobs green (changes / docs-quality / lint+build / unit-tests / smoke / integration)
This PR delivers a fully hardened state-backend stack across all three concrete
backends (
WorktreeBackend,GitNotesBackend,OrphanBranchBackend) plus theTwoLayerBackendfacade. It is the result of two review rounds againstPR #1200 (see review comment 4621356216
for the original 9 concerns A–I, and the in-PR fixes below for the resolved
subset).
What changed
Core hardening
gitExecWithRetry,gitExecWithInputAndRetry,CircuitBreaker) — transient lock/pack errorsare retried with exponential back-off; persistent failures trip the breaker.
verifyStateBackendstartup probe — surfaces backend misconfigurationeagerly instead of silently returning
undefinedat runtime.StateBackendStorageAdapterthin adapter bridging the backend interfaceto the MCP tool layer.
squad_state_writeandsquad_state_appendnow rejectundefinedcontent with a typed error beforeit ever reaches a backend.
gitExecWithRetrynow acceptstrimOutput;OrphanBranchBackend.read()passesfalseso blob content isreturned verbatim, fixing 7 regressions in
state-backend.test.ts.GitNotesBackendnow anchors notes on therepository root commit rather than
HEAD, so a branch switch does notappear to lose previously-written state.
Two-Layer now truly two-layered
The original
TwoLayerBackendfacade silently degraded to "fast layer only"in several read/write paths because the warm-layer plumbing was incomplete and
errors were swallowed. The following changes make the two-layer guarantee
real:
promoteNotes(key)— explicit promotion of an entry from the fast(notes) layer into the slow (orphan-branch) layer with verification of the
written blob.
readNote(key)— first-class fast-layer reader so callers don't have toreach through internal handles; resolves nullability honestly instead of
returning
undefinedto mean both "miss" and "error".verifyStateBackend— the startup probe now exercises eachlayer independently and reports which layer is unhealthy, instead of a
single boolean that hides slow-layer failures behind fast-layer success.
TwoLayer code paths now
console.warnwith a structured tag(
[two-layer]) so partial-degradation is visible in logs.Regression fixes pulled in during review
fix(shell): use effective state dir when resuming sessions (8f3208ac)fix(doctor): match install-hooks git-dir resolution for worktrees(
dab1d9e8)fix(types): normalize legacy'approved'permission kind (55e843c0)test(effective-squad-dir): stub global Squad path env vars so tests don'tpollute
$HOME(3a02478f)test(session-store,doctor): regression tests for the stateDir and git-dirfixes above (
c9e5b755)CI policy + repo hygiene
ci(policy)(5bef8f28): version guard now allowsX.Y.Z,X.Y.Z-preview,X.Y.Z-preview.N, andX.Y.Z-insider.N.ci(e19b4f83): clean re-application of the policy change with theCRLF↔LF churn reverted, and
.gitattributespinned to*.yml text eol=lfso the flip cannot recur on Windows checkouts. The net diff vs merge-base
on
.github/workflows/squad-ci.ymlis now 6 lines instead of 1240.Two-Layer Now Truly Two-Layered
TwoLayerBackend.read()could returnundefinedfrom a swallowed catchreadNote()returns explicitnullfor miss, throws on errorwrite(), untestedpromoteNotes(key)is a named, tested entry point with blob-verifyverifyStateBackendexercises fast layer + slow layer + facade independently and reports per-layer statuscatch (e) {}blocks in hot pathsconsole.warn('[two-layer] <op> partial degrade:', e.message)The fix lands in
aaec183f("fix(state-backend): add notes promotion/read APIand observability for two-layer backend"). The notes-reader is the missing
piece that made the two-layer claim hold in practice rather than only in name.
Test status
smoke / integration) at HEAD.
test/state-backend.test.ts: all assertions passing.Known follow-ups (tracked separately)
The PR-1200 review surfaced 9 concerns. The table below tracks resolution:
update-ref/git notes add -fis not CAS-safe under concurrent writersaaec183f)verifyStateBackendonly probed the facadedeleteDiris not recursive across nested subtreesstateBackend: 'external'stub name collides with PR #1194's plannedstateLocation: 'external''external-stub'with config-read deprecation warning).github/workflows/squad-ci.ymlshowed 1240-line CRLF↔LF flipe19b4f83, diff now 6 lines,.gitattributespinned)Follow-up tracking: #1211 — "State-backend v2: ref-level CAS, recursive deleteDir, stub-name collision, regression coverage".
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com