goal-flight turns a Claude Code session into the orchestrator of a long-running software project. You bring a goal. The orchestrator collects the requirements worth your attention, settles the architecture with you before code is written, then dispatches the work to a range of coding agents: codex, grok, cursor, claude, running locally or on remote nodes over SSH. Each worker iterates its own plan/act/test/review loop, and every commit is gated behind independent review. The run's state lives in durable project files (plan, queue, worker status, review evidence, resume notes) that survive compaction, restarts, and unattended overnight hours. A multi-hour run lands as reviewed, one-commit-per-chunk work on your branch; nothing is pushed without your permission.
Hosts and workers. Claude Code is the supported orchestrator. Codex is the standard worker; cursor, grok, claude-cli, and other worker adapters also serve. Unsupported orchestrator ports for Codex, Cursor, and OpenCode ship in-repo.
What the orchestrator session is for: requirements gathering, architecture decisions, question escalation, and monitoring — not execution. The orchestrator holds enough context about your project's goal, scenery (constraints, architecture, prior decisions, failure modes), and intent to exercise discretion and recommend the next move — then dispatches the work to workers running iterative code-review loops. Your task shifts from supervising coding to steering design: you check in, ratify or redirect, and answer the questions the run escalates.
Quickstart • What it gets you • Architecture • Commands
Install once:
# Claude Code (supported orchestrator):
git clone https://github.com/simonrowland/goal-flight.git ~/.claude/skills/goal-flight(Orchestrator ports for Codex / Cursor / OpenCode exist but are unsupported — install commands are under Host install notes.)
Primary platform is macOS. Linux hosts work too (docs/hosts/linux.md); the macOS-only OS-sandbox backstop is absent there, so workers rely on their own sandbox and approval policies. Full Windows support requires WSL — see Windows.
Restart the host, then open your project repo in an orchestrator session. From here the flow is four moves:
1. Init the project and meet your workers.
/goal-flight init <topic>
Audits the repo, scaffolds the run's durable state (docs-private/, a goal statement, an
operator-local AGENTS.md), probes which worker CLIs are installed and healthy, registers
codex trust, measures machine capacity, records environment caveats workers will need, and
optionally builds a RAG corpus so workers stop re-reading the same project landscape. You
leave init knowing which agents are available and healthy.
(/goal-flight doctor re-checks all of it any time.)
2. Build the requirements document.
/goal-flight ask-questions
Anticipatory subagents read your repo and goal statement, then interview you on the product choices and trade-offs the code can't answer — not trivia it can. Your answers land in the goal file as it grows into the project's requirements document: the north star (the durable statement of what the project must do), the validated premises, and the open questions. Every later architecture call and review is tested against this file, and it survives compaction.
3. Build the architecture document.
/goal-flight decompose-plan [<plan>]
Bring whatever signal you have — a written plan, an architecture doc, or the conversation
so far — and work it into an architecture the run can execute: closed chunks (SCOPE /
CHECKLIST / ACCEPTANCE / FORBIDDEN) with parallelism flagged, plus a reviewer pass over
the decomposition itself, so the design gets adversarial scrutiny before any code is
written. Disagree with a chunk? Say so in chat; mid-session steering is requirements
input, and the plan revises. (With gstack installed,
/office-hours and /plan-eng-review add structured interrogation at this step.)
4. Dispatch the work.
/goal-flight execute [--parallel <N>]
The orchestrator dispatches each chunk to the best-fit worker, watches status files rather than babysitting terminals, routes around rate-limited providers, and gates every chunk behind executor self-review plus an independent reviewer before it lands. The run stops and asks for permission gates, destructive choices, auth or capacity hard stops, failed review or test gates, and product choices the plan can't infer; it does not push without your permission. Come back hours later to reviewed, one-commit-per-chunk work on your branch, with review evidence and resume notes on disk.
Session management. Compaction is managed by a handoff file the controller keeps
updated — repo state, in-flight dispatches, the next intended action. Before the session
compacts, ask the controller to update its handoff, run /compact, then run
/goal-flight resume. The same resume command picks the project up after a restart or
days away; nothing depends on chat memory.
Working signal, not rigid gates: the skill pins a
goal-<topic>-<date>.mdfile at init for compaction-survival, but it's an anchor — not a contract.decompose-planproceeds on whatever signal exists (the goal-statement when present, or the plan source, architecture doc, and in-session conversation), surfacing any inferred assumptions as inline-office-hours backlog items the user can validate during the run. Show up with "here's my architecture doc plus ten minutes of context-setting chat" anddecompose-planproceeds from there. The premises file accumulates validated answers as the run progresses. DRAFT goal-statement is fine —decompose-planproceeds anyway; sharpen any time by editingdocs-private/goal-<topic>-<date>.mddirectly.
/goal-flight with no args prints SKILL.md — the full pattern reference.
Terms you'll see: a chunk is one closed unit of work (scope, checklist, acceptance criteria, forbidden paths); goal-mode is a worker loop that iterates plan/act/test/review until its chunk converges; ACP (Agent Client Protocol) gives the orchestrator structured worker events, with bash-tail (a watched log file) as the fallback; the RAG corpus is an optional set of curated project notes workers read instead of rediscovering the codebase; walkback is the rate-pressure response that reroutes work away from a limited provider.
- A range of agents, local or remote. Workers include codex, grok, cursor, and claude-cli over ACP or bash-tail; the fleet layer runs the same dispatch on remote SSH nodes. The orchestrator picks executor and iteration pattern per chunk (one-shot, goal-mode loop to convergence, or controller-direct for trivial edits) and divides usage and rate limits across vendors, routing around a pressured provider instead of stalling the run.
- Independent review before commit. Every
/goalchunk runs a 7-category adversarial self-review to convergence inside the worker loop; commit-worthy chunks then get an independent reviewer pass (via gstack's/reviewwhen installed), and milestone reviews sweep at configured cadence with concern-diverse lenses. Each new bug shape caught is minted as a durable bug-class predicate; your already-reviewed code and the saved review archive are swept for further instances, and the predicate joins the standing lenses for every later review (protocols/review-mining.md). - A curated project corpus (RAG). Optionally at init — or via
build-corpuslater — goal-flight distills your repo and docs into curated corpus slices underdocs-private/rag/. Dispatch briefs anchor to those files instead of re-pasting project context into every worker prompt, so the read cost lands on workers and the corpus is written once. - Verification-first dispatch. Wrappers point at files for the agent to investigate, not pre-pasted "facts" that go stale on the timescale of minutes. Frontier models trust orchestrator-text uncritically; pointers force them to re-verify against live disk and surface drift.
- Multi-hour unattended runs with light supervision. Check in periodically or respond to decision notifications. The orchestrator's context primarily holds architecture, plan, and metadata (queue state, recent commits, in-flight dispatch headers); real work happens in subagent context windows, and the orchestrator escalates to you only when a decision is genuinely yours.
- Deterministic ops scripts. Capacity leases, dispatch ledgers, compact status,
structured ACP monitoring, rate-pressure detection, and file-backed review jobs live in
scripts/goalflight_*.py; the model spends its judgment on what the numbers mean. Doctor checks runtime readiness and worker-CLI currency, so you know when to run/goal-flight update.
- vs. running one coding session to context exhaustion — the orchestrator doesn't itself write code. It dispatches and verifies, holding requirements and architecture while workers burn their own context windows; the run outlives any single session.
- vs. agent frameworks with built-in long-horizon memory — durable cross-session memory (file-backed project state, the RAG corpus, the resume routine) arrives inside your existing Claude Code session, not in a separate runtime you'd have to adopt.
- vs. cloud agent swarms or editor agents — workers run on your machine (or your SSH nodes) with your CLIs and provider subscriptions; the orchestrator coordinates them from your host session.
- vs. writing prompts manually — make the plan, not the code. The skill asks a frontier
model to decompose your plan into chunks, flagging what can run in parallel and what can
have a
/goalpattern. Every/goalreviews to convergence by default; the 7-category adversarial self-review runs inside the loop until reviews pass, so the orchestrator never sees a non-converged result.
| Command | What it does |
|---|---|
/goal-flight init <topic> |
Tool check, repo audit, scaffold, codex-trust registration |
/goal-flight decompose-plan [<plan>] |
Break a plan into /goal chunks with parallel reviewer pass |
/goal-flight ask-questions [<scope>] |
Anticipatory subagents; surface clarifying questions |
/goal-flight execute [--parallel <N>] |
Per-chunk loop; sequential default, parallel-safe opt-in |
/goal-flight doctor |
Read-only health check for plugin/package/runtime readiness, model currency, rate-pressure |
/goal-flight update |
Pull latest goal-flight from origin + run each worker CLI's self-update |
/goal-flight build-corpus [<flags>] |
Extend / rebuild the optional RAG corpus |
/goal-flight resume |
Rebuild RESUME-NOTES from current git state |
/goal-flight goal <SLUG> |
Append one goal to the queue |
/goal-flight register-codex [<path>] |
Register a project as codex-trusted |
/goal-flight validate-dispatch [<slug>] |
Render a chunk's dispatch wrapper without dispatching |
/goal-flight validate-queue [<path>] |
Schema-check the goal-queue |
Plus an opt-in self-delegation pattern via /fork — orchestrator writes a marker contract;
forked session detects via env var and follows the contract; orchestrator monitors compact
status. See protocols/self-delegation.md; fork instructions are not always-loaded.
Detailed operating procedures are split into load-on-demand files under
protocols/. The always-loaded SKILL.md is intentionally small.
You normally don't need this section unless you're tuning worker selection.
Iteration pattern — how many turns the chunk needs:
| Pattern | When | Cost |
|---|---|---|
controller-direct |
Trivially small (single-file, < ~30 LoC), OR orchestrator already has the session-loaded context a fresh subagent would have to re-discover | Inline; no subagent |
| one-shot subagent | Default for most chunks. Frontier model picks the executor target based on chunk shape | One subagent dispatch per chunk |
| goal-mode loop | Multi-step refactor, code migration, prototype implementation, converge code to ground-truth — anything that benefits from plan/act/test/review-to-convergence | Multi-hour autonomous session (codex /goal natively, or orchestrator-driven iteration loop) |
Comms shape (orthogonal axis) — how the orchestrator observes the worker. Goal-flight
uses the Agent Client Protocol wherever the worker has
an adapter (codex / cursor / claude / grok all do today); bash-tail with a tail -f-style
marker-grep watcher is the cold-storage fallback. ACP composes with goal-mode for any
worker; goal-mode + bash-tail composes only with codex /goal today (codex emits a
Final-response marker the watcher detects; other workers' headless modes don't).
The orchestrator picks executor + comms per chunk based on chunk shape, available adapters,
and the rate-pressure walkback's recent observations. The shipped routing defaults lean
toward sub-billed workers (codex / cursor / grok) for code-writing — calibrated against the
maintainer's current vendor plans, not a project-wide prescription. Adjust to your
environment by editing the routing table in SKILL.md "Worker Routing"; the walkback
adapts dynamically when any one provider gets pressured.
For remote workers over SSH, bootstrap a fleet store and use goalflight_fleet.py
dispatch / watch / reconcile. OpenCode, Cursor, Codex, and Claude ACP workers
can run on registered nodes while the orchestrator stays local. See
docs/fleet.md for the operator guide; live smoke:
GOALFLIGHT_LIVE_SSH=1 ./tests/manual/test_fleet_live_smoke.sh.
Unified CLI: bin/goalflight <domain> <resource> <verb> (action router over
config/actions/).
- One-off scripts or quick fixes. Overhead unjustified for <8 chunks or <2000 LoC delta. Pre-flight gates auto-skip the RAG corpus for projects this small.
- Pair-programming sessions. Designed for unattended runs. If you're steering every turn, the wrapper overhead just slows you down.
- Sensitive operations the orchestrator shouldn't autonomously trigger. Production deploys, prod data writes, credential rotations — human-in-the-loop is the point.
- When speed matters more than rigor. The review-to-convergence design targets reference-quality code — its home domain is scientific programming. For a small one-shot task it is slower than just writing the script.
Review skills (recommended — the review gates lean on them):
- gstack — Garry Tan's skill pack provides
/review,/office-hours,/plan-eng-review,/cso,/investigatefor both Claude Code and codex. Goal-flight invokes/reviewas the default independent reviewer for chunk-level pre-commit review (protocols/chunk-review.md) and for milestone reviews (protocols/milestone-review.md, gstack + concern-diverse sweep);/office-hourscovers fuzzy-goal interrogation at init. Optional — without gstack, goal-flight falls back to local prompts atprompts/gstack-claude-review.md+prompts/gstack-codex-challenge.md(executor self-review still runs its seven-category pass). With gstack installed, you get consistent severity-ranking framing across both review lenses on long runs./goal-flight initoffers to install just the subset goal-flight uses —/review,/plan-eng-review,/office-hours, plus the community skillsgrill-meandthermo-nuclear-code-quality-reviewdownloaded from their upstream repos — or the full pack, or neither. - autoreview — Complementary diff-local pre-commit pass (
protocols/chunk-review.md,./scripts/autoreview.sh). Runs in parallel with gstack at chunk level when the orchestrator chooses; does not replace gstack as the default review path. Catches diff-local issues (API footguns, missing tests on touched paths, regression invariants) that a structural reviewer may not prioritize. Requires upstream autoreview (typically the Cursor autoreview skill orAUTOREVIEW_HELPER); doctor reports WARN when absent.
Also recommended:
- context-mode — MCP plugin that offloads large command outputs (diffs, integration test runs, codex tail files, large greps) to an FTS5 sandbox the orchestrator queries by pattern. On long runs, raw tool output fills the orchestrator's context and triggers early compaction; context-mode keeps that output out of the session.
- codedb — code-intelligence MCP (tree, outline, symbol, search, deps) the orchestrator uses to anchor dispatch briefs to exact files and lines, and to spot-check worker findings, at a fraction of the context cost of grep-and-read. Optional; use it when indexed lookup beats a broad grep — most useful on large or unfamiliar codebases.
Orchestrator ports for Codex, Cursor, and OpenCode are implemented and unsupported; the Claude Code wrapper is the maintained path. To install a port anyway:
git clone https://github.com/simonrowland/goal-flight.git ~/.goal-flight && cd ~/.goal-flight
./install.sh cursor /path/to/your/project
./install.sh opencode /path/to/your/project
./install.sh codexAfter source SKILL.md, commands/, protocols/, templates/, or adapters/
changes, copied host installs must be resynced from the source repo with
./install.sh <host> unless the host skill is symlinked to the source. The
doctor's installed_skill_drift probe hashes only the installed SKILL.md
file (per host); a hash divergence WARN there means the wrapper is stale,
which usually correlates with the other directories being stale too. Run
the resync command in the probe's resync_command field; text mode prints
installed_skill_md_hash WARNs.
Same flags via setup.sh: --cursor-install, --opencode-install, and
--codex-install (each implies --apply --yes). Dry-run, link-to-Claude, and
agents-standard paths are in docs/hosts/cursor.md and
docs/hosts/opencode.md.
After any install, run doctor:
python3 scripts/goalflight_doctor.py --project-root /path/to/your/project.
Windows runs goal-flight through WSL. Ask your agent to set it up — starting with:
wsl --installthen install goal-flight inside the distro as on Linux. Native Windows (no WSL) is a read/plan control plane only: doctor, status, and ledger reads work; dispatch honestly refuses. Full details — WSL baseline, capability matrix, two-install procedure, launcher and CRLF caveats — are in docs/hosts/windows.md.
Default ./tests/run.sh stays hermetic and cheap. Set
GOALFLIGHT_AUTOREVIEW=1 to include tests/bash/test-autoreview-smoke.sh,
which runs scripts/autoreview.sh --engine claude against a known-good fixture
commit through scripts/autoreview_claude_acp. Each invocation consumes one
Claude ACP-sub-billed autoreview pass.
Set GOALFLIGHT_ACP_LIVE=1 to include the real codex-acp dispatch smoke.
This skill ships tuned for high-accuracy scientific programming but the patterns generalize. Workflow: clone the repo, open it in your orchestrator host (Claude Code wrapper today), and ask the host to adapt the skill for a domain project with a north star, verification command, invariants, and any domain-specific self-review categories. A subagent can read the whole thing, propose a diff, and apply it in one pass.
Main tuning knobs:
- North star + asking discipline + token bias —
SKILL.mdhard-conventions section. - Self-review categories —
prompts/executor-self-review.mdlines 14–35. Seven abstract categories; add domain-specific ones (e.g. SCHEMA GAP for ETL, A11Y GAP for frontend). - Milestone review cadence — milestone review flights run at configured cadence or on
[milestone]-marked chunks (commands/execute.md,protocols/milestone-review.md); mark chunks in the goal queue or adjust the cadence there. - RAG corpus slice mix + word budgets —
templates/rag-corpus-schema.md.tpl. /goalmode prompt shape —templates/codex-goal-prompt.md.tpl(Objective / Workspace / Rules / Acceptance / Test gates / Final response schema).
MIT — see LICENSE.