Skip to content

Latest commit

 

History

History
444 lines (335 loc) · 20.4 KB

File metadata and controls

444 lines (335 loc) · 20.4 KB

Ralph the Riveter

A focused reimplementation of the Ralph autonomous coding loop in Rust. Slimmer than the original bash version, with pluggable agents (codex or claude), per-iteration jj commits, and in-place execution (no branch switching).

The binary is named riveter (the tool); the methodology it implements is Ralph (one focused story per fresh agent context, persistent state on disk, jj as the review surface).


1. Goals

  • Single binary riveter written in Rust.
  • Pluggable agents: codex, claude, or mock (built-in test agent) selected via flag/config.
  • Pluggable model + reasoning: --model, --thinking <low|med|high>.
  • Commits via jj (jujutsu), not git. One jj commit per loop iteration after the agent finishes.
  • In-place operation: never switch branches/bookmarks; work on @ directly.
  • Strict separation of concerns: a skill (run in an agent session) converts a spec into a PRD and writes a fresh run folder with a unique runId. The CLI knows nothing about specs — it only loads an existing run folder and executes the loop.
  • PRD as TOML (prd.toml) — human-editable, comment-friendly.
  • Integration-testable: a built-in mock agent with hardcoded deterministic behavior so end-to-end tests don't need an external binary or script.

2. Non-Goals

  • No archiving / .last-branch bookkeeping.
  • No multi-repo / multi-worktree orchestration.
  • No web UI or daemon mode.
  • No retry/backoff sophistication. Agent failures stop the loop.

3. Project Layout

ralph-the-riveter/
├── Cargo.toml                  # workspace
├── .gitignore
├── .jjignore
├── README.md
├── rust-toolchain.toml
├── crates/
│   └── riveter/                # main binary (cli name: `riveter`)
│       ├── Cargo.toml
│       ├── src/
│       │   ├── main.rs
│       │   ├── cli.rs          # clap definitions + Agent/Model/Thinking enums
│       │   ├── agent.rs        # trait Agent + Codex/Claude/Mock impls
│       │   ├── jj.rs           # thin wrapper over `jj` CLI
│       │   ├── loop_.rs        # iteration loop + termination check
│       │   ├── run_dir.rs      # run folder layout + runId resolution
│       │   ├── prompt.rs       # iteration prompt template (include_str! + replace)
│       │   └── prd.rs          # load/parse/write prd.toml
│       └── tests/
│           └── integration.rs  # uses --agent mock + tempdir + jj
├── skills/
│   └── prd/SKILL.md            # spec -> PRD + run folder (non-interactive)

4. CLI

The CLI consumes pre-built run folders. It has no --spec flag and no PRD-generation logic — that's the skill's job.

riveter run      -r <RUN_ID> [OPTIONS]   # execute the loop
riveter validate -r <RUN_ID>             # check a run folder for correctness

For inspecting / listing runs, the user uses cat, ls, and jj log — no dedicated subcommands.

riveter validate

Checks that a run folder is well-formed and ready to execute. Designed to be invoked by skills (and CI) so authoring mistakes are caught before any agent iteration is wasted.

riveter validate -r <RUN_ID>

What it checks:

  • Filesystem layoutprd.toml and spec.md exist.
  • prd.toml schema — parseable; all required fields (description, createdAt, stories) present and typed.
  • Stories — at least one story; id is a positive integer, unique, sequential 1..N with no gaps; title ≤ 80 chars (so the commit subject fits in 72 after prefix); acceptanceCriteria non-empty; passes is a bool.

Output:

$ riveter validate -r task-priority-a1b2c3d4
✓ filesystem layout
✗ prd.toml
  - stories[1].id: expected 2, found 5 (ids must be sequential 1..N)
  - stories[2].acceptanceCriteria: empty
2 errors

Exit codes: 0 ok, 30 validation errors, 31 filesystem missing, 32 I/O error.

Skill integration. The prd skill runs riveter validate -r <runId> after generating the folder, parses the human output (one error per - line), fixes each error (re-writes prd.toml), and re-runs until exit 0 — up to 3 attempts. If validation still fails, the skill surfaces the remaining errors to the user and refuses to print the "ready to run" message.

riveter run

riveter run -r <RUN_ID> [OPTIONS]

Required:
  -r, --run <RUN_ID>              run id (or full path to a run folder)

Options:
  -a, --agent <codex|claude|mock> [default: codex]
  -m, --model <STRING>            [default: gpt-5]
  -t, --thinking <low|med|high>   [default: high]
  -n, --max-iterations <N>        [default: 10]
  -v, --verbose
  -h, --help
  -V, --version

<RUN_ID> is resolved as: (1) absolute/relative path if it contains /, (2) <state-dir>/runs/<RUN_ID> otherwise. Errors out if the folder is missing or prd.toml doesn't exist.

All flags are also readable from env (RIVETER_AGENT, RIVETER_MODEL, RIVETER_THINKING, RIVETER_MAX_ITERATIONS). Env-only knobs: RIVETER_STATE_DIR (override the state dir, used by tests) and RIVETER_AGENT_BIN (override the agent executable, used by tests for codex/claude shims).

Typical workflow

# 1. In an agent session, invoke the `prd` skill with your spec.
#    The skill generates the PRD, creates the run folder, validates it,
#    and tells you the runId.

# 2. (Optional) inspect / edit the generated PRD.
$EDITOR ~/.local/state/riveter/runs/task-priority-a1b2c3d4/prd.toml

# 3. Run the loop in-place against the current jj repo.
riveter run -r task-priority-a1b2c3d4 -a codex -m gpt-5 -t high

# 4. Resume later (same command - skips stories already passing).
riveter run -r task-priority-a1b2c3d4

5. Agent Abstraction

pub trait Agent {
    fn name(&self) -> &'static str;
    fn run(&self, prompt: &str, opts: &RunOpts) -> Result<AgentOutput>;
}

pub struct RunOpts<'a> {
    pub model: Option<&'a str>,
    pub thinking: Thinking,   // Low | Med | High
    pub cwd: &'a Path,
}

pub struct AgentOutput {
    pub stdout: String,
    pub stderr: String,
    pub exit_code: i32,
}

Codex impl invokes codex exec --model <m> --reasoning <thinking> … (exact flags pinned in code with comments). Claude impl invokes claude --dangerously-skip-permissions --print --model <m> …. Both stream output live to stderr (via tee-style passthrough) while capturing stdout for logging.

Mock impl is hardcoded in the binary for integration tests. On each call it:

  1. Reads prd.toml from cwd.
  2. Picks the first story with passes = false.
  3. Touches riveter-mock-<storyId>.txt in cwd (so jj st has changes).
  4. Rewrites prd.toml setting that story's passes = true.
  5. Exits 0.

After enough invocations the mock walks the entire PRD and the loop terminates naturally (all passes = true). No env vars, no scripts.


6. Loop Behavior

Per iteration:

  1. Load PRD. Read <run-dir>/prd.toml.

  2. Terminate? If every story already has passes = true, exit 0 (status=ok).

  3. Pick story. Riveter selects the first story (array order) with passes = false and injects it into the prompt. The agent only sees one story per iteration — keeping its context tiny. The prompt instructs the agent to set that story's passes = true in prd.toml when its acceptance criteria are met.

  4. Run agent. Pipe the rendered prompt to the selected agent; tee live output to console (line-prefixed ) and to iterations/NNN/stdout.log.

  5. Commit. If jj st shows changes, Riveter runs jj describe -m <msg> then jj new. No bookmark moves, no branch switching. Message format:

    [RIVETER(<runId>,#<storyId>,<model>)] chore: <story title>
    

    Example:

    [RIVETER(task-priority-a1b2c3d4,#1,gpt-5)] chore: Add priority field to tasks table
    

    Always chore: <story title> — no parsing of agent output, no fallback. Reviewers can jj describe to rewrite the subject if they prefer a different conventional-commit type. Grep with jj log -r 'description(glob:"[RIVETER*")'.

  6. Empty work-copy = failure. If jj st is clean after the agent exits 0, treat it as clean_workcopy and stop. The agent should have produced changes; if it didn't, something's wrong and we don't want to silently re-prompt.

  7. Loop back to step 1, up to --max-iterations.

After --max-iterations without all stories passing → exit 20.

Termination is driven solely by prd.toml state: when every story has passes = true, the loop exits ok. There is no special "COMPLETE" signal in agent output. Story selection happens in Riveter, not the agent.

No resume support. If a run is interrupted mid-iteration (Ctrl-C, crash), the partial commit and partially-updated prd.toml are left on disk. Re-running riveter run -r <id> does not auto-clean: it re-reads prd.toml as-is and proceeds from the first passes = false story. If the user wants a clean retry, they should jj abandon <commit> and edit prd.toml themselves.


7. Skills

A single skill, plain markdown so any agent (codex/claude/amp) can load it. The skill is the spec→PRD pipeline. The CLI never sees a spec.

skills/prd/SKILL.md

Triggered by "create a prd", "spec out", "plan this feature", "convert this spec".

The skill is non-interactive: it does not ask the user clarifying questions. It converts whatever spec it is given into a structured PRD. The user can always edit prd.toml afterwards.

Steps:

  1. Take the user's spec (free-form description or full PRD markdown).
  2. Compute a runId = <kebab-of-feature-title>-<random-8-alphanumeric> (lowercase [a-z0-9-]+, ≤64 chars). Random suffix so re-running the skill on the same spec creates a new run rather than colliding.
  3. Create the run folder at <state-dir>/runs/<runId>/. State dir convention: $XDG_STATE_HOME/riveter (Linux), ~/Library/Application Support/riveter (macOS), %LOCALAPPDATA%\riveter (Windows).
  4. Write:
    • spec.md — the user's original input, verbatim
    • prd.toml — the structured PRD Riveter consumes (schema in §7b)
  5. Validate. Run riveter validate -r <runId>. If errors are reported, patch the offending files (typically prd.toml) and re-validate. Retry up to 3 times. If validation still fails, show the errors to the user and stop — do not print the "ready to run" message.
  6. Print to the user (only after validation passes):

    Created run task-priority-a1b2c3d4 at ~/.local/state/riveter/runs/task-priority-a1b2c3d4. Review prd.toml, then run: riveter run -r task-priority-a1b2c3d4

The skill is markdown executed by an LLM, so it's free to compute the runId, call mkdir, and write TOML directly. It does not need a Rust helper. The CLI's only contract with skills is: read prd.toml from a run folder.

The skill handles both free-form specs and pre-written PRDs uniformly. If the input already contains ## User Stories, the skill preserves them; otherwise it generates them from the description.


7b. prd.toml Format

description = "Add high/medium/low priority to tasks (default medium); show colored badges; filter by priority."
createdAt   = "2026-05-24T11:42:07Z"

[[stories]]
id          = 1
title       = "Add priority field to tasks table"
passes      = false
acceptanceCriteria = [
  "Add priority column: 'high' | 'medium' | 'low' (default 'medium')",
  "Generate and run migration successfully",
  "Typecheck passes",
]

[[stories]]
id          = 2
title       = "Display priority badge on task cards"
passes      = false
acceptanceCriteria = [
  "Each task card shows colored badge (red=high, yellow=medium, gray=low)",
  "Typecheck passes",
]

description is the top-level context the agent sees on every iteration (alongside the picked story) — it tells it what the project is and what it's building. createdAt is recorded for traceability. The folder name is the runId; no runId field needed inside the file.

  • Story execution order is the array order. id is a stable integer (1-indexed, sequential, no gaps).
  • The only per-story state Riveter mutates is passes. Reviewers can flip passes = false to re-run a story.
  • TOML chosen over JSON because: comments survive round-trips, multiline arrays read well for criteria, humans actually edit this file between iterations.

7c. Run Layout & Output Files

The run folder is the run identity. It lives outside the project under test so agent transcripts never pollute the working repo.

$XDG_STATE_HOME/riveter/runs/<runId>/
  ├── spec.md               # snapshot of the original spec input (verbatim)
  ├── prd.toml              # structured PRD — the single source of truth Riveter reads
  └── iterations/
      ├── 001/
      │   ├── prompt.txt    # exact bytes piped to the agent
      │   ├── stdout.log
      │   ├── stderr.log
      │   └── exit.txt      # agent exit code
      ├── 002/
      └── ...
  • Default state location: dirs::state_dir(). Override with RIVETER_STATE_DIR.
  • riveter run prints the run dir path on stderr first so users can tail -f iterations/.../stdout.log immediately.
  • Iteration dirs are append-only across re-runs (numbering continues 003, 004, …); they never overwrite a prior iteration.

To list runs: ls $XDG_STATE_HOME/riveter/runs/. To see PRD progress: cat $XDG_STATE_HOME/riveter/runs/<runId>/prd.toml (grep for passes =). To find the most recent run: ls -t $XDG_STATE_HOME/riveter/runs/ | head -1.


7d. Console Output (what the user sees)

Default (-v adds debug lines, -vv adds every shell command):

riveter 0.1.0
run:     ~/.local/state/riveter/runs/task-priority-a1b2c3d4
agent:   codex   model: gpt-5   thinking: high
cwd:     /home/me/proj   (jj repo @ kwptlqqv)
prd:     4 stories, 1 passing, 3 remaining

============================================================
  iteration 3/10  ·  #2 "Display priority badge on task cards"
============================================================
[agent] streaming to iterations/003/stdout.log ...
│ Reading prd.toml...
│ Implementing priority badge component...
│ (live agent output prefixed with "│ ")
[agent] exit 0 in 47.2s
[jj] 3 files changed -> commit zptlqqvr
     [RIVETER(task-priority-a1b2c3d4,#2,gpt-5)] chore: Display priority badge on task cards
[prd] story #2 marked passes=true (2/4)

============================================================
  iteration 4/10  ·  #3 "Add priority selector to task edit"
============================================================
...

[done] all stories passing after 4 iterations (3m12s)
       run dir: ~/.local/state/riveter/runs/task-priority-a1b2c3d4
       review:  jj log -r 'description(glob:"\\[RIVETER*")'

Live agent output is line-prefixed (). With --quiet, only iteration headers, errors, and the final summary are printed; full transcripts always land in the run dir.


7e. Failure Handling

Trigger Exit
All stories passes = true 0
Agent exits non-zero 10
Agent exits 0 but jj st is clean 12
Any jj subcommand fails 13
prd.toml missing/invalid after iteration 14
Hit --max-iterations with stories still pending 20
SIGINT/SIGTERM 130

Rules:

  1. No silent rollback of jj commits. Bad commits are left in place; the user can jj abandon <id> to discard them.
  2. No resume. Re-running riveter run -r <runId> does not clean up partial work from a prior interrupted run; it just re-reads prd.toml and proceeds. Iteration numbering continues from the highest existing NNN.
  3. Signal handling. SIGINT during an agent call: wait up to 10s for graceful shutdown, then SIGKILL the process group; exit 130.

8. Built-in mock Agent for Integration Tests

The mock agent is selected with --agent mock. Its behavior is hardcoded in the binary (no env vars, no scripts, no separate crate). On each invocation it:

  1. Reads prd.toml from the run folder.
  2. Picks the first story with passes = false.
  3. Touches riveter-mock-<storyId>.txt in the working copy (so jj st sees changes).
  4. Rewrites prd.toml to set that story's passes = true.
  5. Exits 0.

After N iterations (where N = number of stories), every story is passing and the loop terminates with exit 0. This is enough to test the loop, the jj integration, the iteration numbering, and the per-iteration log files.

Sample integration test

#[test]
fn loop_walks_prd_to_completion() {
    let tmp = tempdir().unwrap();
    init_jj(tmp.path());
    let run_id = seed_run(tmp.path(), &demo_prd_with_3_stories());

    let exit = run_riveter(&[
        "run", "-r", &run_id, "--agent", "mock", "--max-iterations", "10",
    ], tmp.path());

    assert_eq!(exit, 0);
    assert_eq!(jj_log_count(tmp.path()), 3);   // one commit per story
    assert!(all_stories_passing(tmp.path(), &run_id));
}

9. Fresh Project Setup

  • cargo new --bin crates/riveter, then add a workspace root Cargo.toml.
  • rust-toolchain.toml pinning a recent stable (e.g. 1.85.0).
  • .gitignore:
    /target
    /scratch/
    *.tmp
    **/*.log
    .DS_Store
    
  • .jjignore mirrors the above for jj.
  • Dependencies (intentionally minimal):
    • clap (derive) — CLI
    • serde + tomlprd.toml (read/write)
    • dirs — cross-platform state dir
    • rand — runId random suffix
    • anyhow — errors
    • tracing + tracing-subscriber — logs
    • tempfile, assert_cmd, predicates — dev-only, for integration tests

The iteration prompt template is embedded via include_str!("../templates/prompt.md") and rendered with str::replace — no templating engine needed.


9a. README Contents

The top-level README.md must include a "Why Ralph the Riveter" section explaining the rationale, because the design only makes sense once you understand the trade. Outline:

Why Ralph the Riveter

  • Fresh context per story. Each iteration spawns a new agent process with no memory of prior iterations. The agent's working set is exactly one user story + the codebase + the progress log — never the full transcript of past attempts. This keeps the context window small and focused, which is the single biggest lever on agent quality and cost.
  • Persistent state lives on disk, not in-context. prd.toml tracks which stories are done — the agent reads it (plus its assigned story) at the start of every run and flips passes = true when the story's acceptance criteria are met.
  • jj is the review surface. Because every iteration is one jj commit on @, the user reviews Riveter's work the same way they review their own: jj log, jj diff -r <id>, jj abandon <id> to reject, jj squash/jj split to reshape. No branch dance, no PR ceremony, no merge conflicts with your in-flight work.
  • Pluggable agents. Codex and Claude are interchangeable — pick the model that's good at the kind of work the spec needs. Same loop, same artifacts, same review flow.

Also include:

  • Quick start — two phases: (1) run the prd skill in an agent session against your spec → it prints a runId; (2) riveter run -r <runId> from the project's jj repo.
  • Reviewing a run (jj log -r 'description(glob:"\[RIVETER*")')
  • Rejecting bad iterations (jj abandon <commit>, optionally edit prd.toml to flip passes back to false)
  • Tuning for cost vs. quality (the --thinking/--model matrix, and a note that defaults are deliberately on the expensive end)
  • --agent mock usage (deterministic agent for hacking on Riveter itself)

10. Milestones

  1. M1 — Scaffold. Workspace, .gitignore, .jjignore, clap CLI parsing, no-op loop.
  2. M2 — riveter validate. Schema + filesystem checks; unit tests against good/bad fixtures.
  3. M3 — Agent trait + Claude impl. Real claude invocation, output tee'd to per-iteration log files.
  4. M4 — jj integration. Commit per iteration, in-place.
  5. M5 — Codex impl + model/thinking flags.
  6. M6 — Built-in mock agent + first integration test.
  7. M7 — prd skill authored — non-interactive spec→PRD conversion + validate auto-fix loop.
  8. M8 — Docs (README) and a small example PRD end-to-end run.

11. Open Questions

  1. Codex flag mapping. Confirm exact codex exec flags for --thinking and --model for the target version.
  2. Repo / workspace folder name. Working tree is currently rivet-ralph/ on disk. Rename to ralph-the-riveter/ to match the project name, or leave it?