A CLI tool that wraps LLM API calls in iterative refinement loops to produce higher-quality outputs for complex cognitive tasks.
Instead of a single prompt-and-response, refine decomposes a task, generates an initial response, evaluates it against task-specific quality criteria, and refines through targeted iteration — stopping when quality thresholds are met or the cost budget is exhausted.
Setup (decompose) → Loop (generate → evaluate → refine → audit) × N → Teardown (synthesise)
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Copy the example environment file and add your keys:
cp .env.example .env# Required: Anthropic API key
ANTHROPIC_API_KEY=sk-ant-...
# Optional: Pinboard token for research phase (see below)
PINBOARD_TOKEN=username:TOKENrefine resolves Anthropic credentials using a three-tier cascade:
| Priority | Source | How it's used |
|---|---|---|
| 1 | ANTHROPIC_API_KEY in .env or environment |
Standard PydanticAI agent |
| 2 | ANTHROPIC_AUTH_TOKEN |
PydanticAI via AsyncAnthropic(auth_token=...) |
| 3 | claude CLI on PATH |
Shells out to claude -p (uses Claude Code's own auth) |
Inside Claude Code / Claude Desktop: auth is automatic — the tool detects the CLI and uses it as a fallback when no API key is set.
refine [PROMPT] [OPTIONS]
# Research task with default settings
refine "Analyse the competitive landscape for AI code editors"
# Quick draft — fewer iterations, lower cost
refine "Pros and cons of Rust vs Go for CLI tools" --tier quick
# Thorough analysis — more iterations, higher quality targets
refine "Develop a market entry strategy for UK fintech" --tier thorough
# Save to file
refine "State of WebAssembly adoption in 2026" -o output/report.md# Elevator pitch
refine "Write a 100-word elevator pitch for a B2B SaaS product" --profile copywriting
# Multiple variants
refine "Write 3 tagline variants for a fintech app targeting Gen Z: one punchy, one warm, one investor-facing" --profile copywriting
# Quick first draft — generate + evaluate, no refinement
refine "Write a launch email subject line for our new API product" --profile copywriting --tier quick
# High-stakes copy with more refinement passes
refine "Write the hero copy for our Series B fundraise landing page" --profile copywriting --tier thorough -o output/hero-copy.md# Refine an existing draft
refine --input draft.md
# Analyse a document with a specific question
refine "Evaluate this against current market conditions" --input business-plan.md# Set a strict budget
refine "..." --max-budget-usd 1.00
# Disable budget limit entirely — run until quality or iteration cap
refine "..." --no-budget-limit
# Limit to 1 iteration (generate + evaluate only, no refinement)
refine "..." --max-iterations 1
# Use a cheaper model for all phases
refine "..." --model claude-haiku-4-5Budget behaviour: When the budget is about to be exceeded, the tool prompts you to continue or stop. If you continue, the budget ceiling is raised to cover the next operation. This avoids losing work mid-run while keeping cost visibility. Use --no-budget-limit to skip the prompt entirely.
| Flag | Short | Description |
|---|---|---|
--profile |
-p |
Refinement profile name (default: deep-research) |
--tier |
-t |
Quality tier: quick, standard, thorough (default: standard) |
--input |
-i |
Input file path (markdown, text) |
--output |
-o |
Output file path (default: stdout) |
--model |
-m |
Override all models (e.g. claude-haiku-4-5) |
--max-budget-usd |
Override per-loop cost budget | |
--no-budget-limit |
Disable budget cap (run until quality or iteration cap) | |
--max-iterations |
Override maximum iteration count | |
--dry-run |
Show decomposition and cost estimate without executing | |
--verbose |
-v |
Show evaluation details and debug logging |
--quiet |
-q |
Suppress progress, output result only |
--force |
Skip pre-flight checks (input size warnings) |
Before running refine on a complex topic, you can optionally prime the prompt with curated research from your Pinboard bookmarks. This requires the Pinboard MCP server and a valid token.
pip install pinboard-mcp-serverAdd your Pinboard token to .env:
PINBOARD_TOKEN=username:TOKENAdd the MCP server to your Claude Code config (.claude/settings.local.json):
{
"mcpServers": {
"pinboard": {
"command": "pinboard-mcp-server",
"env": {
"PINBOARD_TOKEN": "${PINBOARD_TOKEN}"
}
}
}
}The research phase is a manual step run inside Claude Code before invoking refine:
- Map tags — Call
listTagsto find tags relevant to your topic - Query bookmarks — Call
listBookmarksByTagsfor each relevant tag - Search keywords — Call
searchBookmarksfor key phrases the tag approach might miss - Curate — Deduplicate, rank by relevance, and save a manifest
- Extract — Fetch each URL, extract key ideas, and write a synthesis
The curated research can then be passed to refine as input:
refine "Design a monetisation strategy for a two-sided marketplace" \
--input output/research-synthesis.md \
--tier thorough \
-o output/strategy.mdSee docs/research-manifest.md and docs/research-synthesis.md for examples of this workflow's output.
Profiles define what the tool evaluates, how it decomposes tasks, and what "good" looks like. Each profile ships with its own axes, prompt templates, and budget defaults.
Multi-source research synthesis with evidence grading. Decomposes via question tree, refines additively, produces structured reports.
| Axis | Weight | Hard-fail | What it measures |
|---|---|---|---|
| Coverage | 1.0 | No | Has the question space been fully explored? |
| Evidence Quality | 1.2 | Yes | Are claims supported by credible sources? |
| Coherence | 1.0 | No | Is the argument internally consistent? |
| Depth | 0.8 | No | Does the analysis go beyond surface-level? |
| Actionability | 0.8 | No | Can the reader act on the findings? |
Short-form persuasive writing: pitches, taglines, positioning, brand copy. Decomposes via constraint mapping (extracts rules from the brief, identifies variants), refines eliminatively (cuts rather than adds), produces copy directly with no strategic scaffolding.
| Axis | Weight | Hard-fail | What it measures |
|---|---|---|---|
| Precision | 1.2 | No | Every sentence earns its place — no padding, no filler |
| Constraint Compliance | 1.0 | Yes | Follows every hard rule in the brief (naming, claims, format) |
| Audience Fit | 1.0 | No | The target listener recognises their reality in this |
| Distinctiveness | 0.8 | No | Variants sound meaningfully different, not just reworded |
| Coherence | 0.8 | No | Tone is consistent, claims don't contradict each other |
Hard-fail axes must pass before the loop can exit, regardless of other scores.
Tiers adjust targets, iteration caps, and budgets relative to the profile's base values:
| Tier | Targets | Max iterations | Budget | Use when |
|---|---|---|---|---|
quick |
Default - 1 (floor: 2) | 2 | 50% of base | Speed matters more than polish |
standard |
As defined in profile | base | 100% of base | Normal use |
thorough |
Default + 1 (cap: 4) | base × 1.5 | 200% of base | High-stakes output worth the spend |
Each run follows three phases:
1. Decomposition — The prompt is broken into sub-tasks using a task-specific strategy (question tree for research, constraint mapping for copywriting).
2. Iterative loop — For each iteration:
- Generate content for each sub-task (or refine from previous iteration)
- Evaluate against multi-axis quality criteria
- Audit — deterministic check: all axes pass? budget left? diminishing returns? Continue or stop.
3. Synthesis — Sub-task outputs are assembled into a coherent final document.
The loop terminates when:
- All quality axes meet their targets
- Cost budget is exhausted (you'll be prompted to continue or stop)
- Improvement plateaus for 2 consecutive iterations
- Maximum iterations reached
All generated content should be saved to the output/ directory (gitignored by default).
The final output is markdown with a metadata block (HTML comment, non-rendering):
<!-- refine-metadata
profile: deep-research
tier: standard
iterations: 2
termination: quality_threshold_reached
cost_usd: 0.78
duration_seconds: 8.1
axes:
coverage: 3/3 ✓
evidence_quality: 3/3 ✓
-->Every run also writes a full JSON audit trail to .refine/logs/.
pytest # run all tests
pytest -v # verbose output
pytest tests/test_auditor.py # just the convergence detector tests.
├── refine/
│ ├── __init__.py
│ ├── cli.py # CLI entry point (Typer + Rich)
│ ├── config.py # Settings, auth resolution, env loading
│ ├── errors.py # Error types
│ ├── engine/
│ │ ├── loop.py # Core refinement loop orchestrator
│ │ ├── decomposer.py # Task decomposition (setup)
│ │ ├── generator.py # Generation + refinement
│ │ ├── evaluator.py # Multi-axis evaluation
│ │ ├── auditor.py # Convergence detection (deterministic)
│ │ ├── synthesiser.py # Final assembly (teardown)
│ │ ├── context.py # Context window management
│ │ ├── llm.py # LLM abstraction (PydanticAI + claude -p fallback)
│ │ └── templates.py # Jinja2 template rendering
│ ├── models/
│ │ ├── evaluation.py # EvaluationResult, AxisEvaluation
│ │ ├── decomposition.py # DecompositionResult, SubTask
│ │ ├── profile.py # Profile config + loading
│ │ ├── trace.py # Audit trail models
│ │ └── cost.py # Cost tracking + pricing
│ └── profiles/
│ ├── deep-research/ # Research synthesis profile
│ │ ├── profile.yaml
│ │ └── *.md # 5 prompt templates
│ └── copywriting/ # Short-form persuasive writing profile
│ ├── profile.yaml
│ └── *.md # 5 prompt templates
├── tests/
├── docs/
│ ├── spec-v2.md # Design specification
│ ├── research-manifest.md # Curated bookmark research
│ └── research-synthesis.md # Research extraction + synthesis
├── output/ # Generated content (gitignored)
├── .env.example # Environment template
├── pyproject.toml
├── LICENSE
└── README.md
MIT — see LICENSE.