Skip to content

WE3io/refine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

refine

A CLI tool that wraps LLM API calls in iterative refinement loops to produce higher-quality outputs for complex cognitive tasks.

Instead of a single prompt-and-response, refine decomposes a task, generates an initial response, evaluates it against task-specific quality criteria, and refines through targeted iteration — stopping when quality thresholds are met or the cost budget is exhausted.

Setup (decompose) → Loop (generate → evaluate → refine → audit) × N → Teardown (synthesise)

Install

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Configuration

Copy the example environment file and add your keys:

cp .env.example .env
# Required: Anthropic API key
ANTHROPIC_API_KEY=sk-ant-...

# Optional: Pinboard token for research phase (see below)
PINBOARD_TOKEN=username:TOKEN

Authentication

refine resolves Anthropic credentials using a three-tier cascade:

Priority Source How it's used
1 ANTHROPIC_API_KEY in .env or environment Standard PydanticAI agent
2 ANTHROPIC_AUTH_TOKEN PydanticAI via AsyncAnthropic(auth_token=...)
3 claude CLI on PATH Shells out to claude -p (uses Claude Code's own auth)

Inside Claude Code / Claude Desktop: auth is automatic — the tool detects the CLI and uses it as a fallback when no API key is set.

Usage

refine [PROMPT] [OPTIONS]

Research (deep-research profile, default)

# Research task with default settings
refine "Analyse the competitive landscape for AI code editors"

# Quick draft — fewer iterations, lower cost
refine "Pros and cons of Rust vs Go for CLI tools" --tier quick

# Thorough analysis — more iterations, higher quality targets
refine "Develop a market entry strategy for UK fintech" --tier thorough

# Save to file
refine "State of WebAssembly adoption in 2026" -o output/report.md

Copywriting (copywriting profile)

# Elevator pitch
refine "Write a 100-word elevator pitch for a B2B SaaS product" --profile copywriting

# Multiple variants
refine "Write 3 tagline variants for a fintech app targeting Gen Z: one punchy, one warm, one investor-facing" --profile copywriting

# Quick first draft — generate + evaluate, no refinement
refine "Write a launch email subject line for our new API product" --profile copywriting --tier quick

# High-stakes copy with more refinement passes
refine "Write the hero copy for our Series B fundraise landing page" --profile copywriting --tier thorough -o output/hero-copy.md

Working with files

# Refine an existing draft
refine --input draft.md

# Analyse a document with a specific question
refine "Evaluate this against current market conditions" --input business-plan.md

Controlling cost and iterations

# Set a strict budget
refine "..." --max-budget-usd 1.00

# Disable budget limit entirely — run until quality or iteration cap
refine "..." --no-budget-limit

# Limit to 1 iteration (generate + evaluate only, no refinement)
refine "..." --max-iterations 1

# Use a cheaper model for all phases
refine "..." --model claude-haiku-4-5

Budget behaviour: When the budget is about to be exceeded, the tool prompts you to continue or stop. If you continue, the budget ceiling is raised to cover the next operation. This avoids losing work mid-run while keeping cost visibility. Use --no-budget-limit to skip the prompt entirely.

Options

Flag Short Description
--profile -p Refinement profile name (default: deep-research)
--tier -t Quality tier: quick, standard, thorough (default: standard)
--input -i Input file path (markdown, text)
--output -o Output file path (default: stdout)
--model -m Override all models (e.g. claude-haiku-4-5)
--max-budget-usd Override per-loop cost budget
--no-budget-limit Disable budget cap (run until quality or iteration cap)
--max-iterations Override maximum iteration count
--dry-run Show decomposition and cost estimate without executing
--verbose -v Show evaluation details and debug logging
--quiet -q Suppress progress, output result only
--force Skip pre-flight checks (input size warnings)

Research phase (optional)

Before running refine on a complex topic, you can optionally prime the prompt with curated research from your Pinboard bookmarks. This requires the Pinboard MCP server and a valid token.

Setup

pip install pinboard-mcp-server

Add your Pinboard token to .env:

PINBOARD_TOKEN=username:TOKEN

Add the MCP server to your Claude Code config (.claude/settings.local.json):

{
  "mcpServers": {
    "pinboard": {
      "command": "pinboard-mcp-server",
      "env": {
        "PINBOARD_TOKEN": "${PINBOARD_TOKEN}"
      }
    }
  }
}

Workflow

The research phase is a manual step run inside Claude Code before invoking refine:

  1. Map tags — Call listTags to find tags relevant to your topic
  2. Query bookmarks — Call listBookmarksByTags for each relevant tag
  3. Search keywords — Call searchBookmarks for key phrases the tag approach might miss
  4. Curate — Deduplicate, rank by relevance, and save a manifest
  5. Extract — Fetch each URL, extract key ideas, and write a synthesis

The curated research can then be passed to refine as input:

refine "Design a monetisation strategy for a two-sided marketplace" \
  --input output/research-synthesis.md \
  --tier thorough \
  -o output/strategy.md

See docs/research-manifest.md and docs/research-synthesis.md for examples of this workflow's output.

Profiles

Profiles define what the tool evaluates, how it decomposes tasks, and what "good" looks like. Each profile ships with its own axes, prompt templates, and budget defaults.

deep-research (default)

Multi-source research synthesis with evidence grading. Decomposes via question tree, refines additively, produces structured reports.

Axis Weight Hard-fail What it measures
Coverage 1.0 No Has the question space been fully explored?
Evidence Quality 1.2 Yes Are claims supported by credible sources?
Coherence 1.0 No Is the argument internally consistent?
Depth 0.8 No Does the analysis go beyond surface-level?
Actionability 0.8 No Can the reader act on the findings?

copywriting

Short-form persuasive writing: pitches, taglines, positioning, brand copy. Decomposes via constraint mapping (extracts rules from the brief, identifies variants), refines eliminatively (cuts rather than adds), produces copy directly with no strategic scaffolding.

Axis Weight Hard-fail What it measures
Precision 1.2 No Every sentence earns its place — no padding, no filler
Constraint Compliance 1.0 Yes Follows every hard rule in the brief (naming, claims, format)
Audience Fit 1.0 No The target listener recognises their reality in this
Distinctiveness 0.8 No Variants sound meaningfully different, not just reworded
Coherence 0.8 No Tone is consistent, claims don't contradict each other

Hard-fail axes must pass before the loop can exit, regardless of other scores.

Quality tiers

Tiers adjust targets, iteration caps, and budgets relative to the profile's base values:

Tier Targets Max iterations Budget Use when
quick Default - 1 (floor: 2) 2 50% of base Speed matters more than polish
standard As defined in profile base 100% of base Normal use
thorough Default + 1 (cap: 4) base × 1.5 200% of base High-stakes output worth the spend

How it works

Each run follows three phases:

1. Decomposition — The prompt is broken into sub-tasks using a task-specific strategy (question tree for research, constraint mapping for copywriting).

2. Iterative loop — For each iteration:

  • Generate content for each sub-task (or refine from previous iteration)
  • Evaluate against multi-axis quality criteria
  • Audit — deterministic check: all axes pass? budget left? diminishing returns? Continue or stop.

3. Synthesis — Sub-task outputs are assembled into a coherent final document.

The loop terminates when:

  1. All quality axes meet their targets
  2. Cost budget is exhausted (you'll be prompted to continue or stop)
  3. Improvement plateaus for 2 consecutive iterations
  4. Maximum iterations reached

Output

All generated content should be saved to the output/ directory (gitignored by default).

The final output is markdown with a metadata block (HTML comment, non-rendering):

<!-- refine-metadata
profile: deep-research
tier: standard
iterations: 2
termination: quality_threshold_reached
cost_usd: 0.78
duration_seconds: 8.1
axes:
  coverage: 3/3 ✓
  evidence_quality: 3/3 ✓
-->

Every run also writes a full JSON audit trail to .refine/logs/.

Tests

pytest              # run all tests
pytest -v           # verbose output
pytest tests/test_auditor.py   # just the convergence detector tests

Project structure

.
├── refine/
│   ├── __init__.py
│   ├── cli.py                    # CLI entry point (Typer + Rich)
│   ├── config.py                 # Settings, auth resolution, env loading
│   ├── errors.py                 # Error types
│   ├── engine/
│   │   ├── loop.py               # Core refinement loop orchestrator
│   │   ├── decomposer.py         # Task decomposition (setup)
│   │   ├── generator.py          # Generation + refinement
│   │   ├── evaluator.py          # Multi-axis evaluation
│   │   ├── auditor.py            # Convergence detection (deterministic)
│   │   ├── synthesiser.py        # Final assembly (teardown)
│   │   ├── context.py            # Context window management
│   │   ├── llm.py                # LLM abstraction (PydanticAI + claude -p fallback)
│   │   └── templates.py          # Jinja2 template rendering
│   ├── models/
│   │   ├── evaluation.py         # EvaluationResult, AxisEvaluation
│   │   ├── decomposition.py      # DecompositionResult, SubTask
│   │   ├── profile.py            # Profile config + loading
│   │   ├── trace.py              # Audit trail models
│   │   └── cost.py               # Cost tracking + pricing
│   └── profiles/
│       ├── deep-research/        # Research synthesis profile
│       │   ├── profile.yaml
│       │   └── *.md              # 5 prompt templates
│       └── copywriting/          # Short-form persuasive writing profile
│           ├── profile.yaml
│           └── *.md              # 5 prompt templates
├── tests/
├── docs/
│   ├── spec-v2.md                # Design specification
│   ├── research-manifest.md      # Curated bookmark research
│   └── research-synthesis.md     # Research extraction + synthesis
├── output/                       # Generated content (gitignored)
├── .env.example                  # Environment template
├── pyproject.toml
├── LICENSE
└── README.md

Licence

MIT — see LICENSE.

About

A CLI tool that wraps LLM API calls in iterative refinement loops to produce higher-quality outputs for complex cognitive tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages