Inspired by Karpathy's autoresearch — which showed that an AI agent can run experiments autonomously overnight with nothing but a markdown file and a training script. research-loop tries to add guardrails to that loop in as lightweight a way as possible, using protomcp workflow primitives to enforce experiment discipline through tool visibility rather than prompt instructions.
protomcp → workflow/tool/resource/prompt primitives
research-loop → experiment discipline (this plugin)
your domain → what "run" means, what "evaluate" returns
The plugin registers a protomcp workflow with visibility-controlled steps:
propose → setup_and_run → conclude → [promote | discard]
The agent only sees the next valid step. It can't run without a hypothesis. It can't start a new experiment without concluding the last one. It can't skip evaluation. The tools enforce this — no prompt engineering required.
# Prerequisites: node 20+, protomcp (pmcp CLI)
git clone https://github.com/msilverblatt/research-loop
cd research-loop
# protomcp must be available as a local dependency
# Clone it alongside or adjust package.json
npm install
npx tsc
# Run the iris example
./bin/run-loop.sh examples/iris 5 "Maximize accuracy on Iris classification"A domain server registers researchLoop() with hooks that define what experimentation means in your domain.
The agent edits code directly. One hook.
import { run } from 'protomcp';
import { researchLoop } from 'research-loop';
import { execSync } from 'node:child_process';
researchLoop({
name: 'experiment',
hooks: {
onRun: () => {
const output = execSync('python3 train.py', { encoding: 'utf-8' });
return JSON.parse(output.trim());
},
},
});
run();researchLoop({
name: 'experiment',
hooks: {
onRun: () => execTrainingScript(),
detectChange: () => fileChanged('train.py'),
onEvaluate: (result) => compareToBaseline(result),
onPromote: (exp) => updateBaseline(exp),
onLog: (exp) => appendToJournal(exp),
getBaseline: () => loadBaseline(),
},
});Domain-extended schemas, custom tools visible during the workflow.
import { z } from 'zod';
researchLoop({
name: 'ml-experiment',
runSchema: z.object({
overlay: z.record(z.any()),
primary_metric: z.string().default('brier'),
}),
proposeSchema: z.object({
phase: z.enum(['eda', 'feature_eng', 'tuning']).optional(),
}),
concludeSchema: z.object({
next_steps: z.array(z.string()).optional(),
}),
hooks: {
onRun: (args) => pipeline.backtest(args.overlay),
detectChange: () => overlay.hasChanges(),
onEvaluate: (result) => metrics.compare(result, baseline),
onCompare: (baseline, experiment) => metrics.format(baseline, experiment),
onPromote: (exp) => config.mergeOverlay(exp),
onDiscard: () => config.resetOverlay(),
onLog: (exp) => journal.append(exp),
getBaseline: () => loadBaselineMetrics(),
},
allowDuring: ['pipeline.*', 'data.*', 'models.*'],
});| Hook | Required | Default | Purpose |
|---|---|---|---|
onRun |
yes | — | Execute the experiment |
detectChange |
no | true |
Is something different from baseline? |
onPropose |
no | no-op | Validate or enrich the hypothesis |
onEvaluate |
no | no-op | Structured evaluation of results |
onCompare |
no | no-op | Format baseline vs experiment comparison |
onPromote |
no | no-op | What happens when changes are adopted |
onDiscard |
no | no-op | Cleanup on discard |
onLog |
no | no-op | Persist the experiment record |
getBaseline |
no | static | Resolve current baseline |
The plugin auto-registers MCP resources the agent can read:
| URI | Description |
|---|---|
experiment://current |
Active experiment state |
experiment://history |
Previous experiments (this session) |
experiment://baseline |
Current baseline (if getBaseline provided) |
experiment://comparison |
Baseline vs results (if onCompare provided) |
Domain servers can register additional resources under their own namespace.
bin/run-loop.sh automates experimentation via claude -p:
./bin/run-loop.sh <directory> [num_experiments] [goal]- Launches claude with the MCP server
- Agent runs experiments until killed
- Script watches the experiment log, kills claude when target reached
- If claude dies early, relaunches it
- Prints summary table at the end
Live output streams from the server via stderr:
> experiment.propose(hypothesis="Increasing LR to 0.04...")
HYPOTHESIS: Increasing LR to 0.04 will reduce val_bpb...
> experiment.setup_and_run()
> experiment.conclude(verdict="keep")
RESULT: accuracy=0.9733 f1=0.9733 model=Pipeline
VERDICT: keep
> experiment.promote()
-
The plugin knows the scientific method, the domain knows the science. The plugin enforces hypothesis → run → conclude. It never interprets what any of those mean.
-
Progressive rigor without progressive complexity. Defaults trust the agent. Overrides add structural guarantees.
-
The agent only sees MCP tools. The workflow controls visibility. No prompt engineering.
-
Failed experiments are successful tasks. The verdict is agent-provided. A negative result with insight is valuable.
-
The plugin doesn't own persistence.
onLogis a hook. The domain decides storage.
src/
index.ts — Public API (researchLoop, types)
types.ts — Interfaces and constants
schema.ts — Schema merging with collision detection
state.ts — In-memory experiment state manager
gates.ts — Discipline gate enforcement
resources.ts — Plugin-managed MCP resources
prompts.ts — Plugin-provided MCP prompts
loop.ts — researchLoop() composition root
tests/ — Unit and e2e tests (vitest)
bin/ — Runner script
examples/iris/ — Minimal working example