research-loop

Inspired by Karpathy's autoresearch — which showed that an AI agent can run experiments autonomously overnight with nothing but a markdown file and a training script. research-loop tries to add guardrails to that loop in as lightweight a way as possible, using protomcp workflow primitives to enforce experiment discipline through tool visibility rather than prompt instructions.

protomcp          → workflow/tool/resource/prompt primitives
research-loop     → experiment discipline (this plugin)
your domain       → what "run" means, what "evaluate" returns

What it does

The plugin registers a protomcp workflow with visibility-controlled steps:

propose → setup_and_run → conclude → [promote | discard]

The agent only sees the next valid step. It can't run without a hypothesis. It can't start a new experiment without concluding the last one. It can't skip evaluation. The tools enforce this — no prompt engineering required.

Quick start

# Prerequisites: node 20+, protomcp (pmcp CLI)
git clone https://github.com/msilverblatt/research-loop
cd research-loop

# protomcp must be available as a local dependency
# Clone it alongside or adjust package.json
npm install
npx tsc

# Run the iris example
./bin/run-loop.sh examples/iris 5 "Maximize accuracy on Iris classification"

Building a domain server

A domain server registers researchLoop() with hooks that define what experimentation means in your domain.

Minimum viable (Karpathy-style)

The agent edits code directly. One hook.

import { run } from 'protomcp';
import { researchLoop } from 'research-loop';
import { execSync } from 'node:child_process';

researchLoop({
  name: 'experiment',
  hooks: {
    onRun: () => {
      const output = execSync('python3 train.py', { encoding: 'utf-8' });
      return JSON.parse(output.trim());
    },
  },
});

run();

With change detection and evaluation

researchLoop({
  name: 'experiment',
  hooks: {
    onRun: () => execTrainingScript(),
    detectChange: () => fileChanged('train.py'),
    onEvaluate: (result) => compareToBaseline(result),
    onPromote: (exp) => updateBaseline(exp),
    onLog: (exp) => appendToJournal(exp),
    getBaseline: () => loadBaseline(),
  },
});

Full setup (harness-ml style)

Domain-extended schemas, custom tools visible during the workflow.

import { z } from 'zod';

researchLoop({
  name: 'ml-experiment',
  runSchema: z.object({
    overlay: z.record(z.any()),
    primary_metric: z.string().default('brier'),
  }),
  proposeSchema: z.object({
    phase: z.enum(['eda', 'feature_eng', 'tuning']).optional(),
  }),
  concludeSchema: z.object({
    next_steps: z.array(z.string()).optional(),
  }),
  hooks: {
    onRun: (args) => pipeline.backtest(args.overlay),
    detectChange: () => overlay.hasChanges(),
    onEvaluate: (result) => metrics.compare(result, baseline),
    onCompare: (baseline, experiment) => metrics.format(baseline, experiment),
    onPromote: (exp) => config.mergeOverlay(exp),
    onDiscard: () => config.resetOverlay(),
    onLog: (exp) => journal.append(exp),
    getBaseline: () => loadBaselineMetrics(),
  },
  allowDuring: ['pipeline.*', 'data.*', 'models.*'],
});

The hook contract

Hook	Required	Default	Purpose
`onRun`	yes	—	Execute the experiment
`detectChange`	no	`true`	Is something different from baseline?
`onPropose`	no	no-op	Validate or enrich the hypothesis
`onEvaluate`	no	no-op	Structured evaluation of results
`onCompare`	no	no-op	Format baseline vs experiment comparison
`onPromote`	no	no-op	What happens when changes are adopted
`onDiscard`	no	no-op	Cleanup on discard
`onLog`	no	no-op	Persist the experiment record
`getBaseline`	no	static	Resolve current baseline

Resources

The plugin auto-registers MCP resources the agent can read:

URI	Description
`experiment://current`	Active experiment state
`experiment://history`	Previous experiments (this session)
`experiment://baseline`	Current baseline (if `getBaseline` provided)
`experiment://comparison`	Baseline vs results (if `onCompare` provided)

Domain servers can register additional resources under their own namespace.

The runner script

bin/run-loop.sh automates experimentation via claude -p:

./bin/run-loop.sh <directory> [num_experiments] [goal]

Launches claude with the MCP server
Agent runs experiments until killed
Script watches the experiment log, kills claude when target reached
If claude dies early, relaunches it
Prints summary table at the end

Live output streams from the server via stderr:

> experiment.propose(hypothesis="Increasing LR to 0.04...")
HYPOTHESIS: Increasing LR to 0.04 will reduce val_bpb...
> experiment.setup_and_run()
> experiment.conclude(verdict="keep")
RESULT:  accuracy=0.9733  f1=0.9733  model=Pipeline
VERDICT: keep
> experiment.promote()

Design principles

The plugin knows the scientific method, the domain knows the science. The plugin enforces hypothesis → run → conclude. It never interprets what any of those mean.
Progressive rigor without progressive complexity. Defaults trust the agent. Overrides add structural guarantees.
The agent only sees MCP tools. The workflow controls visibility. No prompt engineering.
Failed experiments are successful tasks. The verdict is agent-provided. A negative result with insight is valuable.
The plugin doesn't own persistence. onLog is a hook. The domain decides storage.

Project structure

src/
  index.ts      — Public API (researchLoop, types)
  types.ts      — Interfaces and constants
  schema.ts     — Schema merging with collision detection
  state.ts      — In-memory experiment state manager
  gates.ts      — Discipline gate enforcement
  resources.ts  — Plugin-managed MCP resources
  prompts.ts    — Plugin-provided MCP prompts
  loop.ts       — researchLoop() composition root

tests/          — Unit and e2e tests (vitest)
bin/            — Runner script
examples/iris/  — Minimal working example

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
bin		bin
docs/superpowers		docs/superpowers
examples/iris		examples/iris
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

research-loop

What it does

Quick start

Building a domain server

Minimum viable (Karpathy-style)

With change detection and evaluation

Full setup (harness-ml style)

The hook contract

Resources

The runner script

Design principles

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

research-loop

What it does

Quick start

Building a domain server

Minimum viable (Karpathy-style)

With change detection and evaluation

Full setup (harness-ml style)

The hook contract

Resources

The runner script

Design principles

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages