Skip to content

msilverblatt/research-loop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

research-loop

Inspired by Karpathy's autoresearch — which showed that an AI agent can run experiments autonomously overnight with nothing but a markdown file and a training script. research-loop tries to add guardrails to that loop in as lightweight a way as possible, using protomcp workflow primitives to enforce experiment discipline through tool visibility rather than prompt instructions.

protomcp          → workflow/tool/resource/prompt primitives
research-loop     → experiment discipline (this plugin)
your domain       → what "run" means, what "evaluate" returns

What it does

The plugin registers a protomcp workflow with visibility-controlled steps:

propose → setup_and_run → conclude → [promote | discard]

The agent only sees the next valid step. It can't run without a hypothesis. It can't start a new experiment without concluding the last one. It can't skip evaluation. The tools enforce this — no prompt engineering required.

Quick start

# Prerequisites: node 20+, protomcp (pmcp CLI)
git clone https://github.com/msilverblatt/research-loop
cd research-loop

# protomcp must be available as a local dependency
# Clone it alongside or adjust package.json
npm install
npx tsc

# Run the iris example
./bin/run-loop.sh examples/iris 5 "Maximize accuracy on Iris classification"

Building a domain server

A domain server registers researchLoop() with hooks that define what experimentation means in your domain.

Minimum viable (Karpathy-style)

The agent edits code directly. One hook.

import { run } from 'protomcp';
import { researchLoop } from 'research-loop';
import { execSync } from 'node:child_process';

researchLoop({
  name: 'experiment',
  hooks: {
    onRun: () => {
      const output = execSync('python3 train.py', { encoding: 'utf-8' });
      return JSON.parse(output.trim());
    },
  },
});

run();

With change detection and evaluation

researchLoop({
  name: 'experiment',
  hooks: {
    onRun: () => execTrainingScript(),
    detectChange: () => fileChanged('train.py'),
    onEvaluate: (result) => compareToBaseline(result),
    onPromote: (exp) => updateBaseline(exp),
    onLog: (exp) => appendToJournal(exp),
    getBaseline: () => loadBaseline(),
  },
});

Full setup (harness-ml style)

Domain-extended schemas, custom tools visible during the workflow.

import { z } from 'zod';

researchLoop({
  name: 'ml-experiment',
  runSchema: z.object({
    overlay: z.record(z.any()),
    primary_metric: z.string().default('brier'),
  }),
  proposeSchema: z.object({
    phase: z.enum(['eda', 'feature_eng', 'tuning']).optional(),
  }),
  concludeSchema: z.object({
    next_steps: z.array(z.string()).optional(),
  }),
  hooks: {
    onRun: (args) => pipeline.backtest(args.overlay),
    detectChange: () => overlay.hasChanges(),
    onEvaluate: (result) => metrics.compare(result, baseline),
    onCompare: (baseline, experiment) => metrics.format(baseline, experiment),
    onPromote: (exp) => config.mergeOverlay(exp),
    onDiscard: () => config.resetOverlay(),
    onLog: (exp) => journal.append(exp),
    getBaseline: () => loadBaselineMetrics(),
  },
  allowDuring: ['pipeline.*', 'data.*', 'models.*'],
});

The hook contract

Hook Required Default Purpose
onRun yes Execute the experiment
detectChange no true Is something different from baseline?
onPropose no no-op Validate or enrich the hypothesis
onEvaluate no no-op Structured evaluation of results
onCompare no no-op Format baseline vs experiment comparison
onPromote no no-op What happens when changes are adopted
onDiscard no no-op Cleanup on discard
onLog no no-op Persist the experiment record
getBaseline no static Resolve current baseline

Resources

The plugin auto-registers MCP resources the agent can read:

URI Description
experiment://current Active experiment state
experiment://history Previous experiments (this session)
experiment://baseline Current baseline (if getBaseline provided)
experiment://comparison Baseline vs results (if onCompare provided)

Domain servers can register additional resources under their own namespace.

The runner script

bin/run-loop.sh automates experimentation via claude -p:

./bin/run-loop.sh <directory> [num_experiments] [goal]
  • Launches claude with the MCP server
  • Agent runs experiments until killed
  • Script watches the experiment log, kills claude when target reached
  • If claude dies early, relaunches it
  • Prints summary table at the end

Live output streams from the server via stderr:

> experiment.propose(hypothesis="Increasing LR to 0.04...")
HYPOTHESIS: Increasing LR to 0.04 will reduce val_bpb...
> experiment.setup_and_run()
> experiment.conclude(verdict="keep")
RESULT:  accuracy=0.9733  f1=0.9733  model=Pipeline
VERDICT: keep
> experiment.promote()

Design principles

  1. The plugin knows the scientific method, the domain knows the science. The plugin enforces hypothesis → run → conclude. It never interprets what any of those mean.

  2. Progressive rigor without progressive complexity. Defaults trust the agent. Overrides add structural guarantees.

  3. The agent only sees MCP tools. The workflow controls visibility. No prompt engineering.

  4. Failed experiments are successful tasks. The verdict is agent-provided. A negative result with insight is valuable.

  5. The plugin doesn't own persistence. onLog is a hook. The domain decides storage.

Project structure

src/
  index.ts      — Public API (researchLoop, types)
  types.ts      — Interfaces and constants
  schema.ts     — Schema merging with collision detection
  state.ts      — In-memory experiment state manager
  gates.ts      — Discipline gate enforcement
  resources.ts  — Plugin-managed MCP resources
  prompts.ts    — Plugin-provided MCP prompts
  loop.ts       — researchLoop() composition root

tests/          — Unit and e2e tests (vitest)
bin/            — Runner script
examples/iris/  — Minimal working example

About

Agentic experimentation protocol — a protomcp plugin that enforces experiment discipline through tool visibility

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors