Skip to content

makemebitter/ideaforge

Repository files navigation

IdeaForge

License: MIT Python 3.10+ Claude Code

AI-powered research idea generation through adversarial multi-agent debate. IdeaForge crawls top ML conference papers (ICLR, NeurIPS, ICML) from OpenReview, includes a pre-trained AI reviewer calibrated on 50K+ real peer reviews via prompt evolution (GEPA), and runs a 3-agent adversarial debate loop (Critic-Proposer-Judge) to iteratively develop and stress-test research ideas until they meet publication-quality standards.

Unlike code-generation tools or literature review assistants, IdeaForge focuses on the hardest part of research: coming up with novel, publishable ideas and stress-testing them against realistic peer review before you invest months of execution time.

Key capabilities:

  • Automated idea generation — generates research ideas from scratch or from seed papers in any ML domain
  • Multi-agent adversarial refinement — 3 stateful Claude agents (Critic, Proposer, Judge) debate for up to 40 rounds
  • Calibrated AI judge — GEPA-optimized prompt achieves 91.65% accuracy against real reviewer scores (within ±1.5, verified)
  • OpenReview integration — crawls papers and reviews from ICLR, NeurIPS, ICML via the OpenReview API
  • FAISS-based paper retrieval — pre-built 50K-paper index ships with repo (via Git LFS), works immediately
  • Full reproducibility — one-command setup, pre-trained judge + FAISS index included, all skill files and training artifacts in repo

Disclaimer: All research ideas generated by this system are AI-generated and have not been vetted by human researchers. They are intended as starting points for exploration, not as validated research plans. The scores reflect an AI judge's assessment calibrated against real reviews, but do not guarantee publishability or correctness.

System Overview

                    ┌─────────────┐
                    │  Paper Crawl │  ICLR / NeurIPS / ICML
                    │  + Reviews   │  ~1K default, ~50K full
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Judge Train  │  GEPA optimization
                    │ (91.65% acc) │  689 evals, 2 stages
                    └──────┬──────┘
                           │
              ┌────────────▼────────────────┐
              │   Adversarial Idea Refiner   │
              │                              │
              │  Critic ──► Proposer ──► Judge│
              │    │                     │   │
              │    └─── rotate ◄─── score ≥8 │
              └──────────────────────────────┘
                           │
                    ┌──────▼──────┐
                    │  Refined     │
                    │  Research    │  Scored 4-9.5/10
                    │  Ideas       │
                    └─────────────┘

Components

1. Conference Paper Crawlers

Crawl papers and reviews from top ML venues using the OpenReview API.

# Crawl ICLR papers
python data_pipeline/openreview_crawler.py --year 2025

# Crawl ICML papers
python crawl_icml.py

# Crawl NeurIPS papers
python crawl_neurips.py

2. Judge Training (GEPA-Optimized)

Trains a calibrated paper reviewer using Guided Evolution with Prompt Ancestry (GEPA) — a 2-stage optimization process that evolves a judge prompt against real conference reviews.

Training pipeline:

crawl_all_metadata.py  →  Crawl NeurIPS/ICML paper metadata from OpenReview
crawl_all_reviews.py   →  Bulk-crawl full review text for all papers
data_pipeline.py       →  Parse 50K reviews into train/test
generate_skills.py     →  Generate 26 domain-specific skill files (two-step: stats + LLM synthesis)
embedding_index.py     →  Build FAISS index for similar-paper retrieval
eval_harness.py        →  Evaluate judge predictions vs real scores
optimize.py            →  GEPA optimize judge prompt (Stage 1: Sonnet, Stage 2: Opus)

Results (fully reproducible — all logs and artifacts in repo):

3-layer knowledge architecture:

  1. Core prompt (GEPA-optimized, ~4K tokens) — learned evaluation heuristics
  2. Skill library (26 files, dynamically loaded) — topic/dimension/calibration knowledge
  3. Retrieved context (FAISS search) — 5-10 most similar published papers with their actual scores

Skill library breakdown (all included in repo at judge_training/skills/):

  • 16 topic skills — domain-specific evaluation criteria (video generation, language models, diffusion, etc.)
  • 6 dimension skills — per-axis rubrics (novelty, soundness, experiments, clarity, significance, reproducibility)
  • 4 calibration skills — what papers at each score tier look like (2-3, 4-5, 6-7, 8-10)
  • Each skill includes raw statistics (*_stats.json) extracted from the review corpus, the Claude synthesis trace (_gen_*/), and the final skill file

3. Adversarial Idea Refiner

A 3-agent debate system where a Critic attacks research ideas, a Proposer defends and improves them, and a Judge (GEPA-trained) scores each round with independent literature verification.

# Generate and refine an idea from scratch
python idea_refiner/adversarial_refiner.py \
  --from-scratch \
  --domain "ML systems for efficient training" \
  --target-venues "ICML,NeurIPS,ICLR" \
  --rounds 40 \
  --use-trained-judge \
  --min-critics 2

# Refine from a seed paper
python idea_refiner/adversarial_refiner.py \
  --seed-paper "2401.12345" \
  --domain "video generation" \
  --rounds 20 \
  --use-trained-judge

# Resume a previous session
python idea_refiner/adversarial_refiner.py \
  --resume resources/refinements/exp_.../session.pkl \
  --rounds 40

Key features:

  • Stateful Claude Code sessions (each agent maintains context across rounds)
  • Critic rotation (fresh critics when score exceeds threshold for independent validation)
  • Judge does independent web searches to verify novelty claims
  • Proposer can DEFEND, PIVOT, or REPROPOSE based on critique severity
  • Experiment tracking with per-round snapshots

Typical score trajectory: Ideas start at 4-5/10, climb to 7-8 within 3-4 rounds, then oscillate as fresh critics find new issues. The highest-scoring ideas (9+) tend to emerge through natural pivots from methods to benchmarks/measurement studies.

Results

Across all experiments with the GEPA-trained judge:

Idea Domain Rounds Peak Score
PhysDPO Oracle Study Physics video DPO 20 9.5/10
PhysCounterfact Benchmark Video physics 20 9.0/10
TemporalAttrBind Video generation 20 8.5/10
DAS-3D Sparse Attention ML systems 40 8.0/10

See examples/ for the full refined proposals from the top-scoring ideas, plus debate trace excerpts showing how the adversarial loop works in practice:

  • debate_trace_boundcache.md — Full 10-round trajectory from 6 to 9.5, showing how critic pressure drives theoretical depth
  • debate_trace_collisioncue.md — Rapid pivots: critic finds prior art twice in 2 rounds, forcing progressively narrower but more defensible contributions

See optimization_summary.md for detailed analysis of what worked and what didn't across optimization runs.

Key Findings

  1. Patient refinement > clever tricks. 40 rounds of debate on a single idea outperformed tournament selection, early-kill, and forced reproposal combined.
  2. Benchmark/measurement papers score highest. The judge (correctly) rates them 9+ because they're inherently novel and hard to scoop, while method papers in crowded areas cap at 7-8.
  3. The GEPA judge is genuinely rigorous. It does independent literature searches, finds concurrent papers, and holds ideas to real novelty standards. 91.65% calibration accuracy against actual reviewer scores.
  4. Score oscillation is a feature. Fresh critics finding new issues (8→6→8) simulates reviewer diversity. The peaks represent "at least one reviewer would accept."

Reproducibility & Verification

Every claim in this repo is backed by artifacts you can inspect and reproduce:

Claim Evidence How to verify
91.65% judge accuracy training_log.json Re-run python judge_training/optimize.py --seed 42 — same GEPA stages, same eval harness
50K+ paper corpus judge_training/data/summary.json 50,185 papers (37K ICLR + 13K NeurIPS) with reviews parsed into train/test splits
689 evaluation rounds judge_training/output/logs/ Per-paper scoring for every eval in both Stage 1 (500) and Stage 2 (189)
26 skill files judge_training/skills/ 16 topic + 6 dimension + 4 calibration, each with source *_stats.json and generation trace
9.5/10 peak idea score examples/ Full refined proposals with debate traces showing score trajectories

The repo ships a pre-built FAISS index (50K papers, via Git LFS) so the system works immediately — no crawling required to start generating ideas. Default mode crawls ~1,000 representative papers (~15-20 min) for additional training data. Use --full for the complete ~50K paper crawl, or python judge_training/optimize.py to retrain from scratch.

Repository Structure

ideaforge/
├── README.md
├── LICENSE
├── requirements.txt
├── ideaforge.py                     # One-command setup (crawl + build + verify)
├── optimization_summary.md          # Detailed analysis of all runs
│
├── crawl_icml.py                    # ICML paper crawler
├── crawl_neurips.py                 # NeurIPS paper crawler
├── data_pipeline/                   # OpenReview crawling pipeline
│   ├── openreview_crawler.py        # ICLR paper + review crawler
│   ├── crawl_with_reviews.py        # Crawl reviews for existing paper CSVs
│   ├── filter_video_papers.py       # Filter papers by video-related keywords
│   └── test_crawler.py              # Crawler integration test
│
├── judge_training/                  # GEPA judge training
│   ├── PLAN.md                      # Training plan & architecture
│   ├── data_pipeline.py             # Review data processing (→ train/test JSONL)
│   ├── generate_skills.py           # Skill library generation (stats → LLM synthesis)
│   ├── embedding_index.py           # FAISS index builder
│   ├── eval_harness.py              # Evaluation against real reviews
│   ├── optimize.py                  # GEPA prompt optimization
│   ├── claude_utils.py              # Claude API utilities
│   ├── crawl_all_metadata.py        # Bulk NeurIPS/ICML metadata crawler
│   ├── crawl_all_reviews.py         # Bulk review text crawler
│   ├── embeddings/                  # Pre-built FAISS index (Git LFS, ~140MB)
│   │   ├── paper_embeddings.faiss   # 50K paper vectors (all-MiniLM-L6-v2)
│   │   ├── embedding_metadata.jsonl # Paper metadata for retrieval
│   │   └── config.json              # Model + index config
│   ├── output/
│   │   ├── best_judge_prompt.md     # Final GEPA-optimized prompt (4K tokens)
│   │   ├── stage1_best_prompt.md    # Stage 1 (Sonnet) best
│   │   └── training_log.json        # Training run metadata
│   └── skills/                      # 26 skill files + stats + generation traces
│       ├── index.json               # Keyword → skill file mapping
│       ├── topics/                  # 16 topic skills (video_generation, etc.)
│       │   ├── *.md                 # Final skill files
│       │   ├── *_stats.json         # Raw statistical extracts from corpus
│       │   └── _gen_*/              # Claude synthesis traces
│       ├── dimensions/              # 6 dimension skills (novelty, soundness, etc.)
│       └── calibration/             # 4 score-tier calibration skills
│
├── idea_refiner/                    # 3-agent adversarial debate
│   ├── adversarial_refiner.py       # Main refiner (Critic-Proposer-Judge)
│   └── custom_agents.py             # Claude session management
│
├── examples/                        # Top refined ideas (AI-generated)
│   ├── physdpo_oracle_study_9.5.md
│   ├── physcounterfact_benchmark_9.0.md
│   ├── das3d_sparse_attention_8.0.md
│   ├── debate_trace_boundcache.md   # Example debate showing 6→9.5 trajectory
│   └── debate_trace_collisioncue.md # Example showing rapid pivots under criticism
│
└── resources/                       # All generated data (gitignored)
    ├── research_data/               # Crawled papers + reviews
    ├── data/                        # train.jsonl, test.jsonl
    ├── embeddings/                  # FAISS index
    ├── skills/                      # Skill library (copied from judge_training/)
    ├── output/                      # Judge prompt (copied from judge_training/)
    ├── refinements/                 # Experiment sessions + checkpoints
    └── transcripts/                 # Full debate transcripts

Not included in repo (generated at setup time, all in resources/):

  • resources/research_data/ — Crawled papers and reviews
  • resources/refinements/ — Full experiment data (hundreds of round snapshots)
  • resources/transcripts/ — Full debate transcripts (100KB-700KB each)

Try It Now

The repo ships with a pre-trained judge, 26 skill files, and a 50K-paper FAISS index — everything you need to generate ideas immediately, no crawling or training required.

git clone https://github.com/makemebitter/ideaforge.git
cd ideaforge
pip install -r requirements.txt

# Verify everything is ready (checks judge, skills, FAISS index, Claude CLI)
python ideaforge.py --check

# Generate a research idea
python ideaforge.py --run \
  --domain "your research area here" \
  --target-venues "ICML,NeurIPS,ICLR" \
  --rounds 10

That's it — ideaforge.py is the single entry point. --check verifies the setup, --run launches the adversarial refiner with the shipped FAISS index and GEPA-trained judge.

Prerequisites: Python 3.10+, Claude Code CLI, and Anthropic API access. The FAISS index ships via Git LFS (~140MB). If you cloned without LFS, run git lfs pull first.

Setup & Crawling (Optional)

You can optionally crawl your own papers to expand the index or retrain the judge:

pip install sentence-transformers faiss-cpu numpy

# Normal mode: crawl ~1,000 representative papers (~15-20 min)
python ideaforge.py

# Full mode: crawl ALL ~50K papers (2-3 hours, use at your own risk)
python ideaforge.py --full

# Test mode: synthetic data, verifies downstream pipeline works
python ideaforge.py --test

# Just check what's ready
python ideaforge.py --check

All pipeline outputs go to a resources/ folder (configurable via --resources-dir) so the setup doesn't interfere with any existing local data.

End-to-End Reproduction (Manual)

The full pipeline has 4 stages. You can skip stages 1-3 if you just want to use the pre-trained judge (already included in the repo).

Stage 1: Crawl papers + reviews

# Crawl ICLR papers (uses OpenReview public API)
python data_pipeline/openreview_crawler.py --year 2025

# Crawl ICML and NeurIPS
python crawl_icml.py
python crawl_neurips.py

All crawlers write to resources/research_data/ by default. Set IDEAFORGE_RESOURCES_DIR to redirect outputs elsewhere.

Stage 2: Build judge training data

cd judge_training

# Parse crawled reviews into train/test JSONL
python data_pipeline.py

# Generate the 26 skill files
python generate_skills.py

# Build FAISS embedding index for similar-paper retrieval
python embedding_index.py

All scripts default to resources/ at the repo root. Set IDEAFORGE_RESOURCES_DIR to override.

Stage 3: Train the judge (optional — pre-trained prompt included)

# GEPA optimization: ~13 hours, ~$200 in API costs
# Stage 1: 500 evals with Sonnet
python optimize.py --stage 1 --evals 500

# Stage 2: 150 evals with Opus (refines Stage 1 winner)
python optimize.py --stage 2 --evals 150

The pre-trained judge prompt is already at judge_training/output/best_judge_prompt.md — you can skip this stage entirely.

Stage 4: Generate and refine ideas

# Make sure Claude Code CLI is installed and authenticated
claude --version

# Run the adversarial refiner (this is the main event)
python idea_refiner/adversarial_refiner.py \
  --from-scratch \
  --domain "your research domain here" \
  --target-venues "ICML,NeurIPS,ICLR" \
  --rounds 40 \
  --use-trained-judge \
  --min-critics 2

# Results saved to resources/refinements/exp_<timestamp>/

Recommended settings (based on our optimization experiments):

  • --rounds 40 — more rounds > clever tricks
  • --max-reproposals 0 — let ideas refine naturally, don't force restarts
  • --early-kill-threshold 0 — disable early kill
  • --critic-threshold 8.0 — rotate critics when score hits 8
  • --min-critics 2 — require at least 2 independent critics
  • Run one experiment at a time to avoid API rate limits

Cost Estimates

Stage Time API Cost
Crawling ~2 hours Free (OpenReview API)
FAISS index ~10 min Free (local)
Judge training ~13 hours ~$200 (Claude API)
Idea refinement (40 rounds) ~4-8 hours ~$30-50 per run

How It Works

The Adversarial Loop

  1. Idea Generation: Claude generates a research idea from scratch (or from a seed paper) in a specified domain
  2. Baseline Scoring: The GEPA-trained judge scores the raw idea (typically 4-5/10) with independent literature checks
  3. Critic Phase: A fresh Claude session attacks the idea — finds prior work, identifies logical gaps, challenges feasibility
  4. Proposer Phase: Another Claude session defends the idea — addresses critiques, pivots if needed, strengthens weak points
  5. Judge Phase: The trained judge re-scores with independent web searches, provides guidance to both sides
  6. Repeat: Steps 3-5 repeat for N rounds. When score exceeds threshold, the critic is rotated for independent validation

GEPA Judge Training

Standard prompt engineering can't calibrate a reviewer against 50K real reviews. GEPA solves this:

  1. Stage 1 (Sonnet, 500 evals): Evolves the judge prompt through mutations and crossovers, selecting for accuracy against real reviewer scores. Each eval scores 10 papers and compares to ground truth.
  2. Stage 2 (Opus, 150 evals): Takes the Stage 1 winner and refines it with a more capable model, focusing on edge cases and calibration.
  3. Skill Library: 26 auto-generated files covering topic-specific evaluation criteria, score calibration data, and dimension-specific rubrics. Each skill is built in two steps: (1) statistical extraction from the review corpus (Python, no LLM) producing *_stats.json files, then (2) agentic synthesis (Claude Code session) that reads the stats and produces structured Markdown with real reviewer patterns, required baselines, and score-level calibration data.
  4. Retrieval: At eval time, FAISS retrieves the 5-10 most similar published papers with their actual scores, grounding predictions in real data.

Limitations

  • The system requires Claude Code CLI and significant API costs for long runs
  • Judge scores are AI-predicted, not actual peer review — they approximate but don't replace human evaluation
  • Ideas in crowded ML subfields (attention, quantization, KV cache) reliably plateau at 7-8/10, reflecting genuine publication difficulty
  • The system works best for generating and refining ideas, not for validating experimental results
  • Example ideas in examples/ are AI-generated and not human-verified — use as inspiration, not as validated research plans

Citation

If you use IdeaForge in your research, please cite:

@software{ideaforge2025,
  title={IdeaForge: AI-Powered Research Idea Generation via Adversarial Multi-Agent Debate},
  author={Yuhao Zhang},
  url={https://github.com/makemebitter/ideaforge},
  year={2025}
}

License

MIT License


Keywords: research idea generation, automated research, AI researcher, multi-agent debate, adversarial refinement, peer review simulation, paper reviewer AI, GEPA prompt optimization, OpenReview, ICLR, NeurIPS, ICML, Claude Code, LLM agents, FAISS retrieval, research automation, scientific discovery, machine learning research

About

AI-powered research idea generation through adversarial debate. Trains a calibrated judge on 50K real reviews, then uses a 3-agent (Critic-Proposer-Judge) loop to generate and stress-test novel research ideas to publication quality.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages