IdeaForge

AI-powered research idea generation through adversarial multi-agent debate. IdeaForge crawls top ML conference papers (ICLR, NeurIPS, ICML) from OpenReview, includes a pre-trained AI reviewer calibrated on 50K+ real peer reviews via prompt evolution (GEPA), and runs a 3-agent adversarial debate loop (Critic-Proposer-Judge) to iteratively develop and stress-test research ideas until they meet publication-quality standards.

Unlike code-generation tools or literature review assistants, IdeaForge focuses on the hardest part of research: coming up with novel, publishable ideas and stress-testing them against realistic peer review before you invest months of execution time.

Key capabilities:

Automated idea generation — generates research ideas from scratch or from seed papers in any ML domain
Multi-agent adversarial refinement — 3 stateful Claude agents (Critic, Proposer, Judge) debate for up to 40 rounds
Calibrated AI judge — GEPA-optimized prompt achieves 91.65% accuracy against real reviewer scores (within ±1.5, verified)
OpenReview integration — crawls papers and reviews from ICLR, NeurIPS, ICML via the OpenReview API
FAISS-based paper retrieval — pre-built 50K-paper index ships with repo (via Git LFS), works immediately
Full reproducibility — one-command setup, pre-trained judge + FAISS index included, all skill files and training artifacts in repo

Disclaimer: All research ideas generated by this system are AI-generated and have not been vetted by human researchers. They are intended as starting points for exploration, not as validated research plans. The scores reflect an AI judge's assessment calibrated against real reviews, but do not guarantee publishability or correctness.

System Overview

                    ┌─────────────┐
                    │  Paper Crawl │  ICLR / NeurIPS / ICML
                    │  + Reviews   │  ~1K default, ~50K full
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Judge Train  │  GEPA optimization
                    │ (91.65% acc) │  689 evals, 2 stages
                    └──────┬──────┘
                           │
              ┌────────────▼────────────────┐
              │   Adversarial Idea Refiner   │
              │                              │
              │  Critic ──► Proposer ──► Judge│
              │    │                     │   │
              │    └─── rotate ◄─── score ≥8 │
              └──────────────────────────────┘
                           │
                    ┌──────▼──────┐
                    │  Refined     │
                    │  Research    │  Scored 4-9.5/10
                    │  Ideas       │
                    └─────────────┘

Components

1. Conference Paper Crawlers

Crawl papers and reviews from top ML venues using the OpenReview API.

# Crawl ICLR papers
python data_pipeline/openreview_crawler.py --year 2025

# Crawl ICML papers
python crawl_icml.py

# Crawl NeurIPS papers
python crawl_neurips.py

2. Judge Training (GEPA-Optimized)

Trains a calibrated paper reviewer using Guided Evolution with Prompt Ancestry (GEPA) — a 2-stage optimization process that evolves a judge prompt against real conference reviews.

Training pipeline:

crawl_all_metadata.py  →  Crawl NeurIPS/ICML paper metadata from OpenReview
crawl_all_reviews.py   →  Bulk-crawl full review text for all papers
data_pipeline.py       →  Parse 50K reviews into train/test
generate_skills.py     →  Generate 26 domain-specific skill files (two-step: stats + LLM synthesis)
embedding_index.py     →  Build FAISS index for similar-paper retrieval
eval_harness.py        →  Evaluate judge predictions vs real scores
optimize.py            →  GEPA optimize judge prompt (Stage 1: Sonnet, Stage 2: Opus)

Results (fully reproducible — all logs and artifacts in repo):

91.65% accuracy (within ±1.5 of actual reviewer scores on held-out set)
2-stage GEPA: 500 Sonnet evals → 91.1% | 150 Opus evals → 91.65%
13.4 hours total training time, seed=42
Final prompt: judge_training/output/best_judge_prompt.md
Training log: judge_training/output/training_log.json
Per-eval details: judge_training/output/logs/ (every scoring decision across 689 evals)

3-layer knowledge architecture:

Core prompt (GEPA-optimized, ~4K tokens) — learned evaluation heuristics
Skill library (26 files, dynamically loaded) — topic/dimension/calibration knowledge
Retrieved context (FAISS search) — 5-10 most similar published papers with their actual scores

Skill library breakdown (all included in repo at judge_training/skills/):

16 topic skills — domain-specific evaluation criteria (video generation, language models, diffusion, etc.)
6 dimension skills — per-axis rubrics (novelty, soundness, experiments, clarity, significance, reproducibility)
4 calibration skills — what papers at each score tier look like (2-3, 4-5, 6-7, 8-10)
Each skill includes raw statistics (*_stats.json) extracted from the review corpus, the Claude synthesis trace (_gen_*/), and the final skill file

3. Adversarial Idea Refiner

A 3-agent debate system where a Critic attacks research ideas, a Proposer defends and improves them, and a Judge (GEPA-trained) scores each round with independent literature verification.

# Generate and refine an idea from scratch
python idea_refiner/adversarial_refiner.py \
  --from-scratch \
  --domain "ML systems for efficient training" \
  --target-venues "ICML,NeurIPS,ICLR" \
  --rounds 40 \
  --use-trained-judge \
  --min-critics 2

# Refine from a seed paper
python idea_refiner/adversarial_refiner.py \
  --seed-paper "2401.12345" \
  --domain "video generation" \
  --rounds 20 \
  --use-trained-judge

# Resume a previous session
python idea_refiner/adversarial_refiner.py \
  --resume resources/refinements/exp_.../session.pkl \
  --rounds 40

Key features:

Stateful Claude Code sessions (each agent maintains context across rounds)
Critic rotation (fresh critics when score exceeds threshold for independent validation)
Judge does independent web searches to verify novelty claims
Proposer can DEFEND, PIVOT, or REPROPOSE based on critique severity
Experiment tracking with per-round snapshots

Typical score trajectory: Ideas start at 4-5/10, climb to 7-8 within 3-4 rounds, then oscillate as fresh critics find new issues. The highest-scoring ideas (9+) tend to emerge through natural pivots from methods to benchmarks/measurement studies.

Results

Across all experiments with the GEPA-trained judge:

Idea	Domain	Rounds	Peak Score
PhysDPO Oracle Study	Physics video DPO	20	9.5/10
PhysCounterfact Benchmark	Video physics	20	9.0/10
TemporalAttrBind	Video generation	20	8.5/10
DAS-3D Sparse Attention	ML systems	40	8.0/10

See examples/ for the full refined proposals from the top-scoring ideas, plus debate trace excerpts showing how the adversarial loop works in practice:

debate_trace_boundcache.md — Full 10-round trajectory from 6 to 9.5, showing how critic pressure drives theoretical depth
debate_trace_collisioncue.md — Rapid pivots: critic finds prior art twice in 2 rounds, forcing progressively narrower but more defensible contributions

See optimization_summary.md for detailed analysis of what worked and what didn't across optimization runs.

Key Findings

Patient refinement > clever tricks. 40 rounds of debate on a single idea outperformed tournament selection, early-kill, and forced reproposal combined.
Benchmark/measurement papers score highest. The judge (correctly) rates them 9+ because they're inherently novel and hard to scoop, while method papers in crowded areas cap at 7-8.
The GEPA judge is genuinely rigorous. It does independent literature searches, finds concurrent papers, and holds ideas to real novelty standards. 91.65% calibration accuracy against actual reviewer scores.
Score oscillation is a feature. Fresh critics finding new issues (8→6→8) simulates reviewer diversity. The peaks represent "at least one reviewer would accept."

Reproducibility & Verification

Every claim in this repo is backed by artifacts you can inspect and reproduce:

Claim	Evidence	How to verify
91.65% judge accuracy	`training_log.json`	Re-run `python judge_training/optimize.py --seed 42` — same GEPA stages, same eval harness
50K+ paper corpus	`judge_training/data/summary.json`	50,185 papers (37K ICLR + 13K NeurIPS) with reviews parsed into train/test splits
689 evaluation rounds	`judge_training/output/logs/`	Per-paper scoring for every eval in both Stage 1 (500) and Stage 2 (189)
26 skill files	`judge_training/skills/`	16 topic + 6 dimension + 4 calibration, each with source `*_stats.json` and generation trace
9.5/10 peak idea score	`examples/`	Full refined proposals with debate traces showing score trajectories

The repo ships a pre-built FAISS index (50K papers, via Git LFS) so the system works immediately — no crawling required to start generating ideas. Default mode crawls ~1,000 representative papers (~15-20 min) for additional training data. Use --full for the complete ~50K paper crawl, or python judge_training/optimize.py to retrain from scratch.

Repository Structure

ideaforge/
├── README.md
├── LICENSE
├── requirements.txt
├── ideaforge.py                     # One-command setup (crawl + build + verify)
├── optimization_summary.md          # Detailed analysis of all runs
│
├── crawl_icml.py                    # ICML paper crawler
├── crawl_neurips.py                 # NeurIPS paper crawler
├── data_pipeline/                   # OpenReview crawling pipeline
│   ├── openreview_crawler.py        # ICLR paper + review crawler
│   ├── crawl_with_reviews.py        # Crawl reviews for existing paper CSVs
│   ├── filter_video_papers.py       # Filter papers by video-related keywords
│   └── test_crawler.py              # Crawler integration test
│
├── judge_training/                  # GEPA judge training
│   ├── PLAN.md                      # Training plan & architecture
│   ├── data_pipeline.py             # Review data processing (→ train/test JSONL)
│   ├── generate_skills.py           # Skill library generation (stats → LLM synthesis)
│   ├── embedding_index.py           # FAISS index builder
│   ├── eval_harness.py              # Evaluation against real reviews
│   ├── optimize.py                  # GEPA prompt optimization
│   ├── claude_utils.py              # Claude API utilities
│   ├── crawl_all_metadata.py        # Bulk NeurIPS/ICML metadata crawler
│   ├── crawl_all_reviews.py         # Bulk review text crawler
│   ├── embeddings/                  # Pre-built FAISS index (Git LFS, ~140MB)
│   │   ├── paper_embeddings.faiss   # 50K paper vectors (all-MiniLM-L6-v2)
│   │   ├── embedding_metadata.jsonl # Paper metadata for retrieval
│   │   └── config.json              # Model + index config
│   ├── output/
│   │   ├── best_judge_prompt.md     # Final GEPA-optimized prompt (4K tokens)
│   │   ├── stage1_best_prompt.md    # Stage 1 (Sonnet) best
│   │   └── training_log.json        # Training run metadata
│   └── skills/                      # 26 skill files + stats + generation traces
│       ├── index.json               # Keyword → skill file mapping
│       ├── topics/                  # 16 topic skills (video_generation, etc.)
│       │   ├── *.md                 # Final skill files
│       │   ├── *_stats.json         # Raw statistical extracts from corpus
│       │   └── _gen_*/              # Claude synthesis traces
│       ├── dimensions/              # 6 dimension skills (novelty, soundness, etc.)
│       └── calibration/             # 4 score-tier calibration skills
│
├── idea_refiner/                    # 3-agent adversarial debate
│   ├── adversarial_refiner.py       # Main refiner (Critic-Proposer-Judge)
│   └── custom_agents.py             # Claude session management
│
├── examples/                        # Top refined ideas (AI-generated)
│   ├── physdpo_oracle_study_9.5.md
│   ├── physcounterfact_benchmark_9.0.md
│   ├── das3d_sparse_attention_8.0.md
│   ├── debate_trace_boundcache.md   # Example debate showing 6→9.5 trajectory
│   └── debate_trace_collisioncue.md # Example showing rapid pivots under criticism
│
└── resources/                       # All generated data (gitignored)
    ├── research_data/               # Crawled papers + reviews
    ├── data/                        # train.jsonl, test.jsonl
    ├── embeddings/                  # FAISS index
    ├── skills/                      # Skill library (copied from judge_training/)
    ├── output/                      # Judge prompt (copied from judge_training/)
    ├── refinements/                 # Experiment sessions + checkpoints
    └── transcripts/                 # Full debate transcripts

Not included in repo (generated at setup time, all in resources/):

resources/research_data/ — Crawled papers and reviews
resources/refinements/ — Full experiment data (hundreds of round snapshots)
resources/transcripts/ — Full debate transcripts (100KB-700KB each)

Try It Now

The repo ships with a pre-trained judge, 26 skill files, and a 50K-paper FAISS index — everything you need to generate ideas immediately, no crawling or training required.

git clone https://github.com/makemebitter/ideaforge.git
cd ideaforge
pip install -r requirements.txt

# Verify everything is ready (checks judge, skills, FAISS index, Claude CLI)
python ideaforge.py --check

# Generate a research idea
python ideaforge.py --run \
  --domain "your research area here" \
  --target-venues "ICML,NeurIPS,ICLR" \
  --rounds 10

That's it — ideaforge.py is the single entry point. --check verifies the setup, --run launches the adversarial refiner with the shipped FAISS index and GEPA-trained judge.

Prerequisites: Python 3.10+, Claude Code CLI, and Anthropic API access. The FAISS index ships via Git LFS (~140MB). If you cloned without LFS, run git lfs pull first.

Setup & Crawling (Optional)

You can optionally crawl your own papers to expand the index or retrain the judge:

pip install sentence-transformers faiss-cpu numpy

# Normal mode: crawl ~1,000 representative papers (~15-20 min)
python ideaforge.py

# Full mode: crawl ALL ~50K papers (2-3 hours, use at your own risk)
python ideaforge.py --full

# Test mode: synthetic data, verifies downstream pipeline works
python ideaforge.py --test

# Just check what's ready
python ideaforge.py --check

All pipeline outputs go to a resources/ folder (configurable via --resources-dir) so the setup doesn't interfere with any existing local data.

End-to-End Reproduction (Manual)

The full pipeline has 4 stages. You can skip stages 1-3 if you just want to use the pre-trained judge (already included in the repo).

Stage 1: Crawl papers + reviews

# Crawl ICLR papers (uses OpenReview public API)
python data_pipeline/openreview_crawler.py --year 2025

# Crawl ICML and NeurIPS
python crawl_icml.py
python crawl_neurips.py

All crawlers write to resources/research_data/ by default. Set IDEAFORGE_RESOURCES_DIR to redirect outputs elsewhere.

Stage 2: Build judge training data

cd judge_training

# Parse crawled reviews into train/test JSONL
python data_pipeline.py

# Generate the 26 skill files
python generate_skills.py

# Build FAISS embedding index for similar-paper retrieval
python embedding_index.py

All scripts default to resources/ at the repo root. Set IDEAFORGE_RESOURCES_DIR to override.

Stage 3: Train the judge (optional — pre-trained prompt included)

# GEPA optimization: ~13 hours, ~$200 in API costs
# Stage 1: 500 evals with Sonnet
python optimize.py --stage 1 --evals 500

# Stage 2: 150 evals with Opus (refines Stage 1 winner)
python optimize.py --stage 2 --evals 150

The pre-trained judge prompt is already at judge_training/output/best_judge_prompt.md — you can skip this stage entirely.

Stage 4: Generate and refine ideas

# Make sure Claude Code CLI is installed and authenticated
claude --version

# Run the adversarial refiner (this is the main event)
python idea_refiner/adversarial_refiner.py \
  --from-scratch \
  --domain "your research domain here" \
  --target-venues "ICML,NeurIPS,ICLR" \
  --rounds 40 \
  --use-trained-judge \
  --min-critics 2

# Results saved to resources/refinements/exp_<timestamp>/

Recommended settings (based on our optimization experiments):

--rounds 40 — more rounds > clever tricks
--max-reproposals 0 — let ideas refine naturally, don't force restarts
--early-kill-threshold 0 — disable early kill
--critic-threshold 8.0 — rotate critics when score hits 8
--min-critics 2 — require at least 2 independent critics
Run one experiment at a time to avoid API rate limits

Cost Estimates

Stage	Time	API Cost
Crawling	~2 hours	Free (OpenReview API)
FAISS index	~10 min	Free (local)
Judge training	~13 hours	~$200 (Claude API)
Idea refinement (40 rounds)	~4-8 hours	~$30-50 per run

How It Works

The Adversarial Loop

Idea Generation: Claude generates a research idea from scratch (or from a seed paper) in a specified domain
Baseline Scoring: The GEPA-trained judge scores the raw idea (typically 4-5/10) with independent literature checks
Critic Phase: A fresh Claude session attacks the idea — finds prior work, identifies logical gaps, challenges feasibility
Proposer Phase: Another Claude session defends the idea — addresses critiques, pivots if needed, strengthens weak points
Judge Phase: The trained judge re-scores with independent web searches, provides guidance to both sides
Repeat: Steps 3-5 repeat for N rounds. When score exceeds threshold, the critic is rotated for independent validation

GEPA Judge Training

Standard prompt engineering can't calibrate a reviewer against 50K real reviews. GEPA solves this:

Stage 1 (Sonnet, 500 evals): Evolves the judge prompt through mutations and crossovers, selecting for accuracy against real reviewer scores. Each eval scores 10 papers and compares to ground truth.
Stage 2 (Opus, 150 evals): Takes the Stage 1 winner and refines it with a more capable model, focusing on edge cases and calibration.
Skill Library: 26 auto-generated files covering topic-specific evaluation criteria, score calibration data, and dimension-specific rubrics. Each skill is built in two steps: (1) statistical extraction from the review corpus (Python, no LLM) producing *_stats.json files, then (2) agentic synthesis (Claude Code session) that reads the stats and produces structured Markdown with real reviewer patterns, required baselines, and score-level calibration data.
Retrieval: At eval time, FAISS retrieves the 5-10 most similar published papers with their actual scores, grounding predictions in real data.

Limitations

The system requires Claude Code CLI and significant API costs for long runs
Judge scores are AI-predicted, not actual peer review — they approximate but don't replace human evaluation
Ideas in crowded ML subfields (attention, quantization, KV cache) reliably plateau at 7-8/10, reflecting genuine publication difficulty
The system works best for generating and refining ideas, not for validating experimental results
Example ideas in examples/ are AI-generated and not human-verified — use as inspiration, not as validated research plans

Citation

If you use IdeaForge in your research, please cite:

@software{ideaforge2025,
  title={IdeaForge: AI-Powered Research Idea Generation via Adversarial Multi-Agent Debate},
  author={Yuhao Zhang},
  url={https://github.com/makemebitter/ideaforge},
  year={2025}
}

License

MIT License

Keywords: research idea generation, automated research, AI researcher, multi-agent debate, adversarial refinement, peer review simulation, paper reviewer AI, GEPA prompt optimization, OpenReview, ICLR, NeurIPS, ICML, Claude Code, LLM agents, FAISS retrieval, research automation, scientific discovery, machine learning research

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IdeaForge

System Overview

Components

1. Conference Paper Crawlers

2. Judge Training (GEPA-Optimized)

3. Adversarial Idea Refiner

Results

Key Findings

Reproducibility & Verification

Repository Structure

Try It Now

Setup & Crawling (Optional)

End-to-End Reproduction (Manual)

Cost Estimates

How It Works

The Adversarial Loop

GEPA Judge Training

Limitations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.claude		.claude
data_pipeline		data_pipeline
examples		examples
idea_refiner		idea_refiner
judge_training		judge_training
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl_icml.py		crawl_icml.py
crawl_neurips.py		crawl_neurips.py
ideaforge.py		ideaforge.py
optimization_summary.md		optimization_summary.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

IdeaForge

System Overview

Components

1. Conference Paper Crawlers

2. Judge Training (GEPA-Optimized)

3. Adversarial Idea Refiner

Results

Key Findings

Reproducibility & Verification

Repository Structure

Try It Now

Setup & Crawling (Optional)

End-to-End Reproduction (Manual)

Cost Estimates

How It Works

The Adversarial Loop

GEPA Judge Training

Limitations

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages