Talk to Claude Code. Get a research paper.
InfiEpisteme is an automated research pipeline that turns a research idea into a peer-review-quality paper, clean code, and organized results β all through natural language conversation with Claude Code.
No frameworks to learn. No scripts to memorize. Just describe what you want to study.
You: I want to study Kimi's Attention Residuals paper that just came out.
Validate it on small LLMs and propose improvements. Target NeurIPS 2026.
CC: [searches web, finds paper, identifies 5 competing methods]
[sets up server, writes proposal, starts 9-stage pipeline]
[pauses for your review at key decisions]
[monitors, catches bugs, fixes them, keeps going]
...6 hours later...
Pipeline complete. Paper: 12 pages, 36 citations, 6 experiments.
β DELIVERY/paper.pdf
| Traditional | InfiEpisteme |
|---|---|
| Edit YAML configs | Just talk |
| Read docs to learn the framework | CC already knows how |
| SSH in to debug when things break | CC diagnoses and fixes automatically |
| Monitor scripts manually | CC watches for risks proactively |
| Restart from scratch on failure | CC retries, hotfixes, resumes |
Key advantage: You command one local Claude Code, which commands multiple Claude Code instances on the server β and actively monitors, diagnoses, and hotfixes along the way.
InfiEpisteme improves itself through natural language β no RL, no retraining, just updated instructions.
Run pipeline
β
βΌ
Observe problems ββββ "S1 only found 21 citations, need 30"
β
βΌ
Discuss with CC ββββ "The regex doesn't match 'et al.' format"
β
βΌ
CC updates .md ββββ S1_literature.md: "Web Search is Source 1, not fallback"
β state_guard.py: fix regex
βΌ S2_ideation.md: "speculative hypotheses must be questions"
Next run is better
β
ββββΊ repeat
Every skill file is a natural language instruction. When something goes wrong, you tell CC what happened, and it rewrites the instruction to prevent the same mistake. Real examples from our first run:
| You said | CC improved |
|---|---|
| "Why only 21 citations?" | S1: Web Search promoted to primary source; added "run second pass if < 30" |
| "This hypothesis has no evidence" | S2: Added evidence_basis: speculative label requirement |
| "The node packed 6 training runs" | S2: Added "one node = one training run" rule |
| "500M tokens is too much for preliminary" | S4: Added "S4.1 uses 100-200M tokens" limit |
| "experiment_tree.json never updates" | S4: Added "set status to running BEFORE training" step |
| "Only one architecture isn't convincing" | Added Pythia arch support; redesigned S4.1 as cross-architecture screening |
| "Training restarted and lost 2h of logs" | S4/common: Added "never overwrite results, back up with timestamp" rule |
The pipeline's instructions are its weights. Natural language is the gradient. Conversation is the training loop.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β You: "Study Attention Residuals, target NeurIPS 2026" β
β β β
β βΌ β
β Local Claude Code (mission control β the only CC you talk to) β
β β β
β ββ Web search: finds paper, competing methods β
β ββ SSH: configures server, writes config.yaml β
β β β
β β ββ Per-stage execution loop ββββββββββββββββββββββββββββββββββββββ β
β β β β β
β β β ssh: claude -p "skills/S{N}.md" βββΊ skill execution β β
β β β ssh: state_guard.py verify βββΊ deterministic check β β
β β β ssh: claude -p "skills/memory_sync" βββΊ knowledge consolidate β β
β β β ssh: claude -p "skills/judge.md" βββΊ 3-layer quality gate β β
β β β Layer 1: deterministic checks β β
β β β Layer 2: content quality β β
β β β Layer 3: first-principles reasoning β β
β β β ssh: state_guard.py advance βββΊ next stage or retry β β
β β β β β
β β β [CHECKPOINT at P0, S2, S3?] βββΊ pause for human review β β
β β β Fixed checklist (targets known failure modes) β β
β β β + LLM adversarial brief (supplements) β β
β β β + Raw file access (fallback) β β
β β β βββΊ ./run.sh approve to continue β β
β β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββ Diagnoses failures, hotfixes, retries automatically β
β ββ Reports to you in plain language β
β ββ Repeats for all stages βββ intervene anytime βββ You β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU Server receives SSH commands. Each "claude -p" spawns an independent
CC instance that reads skill instructions, does the work, writes files.
Local CC checks results, decides next action.
run.sh is optional. It automates the loop for unattended runs, but the real orchestrator is your local Claude Code. In practice, local CC provides better results because it can:
- Predict risks before they happen ("this paper is too new for Semantic Scholar")
- Diagnose root causes when checks fail ("regex doesn't match this citation format")
- Hotfix and retry without waiting for 3 automated failures
- Search the web for context that server-side CC can't access
- Report to you in plain language and ask for judgment calls
The chain of command: You give one instruction in natural language β your local CC breaks it into actions β each action spawns a dedicated CC instance on the server β they coordinate through shared files β local CC monitors results and intervenes when needed.
All research logic lives in skills/*.md files β structured markdown that Claude Code executes natively. No framework overhead, no hidden state. Every decision traces back to a readable .md file.
Each claude -p call is stateless. Memory is reconstructed from files:
Layer 1: State registry.yaml, experiment_tree.json
Layer 2: Knowledge .ai/core/ (research context, methodology, literature)
.ai/evolution/ (decisions, negative results, experiment log)
Layer 3: Context .ai/context_chain.md β the "why" thread across stages
A dedicated Memory Synthesizer runs after every stage to consolidate knowledge. Skills never write to .ai/ directly.
- Local machine: Claude Code installed
- GPU server: SSH access to a machine with GPU (e.g.,
ubuntu@gpu-box)- Claude Code installed on the server
- Python 3.10+ with
pyyaml,requests - OpenAI API key on the server (for cross-model review in S7):
echo 'export OPENAI_API_KEY="sk-..."' >> ~/.bashrc && source ~/.bashrc
Clone and configure on the server:
ssh user@gpu-box
git clone https://github.com/JayCheng113/InfiEpisteme.git
cd InfiEpisteme
pip install pyyaml requestsThen edit config.yaml on the server:
research_direction: |
Your research idea here. Be specific: what problem, what method,
what comparison, what scale. The more detail, the better the pipeline
performs. Example:
"Validate Kimi's Attention Residuals on dense sub-2B LLMs.
Compare against DCA, MUDDFormer, DenseFormer on LLaMA and Pythia
at 0.5B scale. Target NeurIPS 2026."
target_venue: "NeurIPS 2026" # or "ICML 2026", "arxiv", etc.
compute:
gpu_hours: 80 # your available GPU budget
gpu_type: "A100" # your GPU type
parallel_jobs: 1 # how many experiments to run at onceOption A: Interactive (recommended) β You talk to your local Claude Code, it SSHs to the server and runs stages one by one. You can monitor, intervene, and guide at any point.
You: SSH to ubuntu@gpu-box, go to ~/InfiEpisteme, and start the research pipeline.
Your local CC will read config.yaml, run each stage via claude -p, check results, and report back. At checkpoints (P0, S2, and conditionally S3 for novel methods), it pauses for your review.
Option B: Unattended β Run directly on the server:
./run.sh start # runs pipeline, pauses at checkpoints
./run.sh approve # approve checkpoint and continue
./run.sh approve --with 'add Pythia' # approve with modifications
./run.sh status # check progressThe most important input is config.yaml's research_direction. Write it yourself, or tell your local CC what you want and ask it to write config.yaml for you:
You: I want to study Kimi's Attention Residuals paper (arxiv:2603.15031).
Validate it on small dense LLMs, compare against other depth-aggregation
methods, and propose improvements. Target NeurIPS 2026. I have an A100-40GB
with 80 GPU-hours budget.
CC: Got it. I'll write config.yaml with this direction and set up the pipeline.
[SSHs to server, writes config.yaml, starts P0]
ββ CHECKPOINT: P0 (research direction) ββ
CC: Here's the formalized proposal. Please verify:
1. Is the hypothesis correctly framed?
2. Is the scope appropriate for NeurIPS?
3. Is the compute budget feasible?
You: Looks good, but add DHC as a baseline too.
CC: [updates proposal, continues to S1-S2]
ββ CHECKPOINT: S2 (experiment design) ββ
CC: Experiment design ready. 14 nodes, 2 architectures. Check:
1. Each node tests exactly one variable? β
2. Budget: 14 Γ 2.5h = ~35h screening. Leaves 45h for tuning. β
3. Any missing baselines?
You: Go.
CC: [runs S3, implements all methods]
ββ CHECKPOINT: S3 (only if you proposed a novel method) ββ
CC: Implementation complete. Your DPE-AttnRes method:
- depth_scale initialized at 0.01 (learnable, ReZero-style)
- Invariant verified: reduces to vanilla AttnRes when scale=0 β
- 341.9M params, 14.5 GB peak memory
Approve to start training?
You: Go.
CC: [runs S4-S8 fully automated, reports when done]
You: What's the status?
CC: S4.2 running. 4/6 experiments complete. Best perplexity: 48.4.
You: Change H2's direction to [new idea].
CC: [updates experiment tree, re-runs affected sub-stage]
CC: S1 failed: only 21 citations detected (need 30).
Root cause: regex doesn't match "et al." format.
[fixes regex, re-verifies]
Fixed. 36 citations now. Advancing.
P0 β [CHECKPOINT] β S0 β S1 β S2 β [CHECKPOINT] β S3 β [CHECKPOINT?] β S4 β S5 β S6 β S7 β S8
β β β β
direction experiment fully automated delivery
(human reviews) design (code, train,
(human reviews) analyze, write,
review, package)
| Stage | What Happens | Output |
|---|---|---|
| P0 | You + CC brainstorm direction, novelty check | RESEARCH_PROPOSAL.md |
| [CHECKPOINT] β human reviews direction, checks factual claims | ||
| S0 | Hardware detection, init .ai/ knowledge base | hardware_profile.json, .ai/ |
| S1 | Literature survey (30+ papers, multi-source) | RELATED_WORK.md, BASELINES.md, bibliography.bib |
| S2 | 6-perspective hypothesis debate, experiment design | EXPERIMENT_PLAN.md, experiment_tree.json |
| [CHECKPOINT] β human reviews design, verifies budget, checks baselines | ||
| S3 | Implement methods + baselines | src/, requirements.txt |
| [CHECKPOINT?] β triggers if novel methods exist; human reviews implementation details | ||
| S4 | Progressive tree search (4 sub-stages, 6+ experiments) | RESULTS_SUMMARY.md, results/ |
| S5 | Statistical analysis, 6-perspective interpretation | ANALYSIS.md, tables/, figures/ |
| S6 | LaTeX paper with 5-step citation verification | paper.pdf |
| S7 | Cross-model adversarial review (Claude writes, GPT reviews) | reviews/ |
| S8 | Package deliverables, venue-specific checklist | DELIVERY/ |
10 stages (P0 + S0-S8), 2-3 need human input (P0, S2, and S3 if you proposed a novel method). The rest run fully automated.
| Mechanism | What It Does |
|---|---|
| Human checkpoints | Pipeline pauses after P0, S2, and conditionally S3 (novel methods) for human review β fixed checklist + LLM adversarial brief + raw file access |
| 3-Layer Judge | Layer 1: deterministic checks. Layer 2: content quality. Layer 3: first-principles reasoning β challenges assumptions, catches factual errors, flags insufficient experimental design |
| Circuit breaker | Same failure 3x β pauses for human input |
| State Guard | Deterministic Python checks after every stage |
| Citation verification | 5-step protocol β every citation verified in 2+ sources |
| Cross-model review | External model (GPT) reviews the paper Claude wrote |
| Hardware detection | Prevents experiments exceeding GPU/RAM capacity |
| Git pre-registration | Experiment design committed before execution |
| Venue checklists | NeurIPS/ICML/ICLR/ACL-specific checks before delivery |
Every stage passes through a judge before advancing. The judge doesn't just check boxes β it reasons from first principles.
S2 just completed. Judge evaluating...
Layer 1 (deterministic):
β EXPERIMENT_PLAN.md exists
β experiment_tree.json has 8 nodes
β Each node has id, approach, success_metric
Layer 2 (content quality):
β Hypotheses are genuinely different
β Multi-perspective debate includes real criticism
β Baseline list missing DenseFormer β important competitor
Layer 3 (first-principles):
β "All experiments use LLaMA only. A benchmark claim at NeurIPS
requires β₯2 architectures to demonstrate generalizability."
β "H3 states 'AttnRes underperforms on dense models' as fact,
but evidence_basis is speculative. Reframe as question."
β "Budget: 16 nodes Γ 500M tokens = 43 GPU-hours for screening
alone. 200M tokens would suffice and leave room for ablations."
Result: RETRY
Guidance: "Add second architecture (e.g., Pythia). Reframe H3 as
question. Reduce screening to 200M tokens."
Without Layer 3, this stage would have passed β the files exist, the counts are right. But the research design had fundamental problems that would have wasted 43 GPU-hours and produced a paper no reviewer would accept.
When working on cutting-edge topics, CC proactively watches for:
| Risk | CC's Response |
|---|---|
| Paper too new for Semantic Scholar | Falls back to web search, broadens to related fields |
| Not enough papers for 30-citation threshold | Supplements with adjacent literature |
| Experiments exceed GPU budget | Adjusts batch size, reduces model scale, or stops early |
| Citation format mismatch | Fixes validation regex, re-verifies |
| API errors mid-pipeline | Retries with exponential backoff, or resumes from checkpoint |
Optional MCP servers for enhanced paper search:
claude mcp add semantic-scholar -- npx -y <semantic-scholar-mcp-package>DELIVERY/
paper.pdf # Peer-review-quality paper
code/src/ # Clean implementation
code/requirements.txt # Dependencies
results/ # Experiment tree, metrics, figures
reviews/ # All reviewer feedback
DELIVERY.md # Summary for human review
Design informed by:
- ARIS β cross-model review, state-persistent loops
- Sibyl β multi-perspective debate, self-healing
- GPT-Researcher β multi-source aggregation
- AgentInfra β .ai/ persistent knowledge layer
- Orchestra Research β citation verification, venue checklists
- LaTeX Document Skill β LaTeX debugging patterns, long-form best practices
- Academic Writing Skills β academic style guide, common writing errors
- Antigravity Awesome Skills β incremental build-verify, coding practices, TDD principles
MIT