Skip to content

Commit 1b5fce4

Browse files
committed
Document complete auto research run
1 parent a8e80d4 commit 1b5fce4

2 files changed

Lines changed: 200 additions & 11 deletions

File tree

README.md

Lines changed: 152 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -138,23 +138,110 @@ Useful options:
138138

139139
```bash
140140
python examples/auto-research/run.py --approval-mode validation-only "..."
141-
python examples/auto-research/run.py --resume-run examples/auto-research/runs/<run-id> --approval-mode validation-only
141+
python examples/auto-research/run.py --resume-run /tmp/agentworld-auto-research-runs/<run-id> --approval-mode validation-only
142142
python examples/auto-research/run.py --quiet "..."
143143
python examples/auto-research/run.py --timeout 7200 --max-attempts 2 "..."
144144
```
145145

146-
Run an AutoR-style workflow with the real Claude Code controller:
146+
Run a complete AutoR-style workflow with the real Claude Code controller:
147+
148+
```bash
149+
python - <<'PY'
150+
import importlib.util
151+
152+
for package in ["sklearn", "numpy", "matplotlib"]:
153+
if importlib.util.find_spec(package) is None:
154+
raise SystemExit(f"Missing required package: {package}")
155+
print("Python dependencies are available.")
156+
PY
157+
158+
claude --version
159+
```
147160

148161
```bash
149162
python examples/auto-research/run.py \
163+
--runs-dir /tmp/agentworld-auto-research-runs \
150164
--approval-mode validation-only \
151-
--permission-mode acceptEdits \
165+
--permission-mode bypassPermissions \
152166
--max-attempts 2 \
153-
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train at least two classical machine learning baselines, compare their accuracy and confusion matrices, save reusable Python code, produce one summary figure, and write a concise research-style report with methods, results, limitations, and reproducibility notes."
167+
--timeout 7200 \
168+
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train SVM-RBF, RandomForest, kNN, LogisticRegression, and DecisionTree baselines. The experiment stage must actually execute the Python training script and produce real cross-validation results, held-out test results, confusion matrices, hypothesis verdicts, figures, and a paper-style report. Do not use predicted or literature-only results as a substitute for execution."
154169
```
155170

156171
This example uses the real Claude Code controller by default and requires a working authenticated `claude` CLI. The CLI prints live run, stage, Claude session, tool, validation, repair, and review progress. Use `--quiet` if you only want the final JSON result.
157172

173+
`bypassPermissions` is used here because the experimentation stage must run Python training code. Safer edit-only modes can allow file creation but block `python workspace/code/implementation.py`, producing an invalid run where the report is based on predictions rather than executed results.
174+
175+
After the run finishes, validate the produced artifacts:
176+
177+
```bash
178+
RUN_ROOT="$(ls -td /tmp/agentworld-auto-research-runs/* | head -1)"
179+
180+
python - <<PY
181+
import json
182+
from pathlib import Path
183+
184+
root = Path("$RUN_ROOT")
185+
manifest = json.loads((root / "run_manifest.json").read_text())
186+
results = json.loads((root / "workspace/results/results.json").read_text())
187+
188+
required = [
189+
"workspace/results/results.json",
190+
"workspace/results/cv_results.json",
191+
"workspace/results/test_results.json",
192+
"workspace/results/ablation_results.json",
193+
"workspace/results/hypothesis_verdicts.json",
194+
"workspace/results/confusion_matrices.npz",
195+
"workspace/results/experiment_manifest.json",
196+
"workspace/figures/accuracy_comparison.png",
197+
"workspace/figures/confusion_matrices.png",
198+
"workspace/figures/summary.svg",
199+
"workspace/writing/main.tex",
200+
"workspace/writing/references.bib",
201+
"workspace/artifacts/paper.pdf",
202+
"workspace/artifacts/build_log.txt",
203+
"workspace/artifacts/citation_verification.json",
204+
"workspace/artifacts/self_review.json",
205+
]
206+
207+
missing = [path for path in required if not (root / path).exists()]
208+
approved = sum(1 for stage in manifest["stages"] if stage["approved"])
209+
blocked = bool(results.get("execution_blocker")) or results.get("experiments_executed") is False
210+
exit_code = results.get("exit_code")
211+
212+
print("run_root:", root)
213+
print("run_status:", manifest["run_status"])
214+
print("approved:", approved, "/", len(manifest["stages"]))
215+
print("exit_code:", exit_code)
216+
print("missing:", missing or "none")
217+
218+
if manifest["run_status"] != "completed":
219+
raise SystemExit("Run did not complete.")
220+
if approved != len(manifest["stages"]):
221+
raise SystemExit("Not all stages were approved.")
222+
if blocked:
223+
raise SystemExit("Experiment execution was blocked.")
224+
if exit_code not in (0, None):
225+
raise SystemExit(f"Experiment script failed with exit_code={exit_code}.")
226+
if missing:
227+
raise SystemExit("Required artifacts are missing.")
228+
229+
cv = results.get("cv_results", {})
230+
print("cv_accuracy:")
231+
for model, payload in cv.items():
232+
if isinstance(payload, dict) and "mean_cv_accuracy" in payload:
233+
print(f" {model}: {payload['mean_cv_accuracy']:.4f}")
234+
PY
235+
```
236+
237+
Expected checks:
238+
239+
- `run_status` is `completed`
240+
- every stage is approved
241+
- `workspace/results/results.json` records `exit_code: 0` or equivalent successful execution metadata
242+
- no `execution_blocker` is present
243+
- result JSON files, confusion matrices, figures, manuscript source, PDF, citation verification, and self-review artifacts exist
244+
158245
Run the real Claude Code smoke graph:
159246

160247
```bash
@@ -203,8 +290,8 @@ result = run_auto_research(
203290
"concise research-style report."
204291
),
205292
runs_dir=Path("runs"),
206-
approval_mode="manual",
207-
permission_mode="default",
293+
approval_mode="validation-only",
294+
permission_mode="bypassPermissions",
208295
)
209296

210297
print(result.success)
@@ -226,6 +313,8 @@ result = run_auto_research(
226313

227314
`validation-only` still uses the real controller. It only replaces the manual approval prompt with validation-based approval.
228315

316+
For experiment-heavy workflows, use a permission mode that allows the strong agent to execute local commands inside the run workspace. If command execution is blocked, AgentWorld validation rejects result files that explicitly report blocked, skipped, or unexecuted experiments.
317+
229318
### Build A Skill-Aware Graph
230319

231320
```python
@@ -328,6 +417,63 @@ run_root/
328417

329418
The important files are intentionally plain files. Strong agents can inspect and update the workspace directly, while AgentWorld maintains the stage manifest, artifact index, approved memory, and handoffs.
330419

420+
### Complete Auto-Research Case
421+
422+
The recommended smoke case is a small, fully executable scientific workflow:
423+
424+
1. build a literature-backed plan for the scikit-learn digits dataset
425+
2. generate typed hypotheses
426+
3. design the study
427+
4. write reusable training code
428+
5. execute the experiment
429+
6. analyze measured results
430+
7. write a paper-style report
431+
8. prepare release artifacts
432+
433+
Run it with Claude Code:
434+
435+
```bash
436+
python examples/auto-research/run.py \
437+
--runs-dir /tmp/agentworld-auto-research-runs \
438+
--approval-mode validation-only \
439+
--permission-mode bypassPermissions \
440+
--max-attempts 2 \
441+
--timeout 7200 \
442+
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train SVM-RBF, RandomForest, kNN, LogisticRegression, and DecisionTree baselines. The experiment stage must actually execute the Python training script and produce real cross-validation results, held-out test results, confusion matrices, hypothesis verdicts, figures, and a paper-style report. Do not use predicted or literature-only results as a substitute for execution."
443+
```
444+
445+
A valid completed run should include at least:
446+
447+
```text
448+
workspace/results/results.json
449+
workspace/results/cv_results.json
450+
workspace/results/test_results.json
451+
workspace/results/ablation_results.json
452+
workspace/results/hypothesis_verdicts.json
453+
workspace/results/confusion_matrices.npz
454+
workspace/figures/accuracy_comparison.png
455+
workspace/figures/confusion_matrices.png
456+
workspace/figures/summary.svg
457+
workspace/writing/main.tex
458+
workspace/writing/references.bib
459+
workspace/artifacts/paper.pdf
460+
workspace/artifacts/build_log.txt
461+
workspace/artifacts/citation_verification.json
462+
workspace/artifacts/self_review.json
463+
```
464+
465+
If a run fails or is interrupted, resume it from the run root:
466+
467+
```bash
468+
python examples/auto-research/run.py \
469+
--resume-run /tmp/agentworld-auto-research-runs/<run-id> \
470+
--approval-mode validation-only \
471+
--permission-mode bypassPermissions \
472+
--max-attempts 2
473+
```
474+
475+
Do not treat a run as successful just because `run_manifest.json` says `completed`. The experiment artifacts must also show successful execution, with no `execution_blocker` and no `experiments_executed=false` marker.
476+
331477
<a id="how-it-works"></a>
332478
## ⚙️ How It Works
333479

examples/auto-research/README.md

Lines changed: 48 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,12 @@ This example intentionally does not vendor or modify AutoR. It implements an Aut
1010

1111
```bash
1212
python examples/auto-research/run.py \
13+
--runs-dir /tmp/agentworld-auto-research-runs \
1314
--approval-mode validation-only \
14-
--permission-mode acceptEdits \
15+
--permission-mode bypassPermissions \
1516
--max-attempts 2 \
16-
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train at least two classical machine learning baselines, compare their accuracy and confusion matrices, save reusable Python code, produce one summary figure, and write a concise research-style report with methods, results, limitations, and reproducibility notes."
17+
--timeout 7200 \
18+
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train SVM-RBF, RandomForest, kNN, LogisticRegression, and DecisionTree baselines. The experiment stage must actually execute the Python training script and produce real cross-validation results, held-out test results, confusion matrices, hypothesis verdicts, figures, and a paper-style report. Do not use predicted or literature-only results as a substitute for execution."
1719
```
1820

1921
By default this uses Claude Code through `ClaudeCodeController`, so it requires:
@@ -22,6 +24,8 @@ By default this uses Claude Code through `ClaudeCodeController`, so it requires:
2224
- an authenticated Claude Code environment
2325
- tool permissions sufficient to read, write, edit, and run local commands inside the run workspace
2426

27+
`bypassPermissions` is recommended for this case because the experiment stage must execute Python training code. If the local Claude Code policy blocks Python execution, AgentWorld rejects the stage instead of allowing a prediction-only report.
28+
2529
The CLI prints live progress while the strong agent is running:
2630

2731
- run root and selected stages
@@ -36,7 +40,46 @@ Use `--quiet` to suppress progress lines and only print the final JSON result.
3640
The generated run is written under:
3741

3842
```text
39-
examples/auto-research/runs/
43+
/tmp/agentworld-auto-research-runs/
44+
```
45+
46+
Validate the latest run:
47+
48+
```bash
49+
RUN_ROOT="$(ls -td /tmp/agentworld-auto-research-runs/* | head -1)"
50+
51+
python - <<PY
52+
import json
53+
from pathlib import Path
54+
55+
root = Path("$RUN_ROOT")
56+
manifest = json.loads((root / "run_manifest.json").read_text())
57+
results = json.loads((root / "workspace/results/results.json").read_text())
58+
required = [
59+
"workspace/results/cv_results.json",
60+
"workspace/results/test_results.json",
61+
"workspace/results/ablation_results.json",
62+
"workspace/results/hypothesis_verdicts.json",
63+
"workspace/results/confusion_matrices.npz",
64+
"workspace/figures/accuracy_comparison.png",
65+
"workspace/figures/confusion_matrices.png",
66+
"workspace/figures/summary.svg",
67+
"workspace/artifacts/paper.pdf",
68+
]
69+
missing = [path for path in required if not (root / path).exists()]
70+
approved = sum(1 for stage in manifest["stages"] if stage["approved"])
71+
print("run_root:", root)
72+
print("run_status:", manifest["run_status"])
73+
print("approved:", approved, "/", len(manifest["stages"]))
74+
print("exit_code:", results.get("exit_code"))
75+
print("missing:", missing or "none")
76+
if manifest["run_status"] != "completed" or approved != len(manifest["stages"]):
77+
raise SystemExit("Run did not complete cleanly.")
78+
if results.get("execution_blocker") or results.get("experiments_executed") is False:
79+
raise SystemExit("Experiment was blocked or not executed.")
80+
if missing:
81+
raise SystemExit("Required artifacts are missing.")
82+
PY
4083
```
4184

4285
## What It Demonstrates
@@ -65,8 +108,8 @@ To continue a failed or interrupted run, point `--resume-run` at the run root:
65108

66109
```bash
67110
python examples/auto-research/run.py \
68-
--resume-run examples/auto-research/runs/<run-id> \
111+
--resume-run /tmp/agentworld-auto-research-runs/<run-id> \
69112
--approval-mode validation-only \
70-
--permission-mode acceptEdits \
113+
--permission-mode bypassPermissions \
71114
--max-attempts 2
72115
```

0 commit comments

Comments
 (0)