You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train at least two classical machine learning baselines, compare their accuracy and confusion matrices, save reusable Python code, produce one summary figure, and write a concise research-style report with methods, results, limitations, and reproducibility notes."
167
+
--timeout 7200 \
168
+
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train SVM-RBF, RandomForest, kNN, LogisticRegression, and DecisionTree baselines. The experiment stage must actually execute the Python training script and produce real cross-validation results, held-out test results, confusion matrices, hypothesis verdicts, figures, and a paper-style report. Do not use predicted or literature-only results as a substitute for execution."
154
169
```
155
170
156
171
This example uses the real Claude Code controller by default and requires a working authenticated `claude` CLI. The CLI prints live run, stage, Claude session, tool, validation, repair, and review progress. Use `--quiet` if you only want the final JSON result.
157
172
173
+
`bypassPermissions` is used here because the experimentation stage must run Python training code. Safer edit-only modes can allow file creation but block `python workspace/code/implementation.py`, producing an invalid run where the report is based on predictions rather than executed results.
174
+
175
+
After the run finishes, validate the produced artifacts:
176
+
177
+
```bash
178
+
RUN_ROOT="$(ls -td /tmp/agentworld-auto-research-runs/*| head -1)"
-`workspace/results/results.json` records `exit_code: 0` or equivalent successful execution metadata
242
+
- no `execution_blocker` is present
243
+
- result JSON files, confusion matrices, figures, manuscript source, PDF, citation verification, and self-review artifacts exist
244
+
158
245
Run the real Claude Code smoke graph:
159
246
160
247
```bash
@@ -203,8 +290,8 @@ result = run_auto_research(
203
290
"concise research-style report."
204
291
),
205
292
runs_dir=Path("runs"),
206
-
approval_mode="manual",
207
-
permission_mode="default",
293
+
approval_mode="validation-only",
294
+
permission_mode="bypassPermissions",
208
295
)
209
296
210
297
print(result.success)
@@ -226,6 +313,8 @@ result = run_auto_research(
226
313
227
314
`validation-only` still uses the real controller. It only replaces the manual approval prompt with validation-based approval.
228
315
316
+
For experiment-heavy workflows, use a permission mode that allows the strong agent to execute local commands inside the run workspace. If command execution is blocked, AgentWorld validation rejects result files that explicitly report blocked, skipped, or unexecuted experiments.
317
+
229
318
### Build A Skill-Aware Graph
230
319
231
320
```python
@@ -328,6 +417,63 @@ run_root/
328
417
329
418
The important files are intentionally plain files. Strong agents can inspect and update the workspace directly, while AgentWorld maintains the stage manifest, artifact index, approved memory, and handoffs.
330
419
420
+
### Complete Auto-Research Case
421
+
422
+
The recommended smoke case is a small, fully executable scientific workflow:
423
+
424
+
1. build a literature-backed plan for the scikit-learn digits dataset
425
+
2. generate typed hypotheses
426
+
3. design the study
427
+
4. write reusable training code
428
+
5. execute the experiment
429
+
6. analyze measured results
430
+
7. write a paper-style report
431
+
8. prepare release artifacts
432
+
433
+
Run it with Claude Code:
434
+
435
+
```bash
436
+
python examples/auto-research/run.py \
437
+
--runs-dir /tmp/agentworld-auto-research-runs \
438
+
--approval-mode validation-only \
439
+
--permission-mode bypassPermissions \
440
+
--max-attempts 2 \
441
+
--timeout 7200 \
442
+
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train SVM-RBF, RandomForest, kNN, LogisticRegression, and DecisionTree baselines. The experiment stage must actually execute the Python training script and produce real cross-validation results, held-out test results, confusion matrices, hypothesis verdicts, figures, and a paper-style report. Do not use predicted or literature-only results as a substitute for execution."
443
+
```
444
+
445
+
A valid completed run should include at least:
446
+
447
+
```text
448
+
workspace/results/results.json
449
+
workspace/results/cv_results.json
450
+
workspace/results/test_results.json
451
+
workspace/results/ablation_results.json
452
+
workspace/results/hypothesis_verdicts.json
453
+
workspace/results/confusion_matrices.npz
454
+
workspace/figures/accuracy_comparison.png
455
+
workspace/figures/confusion_matrices.png
456
+
workspace/figures/summary.svg
457
+
workspace/writing/main.tex
458
+
workspace/writing/references.bib
459
+
workspace/artifacts/paper.pdf
460
+
workspace/artifacts/build_log.txt
461
+
workspace/artifacts/citation_verification.json
462
+
workspace/artifacts/self_review.json
463
+
```
464
+
465
+
If a run fails or is interrupted, resume it from the run root:
Do not treat a run as successful just because `run_manifest.json` says `completed`. The experiment artifacts must also show successful execution, with no `execution_blocker` and no `experiments_executed=false` marker.
Copy file name to clipboardExpand all lines: examples/auto-research/README.md
+48-5Lines changed: 48 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,10 +10,12 @@ This example intentionally does not vendor or modify AutoR. It implements an Aut
10
10
11
11
```bash
12
12
python examples/auto-research/run.py \
13
+
--runs-dir /tmp/agentworld-auto-research-runs \
13
14
--approval-mode validation-only \
14
-
--permission-mode acceptEdits \
15
+
--permission-mode bypassPermissions \
15
16
--max-attempts 2 \
16
-
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train at least two classical machine learning baselines, compare their accuracy and confusion matrices, save reusable Python code, produce one summary figure, and write a concise research-style report with methods, results, limitations, and reproducibility notes."
17
+
--timeout 7200 \
18
+
"Build and evaluate a handwritten digit classification model on the scikit-learn digits dataset. Train SVM-RBF, RandomForest, kNN, LogisticRegression, and DecisionTree baselines. The experiment stage must actually execute the Python training script and produce real cross-validation results, held-out test results, confusion matrices, hypothesis verdicts, figures, and a paper-style report. Do not use predicted or literature-only results as a substitute for execution."
17
19
```
18
20
19
21
By default this uses Claude Code through `ClaudeCodeController`, so it requires:
@@ -22,6 +24,8 @@ By default this uses Claude Code through `ClaudeCodeController`, so it requires:
22
24
- an authenticated Claude Code environment
23
25
- tool permissions sufficient to read, write, edit, and run local commands inside the run workspace
24
26
27
+
`bypassPermissions` is recommended for this case because the experiment stage must execute Python training code. If the local Claude Code policy blocks Python execution, AgentWorld rejects the stage instead of allowing a prediction-only report.
28
+
25
29
The CLI prints live progress while the strong agent is running:
26
30
27
31
- run root and selected stages
@@ -36,7 +40,46 @@ Use `--quiet` to suppress progress lines and only print the final JSON result.
36
40
The generated run is written under:
37
41
38
42
```text
39
-
examples/auto-research/runs/
43
+
/tmp/agentworld-auto-research-runs/
44
+
```
45
+
46
+
Validate the latest run:
47
+
48
+
```bash
49
+
RUN_ROOT="$(ls -td /tmp/agentworld-auto-research-runs/*| head -1)"
0 commit comments