kengz · kengz · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026
diff --git a/.claude/skills/benchmark/SKILL.md b/.claude/skills/benchmark/SKILL.md
@@ -14,25 +14,105 @@ description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract result
 5. **Runs must complete in <6h** (dstack max_duration)
 6. **Max 10 concurrent dstack runs** — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures
 
+## Frame Budget — MANDATORY CALCULATION (do this BEFORE every submission)
+
+**dstack kills jobs at 6h with ZERO data** — no trial_metrics, no HF upload, nothing. A run killed at the wall = complete waste.
+
+**Rule: max_frame = observed_fps × 5.5h × 3600** (5.5h, not 6h — leaves 30min margin)
+
+**ALWAYS check FPS after 5-10 min of a new run before committing to the frame budget:**
+```bash
+dstack logs NAME --since 10m 2>&1 | grep "trial_metrics" | tail -3
+# fps = frames_so_far / elapsed_seconds
+```
+If projected wall clock > 5.5h at observed fps → **stop immediately and relaunch with reduced max_frame**.
+
+**Known fps at 64 envs (ppo_playground):**
+| Env category | fps | Safe max_frame (5.5h) |
+|---|---|---|
+| CartpoleBalance, CheetahRun, WalkerWalk | ~450-1800 | 8M–10M |
+| WalkerStand, HopperStand | ~270 | 5M |
+| HumanoidStand | ~200 | 4M |
+| HumanoidWalk | ~290 | 5M |
+| Rough terrain loco (G1Rough, T1Rough, Go1Getup) | ~60-65 | 1M |
+| BerkeleyHumanoidRough | ~36 | 700K |
+
+**For unknown envs:** Submit with conservative 2M, check fps after 5 min, stop and relaunch with correct budget if needed.
+
+## GPU Utilization Check — MANDATORY for Phase 5 / MJWarp runs
+
+**MJWarp must run on GPU. Always verify GPU is actually utilized after a new run starts.**
+
+```bash
+# Option 1: dstack metrics (easiest — shows live GPU %)
+dstack metrics NAME
+
+# Option 2: SSH in and run nvidia-smi
+dstack ssh NAME
+# inside the instance:
+nvidia-smi
+watch -n 2 nvidia-smi
+```
+
+**Thresholds:**
+- GPU util >80% → MJWarp GPU acceleration working correctly ✅
+- GPU util <20% → GPU not utilized — CPU fallback or JAX not using CUDA ❌ Stop run, investigate
+
+**What to check:**
+- GPU utilization % (should be high)
+- GPU memory used (1024 envs on A5000 24GB — expect 8–16GB used)
+- Confirm logs show: `Playground device: GPU (cuda) — DLPack zero-copy` and `impl=warp`
+
+**FPS sanity check for MJWarp at high num_envs (A5000):**
+- 64 envs → ~450fps (confirmed baseline)
+- 1024 envs → ~5000–7000fps expected (linear GPU scaling)
+- 512 envs → ~2500–3500fps expected
+- If fps < 1000 at 1024 envs → MJWarp not GPU-accelerated, stop and investigate before launching more runs
+
+**Phase 5 Playground spec selection:**
+- DM Control (5.1): `ppo_playground` (1024 envs), `sac_playground` (256 envs), `crossq_playground` (16 envs)
+- Locomotion (5.2) / Manipulation (5.3): `ppo_playground_loco` (512 envs), same SAC/CrossQ specs
+- DM Control with NaN rewards: override with `-s normalize_obs=false`
+- Run order: PPO first (fastest), then SAC, then CrossQ
+
 ## Per-Run Intake Checklist
 
 **Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.**
 
 When a run completes (`dstack ps` shows `exited (0)`):
 
-1. **Extract score**: `dstack logs NAME | grep "trial_metrics"` → get `total_reward_ma`
+1. **Extract score + stats** from logs:
+   ```bash
+   dstack logs NAME 2>&1 | grep "trial_metrics"   # → total_reward_ma, frame
+   dstack logs NAME 2>&1 | grep "fps:" | tail -5   # → fps (take last stable value)
+   dstack logs NAME 2>&1 | grep "wall_t:" | tail -1 # → wall_t in seconds → convert to h:mm
+   ```
+   - **MA** = `total_reward_ma` from trial_metrics
+   - **Frames** = `frame:` from trial_metrics (e.g. `1.00e+08`)
+   - **FPS** = last fps value from step logs (e.g. `12500`)
+   - **Wall Clock** = `wall_t` seconds → format as `Xh Ym` (e.g. `2h 18m`)
 2. **Find HF folder name**: `dstack logs NAME 2>&1 | grep "Uploading data/"` → extract folder name from the upload log line
-3. **Update table score** in BENCHMARKS.md
+3. **Update table** in BENCHMARKS.md: fill ALL columns — MA, HF Data, FPS, Frames, Wall Clock
 4. **Update table HF link**: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
 5. **Pull HF data locally**: `source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
-6. **Generate plot**: List ALL data folders for that env (`ls data/benchmark-dev/data/ | grep -i envname`), then generate with ONLY the folders matching BENCHMARKS.md entries:
+6. **Generate plot** (MANDATORY — do NOT skip):
    ```bash
    uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...
    ```
-   NOTE: `-d` sets the base data dir, `-f` takes folder names (NOT full paths).
-   If some folders are in `data/` (local runs) and some in `data/benchmark-dev/data/`, use `data/` as base (it has the `info/` subfolder needed for metrics).
-7. **Verify plot exists** in `docs/plots/`
-8. **Commit** score + link + plot together
+   CRITICAL RULES for plot generation:
+   - Use ONLY the exact folder(s) from the HF Data column of the BENCHMARKS.md table — NEVER grep or ls to find folders
+   - Multiple folders in data/benchmark-dev/data/ may exist for the same env (old failed runs + new good runs). Only use the canonical folder from the table.
+   - Include ALL algorithms that have entries in the table for that env (e.g., both PPO and SAC folders if both have scores)
+   - If the canonical folder is in local `data/` (not in `data/benchmark-dev/data/`), use `-d data` instead
+   - `-d` sets the base data dir, `-f` takes folder names (NOT full paths)
+7. **Display plot** (MANDATORY — call the Read tool on the image file, no exceptions):
+   ```
+   Read: docs/plots/EnvName_multi_trial_graph_mean_returns_ma_vs_frames.png
+   ```
+   This MUST happen in your agent turn — call Read, see the image, THEN send your completion message.
+   Team-lead must also call Read to display it in the main conversation.
+8. **Embed plot in BENCHMARKS.md** — for Phase 5 playground envs, ensure the plot is in the DM Control plot grid (search for the existing grid in the Phase 5 section). If the env is already in the grid, no action needed. If missing, add it.
+9. **Commit** score + link + plot together
 
 A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.
 
@@ -136,18 +216,65 @@ source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAM
 
 Budget: ~3-4 trials per dimension. After search: update spec with best params, run `train`, use that result.
 
-## Autonomous Execution
+## Agent Team Workflow (MANDATORY for team lead)
+
+**You are the team lead. Never work solo on benchmarks — always spawn an agent team.**
+
+### Team Roles
+
+**launcher** — Reads BENCHMARKS.md, identifies missing entries, launches up to 10 dstack runs. Checks FPS after ~5min and stops slow runs (>6h projected). Reports run names + envs to team lead.
+
+**monitor** — Polls `dstack ps` every 5min (`sleep 300 && dstack ps`). Detects completions and failures. When runs complete, assigns intake tasks. When runs fail, reports to team lead immediately. Runs continuously until all runs are done.
+
+**intake-A / intake-B / intake-C** — Each owns a batch of 3-4 completed runs. Executes the full intake checklist (score → HF folder → pull data → plot → BENCHMARKS.md update). Does NOT commit — team lead commits.
+
+### Spawn Pattern
+
+```
+TeamCreate → TaskCreate (one per batch of runs) →
+  Agent(launcher) + Agent(monitor) + Agent(intake-A) + Agent(intake-B) + ...
+```
+
+Spawn all agents in parallel. Intake agents start idle and pick up work as monitor assigns completed runs.
+
+### Team Lead Responsibilities
+
+1. **On spawn**: Brief each agent with full context (run names, env names, BENCHMARKS.md format, intake checklist)
+2. **On intake completion**: Read each plot image (Read tool), verify BENCHMARKS.md edits, then commit
+3. **On monitor report**: If runs fail, relaunch immediately; if fps too slow, stop + reduce frames
+4. **Commit cadence**: Batch-commit after each intake wave (score + HF link + plot per commit)
+5. **Shutdown team**: When all runs intaked and committed, send shutdown_request to all teammates
+
+### Monitor Agent Instructions Template
+
+```
+You are monitor on team TEAM_NAME. Poll dstack ps every 5min.
+Active runs: [LIST OF RUN NAMES]
+When a run shows exited(0): send message to team-lead with run name and env name.
+When a run shows exited(1) or failed: send message to team-lead immediately.
+Use: while true; do dstack ps; sleep 300; done
+Stop when team-lead sends shutdown_request.
+```
+
+### Intake Agent Instructions Template
+
+```
+You are intake-agent-X on team TEAM_NAME. Intake these completed runs: [LIST]
+For each run, follow the full intake checklist in the benchmark skill.
+Working dir: /Users/keng/projects/SLM-Lab
+Do NOT commit — team lead commits.
+After all runs done: send results summary to team-lead (scores, HF folders, any issues).
+```
 
-Work continuously when benchmarking. Use `sleep 300 && dstack ps` to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.
+### Autonomous Execution
 
-**Workflow loop** (repeat every 5-10 minutes):
-1. **Check status**: `dstack ps` — identify completed/failed/running
-2. **Intake completed runs**: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
-3. **Launch next batch**: Up to 10 concurrent. Check capacity before launching more
-4. **Iterate on failures**: Relaunch or adjust config immediately
-5. **Commit progress**: Regular commits of score + link + plot updates
+**Workflow loop** (team lead orchestrates, agents execute):
+1. **launcher**: Identifies gaps in BENCHMARKS.md → launches up to 10 runs → reports to team lead
+2. **monitor**: Watches for completions → notifies team lead → assigns intake work
+3. **intake agents**: Execute full checklist per run → report to team lead
+4. **team lead**: Reviews plots, commits, relaunches failures, spawns next batch
 
-**Key principle**: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.
+**Key principle**: Keep agents working in parallel. Never idle as team lead while GPU runs are active — spawn a monitor agent. Commit after each intake wave. Shut down team cleanly when done.
 
 ## Troubleshooting
 

diff --git a/.dstack/run-gpu-search.yml b/.dstack/run-gpu-search.yml
@@ -16,10 +16,12 @@ env:
   - PROFILE
   - PROF_SKIP
   - PROF_ACTIVE
+  - UV_HTTP_TIMEOUT=300
 
 commands:
   - apt-get update && apt-get install -y swig libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1
-  - cd /workflow && uv sync
+  - cd /workflow && uv sync --group playground
+  - cd /workflow && uv run python -c "from mujoco_playground._src.mjx_env import ensure_menagerie_exists; ensure_menagerie_exists()"
   - cd /workflow && uv run slm-lab run ${SPEC_VARS} ${SPEC_FILE} ${SPEC_NAME} ${LAB_MODE} --upload-hf
 
 resources:

diff --git a/.dstack/run-gpu-train.yml b/.dstack/run-gpu-train.yml
@@ -16,10 +16,13 @@ env:
   - PROFILE
   - PROF_SKIP
   - PROF_ACTIVE
+  - XLA_PYTHON_CLIENT_PREALLOCATE=false
+  - UV_HTTP_TIMEOUT=300
 
 commands:
   - apt-get update && apt-get install -y swig libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1
-  - cd /workflow && uv sync
+  - cd /workflow && uv sync --group playground
+  - cd /workflow && uv run python -c "from mujoco_playground._src.mjx_env import ensure_menagerie_exists; ensure_menagerie_exists()"
   - cd /workflow && uv run slm-lab run ${SPEC_VARS} ${SPEC_FILE} ${SPEC_NAME} ${LAB_MODE} --upload-hf
 
 resources:
@@ -29,7 +32,7 @@ resources:
   memory: 32GB..
 
 spot_policy: auto
-max_duration: 8h
+max_duration: 6h
 max_price: 0.50
 retry:
   on_events: [no-capacity]