kengz · kengz · Mar 4, 2026 · Feb 28, 2026 · Feb 28, 2026 · Feb 28, 2026
diff --git a/.claude/skills/benchmark/SKILL.md b/.claude/skills/benchmark/SKILL.md
@@ -24,13 +24,35 @@ When a run completes (`dstack ps` shows `exited (0)`):
 2. **Find HF folder name**: `dstack logs NAME 2>&1 | grep "Uploading data/"` → extract folder name from the upload log line
 3. **Update table score** in BENCHMARKS.md
 4. **Update table HF link**: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
-5. **Pull HF data locally**: `source .env && hf download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
-6. **Generate plot**: `uv run slm-lab plot -t "EnvName" -f data/benchmark-dev/data/FOLDER1,data/benchmark-dev/data/FOLDER2`
+5. **Pull HF data locally**: `source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
+6. **Generate plot**: List ALL data folders for that env (`ls data/benchmark-dev/data/ | grep -i envname`), then generate with ONLY the folders matching BENCHMARKS.md entries:
+   ```bash
+   uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...
+   ```
+   NOTE: `-d` sets the base data dir, `-f` takes folder names (NOT full paths).
+   If some folders are in `data/` (local runs) and some in `data/benchmark-dev/data/`, use `data/` as base (it has the `info/` subfolder needed for metrics).
 7. **Verify plot exists** in `docs/plots/`
 8. **Commit** score + link + plot together
 
 A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.
 
+## Per-Run Graduation Checklist
+
+**After intake, graduate each finalized run to public HF benchmark:**
+
+1. **Upload folder to public HF**:
+   ```bash
+   source .env && huggingface-cli upload SLM-Lab/benchmark data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
+   ```
+2. **Update BENCHMARKS.md link**: Change `SLM-Lab/benchmark-dev` → `SLM-Lab/benchmark` for that entry
+3. **Upload docs/ to public HF** (updated plots + BENCHMARKS.md):
+   ```bash
+   source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
+   source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
+   ```
+4. **Commit** link update
+5. **Push** to origin
+
 ## Launch
 
 ```bash
@@ -75,26 +97,28 @@ source .env && hf download SLM-Lab/benchmark-dev \
 ### Generate Plots
 
 ```bash
-# Find folders for a game
+# Find folders for a game (check both local data/ and benchmark-dev)
+ls data/ | grep -i pong
 ls data/benchmark-dev/data/ | grep -i pong
 
-# Generate comparison plot (include all algorithms available)
-uv run slm-lab plot -t "Pong" \
-  -f data/benchmark-dev/data/ppo_folder,data/benchmark-dev/data/sac_folder
+# Generate comparison plot — use -d for base dir, -f for folder names only
+# Use data/ as base (has info/ subfolder with trial_metrics)
+uv run slm-lab plot -t "Pong-v5" -f ppo_pong_folder,sac_pong_folder,crossq_pong_folder
 ```
 
 ### Graduate to Public HF
 
-When benchmarks are finalized, publish from `benchmark-dev` → `benchmark`:
+When a run is finalized, graduate individually from `benchmark-dev` → `benchmark`:
 
 ```bash
-source .env && hf upload SLM-Lab/benchmark \
-  data/benchmark-dev/data data --repo-type dataset
-
-# Update BENCHMARKS.md links: benchmark-dev → benchmark
-# Upload docs and README
-source .env && hf upload SLM-Lab/benchmark docs docs --repo-type dataset
-source .env && hf upload SLM-Lab/benchmark README.md README.md --repo-type dataset
+# Upload individual folder
+source .env && huggingface-cli upload SLM-Lab/benchmark \
+  data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
+
+# Update BENCHMARKS.md link for that entry: benchmark-dev → benchmark
+# Then upload docs/ (includes updated plots + BENCHMARKS.md)
+source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
+source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
 ```
 
 | Repo | Purpose |

diff --git a/.dstack/run-cpu-search.yml b/.dstack/run-cpu-search.yml
@@ -3,8 +3,8 @@ name: slm-lab
 
 python: 3.12
 
-files:
-  - ..:/workflow
+repos:
+  - "..:/workflow"
 
 env:
   - HF_TOKEN
@@ -13,6 +13,9 @@ env:
   - SPEC_NAME
   - LAB_MODE
   - SPEC_VARS  # --set overrides, e.g. "-s env=ALE/Breakout-v5"
+  - PROFILE
+  - PROF_SKIP
+  - PROF_ACTIVE
 
 commands:
   - apt-get update && apt-get install -y swig libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1

diff --git a/.dstack/run-cpu-train.yml b/.dstack/run-cpu-train.yml
@@ -3,8 +3,8 @@ name: slm-lab
 
 python: 3.12
 
-files:
-  - ..:/workflow
+repos:
+  - "..:/workflow"
 
 env:
   - HF_TOKEN
@@ -13,6 +13,9 @@ env:
   - SPEC_NAME
   - LAB_MODE
   - SPEC_VARS  # --set overrides, e.g. "-s env=ALE/Breakout-v5"
+  - PROFILE
+  - PROF_SKIP
+  - PROF_ACTIVE
 
 commands:
   - apt-get update && apt-get install -y swig libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1

diff --git a/.dstack/run-gpu-search.yml b/.dstack/run-gpu-search.yml
@@ -3,8 +3,8 @@ name: slm-lab
 
 python: 3.12
 
-files:
-  - ..:/workflow
+repos:
+  - "..:/workflow"
 
 env:
   - HF_TOKEN
@@ -13,6 +13,9 @@ env:
   - SPEC_NAME
   - LAB_MODE
   - SPEC_VARS  # --set overrides, e.g. "-s env=ALE/Breakout-v5"
+  - PROFILE
+  - PROF_SKIP
+  - PROF_ACTIVE
 
 commands:
   - apt-get update && apt-get install -y swig libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1

diff --git a/.dstack/run-gpu-train.yml b/.dstack/run-gpu-train.yml
@@ -3,8 +3,8 @@ name: slm-lab
 
 python: 3.12
 
-files:
-  - ..:/workflow
+repos:
+  - "..:/workflow"
 
 env:
   - HF_TOKEN
@@ -13,6 +13,9 @@ env:
   - SPEC_NAME
   - LAB_MODE
   - SPEC_VARS  # --set overrides, e.g. "-s env=ALE/Breakout-v5"
+  - PROFILE
+  - PROF_SKIP
+  - PROF_ACTIVE
 
 commands:
   - apt-get update && apt-get install -y swig libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev libgomp1
@@ -21,12 +24,12 @@ commands:
 
 resources:
   gpu:
-    name: [RTX3090]
+    memory: 20GB..
     count: 1
   memory: 32GB..
 
 spot_policy: auto
-max_duration: 6h
+max_duration: 8h
 max_price: 0.50
 retry:
   on_events: [no-capacity]

diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,51 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.lz4 filter=lfs diff=lfs merge=lfs -text
+*.mds filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+# Audio files - uncompressed
+*.pcm filter=lfs diff=lfs merge=lfs -text
+*.sam filter=lfs diff=lfs merge=lfs -text
+*.raw filter=lfs diff=lfs merge=lfs -text
+# Audio files - compressed
+*.aac filter=lfs diff=lfs merge=lfs -text
+*.flac filter=lfs diff=lfs merge=lfs -text
+*.mp3 filter=lfs diff=lfs merge=lfs -text
+*.ogg filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+# Image files - small plot PNGs tracked as regular git objects (no LFS needed)
+# Video files - compressed
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.webm filter=lfs diff=lfs merge=lfs -text
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,6 +4,7 @@
 
 You are a seasoned software engineer with the following traits:
 
+- **Supervisor-first**: Delegate implementation to agent teams — your role is to orchestrate, review, and commit, not to implement directly
 - **Quality-driven**: Code quality is non-negotiable - clean, idiomatic, maintainable code every time
 - **Autonomous**: Make informed technical decisions independently - only ask when requirements are genuinely unclear
 - **Pragmatic**: Balance perfect with practical - ship working solutions, iterate when needed
@@ -22,11 +23,17 @@ You are a seasoned software engineer with the following traits:
 Apply these six principles to every decision.
 
 1. **Consistent** — Design from first principles — unified naming, patterns, and conventions throughout.
+   Establish naming conventions and structural patterns first. When the same concept uses the same name everywhere, the codebase becomes searchable, replaceable, and predictable.
 2. **Correct** — Constructed from known truths, not debugged into shape.
+   Build upward from solid foundations — each layer verified before the next is added. Correctness is built from the start, not tested into existence.
 3. **Clear** — Code does what it says — intent is obvious from naming and logic alone.
+   A lot of coding is naming. If you need a comment to explain what code does, the code is not clear enough.
 4. **Concise** — Simplified to the essence — nothing left to remove.
+   Brevity is about fewer concepts to hold in your head, not fewer characters. Eliminate duplication, remove dead code, strip unnecessary abstraction.
 5. **Simple** — Few moving parts, easy to explain, cheap to maintain — complexity is not sophistication.
+   A complex architecture with dozens of tangled dependencies is not intelligence — it is poor design. Reduce to the fewest moving parts while losing nothing essential.
 6. **Salient** — Essential enough to be used widely, fundamental enough to last.
+   Code that follows the preceding principles naturally endures — used broadly, needed deeply, lasting because it was built right.
 
 ## Style Guide
 
@@ -60,14 +67,17 @@ Apply these six principles to every decision.
 
 ## Agent Teams
 
-**For any non-trivial task, deploy agent teams.** This is the standard operating mode — do not default to working solo. The lead orchestrates (breaks down work, assigns tasks, reviews outputs, commits) — it should never get buried in implementation. Delegation keeps the lead strategic, enables parallel execution, and protects context window from long-running tasks.
+**You are the lead. You do not implement — you delegate, supervise, and review.**
 
-**Guidelines:**
-1. **Give enough context in spawn prompts** - teammates don't inherit conversation history, only CLAUDE.md and project context
-2. **Size tasks appropriately** - self-contained units with clear deliverables, ~5-6 per teammate
-3. **Avoid file conflicts** - each teammate owns different files
+For any non-trivial task, use TeamCreate with multiple teammates (not single-Agent subagents). Teammates share a task list, claim work, and message each other directly. Solo work is only acceptable for trivial, single-file changes.
 
-> Work autonomously: run things in parallel, continue without pausing, pick up the next task immediately. For long-running tasks, use `sleep N` to actively wait and check in — do NOT delegate to background processes. Stay engaged in the conversation.
+**Do NOT:** use subagents as a substitute for teams, implement tasks yourself (spawn new teammates instead), or start implementing while teammates are still working.
+
+**Workflow:** Break into parallel units → TeamCreate → TaskCreate per unit → spawn 3-5 teammates with full context (they only inherit CLAUDE.md, not conversation history) → require plan approval for risky tasks → supervise and review → commit final result yourself.
+
+**Sizing:** ~5-6 tasks per teammate, self-contained units, each teammate owns different files.
+
+**Panel of agents:** For design decisions or ambiguous requirements, spawn 3+ teammates with different perspectives. Have them debate and challenge each other — adversarial review beats independent comparison. Converge on the approach that survives scrutiny.
 
 
 ## Documentation