Skip to content

fix(wave34): OPTIMIZER env alias + strict optimizer dispatch (no silent AdamW fallback)#135

Open
gHashTag wants to merge 1 commit into
mainfrom
fix/wave34-optimizer-alias-and-strict-dispatch
Open

fix(wave34): OPTIMIZER env alias + strict optimizer dispatch (no silent AdamW fallback)#135
gHashTag wants to merge 1 commit into
mainfrom
fix/wave34-optimizer-alias-and-strict-dispatch

Conversation

@gHashTag
Copy link
Copy Markdown
Owner

Wave-34 RCA fix — OPTIMIZER env alias + strict optimizer dispatch

Summary

Two compound bugs caused Wave-34 (38 services × ~4h fleet credits) to converge 15 nominally-distinct optimizers to bit-identical BPB = 2.6814258098602295 on seed=123. This PR fixes both and adds a hard fail-fast for unknown optimizer labels.

Bugs

Bug #1src/bin/entrypoint.rs:26

let optimizer = env_or("TRIOS_OPTIMIZER", "adamw");

Does not honor the un-prefixed OPTIMIZER alias. PR #130 (Wave-33 hotfix) added resolve_env_alias for STEPS/LR/HIDDEN/SEED but forgot the optimizer knob. Wave-34 set OPTIMIZER=lion (and 14 other labels) on 38 services and every one silently defaulted to "adamw".

Bug #2src/bin/trios-train.rs:298

let outcome = match cli.optimizer.as_str() {
    "muon" => train_loop::run_single_muon(&args, false)?,
    "muon-cwd" => train_loop::run_single_muon(&args, true)?,
    _ => train_loop::run_single(&args)?,   // ← silent fallback to AdamW
};

Wildcard arm silently routes any unsupported optimizer label (lion / lamb / soap / tiger / sgdm / prodigy / adafactor / shampoo / yogi / ranger / radam / adabelief / adamax) to AdamW. Even if Bug #1 were fixed, this fallback alone would have caused identical BPB.

Fix

  1. entrypoint.rs — replace env_or with resolve_env_alias("TRIOS_OPTIMIZER", "OPTIMIZER", "adamw"). Precedence (matches PR fix(wave33): entrypoint env-alias hotfix — root cause Wave-29 STEPS=200000 silent drop #130 contract): TRIOS_OPTIMIZER > OPTIMIZER > default "adamw". Adds the optimizer source to the [entrypoint-trace] line so operators can grep one keyword to verify the override reached the trainer.

  2. trios-train.rs — explicit "adamw" => … arm plus other => anyhow::bail!("Unsupported optimizer: …"). Fail-fast with a clear message listing the supported set.

Honest probe — reproduction & verification

Reproduction artefacts at /home/user/workspace/skills/user/igla-honest-short-run/SKILL.md (new user skill, sibling of tri-gardener-runbook v2.3).

Test Before fix After fix
adamw, 100 steps, seed=123 BPB=6.4700 BPB=6.4700 ✅
muon, 100 steps, seed=123 BPB=6.4409 BPB=6.4409 ✅ (≠ adamw)
lion (fake) silent → 6.4700 (same as adamw) ❌ bail!("Unsupported optimizer: \"lion\"…")
OPTIMIZER=muon alias silent → adamw default ❌ trace: opt=(muon, src=alias)
TRIOS_OPTIMIZER=muon-cwd + OPTIMIZER=lion TRIOS_* wins (already worked) opt=(muon-cwd, src=TRIOS_*)

Why this took 38 services × 4h to spot

Wave-34 dashboard showed 15 distinct optimizer LABELS, BPB values printed in trace logs LOOKED distinct because rounding hid the byte-identity, and the canonical-name index pivoted on (canon, seed) — making it look like 15 architectures each had their own BPB. Only when one operator dumped raw bpb to 16 decimals did the byte-identity become visible.

Operational mitigation

tri-gardener-runbook skill bumped to v2.3 with a new mandatory section "Pre-flight gate — igla-honest-short-run". Three gate criteria before ANY Wave-N Railway deploy:

  1. Seed variance real: 4 seeds × adamw → BPB spread ≥ 0.005 (catches dead SEED alias)
  2. Optimizer variance real: seed=123 × {adamw, muon, muon-cwd} → at least one pair differs by ≥ 0.001 BPB (catches dead OPTIMIZER alias)
  3. Fake optimizer trap: seed=123 × {adamw, lion, lamb, soap} — if all four byte-identical, the wildcard fallback is still present → HARD BLOCK

Total wall-clock cost: < 5 min on operator's CPU box. Compare with Wave-34 cost: 38 × 4h cloud = 152 service-hours.

RCA artefacts

Checklist

  • Build: cargo build --release --bin trios-train --bin entrypoint → green
  • Smoke: fake optimizer fails fast with clear error
  • Smoke: real optimizers produce distinct BPB values
  • Smoke: OPTIMIZER alias resolves through entrypoint
  • Smoke: TRIOS_OPTIMIZER precedence preserved
  • No new dependencies (uses existing anyhow + resolve_env_alias)
  • Backward-compatible: existing TRIOS_OPTIMIZER deployments unaffected

🌻 phi² + phi⁻² = 3 · TRINITY · NEVER STOP · DOI 10.5281/zenodo.19227877

Two compound bugs caused Wave-34 (38 services × ~4h fleet credits) to
converge 15 nominally-distinct optimizers to bit-identical BPB
2.6814258098602295 on seed=123:

1. src/bin/entrypoint.rs:26 — `let optimizer = env_or("TRIOS_OPTIMIZER",
   "adamw")` did NOT honor the un-prefixed `OPTIMIZER` alias. PR #130
   added alias resolution for STEPS/LR/HIDDEN/SEED but forgot the
   optimizer knob. Wave-34 set `OPTIMIZER=lion` (etc.) on 38 services
   and every one silently defaulted to "adamw".

2. src/bin/trios-train.rs:298 — the dispatch `match cli.optimizer.as_str()`
   had a wildcard `_ => run_single(...)` arm. Result: even if the alias
   had been honored, any unsupported optimizer label (lion / lamb / soap
   / tiger / sgdm / prodigy / adafactor / shampoo / yogi / ranger /
   radam / adabelief / adamax) would have silently routed to AdamW
   anyway.

This PR:

- Replaces `env_or` with `resolve_env_alias("TRIOS_OPTIMIZER", "OPTIMIZER", "adamw")`
  so `OPTIMIZER=muon` (etc.) reaches `trios-train`. Precedence:
  TRIOS_OPTIMIZER > OPTIMIZER > default "adamw" (matches PR #130 contract).
- Adds the optimizer source to the entrypoint-trace line so operators
  can verify with one grep that the override reached the trainer.
- Replaces the wildcard arm with explicit `"adamw" => …` plus
  `other => anyhow::bail!("Unsupported optimizer: …")`. Unsupported
  labels now FAIL FAST with a clear error citing the supported set.

Reproduced honest probe (4 seeds × adamw + 3 real opts × seed=123 +
4 fake opts × seed=123) at /home/user/workspace/skills/user/
igla-honest-short-run/SKILL.md. RCA + table: trios#143
comment 4427543239 / 4427583321. Pre-flight gate now required before
any Wave-N deploy: skill 'tri-gardener-runbook' v2.3 §'Pre-flight gate'.

Verified locally:
- adamw 100 steps seed=123 → BPB=6.4700
- muon  100 steps seed=123 → BPB=6.4409 (≠ adamw, real)
- lion  100 steps seed=123 → bail!("Unsupported optimizer: \"lion\"…")
- OPTIMIZER=muon (alias)        → trace: opt=(muon, src=alias)
- TRIOS_OPTIMIZER=muon-cwd + OPTIMIZER=lion → opt=(muon-cwd, src=TRIOS_*)

Anchor: phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP · DOI 10.5281/zenodo.19227877
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant