fix(wave34): OPTIMIZER env alias + strict optimizer dispatch (no silent AdamW fallback)#135
Open
gHashTag wants to merge 1 commit into
Open
fix(wave34): OPTIMIZER env alias + strict optimizer dispatch (no silent AdamW fallback)#135gHashTag wants to merge 1 commit into
gHashTag wants to merge 1 commit into
Conversation
Two compound bugs caused Wave-34 (38 services × ~4h fleet credits) to
converge 15 nominally-distinct optimizers to bit-identical BPB
2.6814258098602295 on seed=123:
1. src/bin/entrypoint.rs:26 — `let optimizer = env_or("TRIOS_OPTIMIZER",
"adamw")` did NOT honor the un-prefixed `OPTIMIZER` alias. PR #130
added alias resolution for STEPS/LR/HIDDEN/SEED but forgot the
optimizer knob. Wave-34 set `OPTIMIZER=lion` (etc.) on 38 services
and every one silently defaulted to "adamw".
2. src/bin/trios-train.rs:298 — the dispatch `match cli.optimizer.as_str()`
had a wildcard `_ => run_single(...)` arm. Result: even if the alias
had been honored, any unsupported optimizer label (lion / lamb / soap
/ tiger / sgdm / prodigy / adafactor / shampoo / yogi / ranger /
radam / adabelief / adamax) would have silently routed to AdamW
anyway.
This PR:
- Replaces `env_or` with `resolve_env_alias("TRIOS_OPTIMIZER", "OPTIMIZER", "adamw")`
so `OPTIMIZER=muon` (etc.) reaches `trios-train`. Precedence:
TRIOS_OPTIMIZER > OPTIMIZER > default "adamw" (matches PR #130 contract).
- Adds the optimizer source to the entrypoint-trace line so operators
can verify with one grep that the override reached the trainer.
- Replaces the wildcard arm with explicit `"adamw" => …` plus
`other => anyhow::bail!("Unsupported optimizer: …")`. Unsupported
labels now FAIL FAST with a clear error citing the supported set.
Reproduced honest probe (4 seeds × adamw + 3 real opts × seed=123 +
4 fake opts × seed=123) at /home/user/workspace/skills/user/
igla-honest-short-run/SKILL.md. RCA + table: trios#143
comment 4427543239 / 4427583321. Pre-flight gate now required before
any Wave-N deploy: skill 'tri-gardener-runbook' v2.3 §'Pre-flight gate'.
Verified locally:
- adamw 100 steps seed=123 → BPB=6.4700
- muon 100 steps seed=123 → BPB=6.4409 (≠ adamw, real)
- lion 100 steps seed=123 → bail!("Unsupported optimizer: \"lion\"…")
- OPTIMIZER=muon (alias) → trace: opt=(muon, src=alias)
- TRIOS_OPTIMIZER=muon-cwd + OPTIMIZER=lion → opt=(muon-cwd, src=TRIOS_*)
Anchor: phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP · DOI 10.5281/zenodo.19227877
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave-34 RCA fix — OPTIMIZER env alias + strict optimizer dispatch
Summary
Two compound bugs caused Wave-34 (38 services × ~4h fleet credits) to converge 15 nominally-distinct optimizers to bit-identical BPB = 2.6814258098602295 on seed=123. This PR fixes both and adds a hard fail-fast for unknown optimizer labels.
Bugs
Bug #1 —
src/bin/entrypoint.rs:26Does not honor the un-prefixed
OPTIMIZERalias. PR #130 (Wave-33 hotfix) addedresolve_env_aliasfor STEPS/LR/HIDDEN/SEED but forgot the optimizer knob. Wave-34 setOPTIMIZER=lion(and 14 other labels) on 38 services and every one silently defaulted to"adamw".Bug #2 —
src/bin/trios-train.rs:298Wildcard arm silently routes any unsupported optimizer label (
lion/lamb/soap/tiger/sgdm/prodigy/adafactor/shampoo/yogi/ranger/radam/adabelief/adamax) to AdamW. Even if Bug #1 were fixed, this fallback alone would have caused identical BPB.Fix
entrypoint.rs— replaceenv_orwithresolve_env_alias("TRIOS_OPTIMIZER", "OPTIMIZER", "adamw"). Precedence (matches PR fix(wave33): entrypoint env-alias hotfix — root cause Wave-29 STEPS=200000 silent drop #130 contract):TRIOS_OPTIMIZER>OPTIMIZER> default"adamw". Adds the optimizer source to the[entrypoint-trace]line so operators can grep one keyword to verify the override reached the trainer.trios-train.rs— explicit"adamw" => …arm plusother => anyhow::bail!("Unsupported optimizer: …"). Fail-fast with a clear message listing the supported set.Honest probe — reproduction & verification
Reproduction artefacts at
/home/user/workspace/skills/user/igla-honest-short-run/SKILL.md(new user skill, sibling oftri-gardener-runbookv2.3).bail!("Unsupported optimizer: \"lion\"…")✅opt=(muon, src=alias)✅opt=(muon-cwd, src=TRIOS_*)✅Why this took 38 services × 4h to spot
Wave-34 dashboard showed 15 distinct optimizer LABELS, BPB values printed in trace logs LOOKED distinct because rounding hid the byte-identity, and the canonical-name index pivoted on (canon, seed) — making it look like 15 architectures each had their own BPB. Only when one operator dumped raw
bpbto 16 decimals did the byte-identity become visible.Operational mitigation
tri-gardener-runbookskill bumped to v2.3 with a new mandatory section "Pre-flight gate —igla-honest-short-run". Three gate criteria before ANY Wave-N Railway deploy:Total wall-clock cost: < 5 min on operator's CPU box. Compare with Wave-34 cost: 38 × 4h cloud = 152 service-hours.
RCA artefacts
igla-honest-short-run(scope=user)tri-gardener-runbookv2.3 (scope=user, §"Pre-flight gate (MANDATORY)")Checklist
cargo build --release --bin trios-train --bin entrypoint→ greenOPTIMIZERalias resolves through entrypointTRIOS_OPTIMIZERprecedence preservedanyhow+resolve_env_alias)TRIOS_OPTIMIZERdeployments unaffected🌻 phi² + phi⁻² = 3 · TRINITY · NEVER STOP · DOI 10.5281/zenodo.19227877