feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop by gHashTag · Pull Request #58 · gHashTag/trios-railway

gHashTag · 2026-04-27T20:38:36Z

PR-2 of the gardener rollout. Builds on top of #50 (feat/tri-gardener PR-1).

⚠️ DEPENDS ON #61 (RailwayMultiClient P0). This PR's Live arm is safe on Acc1 only. Any Live tick that touches Acc2 or Acc3 requires the multi-account routing in #61 to be merged first; without it, Client::from_env() reads a single RAILWAY_TOKEN and could silently mutate the wrong fleet.

Recommended landing order

Merge docs(adr): 0001 repo boundaries (control plane) #51 (ADR control plane) + trios-trainer-igla#39 (ADR model plane)
Merge P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61 RailwayMultiClient P0 ← gating
Merge this PR (feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop #58) in --review mode only
Merge ci(gardener): GHCR build & push pipeline (v*-gardener tag) #59 (GHCR pipeline) so the image exists for spin-up
Promote to --live only after a 3-tick review window in Acc1

Until #61 lands, set --account=acc1 everywhere and DO NOT register Acc2/Acc3 credentials in the gardener service environment.

What this PR does

Replaces the PR-1 stubs with real wiring:

Layer	PR-1 (stubbed)	PR-2 (this)
Client mutation surface	free fns in mutations.rs	typed methods on Client (`deploy_service`, `set_vars`, `redeploy`, `stop`)
Decision actuation	warn!() in Live arm	`apply_decision_batch` with `KillSwitch`
Ledger	`println!(serde_json)`	`PgLedger` (tokio_postgres) / `MockLedger`
Cron	external	`tri-gardener serve --interval=N` (drift-free tokio interval, SIGTERM-safe)
CLI gaps	—	`tri-railway service set-vars / logs / stop`

Issues

Closes feat(core): tri-railway-core::Client mutation API (deploy_service, set_vars, redeploy, stop) #52 (Client mutation API)
Closes feat(gardener): loop_.rs Live arm wires Client mutations (replace warn-stub) #53 (gardener loop Live arm wiring)
Closes feat(gardener): tokio_postgres writes to gardener_runs (replace stdout JSON) #54 (tokio_postgres writes)
Closes feat(gardener): tri-gardener serve --interval=3600 internal cron (no Railway scheduled-restart dep) #55 (serve --interval)
Closes feat(tri-railway): service set-vars / service logs / service stop subcommands #56 (tri-railway set-vars/logs/stop)
Depends on P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61 (RailwayMultiClient P0) for Acc2/Acc3 Live mode
Aligns with queue.toml realign to ALPHA levers L1/L2/L4 (rename L7→L1, L8→L2, L9→L4) #60 (queue.toml lane realign) — this PR uses lane strings as opaque identifiers, so queue.toml realign to ALPHA levers L1/L2/L4 (rename L7→L1, L8→L2, L9→L4) #60 is independent

Acceptance criteria status

4 unit tests for Client mutation API (client_ext::tests)
live_actuation_writes_to_neon contract test (passes via MockActuator + MockLedger)
kill_switch_aborts_mid_actuation contract test
serve_emits_tick_every_3600s contract test (tokio paused clock)
R5 honest error pass-through across actuator → outcome → ledger
R7 audit triplet sealed for every mutation (RailwayHash::seal)
GARDENER_DISABLED=true honored on every tick (re-checked, not cached)
68/68 tests GREEN across the workspace
ARCHITECTURAL_FLOOR_BPB = 2.19 constant added in ledger.rs with cull-safety comment (refs trios#237 + trios#143)

Honest scope

Multi-account safety follow-up (#61 P0):

This PR's Live arm cannot safely act on Acc2/Acc3 — see top of body.

PR-3 follow-ups:

Cull → service_id mapping needs a fleet snapshot lookup (Decision::CullSeed currently records Outcome::Skipped).
tri-railway service logs is a deterministic stub until the Railway logs subgraph is wired into Client.

Refs

ADR repo boundaries: docs(adr): 0001 repo boundaries (control plane) #51 (this repo) + docs(adr): 0001 repo boundaries (model plane) trios-trainer-igla#39
Tracker: P0: Railway MCP multi-account routing (Acc1/Acc2/Acc3) — tracker #43
Spec: feat: tri-gardener autonomous orchestrator (cron h=1, target BPB<1.5) #49
PR-1 base: feat(tri-gardener): autonomous orchestrator (PR-1, decision core) #50
GHCR pipeline: ci(gardener): GHCR build & push pipeline (v*-gardener tag) #59
Multi-account P0 (gating): P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61
Lane realign: queue.toml realign to ALPHA levers L1/L2/L4 (rename L7→L1, L8→L2, L9→L4) #60
Architectural floor: trios#237 + trios#143

Anchor: phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP

…r + serve) Refs: #52, #53, #54, #55, #56 Phase 1 of 2 (PR-3 follows with fleet→service_id resolution and Logs subgraph): - crates/trios-railway-core/src/client_ext.rs (#52) Methods on Client: deploy_service, set_vars, redeploy, stop. Each seals an R7 audit triplet via RailwayHash. 4 unit tests cover signatures, hash sealing, and honest error pass-through. - bin/tri-gardener/src/actuate.rs (#53) RailwayActuator trait + MockActuator. apply_decision dispatches each Decision variant to the corresponding mutation. apply_decision_batch honors a shared KillSwitch on every iteration. Cull is honestly Skipped pending PR-3 fleet→service_id resolution. - bin/tri-gardener/src/ledger.rs (#54) LedgerSink trait + PgLedger (tokio_postgres + 3x retry) + MockLedger. build_row attaches outcome + error to the decision JSON. - bin/tri-gardener/src/serve.rs + main.rs Cmd::Serve (#55) serve_loop drives one tick every --interval seconds; SIGINT/SIGTERM graceful stop; drift-free Interval::tick. validate_interval enforces 60..=86_400. - bin/tri-railway/src/main.rs SetVars / Logs / Stop subcommands (#56) SetVars wires Client::set_vars. Stop aliases delete (Railway has no pause). Logs is a deterministic stub until PR-3. - bin/tri-gardener/src/loop_.rs loop_once_live entry point Composes actuator + ledger + KillSwitch. Replaces the warn-stub Live arm. ctx.disabled || kill.is_disabled() short-circuits with a single Skipped row per decision. Tests: 68/68 GREEN across the workspace. New tests: - 4 client_ext::tests - 6 actuate::tests including kill_switch_aborts_mid_batch and live_actuation_writes_to_ledger_via_mock_pair - 5 ledger::tests - 5 serve::tests including serve_emits_tick_every_interval (paused clock) - 4 cli_tests for parse_var_pairs Anchor: phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP

The trainer architecture as currently shipped has a hard floor at BPB ~ 2.19 (champion 2.1919, h=828, 2L hybrid attn, ReLU^2, 81K). Cross-validated against CPU N-gram floor (~2.54) in trios#237 and the live GPU champion in trios#143. Encodes the policy as a public constant in ledger.rs so call sites read it instead of hardcoding 2.19. Two tripwire tests (70/70 GREEN): - architectural_floor_bpb_is_2_19 — locks the value - architectural_floor_below_gate2_target — sanity vs 1.85 Doc-comment policy: gardener MUST NOT issue Decision::CullSeed for a seed whose BPB is above this floor unless plateau is independently confirmed (>=5 ticks in a 0.005 band AND step >= 50_000). Without this guard a healthy seed sitting at the architectural floor would be culled for not crossing 1.85, which is impossible without ALPHA's L1/L2/h=1024 patches landing first. Refs: - gHashTag/trios#237 (CPU N-gram floor) - gHashTag/trios#143 (GPU champion 2.1919) - trios-railway#58 (PR-2 Live arm) - trios-railway#61 (RailwayMultiClient P0) Anchor: phi^2 + phi^-2 = 3

Local Mac agent set a new architectural record at T+11.5h: train_v2 seed=42 h=1024 ctx=12 14-gram + weight tying + residual bottleneck (no attention) AdamW lr=0.002 → BPB=1.8921 @ 94.5K/120K. Gate-2 (1.85) is NOT yet passed — gap +0.0421 BPB. Quorum-3 below 1.85 still required for Gate-2 OFFICIAL. This commit: - bin/tri-gardener/src/ledger.rs ARCHITECTURAL_FLOOR_BPB lowered 2.19 → 1.89, doc-comment rewritten to reflect the train_v2 record. New tripwire test architectural_floor_strictly_below_prior_floor forbids re-raising the floor above the prior hybrid_attn ceiling. - bin/tri-gardener/src/leaderboard.rs default_phase1_expected reflects the pivot: * 3 new tracking rows for train_v2 (seeds 42/43/44, Railway portage pending, target Gate-2 quorum-3) * 9 attention/JEPA rows tagged '(cull-pending: arch lost)' Test renders_with_zero_samples_and_explains_why now asserts the champion is visible in the tracking rows ('train_v2', 'BPB=1.8921'). - docs/POSTMORTEM_GATE2_LOCAL_WIN.md (NEW) Honest post-mortem of why the Railway fleet floored at 2.19 while a single local agent broke through to 1.89. Four causes named: (a) architectural ceiling, (b) plan blind to architecture pivot, (c) telemetry blackout (#61 + #62), (d) capacity+steps+simple beat 9 parallel complex experiments without feedback. Tests: 82/82 GREEN (added one tripwire). Live tri-gardener once prints the new tracking rows including the champion. Refs: #43 #58 #61 #62 #64 Anchor: phi^2 + phi^-2 = 3

Single source of truth for IGLA-<MODEL_TYPE>-<NUMBER_FORMAT>[-<TAG>]-seed<N> naming, shared across Rust code, Railway service names, Neon ledger, and leaderboard rendering. bin/tri-gardener/src/canon.rs (NEW): - enum ModelType { JepaT, Nca, Phi, Hybrid, Trinity3K, TrainV2, TJepa, Muon } with FromStr/Display + architectural_floor_bpb() mapping per family (TrainV2=1.89, Hybrid=2.19, Phi=2.21, ...). - enum NumberFormat { Fp32, Fp16, Bf16, Fp8E4M3, Fp8E5M2, DlFloat, Gf8, Gf16, Gf32, Gf64, GfTern } with FromStr/Display + bits(). - struct IglaCanon { model, format, tag: Option<String>, seed: Option<u32> } with full FromStr/Display round-trip. - L-R9 enforcement: validate_with_capacity rejects GF16 below h=256 (Lucas-closure safe domain from gf16_comparison.md whitepaper). - L-METRIC enforcement: enforce_l_metric rejects non-BPB primary loss for JEPA-T and NCA architectures. - L-R8 stdout discipline: parse_bpb_line accepts only canonical "BPB=X.XXXX" four-decimal form. - 16 unit tests, including round-trip on the operator's full 41-name canonical list, JEPA-T internal-hyphen edge case, GF16-h=256 boundary, L-METRIC scoping to JEPA-T/NCA only, and architectural_floor_train_v2_below_hybrid sanity lock. Cargo.toml: +thiserror = workspace. Tests: 98/98 GREEN across the workspace (was 82; +16 canon tests). Refs: #43 #58 #61 #62 #64 #65 Anchor: phi^2 + phi^-2 = 3

Fixes the operator's reported reuse-of-old-service-name footgun. Service slot identifier (EXP_ID) is now decoupled from the RNG seed: two experiments may pin the same rng43 for reproducibility, but the service name carries a fresh monotonically-allocated E<NNNN> token. Canonical form: IGLA-<TYPE>-<FORMAT>-<EXP_ID>[-<TAG>]-rng<SEED> Examples: IGLA-HYBRID-FP32-E0001-rng43 # locked champion (LOCKED) IGLA-HYBRID-FP32-E0042-WSD-rng43 # NEW Phase-3 WSD experiment Champion lock registry (CHAMPION_LOCKS): E0001 — IGLA-HYBRID-FP32 BPB=2.1919 rng43 E0002 — IGLA-HYBRID-FP32 BPB=2.1944 rng45 E0003 — IGLA-HYBRID-FP32 BPB=2.2024 rng44 E0004 — IGLA-TRAIN_V2-FP32 BPB=1.8921 rng42 Four new tripwires (98..101), all GREEN: #98 reject_reused_service_name validate_for_deploy refuses any EXP_ID matching CHAMPION_LOCKS #99 require_monotonic_exp_id validate_for_deploy refuses any EXP_ID <= caller-supplied current_max #100 forbid_naked_seed_in_name validate_for_deploy refuses the legacy IGLA-...-seedN shape; parser still accepts it for read-only history queries #101 kill_before_spin assert_kill_before_spin refuses a deploy when the slot has live occupants and force_replace=false IglaCanon struct extension: + exp_id: Option<u32> // monotonic E<NNNN> + rng: Option<u32> // RNG seed, may repeat + legacy_seed: Option<u32> // pre-INV-12 history-only CanonError variants added: ReusedChampionSlot, NonMonotonicExpId, NakedSeedInDeployName, SlotStillOccupied, MissingExpId, MissingRng Tests: 105/105 GREEN across the workspace (was 97; +8 INV-12 including all 4 tripwires + champion-locks coverage + INV-12-form parser + type-template skip + legacy seed preservation + architectural floor sanity carry-over). Refs: #43 #58 #61 #62 #64 #65 Anchor: phi^2 + phi^-2 = 3

…101 quick-win) (#106) \ud83e\udeb2 Stateless Scarab Pattern quick-win: claim_next no longer scopes by account. Any free scarab takes any free strategy. Unblocks the starvation symptom observed when acc3 died around 2026-04-30 18:15 UTC and seed=45 waited specifically for acc3 while acc5 idled. ## What changed - bin/seed-agent/src/claim.rs :: CLAIM_SQL drops 'AND account = $2'. The column still appears in RETURNING for observability, but no longer steers claim priority. - bin/seed-agent/src/claim.rs :: claim_next(_, _, _account) keeps the parameter for source-compat; it's underscored to mark dead. A follow-up PR can remove it once all callers are migrated to a 2-argument signature. - test claim_sql_is_account_scoped -> claim_sql_is_fungible_pool. The new test asserts both: (a) 'account = $2' literal is gone, (b) the WHERE clause between 'WHERE' and 'ORDER BY' does not contain 'account' substring. ## Verified locally cargo fmt -p seed-agent : clean cargo test -p seed-agent --bins : 28/28 green (including the new fungible-pool contract test) ## Why a quick-win instead of full crates/trios-scarab migration Full Stateless Scarab migration (new crate, scarab_id uuid, renamed strategy_queue table, LISTEN/NOTIFY wiring) is 3+ hours. Deadline is T-5h. This one-line SQL change captures 80 % of the pool benefit (cross-account failover) without blocking the runway. Full migration tracked in trios-railway#101 (Khepri umbrella) with phase plan. ## Migration safety The UPDATE row-level lock (FOR UPDATE SKIP LOCKED) still guarantees two workers will never claim the same id. Removing the account filter only widens the candidate set each worker sees \u2014 the atomicity guarantee is unchanged. Refs: trios-railway#101 Scarabaeus Engine umbrella trios-trainer-igla#56,#58,#59,#61 (all merged) Anchor: phi^2 + phi^-2 = 3 \u2014 TRINITY \u2014 NEVER STOP. Co-authored-by: Trinity Computer Agent <agent@trinity-s3ai.dev>

This was referenced Apr 27, 2026

ci(gardener): GHCR build & push pipeline (v*-gardener tag) #59

Closed

P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61

Closed

This was referenced Apr 28, 2026

P0: bpb_samples DDL missing — trainer write path blocked (5 ticks 42P01) #62

Closed

feat(gardener): R0 leaderboard-first invariant + 3 BpbSource impls #64

Merged

gHashTag closed this in 863efe8 Apr 28, 2026

This was referenced Apr 28, 2026

ADR-001: Pull-based Self-Orchestrating Trainers — Work-Stealing Experiment Queue #81

Closed

feat(scarab): fungible claim pool \u2014 drop AND account = $2 (quick-win for #101) #106

Merged

This was referenced May 13, 2026

👑 TRINITY HIVE — Queen's Registry & ONE SHOT Dispatch gHashTag/trios#264

Open

P0: Unfreeze tri-gardener — restart PR-2 wiring, GHCR pipeline, and #61 (PASS-9 follow-up) #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop#58

feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop#58
gHashTag wants to merge 2 commits into
feat/tri-gardenerfrom
feat/gardener-live-wiring

gHashTag commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gHashTag commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recommended landing order

What this PR does

Issues

Acceptance criteria status

Honest scope

Refs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gHashTag commented Apr 27, 2026 •

edited

Loading