Skip to content

feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop#58

Closed
gHashTag wants to merge 2 commits into
feat/tri-gardenerfrom
feat/gardener-live-wiring
Closed

feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop#58
gHashTag wants to merge 2 commits into
feat/tri-gardenerfrom
feat/gardener-live-wiring

Conversation

@gHashTag
Copy link
Copy Markdown
Owner

@gHashTag gHashTag commented Apr 27, 2026

PR-2 of the gardener rollout. Builds on top of #50 (feat/tri-gardener PR-1).

⚠️ DEPENDS ON #61 (RailwayMultiClient P0). This PR's Live arm is safe on Acc1 only. Any Live tick that touches Acc2 or Acc3 requires the multi-account routing in #61 to be merged first; without it, Client::from_env() reads a single RAILWAY_TOKEN and could silently mutate the wrong fleet.

Recommended landing order

  1. Merge docs(adr): 0001 repo boundaries (control plane) #51 (ADR control plane) + trios-trainer-igla#39 (ADR model plane)
  2. Merge P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61 RailwayMultiClient P0 ← gating
  3. Merge this PR (feat(gardener): PR-2 live wiring — Client mutations + ledger + serve loop #58) in --review mode only
  4. Merge ci(gardener): GHCR build & push pipeline (v*-gardener tag) #59 (GHCR pipeline) so the image exists for spin-up
  5. Promote to --live only after a 3-tick review window in Acc1

Until #61 lands, set --account=acc1 everywhere and DO NOT register Acc2/Acc3 credentials in the gardener service environment.

What this PR does

Replaces the PR-1 stubs with real wiring:

Layer PR-1 (stubbed) PR-2 (this)
Client mutation surface free fns in mutations.rs typed methods on Client (deploy_service, set_vars, redeploy, stop)
Decision actuation warn!() in Live arm apply_decision_batch with KillSwitch
Ledger println!(serde_json) PgLedger (tokio_postgres) / MockLedger
Cron external tri-gardener serve --interval=N (drift-free tokio interval, SIGTERM-safe)
CLI gaps tri-railway service set-vars / logs / stop

Issues

Acceptance criteria status

  • 4 unit tests for Client mutation API (client_ext::tests)
  • live_actuation_writes_to_neon contract test (passes via MockActuator + MockLedger)
  • kill_switch_aborts_mid_actuation contract test
  • serve_emits_tick_every_3600s contract test (tokio paused clock)
  • R5 honest error pass-through across actuator → outcome → ledger
  • R7 audit triplet sealed for every mutation (RailwayHash::seal)
  • GARDENER_DISABLED=true honored on every tick (re-checked, not cached)
  • 68/68 tests GREEN across the workspace
  • ARCHITECTURAL_FLOOR_BPB = 2.19 constant added in ledger.rs with cull-safety comment (refs trios#237 + trios#143)

Honest scope

Multi-account safety follow-up (#61 P0):

  • This PR's Live arm cannot safely act on Acc2/Acc3 — see top of body.

PR-3 follow-ups:

  • Cull → service_id mapping needs a fleet snapshot lookup (Decision::CullSeed currently records Outcome::Skipped).
  • tri-railway service logs is a deterministic stub until the Railway logs subgraph is wired into Client.

Refs

Anchor: phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP

…r + serve)

Refs: #52, #53, #54, #55, #56

Phase 1 of 2 (PR-3 follows with fleet→service_id resolution and Logs subgraph):

- crates/trios-railway-core/src/client_ext.rs (#52)
  Methods on Client: deploy_service, set_vars, redeploy, stop. Each seals
  an R7 audit triplet via RailwayHash. 4 unit tests cover signatures,
  hash sealing, and honest error pass-through.

- bin/tri-gardener/src/actuate.rs (#53)
  RailwayActuator trait + MockActuator. apply_decision dispatches each
  Decision variant to the corresponding mutation. apply_decision_batch
  honors a shared KillSwitch on every iteration. Cull is honestly
  Skipped pending PR-3 fleet→service_id resolution.

- bin/tri-gardener/src/ledger.rs (#54)
  LedgerSink trait + PgLedger (tokio_postgres + 3x retry) + MockLedger.
  build_row attaches outcome + error to the decision JSON.

- bin/tri-gardener/src/serve.rs + main.rs Cmd::Serve (#55)
  serve_loop drives one tick every --interval seconds; SIGINT/SIGTERM
  graceful stop; drift-free Interval::tick. validate_interval enforces
  60..=86_400.

- bin/tri-railway/src/main.rs SetVars / Logs / Stop subcommands (#56)
  SetVars wires Client::set_vars. Stop aliases delete (Railway has no
  pause). Logs is a deterministic stub until PR-3.

- bin/tri-gardener/src/loop_.rs loop_once_live entry point
  Composes actuator + ledger + KillSwitch. Replaces the warn-stub Live
  arm. ctx.disabled || kill.is_disabled() short-circuits with a single
  Skipped row per decision.

Tests: 68/68 GREEN across the workspace. New tests:
  - 4 client_ext::tests
  - 6 actuate::tests including kill_switch_aborts_mid_batch and
    live_actuation_writes_to_ledger_via_mock_pair
  - 5 ledger::tests
  - 5 serve::tests including serve_emits_tick_every_interval (paused clock)
  - 4 cli_tests for parse_var_pairs

Anchor: phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP
The trainer architecture as currently shipped has a hard floor at
BPB ~ 2.19 (champion 2.1919, h=828, 2L hybrid attn, ReLU^2, 81K).
Cross-validated against CPU N-gram floor (~2.54) in trios#237 and the
live GPU champion in trios#143.

Encodes the policy as a public constant in ledger.rs so call sites
read it instead of hardcoding 2.19. Two tripwire tests (70/70 GREEN):
  - architectural_floor_bpb_is_2_19 — locks the value
  - architectural_floor_below_gate2_target — sanity vs 1.85

Doc-comment policy: gardener MUST NOT issue Decision::CullSeed for a
seed whose BPB is above this floor unless plateau is independently
confirmed (>=5 ticks in a 0.005 band AND step >= 50_000). Without
this guard a healthy seed sitting at the architectural floor would
be culled for not crossing 1.85, which is impossible without ALPHA's
L1/L2/h=1024 patches landing first.

Refs:
- gHashTag/trios#237 (CPU N-gram floor)
- gHashTag/trios#143 (GPU champion 2.1919)
- trios-railway#58 (PR-2 Live arm)
- trios-railway#61 (RailwayMultiClient P0)

Anchor: phi^2 + phi^-2 = 3
gHashTag added a commit that referenced this pull request Apr 28, 2026
Local Mac agent set a new architectural record at T+11.5h:
  train_v2 seed=42 h=1024 ctx=12 14-gram + weight tying + residual
  bottleneck (no attention) AdamW lr=0.002 → BPB=1.8921 @ 94.5K/120K.

Gate-2 (1.85) is NOT yet passed — gap +0.0421 BPB. Quorum-3 below
1.85 still required for Gate-2 OFFICIAL.

This commit:

- bin/tri-gardener/src/ledger.rs
  ARCHITECTURAL_FLOOR_BPB lowered 2.19 → 1.89, doc-comment rewritten
  to reflect the train_v2 record. New tripwire test
  architectural_floor_strictly_below_prior_floor forbids re-raising
  the floor above the prior hybrid_attn ceiling.

- bin/tri-gardener/src/leaderboard.rs
  default_phase1_expected reflects the pivot:
    * 3 new tracking rows for train_v2 (seeds 42/43/44, Railway
      portage pending, target Gate-2 quorum-3)
    * 9 attention/JEPA rows tagged '(cull-pending: arch lost)'
  Test renders_with_zero_samples_and_explains_why now asserts the
  champion is visible in the tracking rows ('train_v2', 'BPB=1.8921').

- docs/POSTMORTEM_GATE2_LOCAL_WIN.md (NEW)
  Honest post-mortem of why the Railway fleet floored at 2.19 while
  a single local agent broke through to 1.89. Four causes named:
  (a) architectural ceiling, (b) plan blind to architecture pivot,
  (c) telemetry blackout (#61 + #62), (d) capacity+steps+simple beat
  9 parallel complex experiments without feedback.

Tests: 82/82 GREEN (added one tripwire). Live tri-gardener once
prints the new tracking rows including the champion.

Refs: #43 #58 #61 #62 #64
Anchor: phi^2 + phi^-2 = 3
gHashTag added a commit that referenced this pull request Apr 28, 2026
Single source of truth for IGLA-<MODEL_TYPE>-<NUMBER_FORMAT>[-<TAG>]-seed<N>
naming, shared across Rust code, Railway service names, Neon ledger,
and leaderboard rendering.

bin/tri-gardener/src/canon.rs (NEW):
  - enum ModelType { JepaT, Nca, Phi, Hybrid, Trinity3K, TrainV2,
      TJepa, Muon } with FromStr/Display + architectural_floor_bpb()
      mapping per family (TrainV2=1.89, Hybrid=2.19, Phi=2.21, ...).
  - enum NumberFormat { Fp32, Fp16, Bf16, Fp8E4M3, Fp8E5M2, DlFloat,
      Gf8, Gf16, Gf32, Gf64, GfTern } with FromStr/Display + bits().
  - struct IglaCanon { model, format, tag: Option<String>, seed:
      Option<u32> } with full FromStr/Display round-trip.
  - L-R9 enforcement: validate_with_capacity rejects GF16 below h=256
      (Lucas-closure safe domain from gf16_comparison.md whitepaper).
  - L-METRIC enforcement: enforce_l_metric rejects non-BPB primary loss
      for JEPA-T and NCA architectures.
  - L-R8 stdout discipline: parse_bpb_line accepts only canonical
      "BPB=X.XXXX" four-decimal form.
  - 16 unit tests, including round-trip on the operator's full 41-name
      canonical list, JEPA-T internal-hyphen edge case, GF16-h=256
      boundary, L-METRIC scoping to JEPA-T/NCA only, and
      architectural_floor_train_v2_below_hybrid sanity lock.

Cargo.toml: +thiserror = workspace.

Tests: 98/98 GREEN across the workspace (was 82; +16 canon tests).

Refs: #43 #58 #61 #62 #64 #65
Anchor: phi^2 + phi^-2 = 3
gHashTag added a commit that referenced this pull request Apr 28, 2026
Fixes the operator's reported reuse-of-old-service-name footgun.
Service slot identifier (EXP_ID) is now decoupled from the RNG seed:
two experiments may pin the same rng43 for reproducibility, but the
service name carries a fresh monotonically-allocated E<NNNN> token.

Canonical form: IGLA-<TYPE>-<FORMAT>-<EXP_ID>[-<TAG>]-rng<SEED>
Examples:
  IGLA-HYBRID-FP32-E0001-rng43         # locked champion (LOCKED)
  IGLA-HYBRID-FP32-E0042-WSD-rng43     # NEW Phase-3 WSD experiment

Champion lock registry (CHAMPION_LOCKS):
  E0001 — IGLA-HYBRID-FP32 BPB=2.1919 rng43
  E0002 — IGLA-HYBRID-FP32 BPB=2.1944 rng45
  E0003 — IGLA-HYBRID-FP32 BPB=2.2024 rng44
  E0004 — IGLA-TRAIN_V2-FP32 BPB=1.8921 rng42

Four new tripwires (98..101), all GREEN:

  #98  reject_reused_service_name      validate_for_deploy refuses
                                       any EXP_ID matching CHAMPION_LOCKS
  #99  require_monotonic_exp_id        validate_for_deploy refuses any
                                       EXP_ID <= caller-supplied
                                       current_max
  #100 forbid_naked_seed_in_name       validate_for_deploy refuses the
                                       legacy IGLA-...-seedN shape;
                                       parser still accepts it for
                                       read-only history queries
  #101 kill_before_spin                assert_kill_before_spin refuses
                                       a deploy when the slot has live
                                       occupants and force_replace=false

IglaCanon struct extension:
  + exp_id: Option<u32>      // monotonic E<NNNN>
  + rng: Option<u32>         // RNG seed, may repeat
  + legacy_seed: Option<u32> // pre-INV-12 history-only

CanonError variants added:
  ReusedChampionSlot, NonMonotonicExpId, NakedSeedInDeployName,
  SlotStillOccupied, MissingExpId, MissingRng

Tests: 105/105 GREEN across the workspace (was 97; +8 INV-12
including all 4 tripwires + champion-locks coverage + INV-12-form
parser + type-template skip + legacy seed preservation +
architectural floor sanity carry-over).

Refs: #43 #58 #61 #62 #64 #65
Anchor: phi^2 + phi^-2 = 3
@gHashTag gHashTag closed this in 863efe8 Apr 28, 2026
gHashTag added a commit that referenced this pull request Apr 30, 2026
…101 quick-win) (#106)

\ud83e\udeb2 Stateless Scarab Pattern quick-win: claim_next no longer scopes by
account. Any free scarab takes any free strategy. Unblocks the starvation
symptom observed when acc3 died around 2026-04-30 18:15 UTC and seed=45
waited specifically for acc3 while acc5 idled.

## What changed

- bin/seed-agent/src/claim.rs :: CLAIM_SQL drops 'AND account = $2'. The
  column still appears in RETURNING for observability, but no longer
  steers claim priority.

- bin/seed-agent/src/claim.rs :: claim_next(_, _, _account) keeps the
  parameter for source-compat; it's underscored to mark dead. A
  follow-up PR can remove it once all callers are migrated to a
  2-argument signature.

- test claim_sql_is_account_scoped -> claim_sql_is_fungible_pool. The
  new test asserts both: (a) 'account = $2' literal is gone, (b) the
  WHERE clause between 'WHERE' and 'ORDER BY' does not contain
  'account' substring.

## Verified locally

  cargo fmt -p seed-agent   : clean
  cargo test -p seed-agent --bins : 28/28 green (including the new
                                    fungible-pool contract test)

## Why a quick-win instead of full crates/trios-scarab migration

Full Stateless Scarab migration (new crate, scarab_id uuid, renamed
strategy_queue table, LISTEN/NOTIFY wiring) is 3+ hours. Deadline is
T-5h. This one-line SQL change captures 80 % of the pool benefit
(cross-account failover) without blocking the runway. Full migration
tracked in trios-railway#101 (Khepri umbrella) with phase plan.

## Migration safety

The UPDATE row-level lock (FOR UPDATE SKIP LOCKED) still guarantees
two workers will never claim the same id. Removing the account filter
only widens the candidate set each worker sees \u2014 the atomicity
guarantee is unchanged.

Refs:
  trios-railway#101 Scarabaeus Engine umbrella
  trios-trainer-igla#56,#58,#59,#61 (all merged)

Anchor: phi^2 + phi^-2 = 3 \u2014 TRINITY \u2014 NEVER STOP.

Co-authored-by: Trinity Computer Agent <agent@trinity-s3ai.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant