fix(byoc/training): first-tick grace + stalled-adapter pause (PR-4)#3932
Open
seanhanca wants to merge 11 commits into
Open
fix(byoc/training): first-tick grace + stalled-adapter pause (PR-4)#3932seanhanca wants to merge 11 commits into
seanhanca wants to merge 11 commits into
Conversation
PR-4 of byoc-payment-fleet-2026-05 plan, addressing two known sharp edges in runTrainingJob's charge accounting (design doc §3.B): 1. **First-tick grace** (Invariant I7 — pre-billing-window grace). chargeTick no longer fires until the adapter has reported a non- submitted status at least once. Previously: orch billed for 30s of "training" that was really 30s of fal-side queue + zip download + GPU spin-up. Now: billing starts the moment we observe the adapter is actually running the job. ~5 LOC: seenInProgress flag, set on first non-empty/non-submitted status read. 2. **Stalled-adapter pause** (Invariant I1 — pay only for billable work). When status polls fail 3+ consecutive times, the orch stops accumulating billable time. consecutivePollFails resets on any successful poll. While stalled, lastChargeTime is pushed forward so accrued time is dropped on the floor. These are the two §3.B simplifications from design §14.5 — no new fields, no new persistence, ~30 LOC total. Refund mechanism (the third §3.B item) deferred to a follow-up PR once we identify the orch-side credit function (or build one as part of the payment ledger work in §3.D Redis persistence). Tests added in same PR — test_training_billing.go covers: - T1: chargeTick doesn't fire before seenInProgress - T2: chargeTick fires after seenInProgress observed - T3: stall threshold reached → billing pauses - T4: stall recovers → billing resumes Branch: feat/byoc-payment-fleet-2026-05 (branched from feat/remote-signer-byoc-v2 per design §19.6).
PR-5 of byoc-payment-fleet-2026-05 plan. Adds the orch-side endpoint
for refresh-on-watermark ticket top-up (design §3.A).
New route:
POST /process/job/{jobId}/refresh-payment
Headers: Livepeer-Payment (required), Livepeer-Segment (optional)
Returns: {"job_id", "new_balance_wei"} on HTTP 200
Behavior:
- Job must exist + not be in terminal state (completed/failed/cancelled)
- Reads Livepeer-Payment header, parses to net.Payment protobuf
- Enforces invariant I6 (sender attribution): payment.Sender must match
job.sender from the original submit. Mismatch → HTTP 403.
- Calls ProcessPayment (same as inference path) to credit the deposit
ledger. Idempotency on duplicate nonces is enforced at the PM layer.
- Updates job.Balance + job.UpdatedAt in trainingStore.
Cross-repo dependencies:
- Required by: livepeer/livepeer-python-gateway PR-2 (calls this endpoint)
- Coordinates with: livepeer/simple-infra PR-8 (SDK refresh-watermark
loop relies on this endpoint being live)
…Tick Addresses review C1 on PR #3932. The synchronous completion path (adapterJobID == "") would silently skip billing. Reviewer asked whether this is intentional or a bug. Answer: intentional dead-code path for fal-direct training (fal-ai/flux-lora-fast-training is always async). The branch is preserved for future sync training providers. Added comment so a future reviewer doesn't restore billing here without re-reading the charge accounting model in §3.B of the design doc. When/if a sync training provider lands, that provider must include cost info in adapterResp and we'll wire it through chargeTrainingTick with the actual elapsed seconds — not by re-using the chargeInterval ticker semantics.
PR-6 of byoc-payment-fleet-2026-05 plan. Adds opt-in persistence to
TrainingJobStore for recovery on orch restart (design §3.D).
**Deviation from design**: §3.D specified Redis. After review, filesystem
JSON checkpoint gives equivalent recovery semantics for the single-
instance BYOC orch today without the operational cost of a new Redis
container or `github.com/redis/go-redis` module dependency. Swap to
Redis if/when the orch becomes multi-instance — the TrainingJobStore
internal interface stays the same; only the persistence backend changes.
Implementation:
- New constructor `NewTrainingJobStoreWithCheckpoint(ttl, dir)` for
filesystem-backed mode. `NewTrainingJobStore(ttl)` unchanged for
backward-compat in-memory mode.
- Store/Update writes atomically to {jobID}.json via .tmp + rename.
- Startup sweep reads every {jobID}.json; in-flight jobs
(submitted/running) are marked `failed_orchestrator_restart`,
their checkpoint rewritten, and added to the in-memory map.
- TTL cleanup deletes terminal checkpoints alongside in-memory entries.
- Corrupt JSON files logged + skipped, not fatal.
Wiring:
- NewBYOCOrchestratorServer reads TRAINING_CHECKPOINT_DIR env var.
Set on byoc-staging-1 to enable persistence; unset for local dev.
Invariant satisfied:
- I8 (restart-lost jobs are refunded): sweep marks them as
failed_orchestrator_restart. Refund issuance is the orch's job
(PR-10 BillingEvent emission), not the store's — TODO follow-up.
Tests (5, all isolated to byoc/ — verifiable with go test ./byoc/...):
- TestTrainingStoreInMemory: backward-compat
- TestTrainingStoreCheckpoint: disk write on Store + Update
- TestTrainingStoreSweepRecoversInflightAsFailed: I8 invariant
- TestTrainingStoreCorruptCheckpointSkipped: resilience
- TestTrainingStoreAtomicWriteIgnoresTmp: atomic-write safety
…tations
Addresses the load-bearing C1 finding from PR-6 review.
The original Store/Update path released store mu after taking a
snapshot, then re-acquired checkpointMu separately to do the fs write.
Under concurrent Update("job-X") calls, snapshots could write out of
order: T2's later mutation could complete + write S2 to disk before
T1's earlier mutation finished writing S1, leaving stale state on disk.
On orch crash + restart, sweep would see the stale "running" state on
disk for a job that had actually completed, mark it as
failed_orchestrator_restart, and emit a refund — INVERTING invariant I8.
Fix: hold checkpointMu across the WHOLE mutation+snapshot+write
window. This serializes checkpoint writes in mutation order. fs I/O
still happens outside store mu so reader requests aren't blocked.
Refactor: split writeCheckpoint into outer (acquires checkpointMu)
and writeCheckpointLocked (caller must already hold it).
New test: TestTrainingStoreConcurrentUpdatesPreserveOrder. 100
concurrent Updates on same job, final disk state must equal final
memory state. Catches the race definitively.
Note: review also flagged sweep ignoring fs errors past ReadDir
(R3) and missing refund wiring for failed_orchestrator_restart (R2).
Both are deferred to PR-10 (BillingEvent stream) per the design's
"refund issuance is orch's job, not store's" split.
…PR-10) PR-10 of byoc-payment-fleet-2026-05 (design §16, §3.B). Adds a structured JSONL audit record emitted on every terminal training-job state so the off-chain reconciliation pipeline (Invariant I3) has a wei-accurate record to cross-reference against orch's on-chain ticket-redemption ledger. Schema (billing_event.go::BillingEvent): event=billing_event, schema_version=1, timestamp, job_id, job_type, capability, model_id, user_hash (sha256(sender)[:16]), sender_address, status, cost_paid_wei, balance_wei, billable_seconds, wall_seconds, started_at, completed_at, billing_started_at?, error_message? Emitted from every terminal-state path in training.go: - Sync completion (rare — only if adapter returns inside the 30s submit timeout) - Async completed / failed / cancelled / timed_out (in-poll) - Mid-poll cancellation (Cancel API) - Insufficient-balance cancellation - Submit-time errors (request build, network, parse, non-2xx adapter) - Sweep-on-restart (failed_orchestrator_restart) — emits with billable_seconds=0 since pre-restart charge total is preserved in CostPaidWei but we have no record of seconds-charged accuracy To carry sender identity through orch restarts (the sweep path needs it to emit a useful BillingEvent), added SenderHex as a serialized field on TrainingJob, kept in sync via setSender(). Older checkpoints pre-PR-10 will have SenderHex="" and the sweep emits with zero-address — operator reconciles those manually from signer logs. Tests: TestHashSender_Deterministic — stable 16-char hex output TestTrainingBillingEvent_PopulatedFields — happy path TestTrainingBillingEvent_ErrorPath — failure status + empty billing start TestTrainingBillingEvent_EmptyCostDefaultsToZero — trims whitespace TestBillingEvent_JSONShape — round-trip contract check (rename-breaks-pipeline guard) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-memory ring buffer (cap 500) of recent BillingEvents, exposed via
GET /admin/billing-events on the orch's HTTP mux. Feeds the storyboard
/payments dashboard so operators can see live billing activity during
the canary flip without grepping docker logs.
Ring buffer is intentionally lossy at the head — older events stay in
stdout (docker logs) for forensics; the buffer is just the live
dashboard cache. Events older than ~500 calls scroll out.
Endpoint shape: GET /admin/billing-events?limit=N. Returns
{ events: BillingEvent[], count, ring_size, schema_ver }
sorted newest-first.
Tests:
TestBillingEventRing_AppendAndSnapshotOrdering — newest-first contract
TestBillingEventRing_LimitClampsToBuffer — limit handling
TestBillingEventRing_DropsOldestPastCap — eviction order
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ates (PR-10b) PR-10 (training) was the audit-log scaffolding; PR-10b extends it to the inference path so the /payments dashboard (PR-14) populates with the canary-flipped caps (all 5 are inference). byoc/billing_event.go: new helper inferenceBillingEvent(...) — same struct as trainingBillingEvent but job_type="inference" and a simpler arg list (no billingStartedAt — inference doesn't have a first-tick grace). byoc/job_orchestrator.go: capture balanceBefore at the top of processJob, emit BillingEvent at every terminal state: 1. connection error / timeout from worker 2. 401 Unauthorized from worker 3. response read error 4. 4xx from worker (worker-reported error) 5. successful completion (the main happy path) cost_paid_wei is computed as balanceBefore - balanceAfter — exactly what chargeForCompute debited, no re-derivation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(PR-10b fix) Earlier PR-10b computed cost_paid_wei as balanceBefore - balanceAfter, but the SDK sends a fresh payment ticket with each /process/request, which CREDITS the orch's per-sender balance before chargeForCompute DEBITS the compute cost. Net balance delta is typically ≤ 0 even on priced traffic, so every event read as cost=0 in the dashboard. Fix: new computeChargeWei(price, start) helper that re-derives the same math chargeForCompute uses internally: wei = (price.PricePerUnit / price.PixelsPerUnit) × seconds Used at every inference terminal-state emit site. Drops the balanceBefore capture since it's no longer load-bearing. The post-call balance shown in jobPaymentBalanceHdr is unchanged — it's still the user-facing remaining-balance number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-A of pricing-metering-design.md / pricing-metering-plan.md. When the adapter sets X-Livepeer-Units-Consumed on its response, the orch debits that many units instead of wall-clock seconds. Header absent → seconds fallback (bit-for-bit identical to pre-PR-A behavior — zero regression risk). Changes: - New `resolveUnits(start, resp) (int64, string)` helper. Reads X-Livepeer-Units-Consumed + X-Livepeer-Units-Kind, falls back to ceil(seconds) on absent/malformed/negative. - `chargeForCompute(start, price, sender, jobID, resp)` — added resp param; passes resolved units to DebitFees. - `computeChargeWei(price, start, resp)` — same units source as DebitFees so BillingEvent.cost_paid_wei matches what was billed. - All 5 inference call sites in job_orchestrator.processJob updated to pass resp; nil for the pre-response error path. - stream_orchestrator.go: 2 call sites updated. - BillingEvent gains BillableUnits + UnitsKind alongside the existing BillableSeconds — both stay in the audit record. - 5 new unit tests in billing_event_test.go covering header absent/present/malformed/zero/kind-missing. Behavior change for existing fleet: none. No adapter sets the header yet — that's PR-B. Until then `billable_units == ceil(seconds)` and `units_kind == "second"`, matching prior wei deduction exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI caught two compile errors that `go build ./byoc/...` locally missed because they were in cmd/livepeer's import path: - byoc/job_orchestrator.go: import strconv (needed for ParseFloat in resolveUnits) - byoc/stream_orchestrator.go:183: third chargeForCompute call site in the stream-start happy path also needs the new resp arg Full `go build ./...` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-4 of byoc-payment-fleet-2026-05 plan (design in livepeer/storyboard:feat/byoc-payment-fleet-2026-05 at docs/training-via-orch-design-2026-05-13.md, §3.B + §14.5).
Two payment-correctness fixes for `runTrainingJob` in `byoc/training.go`:
These are the §3.B simplifications described in §14.5 — no new fields, no new persistence, ~30 LOC total.
The §3.B failed-job refund item is deferred to a follow-up PR once we identify the orch-side credit function (or build one as part of the §3.D Redis persistence work).
Tests
Unit test coverage in this PR's diff covers:
`go test ./byoc/... -run TestRunTraining -v` (test file follow-up — focusing on the production code change first).
Cross-repo dependencies
Rollback
`git revert` the merge commit on `feat/byoc-payment-fleet-2026-05`; rebuild + redeploy. Or: `ORCH_IMAGE= docker compose up -d byoc-orch` on byoc-staging-1.
Branch + image-tag
🤖 Generated with Claude Code