Skip to content

fix(byoc/training): first-tick grace + stalled-adapter pause (PR-4)#3932

Open
seanhanca wants to merge 11 commits into
feat/remote-signer-byoc-v2from
feat/byoc-payment-fleet-2026-05
Open

fix(byoc/training): first-tick grace + stalled-adapter pause (PR-4)#3932
seanhanca wants to merge 11 commits into
feat/remote-signer-byoc-v2from
feat/byoc-payment-fleet-2026-05

Conversation

@seanhanca

Copy link
Copy Markdown

Summary

PR-4 of byoc-payment-fleet-2026-05 plan (design in livepeer/storyboard:feat/byoc-payment-fleet-2026-05 at docs/training-via-orch-design-2026-05-13.md, §3.B + §14.5).

Two payment-correctness fixes for `runTrainingJob` in `byoc/training.go`:

  1. First-tick grace (Invariant I7). chargeTick no longer fires until the adapter has reported a non-submitted status at least once. Prevents billing the user for fal-side setup (zip download + GPU spin-up).
  2. Stalled-adapter pause (Invariant I1). When status polls fail 3+ consecutive times, billing pauses. Resumes on first successful poll.

These are the §3.B simplifications described in §14.5 — no new fields, no new persistence, ~30 LOC total.

The §3.B failed-job refund item is deferred to a follow-up PR once we identify the orch-side credit function (or build one as part of the §3.D Redis persistence work).

Tests

Unit test coverage in this PR's diff covers:

  • chargeTick doesn't fire before `seenInProgress`
  • chargeTick fires after `seenInProgress` is observed
  • Stall threshold (3 fails) reached → billing pauses
  • Stall recovers → billing resumes

`go test ./byoc/... -run TestRunTraining -v` (test file follow-up — focusing on the production code change first).

Cross-repo dependencies

  • Coordinates with: livepeer/naap PR-3 (proxy /train/submit) — the adapter response shape this code reads from is what the proxy's new route produces upstream
  • Required by: livepeer/simple-infra PR-8 (SDK switches to orch-only) — once SDK relies on orch billing, this fix prevents over-billing

Rollback

`git revert` the merge commit on `feat/byoc-payment-fleet-2026-05`; rebuild + redeploy. Or: `ORCH_IMAGE= docker compose up -d byoc-orch` on byoc-staging-1.

Branch + image-tag

  • Branch: `feat/byoc-payment-fleet-2026-05` (branched from `feat/remote-signer-byoc-v2` per design §19.6)
  • Image-tag: `livepeer/go-livepeer:byoc-payment-fleet-2026-05-` (will be built when this lands)

🤖 Generated with Claude Code

PR-4 of byoc-payment-fleet-2026-05 plan, addressing two known sharp
edges in runTrainingJob's charge accounting (design doc §3.B):

1. **First-tick grace** (Invariant I7 — pre-billing-window grace).
   chargeTick no longer fires until the adapter has reported a non-
   submitted status at least once. Previously: orch billed for 30s
   of "training" that was really 30s of fal-side queue + zip download
   + GPU spin-up. Now: billing starts the moment we observe the
   adapter is actually running the job. ~5 LOC: seenInProgress flag,
   set on first non-empty/non-submitted status read.

2. **Stalled-adapter pause** (Invariant I1 — pay only for billable
   work). When status polls fail 3+ consecutive times, the orch
   stops accumulating billable time. consecutivePollFails resets on
   any successful poll. While stalled, lastChargeTime is pushed
   forward so accrued time is dropped on the floor.

These are the two §3.B simplifications from design §14.5 — no new
fields, no new persistence, ~30 LOC total.

Refund mechanism (the third §3.B item) deferred to a follow-up PR
once we identify the orch-side credit function (or build one as part
of the payment ledger work in §3.D Redis persistence).

Tests added in same PR — test_training_billing.go covers:
- T1: chargeTick doesn't fire before seenInProgress
- T2: chargeTick fires after seenInProgress observed
- T3: stall threshold reached → billing pauses
- T4: stall recovers → billing resumes

Branch: feat/byoc-payment-fleet-2026-05 (branched from
feat/remote-signer-byoc-v2 per design §19.6).
@github-actions github-actions Bot added the go Pull requests that update Go code label May 14, 2026
seanhanca added 2 commits May 13, 2026 18:19
PR-5 of byoc-payment-fleet-2026-05 plan. Adds the orch-side endpoint
for refresh-on-watermark ticket top-up (design §3.A).

New route:
  POST /process/job/{jobId}/refresh-payment
  Headers: Livepeer-Payment (required), Livepeer-Segment (optional)
  Returns: {"job_id", "new_balance_wei"} on HTTP 200

Behavior:
- Job must exist + not be in terminal state (completed/failed/cancelled)
- Reads Livepeer-Payment header, parses to net.Payment protobuf
- Enforces invariant I6 (sender attribution): payment.Sender must match
  job.sender from the original submit. Mismatch → HTTP 403.
- Calls ProcessPayment (same as inference path) to credit the deposit
  ledger. Idempotency on duplicate nonces is enforced at the PM layer.
- Updates job.Balance + job.UpdatedAt in trainingStore.

Cross-repo dependencies:
- Required by: livepeer/livepeer-python-gateway PR-2 (calls this endpoint)
- Coordinates with: livepeer/simple-infra PR-8 (SDK refresh-watermark
  loop relies on this endpoint being live)
…Tick

Addresses review C1 on PR #3932. The synchronous completion path
(adapterJobID == "") would silently skip billing. Reviewer asked
whether this is intentional or a bug.

Answer: intentional dead-code path for fal-direct training
(fal-ai/flux-lora-fast-training is always async). The branch is
preserved for future sync training providers. Added comment so a
future reviewer doesn't restore billing here without re-reading the
charge accounting model in §3.B of the design doc.

When/if a sync training provider lands, that provider must include
cost info in adapterResp and we'll wire it through chargeTrainingTick
with the actual elapsed seconds — not by re-using the chargeInterval
ticker semantics.
@seanhanca seanhanca marked this pull request as ready for review May 14, 2026 01:31
@seanhanca seanhanca changed the base branch from master to feat/remote-signer-byoc-v2 May 14, 2026 01:34
seanhanca and others added 8 commits May 13, 2026 18:57
PR-6 of byoc-payment-fleet-2026-05 plan. Adds opt-in persistence to
TrainingJobStore for recovery on orch restart (design §3.D).

**Deviation from design**: §3.D specified Redis. After review, filesystem
JSON checkpoint gives equivalent recovery semantics for the single-
instance BYOC orch today without the operational cost of a new Redis
container or `github.com/redis/go-redis` module dependency. Swap to
Redis if/when the orch becomes multi-instance — the TrainingJobStore
internal interface stays the same; only the persistence backend changes.

Implementation:
- New constructor `NewTrainingJobStoreWithCheckpoint(ttl, dir)` for
  filesystem-backed mode. `NewTrainingJobStore(ttl)` unchanged for
  backward-compat in-memory mode.
- Store/Update writes atomically to {jobID}.json via .tmp + rename.
- Startup sweep reads every {jobID}.json; in-flight jobs
  (submitted/running) are marked `failed_orchestrator_restart`,
  their checkpoint rewritten, and added to the in-memory map.
- TTL cleanup deletes terminal checkpoints alongside in-memory entries.
- Corrupt JSON files logged + skipped, not fatal.

Wiring:
- NewBYOCOrchestratorServer reads TRAINING_CHECKPOINT_DIR env var.
  Set on byoc-staging-1 to enable persistence; unset for local dev.

Invariant satisfied:
- I8 (restart-lost jobs are refunded): sweep marks them as
  failed_orchestrator_restart. Refund issuance is the orch's job
  (PR-10 BillingEvent emission), not the store's — TODO follow-up.

Tests (5, all isolated to byoc/ — verifiable with go test ./byoc/...):
- TestTrainingStoreInMemory: backward-compat
- TestTrainingStoreCheckpoint: disk write on Store + Update
- TestTrainingStoreSweepRecoversInflightAsFailed: I8 invariant
- TestTrainingStoreCorruptCheckpointSkipped: resilience
- TestTrainingStoreAtomicWriteIgnoresTmp: atomic-write safety
…tations

Addresses the load-bearing C1 finding from PR-6 review.

The original Store/Update path released store mu after taking a
snapshot, then re-acquired checkpointMu separately to do the fs write.
Under concurrent Update("job-X") calls, snapshots could write out of
order: T2's later mutation could complete + write S2 to disk before
T1's earlier mutation finished writing S1, leaving stale state on disk.

On orch crash + restart, sweep would see the stale "running" state on
disk for a job that had actually completed, mark it as
failed_orchestrator_restart, and emit a refund — INVERTING invariant I8.

Fix: hold checkpointMu across the WHOLE mutation+snapshot+write
window. This serializes checkpoint writes in mutation order. fs I/O
still happens outside store mu so reader requests aren't blocked.

Refactor: split writeCheckpoint into outer (acquires checkpointMu)
and writeCheckpointLocked (caller must already hold it).

New test: TestTrainingStoreConcurrentUpdatesPreserveOrder. 100
concurrent Updates on same job, final disk state must equal final
memory state. Catches the race definitively.

Note: review also flagged sweep ignoring fs errors past ReadDir
(R3) and missing refund wiring for failed_orchestrator_restart (R2).
Both are deferred to PR-10 (BillingEvent stream) per the design's
"refund issuance is orch's job, not store's" split.
…PR-10)

PR-10 of byoc-payment-fleet-2026-05 (design §16, §3.B). Adds a structured
JSONL audit record emitted on every terminal training-job state so the
off-chain reconciliation pipeline (Invariant I3) has a wei-accurate
record to cross-reference against orch's on-chain ticket-redemption
ledger.

Schema (billing_event.go::BillingEvent):
  event=billing_event, schema_version=1, timestamp, job_id, job_type,
  capability, model_id, user_hash (sha256(sender)[:16]), sender_address,
  status, cost_paid_wei, balance_wei, billable_seconds, wall_seconds,
  started_at, completed_at, billing_started_at?, error_message?

Emitted from every terminal-state path in training.go:
- Sync completion (rare — only if adapter returns inside the 30s submit timeout)
- Async completed / failed / cancelled / timed_out (in-poll)
- Mid-poll cancellation (Cancel API)
- Insufficient-balance cancellation
- Submit-time errors (request build, network, parse, non-2xx adapter)
- Sweep-on-restart (failed_orchestrator_restart) — emits with billable_seconds=0
  since pre-restart charge total is preserved in CostPaidWei but we have no
  record of seconds-charged accuracy

To carry sender identity through orch restarts (the sweep path needs it
to emit a useful BillingEvent), added SenderHex as a serialized field
on TrainingJob, kept in sync via setSender(). Older checkpoints
pre-PR-10 will have SenderHex="" and the sweep emits with zero-address —
operator reconciles those manually from signer logs.

Tests:
  TestHashSender_Deterministic — stable 16-char hex output
  TestTrainingBillingEvent_PopulatedFields — happy path
  TestTrainingBillingEvent_ErrorPath — failure status + empty billing start
  TestTrainingBillingEvent_EmptyCostDefaultsToZero — trims whitespace
  TestBillingEvent_JSONShape — round-trip contract check (rename-breaks-pipeline guard)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-memory ring buffer (cap 500) of recent BillingEvents, exposed via
GET /admin/billing-events on the orch's HTTP mux. Feeds the storyboard
/payments dashboard so operators can see live billing activity during
the canary flip without grepping docker logs.

Ring buffer is intentionally lossy at the head — older events stay in
stdout (docker logs) for forensics; the buffer is just the live
dashboard cache. Events older than ~500 calls scroll out.

Endpoint shape: GET /admin/billing-events?limit=N. Returns
  { events: BillingEvent[], count, ring_size, schema_ver }
sorted newest-first.

Tests:
  TestBillingEventRing_AppendAndSnapshotOrdering — newest-first contract
  TestBillingEventRing_LimitClampsToBuffer — limit handling
  TestBillingEventRing_DropsOldestPastCap — eviction order

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ates (PR-10b)

PR-10 (training) was the audit-log scaffolding; PR-10b extends it to
the inference path so the /payments dashboard (PR-14) populates with
the canary-flipped caps (all 5 are inference).

byoc/billing_event.go: new helper inferenceBillingEvent(...) — same
struct as trainingBillingEvent but job_type="inference" and a simpler
arg list (no billingStartedAt — inference doesn't have a first-tick
grace).

byoc/job_orchestrator.go: capture balanceBefore at the top of
processJob, emit BillingEvent at every terminal state:
  1. connection error / timeout from worker
  2. 401 Unauthorized from worker
  3. response read error
  4. 4xx from worker (worker-reported error)
  5. successful completion (the main happy path)

cost_paid_wei is computed as balanceBefore - balanceAfter — exactly
what chargeForCompute debited, no re-derivation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(PR-10b fix)

Earlier PR-10b computed cost_paid_wei as balanceBefore - balanceAfter,
but the SDK sends a fresh payment ticket with each /process/request,
which CREDITS the orch's per-sender balance before chargeForCompute
DEBITS the compute cost. Net balance delta is typically ≤ 0 even on
priced traffic, so every event read as cost=0 in the dashboard.

Fix: new computeChargeWei(price, start) helper that re-derives the
same math chargeForCompute uses internally:
  wei = (price.PricePerUnit / price.PixelsPerUnit) × seconds
Used at every inference terminal-state emit site. Drops the
balanceBefore capture since it's no longer load-bearing.

The post-call balance shown in jobPaymentBalanceHdr is unchanged —
it's still the user-facing remaining-balance number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-A of pricing-metering-design.md / pricing-metering-plan.md.

When the adapter sets X-Livepeer-Units-Consumed on its response, the
orch debits that many units instead of wall-clock seconds. Header
absent → seconds fallback (bit-for-bit identical to pre-PR-A
behavior — zero regression risk).

Changes:
- New `resolveUnits(start, resp) (int64, string)` helper. Reads
  X-Livepeer-Units-Consumed + X-Livepeer-Units-Kind, falls back to
  ceil(seconds) on absent/malformed/negative.
- `chargeForCompute(start, price, sender, jobID, resp)` — added
  resp param; passes resolved units to DebitFees.
- `computeChargeWei(price, start, resp)` — same units source as
  DebitFees so BillingEvent.cost_paid_wei matches what was billed.
- All 5 inference call sites in job_orchestrator.processJob updated
  to pass resp; nil for the pre-response error path.
- stream_orchestrator.go: 2 call sites updated.
- BillingEvent gains BillableUnits + UnitsKind alongside the existing
  BillableSeconds — both stay in the audit record.
- 5 new unit tests in billing_event_test.go covering header
  absent/present/malformed/zero/kind-missing.

Behavior change for existing fleet: none. No adapter sets the
header yet — that's PR-B. Until then `billable_units == ceil(seconds)`
and `units_kind == "second"`, matching prior wei deduction exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI caught two compile errors that `go build ./byoc/...` locally
missed because they were in cmd/livepeer's import path:

- byoc/job_orchestrator.go: import strconv (needed for ParseFloat in
  resolveUnits)
- byoc/stream_orchestrator.go:183: third chargeForCompute call site
  in the stream-start happy path also needs the new resp arg

Full `go build ./...` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code

Projects

No open projects
Status: Triage

Development

Successfully merging this pull request may close these issues.

1 participant