feat: provider liveness analytics (DAR-61)#201
Conversation
|
@M3kko is attempting to deploy a commit to the EigenLabs Team on Vercel. A member of the Team first needs to authorize it. |
|
@hankbobtheresearchoor can you do a review on this PR? I'd like for you to identify any performance or scalability bottlenecks in this code. iiuc this is introducing additive complexity onto the coordinator with stateful in-memory data structures which could result in connection or performance contention |
hankbobtheresearchoor
left a comment
There was a problem hiding this comment.
Reviewed — see inline comments for details.
| id, ok := t.open[providerID] | ||
| if ok { | ||
| delete(t.open, providerID) | ||
| } |
There was a problem hiding this comment.
🟡 Should fix: delete(t.open, providerID) removes the in-memory entry BEFORE the DB write at L91 (CloseSession). If the DB write fails (transient Postgres blip), the session row stays OPEN in Postgres but the tracker has no record of it — a subsequent Close() call silently returns via the idempotent guard (L76: if !ok return).
The orphan sweep on next coordinator boot catches it, but if the coordinator runs for weeks without restarting, stale open rows accumulate. Consider exposing liveness_session_close_failed_total as a counter at L96 so operators can detect this gap.
| `CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_at ON provider_heartbeats(at DESC)`, | ||
|
|
||
| `CREATE TABLE IF NOT EXISTS provider_heartbeats_hourly ( | ||
| provider_id TEXT NOT NULL, |
There was a problem hiding this comment.
🟡 Should fix: provider_heartbeats_hourly is created in the migration at L503 but nothing populates it in this PR. The writer only calls AppendHeartbeats → raw provider_heartbeats. The features rollup reads from provider_sessions, not heartbeats. The retention pruner deletes from raw heartbeats only.
At 1,000 providers × 10s heartbeat interval = 8.6M raw rows/day, this hourly table should either (a) get a periodic rollup goroutine (INSERT INTO ... SELECT from the raw table), or (b) be removed from the migration until it has a writer. Dead schema misleads operators into thinking hourly data exists.
| )`, | ||
| `CREATE INDEX IF NOT EXISTS idx_provider_sessions_provider_connected ON provider_sessions(provider_id, connected_at DESC)`, | ||
| `CREATE INDEX IF NOT EXISTS idx_provider_sessions_open ON provider_sessions(provider_id) WHERE disconnected_at IS NULL`, | ||
|
|
There was a problem hiding this comment.
🟡 Should fix — missing index for features rollup query. ListSessionsSince(since) in the store interface is called every 5 minutes by the features rollup. For the postgres backend this becomes:
SELECT * FROM provider_sessions
WHERE connected_at >= $1 OR disconnected_at IS NULLThe existing indexes (L526 idx_provider_sessions_provider_connected, L527 idx_provider_sessions_open) don't cover this query — neither leads with connected_at alone. At fleet scale with millions of session rows, this will seq-scan every 5 minutes.
Recommend adding after L527:
CREATE INDEX IF NOT EXISTS idx_provider_sessions_connected_at
ON provider_sessions(connected_at DESC);Low urgency now but will matter as the fleet grows past hundreds of providers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… policy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ys)) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ RecordDisconnect Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rror) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…an-close on boot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… logic Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pers) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se_failed_total counter When the DB write in SessionTracker.Close fails, the tracker has already deleted its in-memory entry; the row stays OPEN in Postgres and the orphan sweep only catches it at the next coordinator boot. Add a nil-safe CounterFn to SessionTracker so operators can alarm on the gap without waiting for a reboot. Threads `nil` through the construction site in cmd/coordinator/main.go and the existing test call sites. Addresses hankbobtheresearchoor's review comment on coordinator/liveness/session.go:79 (PR Layr-Labs#201). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The provider_heartbeats_hourly migration ships in this branch but nothing was writing to it — at 1k providers × 10s heartbeats = 8.6M raw rows/day, dashboards that expected pre-aggregated hourly data silently returned empty. Add Store.RollupHeartbeatsHourly on both backends (Postgres uses INSERT…SELECT…GROUP BY…ON CONFLICT; memory mirrors it in Go for parity) and a HeartbeatHourly loop modeled on the existing features rollup (5m cadence, 2h lookback, nil-safe CounterFn instrumentation). Thermal severity is ranked nominal<fair<serious<critical so MAX-by-severity is correct (plain text MAX is alphabetical). Addresses hankbobtheresearchoor's review comment on coordinator/store/postgres.go:503 (PR Layr-Labs#201). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sSince ListSessionsSince runs every 5 minutes via the features rollup with the predicate `WHERE disconnected_at IS NULL OR disconnected_at >= $1`. The IS NULL half is already covered by idx_provider_sessions_open (partial, WHERE disconnected_at IS NULL). The >= $1 half had no covering index — seq scan at fleet scale. Add a partial index on disconnected_at DESC (WHERE disconnected_at IS NOT NULL) so the planner can BitmapOr the two halves. NB: hankbobtheresearchoor's original comment named connected_at; the actual query predicates on disconnected_at, so this fixes the spirit of the concern against the real query shape. Addresses review comment on coordinator/store/postgres.go:527 (PR Layr-Labs#201). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints) Master deleted the standalone analytics/ service in Layr-Labs#212 with the rationale "duplicate of coordinator endpoints, never deployed." Bring the liveness endpoints back as a single coordinator/api/liveness.go file, admin-gated via the existing requireAdminKey pattern. Reuses the coordinator's store.Store directly — no service layer, no separate types package, no pseudonym aliasing (matches existing /v1/stats which exposes raw provider IDs). Five endpoints: GET /v1/providers/{id}/liveness reliability summary GET /v1/providers/{id}/sessions recent session intervals GET /v1/providers/{id}/heartbeats recent heartbeat samples GET /v1/providers/reliability fleet shortlist by uptime bar GET /v1/network/availability distributional fleet stats Also repairs a stale interface docstring: GetReliabilityFeatures returns (nil, nil) on miss, not ErrNotFound. Both backends already returned (nil, nil); only the comment claimed otherwise. Includes httptest coverage for the admin gate, happy path, not-found, filter behavior, and small-N percentile correctness (math.Round so p90 separates from p50 at N=3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints) Master deleted analytics/ in Layr-Labs#212; this branch had been adding to it. Now that the liveness read endpoints live in coordinator/api/liveness.go, remove the duplicate tree (~1500 LOC across 17 files including the separate go.mod). Post-rebase, this conflict resolves cleanly because both sides delete the same files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rebase) PR Layr-Labs#187 changed store.NewMemory to take store.Config{} instead of a raw adminKey string, and api.NewServer to require a ServerConfig arg. Updates the test call sites added in this branch to match. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6a22c00 to
04829ba
Compare
Collapses the per-provider reliability features into a single number in
[0, 1] so consumers (dashboards, future schedulers) can rank providers
without knowing the formula. Higher = more reliable for long-running
jobs.
Formula:
score = 0.35·uptime_pct + 0.25·p_stays_4h + 0.15·p_stays_8h
+ 0.25·sigmoid((mtbf_seconds - 4h) / 1h)
- 0.10·recency_penalty(last_disconnect_at)
recency_penalty decays linearly from 1.0 at t=0 to 0 at 24h.
Score clamped to [0, 1].
- Adds liveness_score REAL column to provider_reliability_features
(additive migration via ADD COLUMN IF NOT EXISTS) + DESC index for
the new ORDER BY.
- Computed in coordinator/liveness/features.go alongside the other
features and persisted with each rollup tick (5-min cadence over a
14-day window).
- Adds MinScore/MaxScore to store.ReliabilityFilter. ListReliabilityFeatures
now orders by liveness_score DESC, uptime_pct DESC, provider_id ASC.
- /v1/providers/reliability accepts min_score=&max_score= and echoes
them in the response.
Test coverage: scoring math at empty / perfect / flaky / recency-decay
/ clamping edge cases, plus an httptest case that confirms the
endpoint filters AND orders by score.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
provider_heartbeatsinto hourly buckets so dashboards scan 168 rows/week per provider instead of ~60k raw heartbeats.liveness_scorein [0, 1] so consumers can rank providers without knowing the formula. Filterable + sortable via/v1/providers/reliability.Foundation for scheduling long-running scientific workloads (DFT, protein folding) — gives us the data to decide which providers can reliably hold a multi-hour job, and to detect risk before it materializes. Closes DAR-61.
What changes
provider_heartbeats,provider_heartbeats_hourly,provider_sessions,provider_reliability_features. 6 new indexes (see schema details below). 1ADD COLUMN IF NOT EXISTSonprovider_reliability_featuresforliveness_score(the column was added in the same PR that created the table, so this ALTER only matters if a deploy lands the migration in two steps). NoALTERon pre-existing tables. No protocol changes.coordinator/liveness/package: bounded-bufferWriter(non-blocking emit, batched flush, drop policy, error backoff, graceful drain),SessionTracker, retention prune (28d default), 5-min features rollup, 5-min hourly heartbeat rollup.registry.Heartbeatemits to writer;Registeropens a session;evictStaleand the WS read-loop callRecordDisconnectwith classified reason;main.goruns orphan-close on boot and drains on shutdown.coordinator/api/liveness.go, admin-gated via the existingrequireAdminKeypattern:GET /v1/providers/{id}/{liveness,sessions,heartbeats},GET /v1/providers/reliability,GET /v1/network/availability. Provider IDs returned raw, matching the existing/v1/statsconvention.Schema details
provider_heartbeats(raw heartbeat history)provider_heartbeats_hourly(5-min rollup of the above)provider_sessions(per-session intervals)provider_reliability_features(per-provider scalar rollup, recomputed every 5 min)Liveness score
The features rollup writes a scalar
liveness_scorein[0, 1]alongside the per-provider features. Higher = more reliable for long-running jobs.Weights/sigmoid/recency-window constants live in
coordinator/liveness/features.go(scoreWeight*,scoreMTBFTargetSec,scoreMTBFScaleSec,scoreRecencyWindowSec). They are intentionallyconstrather than env-tuned — getting them right takes weeks of soak data, then a code change.Edge-case behavior locked in by tests:
Endpoint usage
Results are ordered
liveness_score DESC, uptime_pct DESC, provider_id ASC.max_score=0is treated as unbounded so callers don't have to pass1.0for the common case.Safeguards
liveness_dropped_total.CloseOrphanSessionsis coordinator_id-scoped → blue-green deploys won't stomp the sibling coordinator's live sessions.date_trunc('hour', $1)); upserts viaON CONFLICT DO UPDATEso re-running is idempotent.provider_reliability_features,provider_heartbeats_hourly) or indexed lookups — no seq scans at request time.ListSessionsSinceis covered byidx_provider_sessions_open(partial,WHERE disconnected_at IS NULL) for the open half andidx_provider_sessions_disconnected_at(partial,WHERE disconnected_at IS NOT NULL) for the closed half; planner uses BitmapOr.idx_provider_reliability_features_score (liveness_score DESC)→min_score/max_scorequeries are index-only.Addressing review comments
coordinator/liveness/session.goclose-failure visibility →SessionTrackernow bumpsliveness_session_close_failed_totalwhen the store'sCloseSessionerrors. Nil-safe CounterFn matches the Writer/Features/Retention pattern. (Commitc7133109.)provider_heartbeats_hourlytable → populated by a new 5-minHeartbeatHourlyrollup loop modeled on the existing Features rollup. (Commitb54c2667.)ListSessionsSince→ added partialidx_provider_sessions_disconnected_at. NB: the original review comment namedconnected_at; the actual query predicates ondisconnected_at, so this fixes the spirit of the concern against the real query. (Commitc3f6c242.)Test plan
cd coordinator && go build ./...— cleancd coordinator && go test ./... -count=1 -timeout 180s— all packages green (api integration suite is the slow one, ~83s)DATABASE_URL=postgres://… go test ./coordinator/store/ -run TestPostgresLiveness -count=1curl -H "Authorization: Bearer $ADMIN_KEY" https://api.dev.darkbloom.xyz/v1/network/availability curl -H "Authorization: Bearer $ADMIN_KEY" "https://api.dev.darkbloom.xyz/v1/providers/reliability?min_score=0.8&limit=10"liveness_dropped_total == 0, flush p99 < 100ms, no connection-pool exhaustion.os.Hostname()is stable across EigenCloud redeploys (blue-green correctness).liveness_scoredistribution on dev after one rollup tick — sanity-check it matches operator intuition before tuning the weights.Deploy follow-ups (not in this PR)
CounterFncallbacks (liveness_dropped_total,liveness_flush_failed_total,liveness_features_runs_total,liveness_hourly_runs_total,liveness_session_close_failed_total, etc.) into the existing metrics pipeline so they're visible on dashboards.liveness_scoreas a registry signal so the routing scorer can multiply by it for long-running jobs (per CLAUDE.md, current registry scoring isdecode TPS × trust × reputation × warm model bonus × health factor—liveness_scorewould be a complementary historical factor).Out of scope for v1
🤖 Generated with Claude Code