Skip to content

feat: provider liveness analytics (DAR-61)#201

Open
M3kko wants to merge 32 commits into
Layr-Labs:masterfrom
M3kko:c-brody/dar-61-figure-out-liveness-profiling-per-provider
Open

feat: provider liveness analytics (DAR-61)#201
M3kko wants to merge 32 commits into
Layr-Labs:masterfrom
M3kko:c-brody/dar-61-figure-out-liveness-profiling-per-provider

Conversation

@M3kko

@M3kko M3kko commented May 22, 2026

Copy link
Copy Markdown

Summary

  • Captures every provider heartbeat + connect/disconnect transition to Postgres without blocking the heartbeat hot path.
  • Adds a per-provider reliability rollup (uptime, MTBF, P10/P50/P90 session length, P(stays ≥ 4h / 8h), 7×24 hourly availability matrix, disconnect-reason histogram) recomputed every 5 minutes.
  • Adds a 5-minute heartbeat rollup that pre-aggregates provider_heartbeats into hourly buckets so dashboards scan 168 rows/week per provider instead of ~60k raw heartbeats.
  • Collapses the per-provider reliability features into a scalar liveness_score in [0, 1] so consumers can rank providers without knowing the formula. Filterable + sortable via /v1/providers/reliability.
  • Exposes five read-only analytics endpoints inside the coordinator (no separate analytics service) behind admin auth, for internal dashboards and future job-aware scheduling.

Foundation for scheduling long-running scientific workloads (DFT, protein folding) — gives us the data to decide which providers can reliably hold a multi-hour job, and to detect risk before it materializes. Closes DAR-61.

What changes

Area Change
Schema (additive only) 4 new tables: provider_heartbeats, provider_heartbeats_hourly, provider_sessions, provider_reliability_features. 6 new indexes (see schema details below). 1 ADD COLUMN IF NOT EXISTS on provider_reliability_features for liveness_score (the column was added in the same PR that created the table, so this ALTER only matters if a deploy lands the migration in two steps). No ALTER on pre-existing tables. No protocol changes.
Coordinator write path New coordinator/liveness/ package: bounded-buffer Writer (non-blocking emit, batched flush, drop policy, error backoff, graceful drain), SessionTracker, retention prune (28d default), 5-min features rollup, 5-min hourly heartbeat rollup.
Coordinator hooks registry.Heartbeat emits to writer; Register opens a session; evictStale and the WS read-loop call RecordDisconnect with classified reason; main.go runs orphan-close on boot and drains on shutdown.
Read endpoints 5 endpoints live in coordinator/api/liveness.go, admin-gated via the existing requireAdminKey pattern: GET /v1/providers/{id}/{liveness,sessions,heartbeats}, GET /v1/providers/reliability, GET /v1/network/availability. Provider IDs returned raw, matching the existing /v1/stats convention.

Schema details

provider_heartbeats (raw heartbeat history)

CREATE TABLE IF NOT EXISTS provider_heartbeats (
    id BIGSERIAL PRIMARY KEY,
    provider_id TEXT NOT NULL,
    at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status TEXT NOT NULL DEFAULT '',
    memory_pressure REAL NOT NULL DEFAULT 0,
    cpu_usage REAL NOT NULL DEFAULT 0,
    thermal_state TEXT NOT NULL DEFAULT '',
    memory_available_gb REAL,
    gpu_memory_active_gb REAL,
    gpu_memory_cache_gb REAL,
    backend_state TEXT NOT NULL DEFAULT '',
    active_model TEXT NOT NULL DEFAULT '',
    capacity JSONB
);
CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_provider_at
    ON provider_heartbeats(provider_id, at DESC);
CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_at
    ON provider_heartbeats(at DESC);

provider_heartbeats_hourly (5-min rollup of the above)

CREATE TABLE IF NOT EXISTS provider_heartbeats_hourly (
    provider_id TEXT NOT NULL,
    hour TIMESTAMPTZ NOT NULL,
    heartbeat_count INT NOT NULL DEFAULT 0,
    avg_memory_pressure REAL NOT NULL DEFAULT 0,
    avg_cpu_usage REAL NOT NULL DEFAULT 0,
    max_thermal_state TEXT NOT NULL DEFAULT '',
    avg_memory_available_gb REAL,
    PRIMARY KEY (provider_id, hour)
);
CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_hourly_hour
    ON provider_heartbeats_hourly(hour DESC);

provider_sessions (per-session intervals)

CREATE TABLE IF NOT EXISTS provider_sessions (
    id BIGSERIAL PRIMARY KEY,
    provider_id TEXT NOT NULL,
    connected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    disconnected_at TIMESTAMPTZ,
    disconnect_reason TEXT NOT NULL DEFAULT '',
    last_heartbeat_at TIMESTAMPTZ,
    requests_served BIGINT NOT NULL DEFAULT 0,
    tokens_generated BIGINT NOT NULL DEFAULT 0,
    coordinator_id TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_provider_sessions_provider_connected
    ON provider_sessions(provider_id, connected_at DESC);
CREATE INDEX IF NOT EXISTS idx_provider_sessions_open
    ON provider_sessions(provider_id) WHERE disconnected_at IS NULL;
CREATE INDEX IF NOT EXISTS idx_provider_sessions_disconnected_at
    ON provider_sessions(disconnected_at DESC) WHERE disconnected_at IS NOT NULL;

provider_reliability_features (per-provider scalar rollup, recomputed every 5 min)

CREATE TABLE IF NOT EXISTS provider_reliability_features (
    provider_id TEXT PRIMARY KEY,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    window_days INT NOT NULL DEFAULT 14,
    uptime_pct REAL NOT NULL DEFAULT 0,
    sessions_count INT NOT NULL DEFAULT 0,
    mtbf_seconds BIGINT NOT NULL DEFAULT 0,
    median_session_seconds BIGINT NOT NULL DEFAULT 0,
    p10_session_seconds BIGINT NOT NULL DEFAULT 0,
    p90_session_seconds BIGINT NOT NULL DEFAULT 0,
    hourly_availability JSONB NOT NULL DEFAULT '{}'::jsonb,
    disconnect_reasons JSONB NOT NULL DEFAULT '{}'::jsonb,
    p_stays_4h REAL NOT NULL DEFAULT 0,
    p_stays_8h REAL NOT NULL DEFAULT 0,
    last_disconnect_at TIMESTAMPTZ,
    last_session_duration_seconds BIGINT NOT NULL DEFAULT 0
);
-- Additive: scalar liveness score in [0, 1]. ADD COLUMN IF NOT EXISTS so
-- the migration is safe to re-run, and safe if a deploy lands the CREATE
-- TABLE and the ADD COLUMN in two steps.
ALTER TABLE provider_reliability_features
    ADD COLUMN IF NOT EXISTS liveness_score REAL NOT NULL DEFAULT 0;
CREATE INDEX IF NOT EXISTS idx_provider_reliability_features_score
    ON provider_reliability_features(liveness_score DESC);

Liveness score

The features rollup writes a scalar liveness_score in [0, 1] alongside the per-provider features. Higher = more reliable for long-running jobs.

score = 0.35·uptime_pct
      + 0.25·p_stays_4h
      + 0.15·p_stays_8h
      + 0.25·sigmoid((mtbf_seconds − 4h) / 1h)
      − 0.10·recency_penalty(last_disconnect_at)

where  recency_penalty decays linearly from 1.0 at t=0 to 0 at 24h
       score is clamped to [0, 1]

Weights/sigmoid/recency-window constants live in coordinator/liveness/features.go (scoreWeight*, scoreMTBFTargetSec, scoreMTBFScaleSec, scoreRecencyWindowSec). They are intentionally const rather than env-tuned — getting them right takes weeks of soak data, then a code change.

Edge-case behavior locked in by tests:

Provider profile Score
Brand new (no sessions) ≈ 0
Perfect (100% uptime, 8h+ sessions, big MTBF, no recent disconnect) 1.0
99% uptime but flaky (lots of short sessions, recent disconnect) ≈ 0.28 (not 0.99)
Same flaky provider 24h later recovers as recency penalty decays

Endpoint usage

GET /v1/providers/reliability
    ?min_uptime=0.9
    &min_stays_4h=0.7
    &min_score=0.8
    &max_score=1
    &limit=50

Results are ordered liveness_score DESC, uptime_pct DESC, provider_id ASC. max_score=0 is treated as unbounded so callers don't have to pass 1.0 for the common case.

Safeguards

  • Bounded channel + non-blocking send → no OOM on Postgres outage; drops counted via liveness_dropped_total.
  • 5s statement timeout per flush + error backoff → bad flushes can't stall the writer.
  • Drain on shutdown forces backoff = 0 → no silent data loss when graceful-stopping after a transient error.
  • CloseOrphanSessions is coordinator_id-scoped → blue-green deploys won't stomp the sibling coordinator's live sessions.
  • Retention deletes in batches of 10k via subquery + IN → bounded WAL pressure, autovacuum keeps pace, partial progress survives cancellation.
  • Hourly rollup window is aligned to a clean hour boundary in SQL (date_trunc('hour', $1)); upserts via ON CONFLICT DO UPDATE so re-running is idempotent.
  • All endpoint queries are pre-aggregated (provider_reliability_features, provider_heartbeats_hourly) or indexed lookups — no seq scans at request time.
  • ListSessionsSince is covered by idx_provider_sessions_open (partial, WHERE disconnected_at IS NULL) for the open half and idx_provider_sessions_disconnected_at (partial, WHERE disconnected_at IS NOT NULL) for the closed half; planner uses BitmapOr.
  • Liveness-score filter uses idx_provider_reliability_features_score (liveness_score DESC)min_score/max_score queries are index-only.

Addressing review comments

  • coordinator/liveness/session.go close-failure visibilitySessionTracker now bumps liveness_session_close_failed_total when the store's CloseSession errors. Nil-safe CounterFn matches the Writer/Features/Retention pattern. (Commit c7133109.)
  • Dead provider_heartbeats_hourly table → populated by a new 5-min HeartbeatHourly rollup loop modeled on the existing Features rollup. (Commit b54c2667.)
  • Missing index for ListSessionsSince → added partial idx_provider_sessions_disconnected_at. NB: the original review comment named connected_at; the actual query predicates on disconnected_at, so this fixes the spirit of the concern against the real query. (Commit c3f6c242.)

Test plan

  • cd coordinator && go build ./... — clean
  • cd coordinator && go test ./... -count=1 -timeout 180s — all packages green (api integration suite is the slow one, ~83s)
  • Postgres-backed store tests (not yet run; needs a real DB):
    DATABASE_URL=postgres://… go test ./coordinator/store/ -run TestPostgresLiveness -count=1
  • Hit each new endpoint via curl against dev coordinator and verify response shape:
    curl -H "Authorization: Bearer $ADMIN_KEY" https://api.dev.darkbloom.xyz/v1/network/availability curl -H "Authorization: Bearer $ADMIN_KEY" "https://api.dev.darkbloom.xyz/v1/providers/reliability?min_score=0.8&limit=10"
  • 24h soak on dev coordinator: verify liveness_dropped_total == 0, flush p99 < 100ms, no connection-pool exhaustion.
  • Confirm os.Hostname() is stable across EigenCloud redeploys (blue-green correctness).
  • Inspect liveness_score distribution on dev after one rollup tick — sanity-check it matches operator intuition before tuning the weights.

Deploy follow-ups (not in this PR)

  1. Wire the CounterFn callbacks (liveness_dropped_total, liveness_flush_failed_total, liveness_features_runs_total, liveness_hourly_runs_total, liveness_session_close_failed_total, etc.) into the existing metrics pipeline so they're visible on dashboards.
  2. Consider exposing liveness_score as a registry signal so the routing scorer can multiply by it for long-running jobs (per CLAUDE.md, current registry scoring is decode TPS × trust × reputation × warm model bonus × health factorliveness_score would be a complementary historical factor).

Out of scope for v1

  • Real-time risk-signal stream (deferred; the captured data is the training set).
  • Job-aware / sticky scheduling (separate workstream — analytics tells you when, but the platform still needs a checkpoint protocol to act).
  • Postgres read replica (analytics queries are bounded by pre-aggregated features + indexed lookups; revisit if you measure operational impact).
  • Env-tunable score weights (intentional — wait for soak data before exposing).

🤖 Generated with Claude Code

@vercel

vercel Bot commented May 22, 2026

Copy link
Copy Markdown

@M3kko is attempting to deploy a commit to the EigenLabs Team on Vercel.

A member of the Team first needs to authorize it.

@ethenotethan

Copy link
Copy Markdown
Contributor

@hankbobtheresearchoor can you do a review on this PR? I'd like for you to identify any performance or scalability bottlenecks in this code. iiuc this is introducing additive complexity onto the coordinator with stateful in-memory data structures which could result in connection or performance contention

@hankbobtheresearchoor hankbobtheresearchoor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed — see inline comments for details.

id, ok := t.open[providerID]
if ok {
delete(t.open, providerID)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Should fix: delete(t.open, providerID) removes the in-memory entry BEFORE the DB write at L91 (CloseSession). If the DB write fails (transient Postgres blip), the session row stays OPEN in Postgres but the tracker has no record of it — a subsequent Close() call silently returns via the idempotent guard (L76: if !ok return).

The orphan sweep on next coordinator boot catches it, but if the coordinator runs for weeks without restarting, stale open rows accumulate. Consider exposing liveness_session_close_failed_total as a counter at L96 so operators can detect this gap.

`CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_at ON provider_heartbeats(at DESC)`,

`CREATE TABLE IF NOT EXISTS provider_heartbeats_hourly (
provider_id TEXT NOT NULL,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Should fix: provider_heartbeats_hourly is created in the migration at L503 but nothing populates it in this PR. The writer only calls AppendHeartbeats → raw provider_heartbeats. The features rollup reads from provider_sessions, not heartbeats. The retention pruner deletes from raw heartbeats only.

At 1,000 providers × 10s heartbeat interval = 8.6M raw rows/day, this hourly table should either (a) get a periodic rollup goroutine (INSERT INTO ... SELECT from the raw table), or (b) be removed from the migration until it has a writer. Dead schema misleads operators into thinking hourly data exists.

)`,
`CREATE INDEX IF NOT EXISTS idx_provider_sessions_provider_connected ON provider_sessions(provider_id, connected_at DESC)`,
`CREATE INDEX IF NOT EXISTS idx_provider_sessions_open ON provider_sessions(provider_id) WHERE disconnected_at IS NULL`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Should fix — missing index for features rollup query. ListSessionsSince(since) in the store interface is called every 5 minutes by the features rollup. For the postgres backend this becomes:

SELECT * FROM provider_sessions
WHERE connected_at >= $1 OR disconnected_at IS NULL

The existing indexes (L526 idx_provider_sessions_provider_connected, L527 idx_provider_sessions_open) don't cover this query — neither leads with connected_at alone. At fleet scale with millions of session rows, this will seq-scan every 5 minutes.

Recommend adding after L527:

CREATE INDEX IF NOT EXISTS idx_provider_sessions_connected_at
  ON provider_sessions(connected_at DESC);

Low urgency now but will matter as the fleet grows past hundreds of providers.

M3kko and others added 24 commits May 28, 2026 12:32
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… policy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ys))

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cycle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ RecordDisconnect

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rror)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…an-close on boot

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rser

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… logic

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pers)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M3kko and others added 6 commits May 28, 2026 12:37
…se_failed_total counter

When the DB write in SessionTracker.Close fails, the tracker has already
deleted its in-memory entry; the row stays OPEN in Postgres and the
orphan sweep only catches it at the next coordinator boot. Add a
nil-safe CounterFn to SessionTracker so operators can alarm on the gap
without waiting for a reboot. Threads `nil` through the construction
site in cmd/coordinator/main.go and the existing test call sites.

Addresses hankbobtheresearchoor's review comment on
coordinator/liveness/session.go:79 (PR Layr-Labs#201).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The provider_heartbeats_hourly migration ships in this branch but
nothing was writing to it — at 1k providers × 10s heartbeats = 8.6M
raw rows/day, dashboards that expected pre-aggregated hourly data
silently returned empty. Add Store.RollupHeartbeatsHourly on both
backends (Postgres uses INSERT…SELECT…GROUP BY…ON CONFLICT; memory
mirrors it in Go for parity) and a HeartbeatHourly loop modeled on
the existing features rollup (5m cadence, 2h lookback, nil-safe
CounterFn instrumentation). Thermal severity is ranked
nominal<fair<serious<critical so MAX-by-severity is correct (plain
text MAX is alphabetical).

Addresses hankbobtheresearchoor's review comment on
coordinator/store/postgres.go:503 (PR Layr-Labs#201).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sSince

ListSessionsSince runs every 5 minutes via the features rollup with the
predicate `WHERE disconnected_at IS NULL OR disconnected_at >= $1`. The
IS NULL half is already covered by idx_provider_sessions_open (partial,
WHERE disconnected_at IS NULL). The >= $1 half had no covering index —
seq scan at fleet scale. Add a partial index on disconnected_at DESC
(WHERE disconnected_at IS NOT NULL) so the planner can BitmapOr the two
halves.

NB: hankbobtheresearchoor's original comment named connected_at; the
actual query predicates on disconnected_at, so this fixes the spirit of
the concern against the real query shape. Addresses review comment on
coordinator/store/postgres.go:527 (PR Layr-Labs#201).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints)

Master deleted the standalone analytics/ service in Layr-Labs#212 with the
rationale "duplicate of coordinator endpoints, never deployed." Bring
the liveness endpoints back as a single coordinator/api/liveness.go
file, admin-gated via the existing requireAdminKey pattern. Reuses the
coordinator's store.Store directly — no service layer, no separate
types package, no pseudonym aliasing (matches existing /v1/stats which
exposes raw provider IDs). Five endpoints:

  GET /v1/providers/{id}/liveness     reliability summary
  GET /v1/providers/{id}/sessions     recent session intervals
  GET /v1/providers/{id}/heartbeats   recent heartbeat samples
  GET /v1/providers/reliability       fleet shortlist by uptime bar
  GET /v1/network/availability        distributional fleet stats

Also repairs a stale interface docstring: GetReliabilityFeatures
returns (nil, nil) on miss, not ErrNotFound. Both backends already
returned (nil, nil); only the comment claimed otherwise.

Includes httptest coverage for the admin gate, happy path, not-found,
filter behavior, and small-N percentile correctness (math.Round so
p90 separates from p50 at N=3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints)

Master deleted analytics/ in Layr-Labs#212; this branch had been adding to it.
Now that the liveness read endpoints live in coordinator/api/liveness.go,
remove the duplicate tree (~1500 LOC across 17 files including the
separate go.mod). Post-rebase, this conflict resolves cleanly because
both sides delete the same files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rebase)

PR Layr-Labs#187 changed store.NewMemory to take store.Config{} instead of a raw
adminKey string, and api.NewServer to require a ServerConfig arg.
Updates the test call sites added in this branch to match. No behavior
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@M3kko M3kko force-pushed the c-brody/dar-61-figure-out-liveness-profiling-per-provider branch from 6a22c00 to 04829ba Compare May 28, 2026 17:40
M3kko and others added 2 commits May 28, 2026 13:08
Collapses the per-provider reliability features into a single number in
[0, 1] so consumers (dashboards, future schedulers) can rank providers
without knowing the formula. Higher = more reliable for long-running
jobs.

Formula:
  score = 0.35·uptime_pct + 0.25·p_stays_4h + 0.15·p_stays_8h
        + 0.25·sigmoid((mtbf_seconds - 4h) / 1h)
        - 0.10·recency_penalty(last_disconnect_at)

  recency_penalty decays linearly from 1.0 at t=0 to 0 at 24h.
  Score clamped to [0, 1].

- Adds liveness_score REAL column to provider_reliability_features
  (additive migration via ADD COLUMN IF NOT EXISTS) + DESC index for
  the new ORDER BY.
- Computed in coordinator/liveness/features.go alongside the other
  features and persisted with each rollup tick (5-min cadence over a
  14-day window).
- Adds MinScore/MaxScore to store.ReliabilityFilter. ListReliabilityFeatures
  now orders by liveness_score DESC, uptime_pct DESC, provider_id ASC.
- /v1/providers/reliability accepts min_score=&max_score= and echoes
  them in the response.

Test coverage: scoring math at empty / perfect / flaky / recency-decay
/ clamping edge cases, plus an httptest case that confirms the
endpoint filters AND orders by score.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants