feat: provider liveness analytics (DAR-61) by M3kko · Pull Request #201 · Layr-Labs/d-inference

M3kko · 2026-05-22T18:48:26Z

Summary

Captures every provider heartbeat + connect/disconnect transition to Postgres without blocking the heartbeat hot path.
Adds a per-provider reliability rollup (uptime, MTBF, P10/P50/P90 session length, P(stays ≥ 4h / 8h), 7×24 hourly availability matrix, disconnect-reason histogram) recomputed every 5 minutes.
Adds a 5-minute heartbeat rollup that pre-aggregates provider_heartbeats into hourly buckets so dashboards scan 168 rows/week per provider instead of ~60k raw heartbeats.
Collapses the per-provider reliability features into a scalar liveness_score in [0, 1] so consumers can rank providers without knowing the formula. Filterable + sortable via /v1/providers/reliability.
Exposes five read-only analytics endpoints inside the coordinator (no separate analytics service) behind admin auth, for internal dashboards and future job-aware scheduling.

Foundation for scheduling long-running scientific workloads (DFT, protein folding) — gives us the data to decide which providers can reliably hold a multi-hour job, and to detect risk before it materializes. Closes DAR-61.

What changes

Area	Change
Schema (additive only)	4 new tables: `provider_heartbeats`, `provider_heartbeats_hourly`, `provider_sessions`, `provider_reliability_features`. 6 new indexes (see schema details below). 1 `ADD COLUMN IF NOT EXISTS` on `provider_reliability_features` for `liveness_score` (the column was added in the same PR that created the table, so this ALTER only matters if a deploy lands the migration in two steps). No `ALTER` on pre-existing tables. No protocol changes.
Coordinator write path	New `coordinator/liveness/` package: bounded-buffer `Writer` (non-blocking emit, batched flush, drop policy, error backoff, graceful drain), `SessionTracker`, retention prune (28d default), 5-min features rollup, 5-min hourly heartbeat rollup.
Coordinator hooks	`registry.Heartbeat` emits to writer; `Register` opens a session; `evictStale` and the WS read-loop call `RecordDisconnect` with classified reason; `main.go` runs orphan-close on boot and drains on shutdown.
Read endpoints	5 endpoints live in `coordinator/api/liveness.go`, admin-gated via the existing `requireAdminKey` pattern: `GET /v1/providers/{id}/{liveness,sessions,heartbeats}`, `GET /v1/providers/reliability`, `GET /v1/network/availability`. Provider IDs returned raw, matching the existing `/v1/stats` convention.

Schema details

`provider_heartbeats` (raw heartbeat history)

CREATE TABLE IF NOT EXISTS provider_heartbeats (
    id BIGSERIAL PRIMARY KEY,
    provider_id TEXT NOT NULL,
    at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status TEXT NOT NULL DEFAULT '',
    memory_pressure REAL NOT NULL DEFAULT 0,
    cpu_usage REAL NOT NULL DEFAULT 0,
    thermal_state TEXT NOT NULL DEFAULT '',
    memory_available_gb REAL,
    gpu_memory_active_gb REAL,
    gpu_memory_cache_gb REAL,
    backend_state TEXT NOT NULL DEFAULT '',
    active_model TEXT NOT NULL DEFAULT '',
    capacity JSONB
);
CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_provider_at
    ON provider_heartbeats(provider_id, at DESC);
CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_at
    ON provider_heartbeats(at DESC);

`provider_heartbeats_hourly` (5-min rollup of the above)

CREATE TABLE IF NOT EXISTS provider_heartbeats_hourly (
    provider_id TEXT NOT NULL,
    hour TIMESTAMPTZ NOT NULL,
    heartbeat_count INT NOT NULL DEFAULT 0,
    avg_memory_pressure REAL NOT NULL DEFAULT 0,
    avg_cpu_usage REAL NOT NULL DEFAULT 0,
    max_thermal_state TEXT NOT NULL DEFAULT '',
    avg_memory_available_gb REAL,
    PRIMARY KEY (provider_id, hour)
);
CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_hourly_hour
    ON provider_heartbeats_hourly(hour DESC);

`provider_sessions` (per-session intervals)

CREATE TABLE IF NOT EXISTS provider_sessions (
    id BIGSERIAL PRIMARY KEY,
    provider_id TEXT NOT NULL,
    connected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    disconnected_at TIMESTAMPTZ,
    disconnect_reason TEXT NOT NULL DEFAULT '',
    last_heartbeat_at TIMESTAMPTZ,
    requests_served BIGINT NOT NULL DEFAULT 0,
    tokens_generated BIGINT NOT NULL DEFAULT 0,
    coordinator_id TEXT NOT NULL DEFAULT ''
);
CREATE INDEX IF NOT EXISTS idx_provider_sessions_provider_connected
    ON provider_sessions(provider_id, connected_at DESC);
CREATE INDEX IF NOT EXISTS idx_provider_sessions_open
    ON provider_sessions(provider_id) WHERE disconnected_at IS NULL;
CREATE INDEX IF NOT EXISTS idx_provider_sessions_disconnected_at
    ON provider_sessions(disconnected_at DESC) WHERE disconnected_at IS NOT NULL;

`provider_reliability_features` (per-provider scalar rollup, recomputed every 5 min)

CREATE TABLE IF NOT EXISTS provider_reliability_features (
    provider_id TEXT PRIMARY KEY,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    window_days INT NOT NULL DEFAULT 14,
    uptime_pct REAL NOT NULL DEFAULT 0,
    sessions_count INT NOT NULL DEFAULT 0,
    mtbf_seconds BIGINT NOT NULL DEFAULT 0,
    median_session_seconds BIGINT NOT NULL DEFAULT 0,
    p10_session_seconds BIGINT NOT NULL DEFAULT 0,
    p90_session_seconds BIGINT NOT NULL DEFAULT 0,
    hourly_availability JSONB NOT NULL DEFAULT '{}'::jsonb,
    disconnect_reasons JSONB NOT NULL DEFAULT '{}'::jsonb,
    p_stays_4h REAL NOT NULL DEFAULT 0,
    p_stays_8h REAL NOT NULL DEFAULT 0,
    last_disconnect_at TIMESTAMPTZ,
    last_session_duration_seconds BIGINT NOT NULL DEFAULT 0
);
-- Additive: scalar liveness score in [0, 1]. ADD COLUMN IF NOT EXISTS so
-- the migration is safe to re-run, and safe if a deploy lands the CREATE
-- TABLE and the ADD COLUMN in two steps.
ALTER TABLE provider_reliability_features
    ADD COLUMN IF NOT EXISTS liveness_score REAL NOT NULL DEFAULT 0;
CREATE INDEX IF NOT EXISTS idx_provider_reliability_features_score
    ON provider_reliability_features(liveness_score DESC);

Liveness score

The features rollup writes a scalar liveness_score in [0, 1] alongside the per-provider features. Higher = more reliable for long-running jobs.

score = 0.35·uptime_pct
      + 0.25·p_stays_4h
      + 0.15·p_stays_8h
      + 0.25·sigmoid((mtbf_seconds − 4h) / 1h)
      − 0.10·recency_penalty(last_disconnect_at)

where  recency_penalty decays linearly from 1.0 at t=0 to 0 at 24h
       score is clamped to [0, 1]

Weights/sigmoid/recency-window constants live in coordinator/liveness/features.go (scoreWeight*, scoreMTBFTargetSec, scoreMTBFScaleSec, scoreRecencyWindowSec). They are intentionally const rather than env-tuned — getting them right takes weeks of soak data, then a code change.

Edge-case behavior locked in by tests:

Provider profile	Score
Brand new (no sessions)	≈ 0
Perfect (100% uptime, 8h+ sessions, big MTBF, no recent disconnect)	1.0
99% uptime but flaky (lots of short sessions, recent disconnect)	≈ 0.28 (not 0.99)
Same flaky provider 24h later	recovers as recency penalty decays

Endpoint usage

GET /v1/providers/reliability
    ?min_uptime=0.9
    &min_stays_4h=0.7
    &min_score=0.8
    &max_score=1
    &limit=50

Results are ordered liveness_score DESC, uptime_pct DESC, provider_id ASC. max_score=0 is treated as unbounded so callers don't have to pass 1.0 for the common case.

Safeguards

Bounded channel + non-blocking send → no OOM on Postgres outage; drops counted via liveness_dropped_total.
5s statement timeout per flush + error backoff → bad flushes can't stall the writer.
Drain on shutdown forces backoff = 0 → no silent data loss when graceful-stopping after a transient error.
CloseOrphanSessions is coordinator_id-scoped → blue-green deploys won't stomp the sibling coordinator's live sessions.
Retention deletes in batches of 10k via subquery + IN → bounded WAL pressure, autovacuum keeps pace, partial progress survives cancellation.
Hourly rollup window is aligned to a clean hour boundary in SQL (date_trunc('hour', $1)); upserts via ON CONFLICT DO UPDATE so re-running is idempotent.
All endpoint queries are pre-aggregated (provider_reliability_features, provider_heartbeats_hourly) or indexed lookups — no seq scans at request time.
ListSessionsSince is covered by idx_provider_sessions_open (partial, WHERE disconnected_at IS NULL) for the open half and idx_provider_sessions_disconnected_at (partial, WHERE disconnected_at IS NOT NULL) for the closed half; planner uses BitmapOr.
Liveness-score filter uses idx_provider_reliability_features_score (liveness_score DESC) → min_score/max_score queries are index-only.

Addressing review comments

coordinator/liveness/session.go close-failure visibility → SessionTracker now bumps liveness_session_close_failed_total when the store's CloseSession errors. Nil-safe CounterFn matches the Writer/Features/Retention pattern. (Commit c7133109.)
Dead provider_heartbeats_hourly table → populated by a new 5-min HeartbeatHourly rollup loop modeled on the existing Features rollup. (Commit b54c2667.)
Missing index for ListSessionsSince → added partial idx_provider_sessions_disconnected_at. NB: the original review comment named connected_at; the actual query predicates on disconnected_at, so this fixes the spirit of the concern against the real query. (Commit c3f6c242.)

Test plan

cd coordinator && go build ./... — clean
cd coordinator && go test ./... -count=1 -timeout 180s — all packages green (api integration suite is the slow one, ~83s)
Postgres-backed store tests (not yet run; needs a real DB):
DATABASE_URL=postgres://… go test ./coordinator/store/ -run TestPostgresLiveness -count=1
Hit each new endpoint via curl against dev coordinator and verify response shape:
curl -H "Authorization: Bearer $ADMIN_KEY" https://api.dev.darkbloom.xyz/v1/network/availability curl -H "Authorization: Bearer $ADMIN_KEY" "https://api.dev.darkbloom.xyz/v1/providers/reliability?min_score=0.8&limit=10"
24h soak on dev coordinator: verify liveness_dropped_total == 0, flush p99 < 100ms, no connection-pool exhaustion.
Confirm os.Hostname() is stable across EigenCloud redeploys (blue-green correctness).
Inspect liveness_score distribution on dev after one rollup tick — sanity-check it matches operator intuition before tuning the weights.

Deploy follow-ups (not in this PR)

Wire the CounterFn callbacks (liveness_dropped_total, liveness_flush_failed_total, liveness_features_runs_total, liveness_hourly_runs_total, liveness_session_close_failed_total, etc.) into the existing metrics pipeline so they're visible on dashboards.
Consider exposing liveness_score as a registry signal so the routing scorer can multiply by it for long-running jobs (per CLAUDE.md, current registry scoring is decode TPS × trust × reputation × warm model bonus × health factor — liveness_score would be a complementary historical factor).

Out of scope for v1

Real-time risk-signal stream (deferred; the captured data is the training set).
Job-aware / sticky scheduling (separate workstream — analytics tells you when, but the platform still needs a checkpoint protocol to act).
Postgres read replica (analytics queries are bounded by pre-aggregated features + indexed lookups; revisit if you measure operational impact).
Env-tunable score weights (intentional — wait for soak data before exposing).

🤖 Generated with Claude Code

vercel · 2026-05-22T18:48:31Z

@M3kko is attempting to deploy a commit to the EigenLabs Team on Vercel.

A member of the Team first needs to authorize it.

ethenotethan · 2026-05-22T22:59:58Z

@hankbobtheresearchoor can you do a review on this PR? I'd like for you to identify any performance or scalability bottlenecks in this code. iiuc this is introducing additive complexity onto the coordinator with stateful in-memory data structures which could result in connection or performance contention

hankbobtheresearchoor

Reviewed — see inline comments for details.

hankbobtheresearchoor · 2026-05-26T16:32:32Z

+	id, ok := t.open[providerID]
+	if ok {
+		delete(t.open, providerID)
+	}


🟡 Should fix: delete(t.open, providerID) removes the in-memory entry BEFORE the DB write at L91 (CloseSession). If the DB write fails (transient Postgres blip), the session row stays OPEN in Postgres but the tracker has no record of it — a subsequent Close() call silently returns via the idempotent guard (L76: if !ok return).

The orphan sweep on next coordinator boot catches it, but if the coordinator runs for weeks without restarting, stale open rows accumulate. Consider exposing liveness_session_close_failed_total as a counter at L96 so operators can detect this gap.

hankbobtheresearchoor · 2026-05-26T16:32:34Z

+		`CREATE INDEX IF NOT EXISTS idx_provider_heartbeats_at ON provider_heartbeats(at DESC)`,
+
+		`CREATE TABLE IF NOT EXISTS provider_heartbeats_hourly (
+			provider_id TEXT NOT NULL,


🟡 Should fix: provider_heartbeats_hourly is created in the migration at L503 but nothing populates it in this PR. The writer only calls AppendHeartbeats → raw provider_heartbeats. The features rollup reads from provider_sessions, not heartbeats. The retention pruner deletes from raw heartbeats only.

At 1,000 providers × 10s heartbeat interval = 8.6M raw rows/day, this hourly table should either (a) get a periodic rollup goroutine (INSERT INTO ... SELECT from the raw table), or (b) be removed from the migration until it has a writer. Dead schema misleads operators into thinking hourly data exists.

hankbobtheresearchoor · 2026-05-26T16:32:36Z

+		)`,
+		`CREATE INDEX IF NOT EXISTS idx_provider_sessions_provider_connected ON provider_sessions(provider_id, connected_at DESC)`,
+		`CREATE INDEX IF NOT EXISTS idx_provider_sessions_open ON provider_sessions(provider_id) WHERE disconnected_at IS NULL`,
+


🟡 Should fix — missing index for features rollup query. ListSessionsSince(since) in the store interface is called every 5 minutes by the features rollup. For the postgres backend this becomes:

SELECT * FROM provider_sessions WHERE connected_at >= $1 OR disconnected_at IS NULL

The existing indexes (L526 idx_provider_sessions_provider_connected, L527 idx_provider_sessions_open) don't cover this query — neither leads with connected_at alone. At fleet scale with millions of session rows, this will seq-scan every 5 minutes.

Recommend adding after L527:

CREATE INDEX IF NOT EXISTS idx_provider_sessions_connected_at ON provider_sessions(connected_at DESC);

Low urgency now but will matter as the fleet grows past hundreds of providers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… policy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ys)) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ RecordDisconnect Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rror) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…an-close on boot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… logic Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pers) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…se_failed_total counter When the DB write in SessionTracker.Close fails, the tracker has already deleted its in-memory entry; the row stays OPEN in Postgres and the orphan sweep only catches it at the next coordinator boot. Add a nil-safe CounterFn to SessionTracker so operators can alarm on the gap without waiting for a reboot. Threads `nil` through the construction site in cmd/coordinator/main.go and the existing test call sites. Addresses hankbobtheresearchoor's review comment on coordinator/liveness/session.go:79 (PR Layr-Labs#201). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The provider_heartbeats_hourly migration ships in this branch but nothing was writing to it — at 1k providers × 10s heartbeats = 8.6M raw rows/day, dashboards that expected pre-aggregated hourly data silently returned empty. Add Store.RollupHeartbeatsHourly on both backends (Postgres uses INSERT…SELECT…GROUP BY…ON CONFLICT; memory mirrors it in Go for parity) and a HeartbeatHourly loop modeled on the existing features rollup (5m cadence, 2h lookback, nil-safe CounterFn instrumentation). Thermal severity is ranked nominal<fair<serious<critical so MAX-by-severity is correct (plain text MAX is alphabetical). Addresses hankbobtheresearchoor's review comment on coordinator/store/postgres.go:503 (PR Layr-Labs#201). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sSince ListSessionsSince runs every 5 minutes via the features rollup with the predicate `WHERE disconnected_at IS NULL OR disconnected_at >= $1`. The IS NULL half is already covered by idx_provider_sessions_open (partial, WHERE disconnected_at IS NULL). The >= $1 half had no covering index — seq scan at fleet scale. Add a partial index on disconnected_at DESC (WHERE disconnected_at IS NOT NULL) so the planner can BitmapOr the two halves. NB: hankbobtheresearchoor's original comment named connected_at; the actual query predicates on disconnected_at, so this fixes the spirit of the concern against the real query shape. Addresses review comment on coordinator/store/postgres.go:527 (PR Layr-Labs#201). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dpoints) Master deleted the standalone analytics/ service in Layr-Labs#212 with the rationale "duplicate of coordinator endpoints, never deployed." Bring the liveness endpoints back as a single coordinator/api/liveness.go file, admin-gated via the existing requireAdminKey pattern. Reuses the coordinator's store.Store directly — no service layer, no separate types package, no pseudonym aliasing (matches existing /v1/stats which exposes raw provider IDs). Five endpoints: GET /v1/providers/{id}/liveness reliability summary GET /v1/providers/{id}/sessions recent session intervals GET /v1/providers/{id}/heartbeats recent heartbeat samples GET /v1/providers/reliability fleet shortlist by uptime bar GET /v1/network/availability distributional fleet stats Also repairs a stale interface docstring: GetReliabilityFeatures returns (nil, nil) on miss, not ErrNotFound. Both backends already returned (nil, nil); only the comment claimed otherwise. Includes httptest coverage for the admin gate, happy path, not-found, filter behavior, and small-N percentile correctness (math.Round so p90 separates from p50 at N=3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dpoints) Master deleted analytics/ in Layr-Labs#212; this branch had been adding to it. Now that the liveness read endpoints live in coordinator/api/liveness.go, remove the duplicate tree (~1500 LOC across 17 files including the separate go.mod). Post-rebase, this conflict resolves cleanly because both sides delete the same files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rebase) PR Layr-Labs#187 changed store.NewMemory to take store.Config{} instead of a raw adminKey string, and api.NewServer to require a ServerConfig arg. Updates the test call sites added in this branch to match. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapses the per-provider reliability features into a single number in [0, 1] so consumers (dashboards, future schedulers) can rank providers without knowing the formula. Higher = more reliable for long-running jobs. Formula: score = 0.35·uptime_pct + 0.25·p_stays_4h + 0.15·p_stays_8h + 0.25·sigmoid((mtbf_seconds - 4h) / 1h) - 0.10·recency_penalty(last_disconnect_at) recency_penalty decays linearly from 1.0 at t=0 to 0 at 24h. Score clamped to [0, 1]. - Adds liveness_score REAL column to provider_reliability_features (additive migration via ADD COLUMN IF NOT EXISTS) + DESC index for the new ORDER BY. - Computed in coordinator/liveness/features.go alongside the other features and persisted with each rollup tick (5-min cadence over a 14-day window). - Adds MinScore/MaxScore to store.ReliabilityFilter. ListReliabilityFeatures now orders by liveness_score DESC, uptime_pct DESC, provider_id ASC. - /v1/providers/reliability accepts min_score=&max_score= and echoes them in the response. Test coverage: scoring math at empty / perfect / flaky / recency-decay / clamping edge cases, plus an httptest case that confirms the endpoint filters AND orders by score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hankbobtheresearchoor approved these changes May 26, 2026

View reviewed changes

hankbobtheresearchoor reviewed May 26, 2026

View reviewed changes

M3kko and others added 24 commits May 28, 2026 12:32

feat(store): add Store interface methods + types for provider liveness

a44708a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(store): postgres impls for liveness + 4 new table migrations

a5543a3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(store): memory backend impls for liveness methods

ab743b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(store): include new liveness tables in TRUNCATE list

739bb1e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(store): shared liveness test suite for memory + postgres backends

93a4a58

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(liveness): bounded-buffer async Writer with batched flush + drop…

6a0654e

… policy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(liveness): writer drop, flush, retain, drain, counter behavior

3a0bf16

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(liveness): SessionTracker for provider connect/disconnect lifecycle

4f4f1b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(liveness): SessionTracker open/close + close-without-open coverage

2781ebd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(liveness): Sink adapter implementing registry.LivenessSink

9d1053e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(liveness): hourly batched DELETE for provider_heartbeats retention

21b3f1c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(liveness): retention drain + error handling + boot/tick lifecycle

7254b83

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(liveness): 5min reliability features rollup (uptime, MTBF, P(sta…

d07ba29

…ys)) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(liveness): features math + rollup loop + error path

888add3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(liveness): end-to-end register/heartbeat/disconnect/orphan-close…

e6be4ae

… cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(registry): wire LivenessSink into Heartbeat/Register/evictStale …

d4d9c0c

…+ RecordDisconnect Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(api): classify provider disconnect reason (clean_close vs read_e…

3f78d90

…rror) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(coordinator): wire liveness writer + retention + features + orph…

0dcd281

…an-close on boot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(analytics): liveness read interface + response types + window pa…

8487569

…rser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(analytics): liveness Service with aliasing + percentile + filter…

0dc2ca9

… logic Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(analytics): memory backend for liveness reads (dev/test seed hel…

119023d

…pers) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(analytics): postgres backend for liveness reads via capped pgxpool

2f23d3d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(analytics): liveness service alias + filter + percentile coverage

ea7dbfb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(analytics): httptest coverage for all new liveness endpoints

7924ee0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

M3kko and others added 6 commits May 28, 2026 12:37

M3kko force-pushed the c-brody/dar-61-figure-out-liveness-profiling-per-provider branch from 6a22c00 to 04829ba Compare May 28, 2026 17:40

M3kko and others added 2 commits May 28, 2026 13:08

style: gofmt liveness tests + postgres.go

dfe54a7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: provider liveness analytics (DAR-61)#201

feat: provider liveness analytics (DAR-61)#201
M3kko wants to merge 32 commits into
Layr-Labs:masterfrom
M3kko:c-brody/dar-61-figure-out-liveness-profiling-per-provider

M3kko commented May 22, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 22, 2026

Uh oh!

ethenotethan commented May 22, 2026

Uh oh!

hankbobtheresearchoor left a comment

Uh oh!

hankbobtheresearchoor May 26, 2026

Uh oh!

hankbobtheresearchoor May 26, 2026

Uh oh!

hankbobtheresearchoor May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

M3kko commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Schema details

provider_heartbeats (raw heartbeat history)

provider_heartbeats_hourly (5-min rollup of the above)

provider_sessions (per-session intervals)

provider_reliability_features (per-provider scalar rollup, recomputed every 5 min)

Liveness score

Endpoint usage

Safeguards

Addressing review comments

Test plan

Deploy follow-ups (not in this PR)

Out of scope for v1

Uh oh!

vercel Bot commented May 22, 2026

Uh oh!

ethenotethan commented May 22, 2026

Uh oh!

hankbobtheresearchoor left a comment

Choose a reason for hiding this comment

Uh oh!

hankbobtheresearchoor May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hankbobtheresearchoor May 26, 2026

Choose a reason for hiding this comment

Uh oh!

hankbobtheresearchoor May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

M3kko commented May 22, 2026 •

edited

Loading

`provider_heartbeats` (raw heartbeat history)

`provider_heartbeats_hourly` (5-min rollup of the above)

`provider_sessions` (per-session intervals)

`provider_reliability_features` (per-provider scalar rollup, recomputed every 5 min)