fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout by BuyWhere · Pull Request #116 · BuyWhere/buywhere

BuyWhere · 2026-06-13T16:28:09Z

Summary

Lands the BUY-45725+BUY-44164 monitoring fixes onto origin/main:

getAllLatestP95({ skipFreshness: true }) on /api/monitoring/p95/all — avoids lock contention from freshness probes blocking the endpoint (BUY-45725)
10s hard req.setTimeout on all /api/monitoring/* routes — prevents any single request from hanging indefinitely (BUY-44164)

Root Cause

Origin/main dist/monitoring/routes.js was compiled before the monitoring fixes landed in src/monitoring/routes.ts. The fixes exist in source but the stale dist was never regenerated on main.

Changes

api/dist/monitoring/routes.js: add setTimeout middleware + skipFreshness: true arg
api/src/monitoring/p95.ts: caching + background freshness (upstream from 483764e63)

Closes BUY-46735.

Follow-up to BUY-45671 (CEO report: 17.56% 5xx + p95 10,003ms). Heavy aggregate reads (catalog stats, category rollups, deals) compete with interactive /v1/products/search for shared_buffers/work_mem on the single maglev Postgres. - New api/src/lib/readReplica.ts: env-gated replica pool (REPLICA_DATABASE_URL), background pg_last_xact_replay_timestamp lag probe, readDb() selector that routes to the replica only while lag <= REPLICA_MAX_LAG_MS (default 2s) and falls back to the primary otherwise. Mirrors the existing vectorDb pattern. - catalog.ts: collectStats / tryExactCount / categories now read via readDb(); /stats/health exposes replica routing + lag for ops. - products.ts: /v1/products/deals heavy client routes via readDb(); interactive /v1/products/search stays on the primary. No-op until REPLICA_DATABASE_URL is set (BUY-45692.A, Bolt/Ops provisions the replica), so it ships safely and activates on config flip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The endpoint was hanging because getAllLatestP95() unconditionally called refreshRecentP95Windows() (DELETE+INSERT on monitoring.p95_latency) BEFORE checking data freshness. This caused lock contention on every request. Adding skipFreshness:true since: - Data is demonstrably fresh (single-market endpoints confirm window_end=06:55Z) - BUY-45685 handles missing/stale data gracefully with its own fallback state - The endpoint is a read-only display endpoint, not a freshness trigger Co-Authored-By: Paperclip <noreply@paperclip.ing>

The watchdog wrote alert.sink: BUY-31463 to result.json but delegated the actual comment POST to the routine execution engine. When the engine missed a cycle (BUY-46620 / 15:10Z), no alert was sent. buy-32264-p95-latency-watchdog.js now POSTs the alert comment to the sink issue itself via PAPERCLIP_API_URL/PAPERCLIP_API_KEY when status === ALERT, mirroring buy-38913-disk-space-watchdog.js. On a failed post the per-market cooldown stamps are rolled back so the next rotation retries delivery rather than silently suppressing it. Also commits the run-buy-32397 wrapper, which was untracked on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nitoring endpoints - getAllLatestP95({ skipFreshness: true }) avoids lock contention from refreshRecentP95Windows() on /api/monitoring/p95/all (BUY-45725) - 10s hard timeout on all /api/monitoring routes prevents indefinite hangs (BUY-44164) - Both fixes unblock the P95 watchdog from hanging Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nt escalation schema - BUY-38482 → BUY-31142 as the owning issue across worker status and escalation file - keep-alive: only kill previous PID when proc alive AND cmdline marker matches; prevents killing an unrelated process that inherited a recycled PID - keep-alive: initialize cmdline_ok default before shell : expansion - escalation file: preserve existing data on upgrade (idempotent field init) - escalation entries: stamp with last_caller='buy31142-crew-wc-rest-keep-alive' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…v3 APIs - Append all log output to buy31142-crew-wc-rest-worker.log in addition to stdout so keep-alive tick output is preserved for debugging - Harvest from both public Store API (/wc/store/products) and authenticated v3 API (/wc/v3/products), then dedupe by merchant_id|sku to maximize coverage when one surface returns partial data - Include storeCount/v3Count/skipped in merchant harvest log lines for visibility into which endpoint contributed products Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…200, 250ms, 552acc 106198B held)

…200, 247ms, 907d6c5c 106198B held 14th) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…clip incidents - Add 5-minute interval disk check for /dev/vda1 (BUY-48087 reference) - CRITICAL threshold: < 5GB free (creates critical priority incident) - WARN threshold: < 20GB free (creates high priority incident) - Paperclip incident creation with proper metadata (assignee: Rex agent) - Deduplication: 1-hour window prevents duplicate incidents - Alert history tracking in monitoring.alert_history table Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rclip incidents - Add diskSpace.ts module for monitoring /dev/vda1 free space - Add diskSpaceRunner.ts job running every 5 minutes - Create critical Paperclip incidents when below 5GB (warn at 20GB) - Use Redis for alert dedup (1h for critical, 24h for warning) - Auto-start in main server via startDiskSpaceRunner() Co-Authored-By: Paperclip <noreply@paperclip.ing>

api/src/routes/products.ts - Add isValidNumericProductId() helper that accepts only non-negative integers (products.id is bigint; non-numeric :id triggered 22P02 `invalid input syntax for type bigint: "count"` from any client request to /v1/products/<not-a-number>). - Add validation to 4 routes: /:id, /:id/price-history, /:id/prices, /:id/similar. Non-numeric ids now return 400 {error: 'Invalid product id', id} before any DB call. api/src/migrate.ts - Add SET lock_timeout = 4000 to the discount_pct GENERATED-column DDL block (line 507). Previously only statement_timeout = 360000 was set; a concurrent ingest transaction that held the AccessExclusive lock would block the DDL for the full 6 min and starve every other query in the pool. With lock_timeout, the DDL aborts in 4s and runMigrations continues to listen. - Remove the trailing countCheck query (SELECT count(*) ... WHERE discount_pct IS NOT NULL) — a full 14M-row scan inside the startup hot path that was itself tripping the 30s default statement_timeout. The column's GENERATED property is the only invariant that matters. Acceptance: - bigint "count" 22P02 no longer reachable from client traffic. - discount_pct migration can no longer pin the pool for 6 min. Co-Authored-By: Paperclip <noreply@paperclip.ing>

BuyWhere Bot and others added 11 commits June 13, 2026 06:52

fix(BUY-47420): tick N+177 frontend health check snapshot (PASS HTTP …

65d4b33

…200, 250ms, 552acc 106198B held)

fix(BUY-48122): tick N+247 frontend health check snapshot (PASS HTTP …

79f345f

…200, 247ms, 907d6c5c 106198B held 14th) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout#116

fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout#116
BuyWhere wants to merge 11 commits into
mainfrom
fix/BUY-45725-p95-endpoint-hang

BuyWhere commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BuyWhere commented Jun 13, 2026

Summary

Root Cause

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant