Skip to content

fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout#116

Open
BuyWhere wants to merge 11 commits into
mainfrom
fix/BUY-45725-p95-endpoint-hang
Open

fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout#116
BuyWhere wants to merge 11 commits into
mainfrom
fix/BUY-45725-p95-endpoint-hang

Conversation

@BuyWhere

Copy link
Copy Markdown
Owner

Summary

Lands the BUY-45725+BUY-44164 monitoring fixes onto origin/main:

  • getAllLatestP95({ skipFreshness: true }) on /api/monitoring/p95/all — avoids lock contention from freshness probes blocking the endpoint (BUY-45725)
  • 10s hard req.setTimeout on all /api/monitoring/* routes — prevents any single request from hanging indefinitely (BUY-44164)

Root Cause

Origin/main dist/monitoring/routes.js was compiled before the monitoring fixes landed in src/monitoring/routes.ts. The fixes exist in source but the stale dist was never regenerated on main.

Changes

  • api/dist/monitoring/routes.js: add setTimeout middleware + skipFreshness: true arg
  • api/src/monitoring/p95.ts: caching + background freshness (upstream from 483764e63)

Closes BUY-46735.

BuyWhere Bot and others added 11 commits June 13, 2026 06:52
Follow-up to BUY-45671 (CEO report: 17.56% 5xx + p95 10,003ms). Heavy
aggregate reads (catalog stats, category rollups, deals) compete with
interactive /v1/products/search for shared_buffers/work_mem on the single
maglev Postgres.

- New api/src/lib/readReplica.ts: env-gated replica pool (REPLICA_DATABASE_URL),
  background pg_last_xact_replay_timestamp lag probe, readDb() selector that
  routes to the replica only while lag <= REPLICA_MAX_LAG_MS (default 2s) and
  falls back to the primary otherwise. Mirrors the existing vectorDb pattern.
- catalog.ts: collectStats / tryExactCount / categories now read via readDb();
  /stats/health exposes replica routing + lag for ops.
- products.ts: /v1/products/deals heavy client routes via readDb(); interactive
  /v1/products/search stays on the primary.

No-op until REPLICA_DATABASE_URL is set (BUY-45692.A, Bolt/Ops provisions the
replica), so it ships safely and activates on config flip.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The endpoint was hanging because getAllLatestP95() unconditionally called
refreshRecentP95Windows() (DELETE+INSERT on monitoring.p95_latency) BEFORE
checking data freshness. This caused lock contention on every request.

Adding skipFreshness:true since:
- Data is demonstrably fresh (single-market endpoints confirm window_end=06:55Z)
- BUY-45685 handles missing/stale data gracefully with its own fallback state
- The endpoint is a read-only display endpoint, not a freshness trigger

Co-Authored-By: Paperclip <noreply@paperclip.ing>
The watchdog wrote alert.sink: BUY-31463 to result.json but delegated the
actual comment POST to the routine execution engine. When the engine missed
a cycle (BUY-46620 / 15:10Z), no alert was sent.

buy-32264-p95-latency-watchdog.js now POSTs the alert comment to the sink
issue itself via PAPERCLIP_API_URL/PAPERCLIP_API_KEY when status === ALERT,
mirroring buy-38913-disk-space-watchdog.js. On a failed post the per-market
cooldown stamps are rolled back so the next rotation retries delivery rather
than silently suppressing it. Also commits the run-buy-32397 wrapper, which
was untracked on this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nitoring endpoints

- getAllLatestP95({ skipFreshness: true }) avoids lock contention from
  refreshRecentP95Windows() on /api/monitoring/p95/all (BUY-45725)
- 10s hard timeout on all /api/monitoring routes prevents indefinite hangs
  (BUY-44164)
- Both fixes unblock the P95 watchdog from hanging

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt escalation schema

- BUY-38482 → BUY-31142 as the owning issue across worker status and escalation file
- keep-alive: only kill previous PID when proc alive AND cmdline marker matches;
  prevents killing an unrelated process that inherited a recycled PID
- keep-alive: initialize cmdline_ok default before shell : expansion
- escalation file: preserve existing data on upgrade (idempotent field init)
- escalation entries: stamp with last_caller='buy31142-crew-wc-rest-keep-alive'

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…v3 APIs

- Append all log output to buy31142-crew-wc-rest-worker.log in addition to stdout
  so keep-alive tick output is preserved for debugging
- Harvest from both public Store API (/wc/store/products) and authenticated v3
  API (/wc/v3/products), then dedupe by merchant_id|sku to maximize coverage
  when one surface returns partial data
- Include storeCount/v3Count/skipped in merchant harvest log lines for
  visibility into which endpoint contributed products

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…200, 247ms, 907d6c5c 106198B held 14th)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…clip incidents

- Add 5-minute interval disk check for /dev/vda1 (BUY-48087 reference)
- CRITICAL threshold: < 5GB free (creates critical priority incident)
- WARN threshold: < 20GB free (creates high priority incident)
- Paperclip incident creation with proper metadata (assignee: Rex agent)
- Deduplication: 1-hour window prevents duplicate incidents
- Alert history tracking in monitoring.alert_history table

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rclip incidents

- Add diskSpace.ts module for monitoring /dev/vda1 free space
- Add diskSpaceRunner.ts job running every 5 minutes
- Create critical Paperclip incidents when below 5GB (warn at 20GB)
- Use Redis for alert dedup (1h for critical, 24h for warning)
- Auto-start in main server via startDiskSpaceRunner()

Co-Authored-By: Paperclip <noreply@paperclip.ing>
api/src/routes/products.ts
- Add isValidNumericProductId() helper that accepts only non-negative
  integers (products.id is bigint; non-numeric :id triggered 22P02
  `invalid input syntax for type bigint: "count"` from any client
  request to /v1/products/<not-a-number>).
- Add validation to 4 routes: /:id, /:id/price-history, /:id/prices,
  /:id/similar. Non-numeric ids now return 400 {error: 'Invalid product
  id', id} before any DB call.

api/src/migrate.ts
- Add SET lock_timeout = 4000 to the discount_pct GENERATED-column DDL
  block (line 507). Previously only statement_timeout = 360000 was set;
  a concurrent ingest transaction that held the AccessExclusive lock
  would block the DDL for the full 6 min and starve every other query
  in the pool. With lock_timeout, the DDL aborts in 4s and runMigrations
  continues to listen.
- Remove the trailing countCheck query (SELECT count(*) ... WHERE
  discount_pct IS NOT NULL) — a full 14M-row scan inside the startup
  hot path that was itself tripping the 30s default statement_timeout.
  The column's GENERATED property is the only invariant that matters.

Acceptance:
- bigint "count" 22P02 no longer reachable from client traffic.
- discount_pct migration can no longer pin the pool for 6 min.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant