fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout#116
Open
BuyWhere wants to merge 11 commits into
Open
fix(BUY-45725+BUY-44164): monitoring skipFreshness + hard timeout#116BuyWhere wants to merge 11 commits into
BuyWhere wants to merge 11 commits into
Conversation
Follow-up to BUY-45671 (CEO report: 17.56% 5xx + p95 10,003ms). Heavy aggregate reads (catalog stats, category rollups, deals) compete with interactive /v1/products/search for shared_buffers/work_mem on the single maglev Postgres. - New api/src/lib/readReplica.ts: env-gated replica pool (REPLICA_DATABASE_URL), background pg_last_xact_replay_timestamp lag probe, readDb() selector that routes to the replica only while lag <= REPLICA_MAX_LAG_MS (default 2s) and falls back to the primary otherwise. Mirrors the existing vectorDb pattern. - catalog.ts: collectStats / tryExactCount / categories now read via readDb(); /stats/health exposes replica routing + lag for ops. - products.ts: /v1/products/deals heavy client routes via readDb(); interactive /v1/products/search stays on the primary. No-op until REPLICA_DATABASE_URL is set (BUY-45692.A, Bolt/Ops provisions the replica), so it ships safely and activates on config flip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The endpoint was hanging because getAllLatestP95() unconditionally called refreshRecentP95Windows() (DELETE+INSERT on monitoring.p95_latency) BEFORE checking data freshness. This caused lock contention on every request. Adding skipFreshness:true since: - Data is demonstrably fresh (single-market endpoints confirm window_end=06:55Z) - BUY-45685 handles missing/stale data gracefully with its own fallback state - The endpoint is a read-only display endpoint, not a freshness trigger Co-Authored-By: Paperclip <noreply@paperclip.ing>
The watchdog wrote alert.sink: BUY-31463 to result.json but delegated the actual comment POST to the routine execution engine. When the engine missed a cycle (BUY-46620 / 15:10Z), no alert was sent. buy-32264-p95-latency-watchdog.js now POSTs the alert comment to the sink issue itself via PAPERCLIP_API_URL/PAPERCLIP_API_KEY when status === ALERT, mirroring buy-38913-disk-space-watchdog.js. On a failed post the per-market cooldown stamps are rolled back so the next rotation retries delivery rather than silently suppressing it. Also commits the run-buy-32397 wrapper, which was untracked on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nitoring endpoints
- getAllLatestP95({ skipFreshness: true }) avoids lock contention from
refreshRecentP95Windows() on /api/monitoring/p95/all (BUY-45725)
- 10s hard timeout on all /api/monitoring routes prevents indefinite hangs
(BUY-44164)
- Both fixes unblock the P95 watchdog from hanging
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt escalation schema - BUY-38482 → BUY-31142 as the owning issue across worker status and escalation file - keep-alive: only kill previous PID when proc alive AND cmdline marker matches; prevents killing an unrelated process that inherited a recycled PID - keep-alive: initialize cmdline_ok default before shell : expansion - escalation file: preserve existing data on upgrade (idempotent field init) - escalation entries: stamp with last_caller='buy31142-crew-wc-rest-keep-alive' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…v3 APIs - Append all log output to buy31142-crew-wc-rest-worker.log in addition to stdout so keep-alive tick output is preserved for debugging - Harvest from both public Store API (/wc/store/products) and authenticated v3 API (/wc/v3/products), then dedupe by merchant_id|sku to maximize coverage when one surface returns partial data - Include storeCount/v3Count/skipped in merchant harvest log lines for visibility into which endpoint contributed products Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…200, 250ms, 552acc 106198B held)
…200, 247ms, 907d6c5c 106198B held 14th) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…clip incidents - Add 5-minute interval disk check for /dev/vda1 (BUY-48087 reference) - CRITICAL threshold: < 5GB free (creates critical priority incident) - WARN threshold: < 20GB free (creates high priority incident) - Paperclip incident creation with proper metadata (assignee: Rex agent) - Deduplication: 1-hour window prevents duplicate incidents - Alert history tracking in monitoring.alert_history table Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rclip incidents - Add diskSpace.ts module for monitoring /dev/vda1 free space - Add diskSpaceRunner.ts job running every 5 minutes - Create critical Paperclip incidents when below 5GB (warn at 20GB) - Use Redis for alert dedup (1h for critical, 24h for warning) - Auto-start in main server via startDiskSpaceRunner() Co-Authored-By: Paperclip <noreply@paperclip.ing>
api/src/routes/products.ts
- Add isValidNumericProductId() helper that accepts only non-negative
integers (products.id is bigint; non-numeric :id triggered 22P02
`invalid input syntax for type bigint: "count"` from any client
request to /v1/products/<not-a-number>).
- Add validation to 4 routes: /:id, /:id/price-history, /:id/prices,
/:id/similar. Non-numeric ids now return 400 {error: 'Invalid product
id', id} before any DB call.
api/src/migrate.ts
- Add SET lock_timeout = 4000 to the discount_pct GENERATED-column DDL
block (line 507). Previously only statement_timeout = 360000 was set;
a concurrent ingest transaction that held the AccessExclusive lock
would block the DDL for the full 6 min and starve every other query
in the pool. With lock_timeout, the DDL aborts in 4s and runMigrations
continues to listen.
- Remove the trailing countCheck query (SELECT count(*) ... WHERE
discount_pct IS NOT NULL) — a full 14M-row scan inside the startup
hot path that was itself tripping the 30s default statement_timeout.
The column's GENERATED property is the only invariant that matters.
Acceptance:
- bigint "count" 22P02 no longer reachable from client traffic.
- discount_pct migration can no longer pin the pool for 6 min.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the BUY-45725+BUY-44164 monitoring fixes onto
origin/main:getAllLatestP95({ skipFreshness: true })on/api/monitoring/p95/all— avoids lock contention from freshness probes blocking the endpoint (BUY-45725)req.setTimeouton all/api/monitoring/*routes — prevents any single request from hanging indefinitely (BUY-44164)Root Cause
Origin/main
dist/monitoring/routes.jswas compiled before the monitoring fixes landed insrc/monitoring/routes.ts. The fixes exist in source but the stale dist was never regenerated on main.Changes
api/dist/monitoring/routes.js: addsetTimeoutmiddleware +skipFreshness: trueargapi/src/monitoring/p95.ts: caching + background freshness (upstream from483764e63)Closes BUY-46735.