Skip to content

COW-1013: orderbook 429/backoff resilience + non-root container#103

Merged
jeffersonBastos merged 2 commits into
developfrom
jefferson/cow-1013-orderbook-api-resilience-429backoff-run-container-as-non
Jun 17, 2026
Merged

COW-1013: orderbook 429/backoff resilience + non-root container#103
jeffersonBastos merged 2 commits into
developfrom
jefferson/cow-1013-orderbook-api-resilience-429backoff-run-container-as-non

Conversation

@jeffersonBastos

@jeffersonBastos jeffersonBastos commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Closes COW-1013.

Two production-hardening items surfaced in the COW-1012 QA review (PR #97), each with tests.

1. Orderbook API resilience (429 / 5xx)

Problem. orderbookClient.ts treated every non-OK response identically: log("warn", ...) then break/continue, returning a partial/empty result. That empty result is indistinguishable downstream from "order genuinely absent" (blockHandler.ts: if (!orderbookEntry) continue; // not on API yet). So a 429/5xx caused candidate orders to silently never promote, with only warn logs — looks healthy from the status enums.

Change.

  • New fetchOrderbook() wrapping fetchWithTimeout with bounded retry/backoff: honors Retry-After (delta-seconds and HTTP-date, capped at ORDERBOOK_RETRY_MAX_DELAY_MS) on 429, exponential backoff on 5xx. Stops at ORDERBOOK_MAX_RETRIES or when the next sleep would exceed ORDERBOOK_RETRY_BUDGET_MS, then throws — never holds a block-handler DB transaction open past the budget (the key constraint per withTimeout.ts).
  • New OrderbookUnavailableError + distinct log codes:
    • ob:unavailable (level error, carries status) — API refused to answer. Alarm on this.
    • ob:retry (level warn) — a retry is happening.
    • A genuinely absent order (empty 2xx) stays the benign missing-UID path — no new alarm.
  • Wired into fetchAccountOrders and fetchOrdersByUids. Caller control flow and return shapes are unchanged — promotion still safely defers on unavailability (retries on the next poll ~ORDERBOOK_POLL_INTERVAL blocks later), it's just now observable.

Constants (src/constants.ts): ORDERBOOK_MAX_RETRIES=2, ORDERBOOK_RETRY_BASE_MS=250, ORDERBOOK_RETRY_MAX_DELAY_MS=2000, ORDERBOOK_RETRY_BUDGET_MS=4000.

Tests (tests/helpers/orderbookClient.test.ts): two focused cases — 429-then-success (Retry-After honored) and persistent-429 → ob:unavailable after bounded retries. 5xx is already exercised by the pre-existing 500 tests (they now route through the retry/classify path), and empty-200 by the existing empty-array test.

2. Run the container as non-root

Dockerfile never set USER (ran as uid 0). Now chown -R node:node /usr/src/app /pnpm after install, then USER node before HEALTHCHECK/CMD (node:22-alpine ships node, uid 1000).

Verified: image builds; docker runuid=1000(node); workdir owned by node and writable (Ponder cache). Full docker compose --profile deploy up end-to-end was not run here (needs RPC secrets/network) — worth a sanity check in deploy.

Live-probe finding — rate limiting may surface as 403, not 429

A controlled probe against the real api.cow.fi (826 requests to a live mainnet order, bursts up to 300 concurrent) returned all 200s, no 429 — the endpoint is CDN/WAF-fronted and absorbs single-IP bursts. Separately, restricted endpoints returned 403 from the edge nginx. Implication: real-world throttling may arrive as a WAF 403 rather than an app-level 429. Our code handles this correctly — a non-429/non-5xx response is a non-retryable OrderbookUnavailableError → logged ob:unavailable (so it is never mistaken for "order absent"), with no retry (a 403 is not transient). No Retry-After is sent on success; the retry path is defensive (honor it if present, else backoff).

Scope decisions (per discussion)

  • Skipped the optional client-side token-bucket throttle — avoiding over-engineering; the bounded retry already absorbs transient 429s.
  • Out of scope (noted): the cancel-cascade preflight (blockHandler.ts) falls back to writing cancelled for orphans the API doesn't return — so an unavailable API (429/5xx, or a timeout) can mislabel a filled order as cancelled. Pre-existing behavior (already happens on timeout); the retry here reduces its likelihood. Could be a follow-up: skip the cancelled write specifically on OrderbookUnavailableError.

Checks

  • pnpm test → 108 passed · pnpm lint clean · tsc --noEmit clean

Orderbook API resilience (429/5xx):
- Add fetchOrderbook(): bounded retry/backoff around fetchWithTimeout that
  honors Retry-After (capped) on 429 and exponential backoff on 5xx, failing
  fast within a wall-clock budget so it never holds a block-handler TX open.
- Add OrderbookUnavailableError + ob:unavailable / ob:retry log codes so a
  rate-limited / down API is distinguishable from "order not on API yet"
  (previously both surfaced as a silent missing UID).
- Wire into fetchAccountOrders and fetchOrdersByUids; caller control flow and
  return shapes are unchanged (promotion still safely defers, now observable).

Container hardening:
- Run as the non-root `node` user (uid 1000) with chown of the workdir and
  /pnpm store. Verified: image builds, runs as uid 1000, workdir writable.

Tests: 429-then-success, persistent 429 → ob:unavailable (bounded retries),
5xx retry, empty-200 stays "absent", HTTP-date Retry-After parsing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 17, 2026

Copy link
Copy Markdown

COW-1013

Keep retry-then-succeed (Retry-After honored) and persistent-429 →
ob:unavailable (bounded). Drop the 5xx, empty-200, and HTTP-date cases:
5xx already flows through the pre-existing 500 tests, and empty-200 is
covered by the existing empty-array test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeffersonBastos jeffersonBastos marked this pull request as ready for review June 17, 2026 13:23
@jeffersonBastos jeffersonBastos merged commit 1d83c3c into develop Jun 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant