Validate URLs and bound response sizes to block SSRF by galuis116 · Pull Request #41 · aglover1221/product-data-extractor

galuis116 · 2026-05-26T21:39:40Z

What Changed

Closes #40.

The source-acquisition pipeline takes URLs returned by the web-search LLM in lib/pipeline/sources.ts:findSources and feeds them directly to fetch via lib/pipeline/pdf-validation.ts:rawFetch. There is no scheme check, no host allow-list, no IP-range deny-list, and no body-size cap on await res.arrayBuffer(). A poisoned web-search result can drive the worker into requesting http://169.254.169.254/latest/meta-data/... (AWS IMDS), http://[fd00:ec2::254]/... (IPv6 IMDS), http://localhost:5432/, etc. — and an oversized response OOMs the worker. The same shape repeats in lib/integrations/reducto.ts at downloadImage (line ~266) and the presigned-result fetch in fetchJobResult (line ~212).

This PR adds defense in depth:

New module lib/safe-url.ts — single source of truth for URL safety:
- validateUrlShape(url, options) — synchronous validator. Scheme allowlist (default http: / https:), IPv6 bracket-unwrap, IP-literal deny-list covering RFC 1918 (10/8, 172.16/12, 192.168/16), 127/8 loopback, 169.254/16 link-local (AWS IMDS), 100.64/10 CGNAT, 0.0.0.0/8, IETF TEST-NET-1/2/3, multicast 224/4, reserved 240/4, broadcast, IPv6 ::1, fe80::/10, fc00::/7 ULA, ff00::/8 multicast, IPv4-mapped ::ffff:… re-checked against v4 rules. Plus a localhost-alias name list (localhost, localhost.localdomain, ip6-localhost, ip6-loopback, metadata.google.internal, metadata).
- validateUrlAsync(url, options) — runs the shape validator, then dns.lookup(host, { all: true }) and re-validates every A/AAAA record against the IP deny-list.
- boundedFetch(url, { maxBytes, timeoutMs, headers, redirect, allowedSchemes }) — fetch wrapper that runs validateUrlAsync before the request, streams the response body via res.body.getReader() and aborts past maxBytes, re-validates the FINAL URL after redirects (so a 302 from a public host to 169.254.169.254 is rejected even though the original URL passed).
lib/pipeline/pdf-validation.ts — rawFetch now delegates to boundedFetch with maxBytes = 100 MiB (well above any real vendor data-sheet). All four existing callers of fetchAndValidatePdf (option-matrices.ts, sources.ts × 2, program-docs.ts, app/api/pipeline/sources/[id]/screenshot/route.ts) inherit the validation transparently — no signature change.
lib/integrations/reducto.ts — downloadImage (LLM/Reducto-supplied image URLs) and fetchJobResult (presigned result follow) both route through boundedFetch with maxBytes = 50 MiB. The trusted Reducto API endpoints (platform.reducto.ai/upload, /parse_async, /job/{id}) are intentionally NOT routed through — they're trusted infrastructure with bearer auth, and going through the deny-list would just add latency.

Scope Details

No public API shape changes. rawFetch, fetchAndValidatePdf, downloadImage, and fetchJobResult keep their existing signatures. Callers see exactly the same { ok, status, buf, contentType, error } / PdfValidationResult / DownloadImageResult shapes — error paths now also surface a clear "host X is a private/reserved IP" / "scheme X not allowed" / "body exceeded N bytes" message instead of a generic network error.

No dependency change, no schema change, no migration. node:dns/promises and node:net are stdlib. Existing npm run test (4 files / 35 tests) passes unchanged.

Known limitation — DNS rebinding. Between validateUrlAsync and the actual fetch, a malicious DNS server can serve a different answer. True defense requires resolving once, then pinning the request to that IP via an undici Agent / dispatcher. Documented in the module header as a deliberate follow-up.

Trusted endpoints kept out of the deny-list. Calls to platform.reducto.ai/{upload,parse_async,job} in reducto.ts:106/145/166 continue to use raw fetch — they're trusted infrastructure, bearer-authenticated, hardcoded host, no LLM input. Routing them through boundedFetch would add a DNS round-trip per call with no security benefit.

Test Plan

npx tsc --noEmit — clean.
npm run test — 4 test files / 35 tests pass; no regressions.

Live deny-list repro against the new helpers — 13 attack URLs all rejected at the validateUrlShape step (no network IO):

AWS IMDS v4         → "host 169.254.169.254 is a private/reserved IP"
AWS IMDS v6         → "host fd00:ec2::254 is a private/reserved IP"
loopback v4 / v6    → blocked
localhost name      → "host localhost is a localhost alias"
metadata.google.internal → blocked
RFC 1918 10/8, 172.16, 192.168 → blocked
::ffff:127.0.0.1    → "host ::ffff:7f00:1 is a private/reserved IP"  (IPv4-mapped unwrap works)
file://, gopher://, javascript: → "scheme X: not allowed"
https://www.hpe.com/... (legitimate vendor PDF) → ok=true
https://8.8.8.8/    (legitimate public IPv4)    → ok=true

boundedFetch against IMDS literal returns { ok: false, error: "host 169.254.169.254 is a private/reserved IP" } without making any network request.
Follow-up (out of scope for this PR): vitest coverage for each deny-list category + integration tests against a mock HTTP server for the maxBytes cap and redirect-to-private-IP rejection — sketched in the linked issue's "Tests" §4.
Follow-up: DNS-rebinding hardening via an undici Agent that pins the request to the IP resolved at validation time.

Validate URLs and bound response sizes to block SSRF

b282498

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate URLs and bound response sizes to block SSRF#41

Validate URLs and bound response sizes to block SSRF#41
galuis116 wants to merge 1 commit into
aglover1221:mainfrom
galuis116:fix/ssrf-url-validation

galuis116 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

galuis116 commented May 26, 2026

What Changed

Scope Details

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant