Skip to content

Validate URLs and bound response sizes to block SSRF#41

Open
galuis116 wants to merge 1 commit into
aglover1221:mainfrom
galuis116:fix/ssrf-url-validation
Open

Validate URLs and bound response sizes to block SSRF#41
galuis116 wants to merge 1 commit into
aglover1221:mainfrom
galuis116:fix/ssrf-url-validation

Conversation

@galuis116
Copy link
Copy Markdown

What Changed

Closes #40.

The source-acquisition pipeline takes URLs returned by the web-search LLM in lib/pipeline/sources.ts:findSources and feeds them directly to fetch via lib/pipeline/pdf-validation.ts:rawFetch. There is no scheme check, no host allow-list, no IP-range deny-list, and no body-size cap on await res.arrayBuffer(). A poisoned web-search result can drive the worker into requesting http://169.254.169.254/latest/meta-data/... (AWS IMDS), http://[fd00:ec2::254]/... (IPv6 IMDS), http://localhost:5432/, etc. — and an oversized response OOMs the worker. The same shape repeats in lib/integrations/reducto.ts at downloadImage (line ~266) and the presigned-result fetch in fetchJobResult (line ~212).

This PR adds defense in depth:

  1. New module lib/safe-url.ts — single source of truth for URL safety:

    • validateUrlShape(url, options) — synchronous validator. Scheme allowlist (default http: / https:), IPv6 bracket-unwrap, IP-literal deny-list covering RFC 1918 (10/8, 172.16/12, 192.168/16), 127/8 loopback, 169.254/16 link-local (AWS IMDS), 100.64/10 CGNAT, 0.0.0.0/8, IETF TEST-NET-1/2/3, multicast 224/4, reserved 240/4, broadcast, IPv6 ::1, fe80::/10, fc00::/7 ULA, ff00::/8 multicast, IPv4-mapped ::ffff:… re-checked against v4 rules. Plus a localhost-alias name list (localhost, localhost.localdomain, ip6-localhost, ip6-loopback, metadata.google.internal, metadata).
    • validateUrlAsync(url, options) — runs the shape validator, then dns.lookup(host, { all: true }) and re-validates every A/AAAA record against the IP deny-list.
    • boundedFetch(url, { maxBytes, timeoutMs, headers, redirect, allowedSchemes }) — fetch wrapper that runs validateUrlAsync before the request, streams the response body via res.body.getReader() and aborts past maxBytes, re-validates the FINAL URL after redirects (so a 302 from a public host to 169.254.169.254 is rejected even though the original URL passed).
  2. lib/pipeline/pdf-validation.tsrawFetch now delegates to boundedFetch with maxBytes = 100 MiB (well above any real vendor data-sheet). All four existing callers of fetchAndValidatePdf (option-matrices.ts, sources.ts × 2, program-docs.ts, app/api/pipeline/sources/[id]/screenshot/route.ts) inherit the validation transparently — no signature change.

  3. lib/integrations/reducto.tsdownloadImage (LLM/Reducto-supplied image URLs) and fetchJobResult (presigned result follow) both route through boundedFetch with maxBytes = 50 MiB. The trusted Reducto API endpoints (platform.reducto.ai/upload, /parse_async, /job/{id}) are intentionally NOT routed through — they're trusted infrastructure with bearer auth, and going through the deny-list would just add latency.

Scope Details

No public API shape changes. rawFetch, fetchAndValidatePdf, downloadImage, and fetchJobResult keep their existing signatures. Callers see exactly the same { ok, status, buf, contentType, error } / PdfValidationResult / DownloadImageResult shapes — error paths now also surface a clear "host X is a private/reserved IP" / "scheme X not allowed" / "body exceeded N bytes" message instead of a generic network error.

No dependency change, no schema change, no migration. node:dns/promises and node:net are stdlib. Existing npm run test (4 files / 35 tests) passes unchanged.

Known limitation — DNS rebinding. Between validateUrlAsync and the actual fetch, a malicious DNS server can serve a different answer. True defense requires resolving once, then pinning the request to that IP via an undici Agent / dispatcher. Documented in the module header as a deliberate follow-up.

Trusted endpoints kept out of the deny-list. Calls to platform.reducto.ai/{upload,parse_async,job} in reducto.ts:106/145/166 continue to use raw fetch — they're trusted infrastructure, bearer-authenticated, hardcoded host, no LLM input. Routing them through boundedFetch would add a DNS round-trip per call with no security benefit.

Test Plan

  • npx tsc --noEmit — clean.
  • npm run test — 4 test files / 35 tests pass; no regressions.
  • Live deny-list repro against the new helpers — 13 attack URLs all rejected at the validateUrlShape step (no network IO):
    AWS IMDS v4         → "host 169.254.169.254 is a private/reserved IP"
    AWS IMDS v6         → "host fd00:ec2::254 is a private/reserved IP"
    loopback v4 / v6    → blocked
    localhost name      → "host localhost is a localhost alias"
    metadata.google.internal → blocked
    RFC 1918 10/8, 172.16, 192.168 → blocked
    ::ffff:127.0.0.1    → "host ::ffff:7f00:1 is a private/reserved IP"  (IPv4-mapped unwrap works)
    file://, gopher://, javascript: → "scheme X: not allowed"
    https://www.hpe.com/... (legitimate vendor PDF) → ok=true
    https://8.8.8.8/    (legitimate public IPv4)    → ok=true
    
  • boundedFetch against IMDS literal returns { ok: false, error: "host 169.254.169.254 is a private/reserved IP" } without making any network request.
  • Follow-up (out of scope for this PR): vitest coverage for each deny-list category + integration tests against a mock HTTP server for the maxBytes cap and redirect-to-private-IP rejection — sketched in the linked issue's "Tests" §4.
  • Follow-up: DNS-rebinding hardening via an undici Agent that pins the request to the IP resolved at validation time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] SSRF via LLM-controlled URLs in source acquisition + unbounded response buffering across pdf-validation and Reducto download paths

1 participant