Validate URLs and bound response sizes to block SSRF#41
Open
galuis116 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What Changed
Closes #40.
The source-acquisition pipeline takes URLs returned by the web-search LLM in
lib/pipeline/sources.ts:findSourcesand feeds them directly tofetchvialib/pipeline/pdf-validation.ts:rawFetch. There is no scheme check, no host allow-list, no IP-range deny-list, and no body-size cap onawait res.arrayBuffer(). A poisoned web-search result can drive the worker into requestinghttp://169.254.169.254/latest/meta-data/...(AWS IMDS),http://[fd00:ec2::254]/...(IPv6 IMDS),http://localhost:5432/, etc. — and an oversized response OOMs the worker. The same shape repeats inlib/integrations/reducto.tsatdownloadImage(line ~266) and the presigned-result fetch infetchJobResult(line ~212).This PR adds defense in depth:
New module
lib/safe-url.ts— single source of truth for URL safety:validateUrlShape(url, options)— synchronous validator. Scheme allowlist (defaulthttp:/https:), IPv6 bracket-unwrap, IP-literal deny-list covering RFC 1918 (10/8, 172.16/12, 192.168/16), 127/8 loopback, 169.254/16 link-local (AWS IMDS), 100.64/10 CGNAT, 0.0.0.0/8, IETF TEST-NET-1/2/3, multicast 224/4, reserved 240/4, broadcast, IPv6::1,fe80::/10,fc00::/7ULA,ff00::/8multicast, IPv4-mapped::ffff:…re-checked against v4 rules. Plus a localhost-alias name list (localhost,localhost.localdomain,ip6-localhost,ip6-loopback,metadata.google.internal,metadata).validateUrlAsync(url, options)— runs the shape validator, thendns.lookup(host, { all: true })and re-validates every A/AAAA record against the IP deny-list.boundedFetch(url, { maxBytes, timeoutMs, headers, redirect, allowedSchemes })— fetch wrapper that runsvalidateUrlAsyncbefore the request, streams the response body viares.body.getReader()and aborts pastmaxBytes, re-validates the FINAL URL after redirects (so a 302 from a public host to169.254.169.254is rejected even though the original URL passed).lib/pipeline/pdf-validation.ts—rawFetchnow delegates toboundedFetchwithmaxBytes = 100 MiB(well above any real vendor data-sheet). All four existing callers offetchAndValidatePdf(option-matrices.ts,sources.ts× 2,program-docs.ts,app/api/pipeline/sources/[id]/screenshot/route.ts) inherit the validation transparently — no signature change.lib/integrations/reducto.ts—downloadImage(LLM/Reducto-supplied image URLs) andfetchJobResult(presigned result follow) both route throughboundedFetchwithmaxBytes = 50 MiB. The trusted Reducto API endpoints (platform.reducto.ai/upload,/parse_async,/job/{id}) are intentionally NOT routed through — they're trusted infrastructure with bearer auth, and going through the deny-list would just add latency.Scope Details
No public API shape changes.
rawFetch,fetchAndValidatePdf,downloadImage, andfetchJobResultkeep their existing signatures. Callers see exactly the same{ ok, status, buf, contentType, error }/PdfValidationResult/DownloadImageResultshapes — error paths now also surface a clear"host X is a private/reserved IP"/"scheme X not allowed"/"body exceeded N bytes"message instead of a generic network error.No dependency change, no schema change, no migration.
node:dns/promisesandnode:netare stdlib. Existingnpm run test(4 files / 35 tests) passes unchanged.Known limitation — DNS rebinding. Between
validateUrlAsyncand the actualfetch, a malicious DNS server can serve a different answer. True defense requires resolving once, then pinning the request to that IP via an undici Agent / dispatcher. Documented in the module header as a deliberate follow-up.Trusted endpoints kept out of the deny-list. Calls to
platform.reducto.ai/{upload,parse_async,job}inreducto.ts:106/145/166continue to use rawfetch— they're trusted infrastructure, bearer-authenticated, hardcoded host, no LLM input. Routing them throughboundedFetchwould add a DNS round-trip per call with no security benefit.Test Plan
npx tsc --noEmit— clean.npm run test— 4 test files / 35 tests pass; no regressions.validateUrlShapestep (no network IO):boundedFetchagainst IMDS literal returns{ ok: false, error: "host 169.254.169.254 is a private/reserved IP" }without making any network request.maxBytescap and redirect-to-private-IP rejection — sketched in the linked issue's "Tests" §4.