From a7c02c31f8530de3fd27172a3a02d1046d02fef6 Mon Sep 17 00:00:00 2001 From: chad-loder <26261238+chad-loder@users.noreply.github.com> Date: Tue, 12 May 2026 20:58:57 -0700 Subject: [PATCH] docs(examples): add regex-allowlist credential-leak example + index-page link MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a new worked example explaining why hand-rolled regexes are the wrong tool for hostname allowlists, why URLPattern is the textbook fix, and how to spell the common "host or any subdomain" policy as a component-aware pattern rather than a regex tweak. The canonical case is invoke-ai/InvokeAI#7518: a configuration field where each trusted upstream gets a regex paired with a credential. The naive entry ``url_regex: 'private.example'`` leaks the secret when the client visits either of two attacker-controlled URL shapes: - https://malicious.example/private.example/theft.safetensors (path-segment fallthrough; re.search finds the literal anywhere) - https://private.example.malicious.example/theft.safetensors (subdomain shadowing; the legitimate label sits inside the attacker's host) A component-aware URLPattern matches the hostname *as the hostname*; it cannot be tricked into accepting a path segment or a label-of-attacker's-host that happens to spell the same text. Wires the new page into: - docs/examples/index.md under a new "Security" section - properdocs.yml nav (alongside the webhook-shape validator) - docs/index.md home page — a sly inline note that hand-rolled regex URL allowlists are routinely error-prone, linking to the seed InvokeAI issue. The differentiator paragraph now closes with one extra clause that points security-curious readers at the example. No code changes; docs only. ``just docs`` builds clean in strict mode. --- .../avoid-regex-hostname-allowlist-vulns.md | 151 ++++++++++++++++++ docs/examples/index.md | 4 + docs/index.md | 5 +- properdocs.yml | 1 + 4 files changed, 160 insertions(+), 1 deletion(-) create mode 100644 docs/examples/avoid-regex-hostname-allowlist-vulns.md diff --git a/docs/examples/avoid-regex-hostname-allowlist-vulns.md b/docs/examples/avoid-regex-hostname-allowlist-vulns.md new file mode 100644 index 0000000..ab994f9 --- /dev/null +++ b/docs/examples/avoid-regex-hostname-allowlist-vulns.md @@ -0,0 +1,151 @@ +# Avoid regex hostname-allowlist credential leaks + +A common pattern: keep a list of "trusted" hosts and attach a credential +(API token, cookie, signed header) when a request URL matches one of them. +The obvious implementation — pick a regex per trusted host — has a security +pitfall that's easy to miss in code review and even easier to exploit. + +## The vulnerability + +The seed case is +[`invoke-ai/InvokeAI#7518`](https://github.com/invoke-ai/InvokeAI/issues/7518): +a configuration field where users register one regex per trusted upstream, +each paired with the credential the client should send when the URL matches. + +```yaml +remote_api_tokens: + - url_regex: 'private.example' + token: 'secret' +``` + +The author's intent reads cleanly: *"when the request URL is for `private.example`, +attach `secret`."* But Python's `re.search` looks for a substring match anywhere +in the URL string, and a regex source is not a hostname — it's a flat character +sequence with a different grammar. Two URL shapes that an attacker controls also +match this regex: + +1. **Path-segment fallthrough.** A `re.search` on the URL string finds + `private.example` inside `https://malicious.example/private.example/theft.safetensors`. + The path contains the literal regex text, so the regex matches, the + credential is attached to the outbound request, and the secret lands on the + attacker's server. +2. **Subdomain shadowing.** The same regex matches + `https://private.example.malicious.example/theft.safetensors`. The attacker + simply registers a subdomain whose label is the legitimate host's name; the + regex sees `private.example` as a substring and attaches the credential. + +It is *possible* to write a regex that resists both — something like +`^https://([^[@/:]+\.)?private\.example/` — but the difference between the +naive version and the correct one is not visually obvious, and there's no +compiler warning when a user gets it wrong. Every shipped configuration +becomes a separate audit problem. + +## With URLPattern + +URLPattern matches on parsed URL *components*, not flat strings. A pattern +constrained on the `hostname` component is structurally incapable of matching +a path segment that happens to spell the same text, and a `hostname` literal +matches the whole component — not a substring within it. + +```python +from yarlpattern import URLPattern + +TRUSTED = URLPattern({ + "protocol": "https", + "hostname": "private.example", +}) + +# Intended traffic +TRUSTED.test("https://private.example/models/sd-xl.safetensors") # True + +# The two attacks from above +TRUSTED.test("https://malicious.example/private.example/theft.safetensors") # False +TRUSTED.test("https://private.example.malicious.example/theft.safetensors") # False + +# Cleartext is rejected at the pattern level +TRUSTED.test("http://private.example/models/sd-xl.safetensors") # False +``` + +The first negative case fails because `private.example` (a path segment) is +not the *hostname* — URLPattern parsed the URL first, then asked "does the +hostname literal match?" The second fails because `private.example.malicious.example` +is the full hostname, and `private.example` (the pattern) does not equal it. +The third fails because `protocol: "https"` is in the pattern; there is no +separate "and also require HTTPS" check to forget elsewhere. + +## Allowing legitimate subdomains + +If the desired policy is *"`private.example` itself or any of its subdomains,"* +spell that as a component-aware pattern — not a regex tweak: + +```python +TRUSTED = URLPattern({ + "protocol": "https", + "hostname": "{:subdomain.}*private.example", +}) + +TRUSTED.test("https://private.example/models/sd-xl.safetensors") # True +TRUSTED.test("https://eu.private.example/models/sd-xl.safetensors") # True + +# Still rejected — the attacker cannot prepend the legit host as a label +TRUSTED.test("https://private.example.malicious.example/theft.safetensors") # False +``` + +The `{:subdomain.}*` part matches zero or more dot-separated labels *before* +the suffix `private.example`. It is parsed against the hostname component, +so a host like `private.example.malicious.example` — whose final label is +`malicious` — cannot satisfy the suffix constraint. + +## Multi-host allowlist + +A list of trusted hosts is one URLPattern per host, kept next to the credential. +The pattern table is a security-review artifact: a reviewer can read the +allowlist directly without auditing imperative control flow. + +```python +TRUSTED_UPSTREAMS = [ + (URLPattern({"protocol": "https", "hostname": "private.example"}), + "secret-private-example"), + (URLPattern({"protocol": "https", "hostname": "{:subdomain.}*models.acme.example"}), + "secret-acme-models"), + (URLPattern({"protocol": "https", "hostname": "huggingface.co"}), + "secret-hf"), +] + +def credential_for(url: str) -> str | None: + for pattern, token in TRUSTED_UPSTREAMS: + if pattern.test(url): + return token + return None +``` + +If `credential_for` returns `None`, the client sends the request unauthenticated. +There is no way for an attacker-controlled URL to "almost match" a trusted entry. + +## What you get for free + +- **Component-aware matching by construction.** A hostname pattern matches + the hostname; a pathname pattern matches the pathname. The grammar of the + matcher mirrors the structure of the URL, so substring-fallthrough attacks + cannot reach the wrong field. +- **WHATWG URL parsing under the hood.** Inputs are parsed via + [`yarl`](https://github.com/aio-libs/yarl) — the same WHATWG-flavoured rules + browsers apply — *before* the pattern is asked anything. Userinfo, ports, + trailing dots, and IDN labels are normalised the way an attacker's + ambiguity tricks expect them to *not* be. +- **No manual scheme check.** `protocol: "https"` lives in the pattern; + cleartext HTTP cannot match. One fewer thing to forget in the call site. +- **Auditable allowlist.** The list of trusted-host patterns is the + allowlist. Reviewers don't have to trace imperative control flow. + +## Further reading + +- The seed issue: + [`invoke-ai/InvokeAI#7518` — "remote_api_tokens should use URL Patterns instead of regular expressions"](https://github.com/invoke-ai/InvokeAI/issues/7518) +- The class of bug: substring-vs-component confusion in URL-allowlist regexes + has produced public CVEs in routing and reverse-proxy code repeatedly; + searching CVE databases for "ReDoS" or "host header bypass via regex" turns + up real examples in projects far larger than InvokeAI. +- Background on why the URLPattern spec exists in the first place — service + workers needed component-aware scope matching for exactly this reason: + see [**Overview → What is URLPattern?**](../overview/what-is-urlpattern.md). diff --git a/docs/examples/index.md b/docs/examples/index.md index b7c42a9..d8d369a 100644 --- a/docs/examples/index.md +++ b/docs/examples/index.md @@ -11,6 +11,10 @@ been verified against the test suite. - [Add subdomain routing to FastAPI](add-subdomain-routing-to-fastapi.md) - [Validate inbound webhooks by URL shape](validate-inbound-webhooks-by-url-shape.md) +## Security + +- [Avoid regex hostname-allowlist credential leaks](avoid-regex-hostname-allowlist-vulns.md) + ## AI / model serving - [Match the KServe `/v2/models` inference path](match-the-kserve-v2-inference-path.md) diff --git a/docs/index.md b/docs/index.md index 94f80d2..1e5721e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -44,7 +44,10 @@ result.pathname["groups"]["0"] # 'users/42' That's the URLPattern differentiator: matching *across* protocol, hostname, port, path, and search at once, returning structured named-group results per component. Flask / FastAPI / Starlette `:id` -routers only match the path. +routers only match the path; [hand-rolled regexes for URL allowlists +are routinely +error-prone](https://github.com/invoke-ai/InvokeAI/issues/7518) +because a regex source is a flat character sequence and a URL is not. ## Where to go next diff --git a/properdocs.yml b/properdocs.yml index dce9060..28eff65 100644 --- a/properdocs.yml +++ b/properdocs.yml @@ -94,6 +94,7 @@ nav: - Add subdomain routing to aiohttp: examples/add-subdomain-routing-to-aiohttp.md - Add subdomain routing to FastAPI: examples/add-subdomain-routing-to-fastapi.md - Validate inbound webhooks by URL shape: examples/validate-inbound-webhooks-by-url-shape.md + - Avoid regex hostname-allowlist credential leaks: examples/avoid-regex-hostname-allowlist-vulns.md - Match the KServe /v2/models inference path: examples/match-the-kserve-v2-inference-path.md - Pick an LLM backend by model name: examples/pick-an-llm-backend-by-model-name.md - Replace MCP resource URI templates: examples/replace-mcp-resource-uri-templates.md