From a7c02c31f8530de3fd27172a3a02d1046d02fef6 Mon Sep 17 00:00:00 2001
From: chad-loder <26261238+chad-loder@users.noreply.github.com>
Date: Tue, 12 May 2026 20:58:57 -0700
Subject: [PATCH] docs(examples): add regex-allowlist credential-leak example +
 index-page link
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds a new worked example explaining why hand-rolled regexes are the wrong
tool for hostname allowlists, why URLPattern is the textbook fix, and how
to spell the common "host or any subdomain" policy as a component-aware
pattern rather than a regex tweak.

The canonical case is invoke-ai/InvokeAI#7518: a configuration field
where each trusted upstream gets a regex paired with a credential. The
naive entry ``url_regex: 'private.example'`` leaks the secret when the
client visits either of two attacker-controlled URL shapes:

  - https://malicious.example/private.example/theft.safetensors
    (path-segment fallthrough; re.search finds the literal anywhere)
  - https://private.example.malicious.example/theft.safetensors
    (subdomain shadowing; the legitimate label sits inside the attacker's host)

A component-aware URLPattern matches the hostname *as the hostname*; it
cannot be tricked into accepting a path segment or a label-of-attacker's-host
that happens to spell the same text.

Wires the new page into:

  - docs/examples/index.md under a new "Security" section
  - properdocs.yml nav (alongside the webhook-shape validator)
  - docs/index.md home page — a sly inline note that hand-rolled regex
    URL allowlists are routinely error-prone, linking to the seed
    InvokeAI issue. The differentiator paragraph now closes with one
    extra clause that points security-curious readers at the example.

No code changes; docs only. ``just docs`` builds clean in strict mode.
---
 .../avoid-regex-hostname-allowlist-vulns.md   | 151 ++++++++++++++++++
 docs/examples/index.md                        |   4 +
 docs/index.md                                 |   5 +-
 properdocs.yml                                |   1 +
 4 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100644 docs/examples/avoid-regex-hostname-allowlist-vulns.md

diff --git a/docs/examples/avoid-regex-hostname-allowlist-vulns.md b/docs/examples/avoid-regex-hostname-allowlist-vulns.md
new file mode 100644
index 0000000..ab994f9
--- /dev/null
+++ b/docs/examples/avoid-regex-hostname-allowlist-vulns.md
@@ -0,0 +1,151 @@
+# Avoid regex hostname-allowlist credential leaks
+
+A common pattern: keep a list of "trusted" hosts and attach a credential
+(API token, cookie, signed header) when a request URL matches one of them.
+The obvious implementation — pick a regex per trusted host — has a security
+pitfall that's easy to miss in code review and even easier to exploit.
+
+## The vulnerability
+
+The seed case is
+[`invoke-ai/InvokeAI#7518`](https://github.com/invoke-ai/InvokeAI/issues/7518):
+a configuration field where users register one regex per trusted upstream,
+each paired with the credential the client should send when the URL matches.
+
+```yaml
+remote_api_tokens:
+  - url_regex: 'private.example'
+    token: 'secret'
+```
+
+The author's intent reads cleanly: *"when the request URL is for `private.example`,
+attach `secret`."* But Python's `re.search` looks for a substring match anywhere
+in the URL string, and a regex source is not a hostname — it's a flat character
+sequence with a different grammar. Two URL shapes that an attacker controls also
+match this regex:
+
+1. **Path-segment fallthrough.** A `re.search` on the URL string finds
+   `private.example` inside `https://malicious.example/private.example/theft.safetensors`.
+   The path contains the literal regex text, so the regex matches, the
+   credential is attached to the outbound request, and the secret lands on the
+   attacker's server.
+2. **Subdomain shadowing.** The same regex matches
+   `https://private.example.malicious.example/theft.safetensors`. The attacker
+   simply registers a subdomain whose label is the legitimate host's name; the
+   regex sees `private.example` as a substring and attaches the credential.
+
+It is *possible* to write a regex that resists both — something like
+`^https://([^[@/:]+\.)?private\.example/` — but the difference between the
+naive version and the correct one is not visually obvious, and there's no
+compiler warning when a user gets it wrong. Every shipped configuration
+becomes a separate audit problem.
+
+## With URLPattern
+
+URLPattern matches on parsed URL *components*, not flat strings. A pattern
+constrained on the `hostname` component is structurally incapable of matching
+a path segment that happens to spell the same text, and a `hostname` literal
+matches the whole component — not a substring within it.
+
+```python
+from yarlpattern import URLPattern
+
+TRUSTED = URLPattern({
+    "protocol": "https",
+    "hostname": "private.example",
+})
+
+# Intended traffic
+TRUSTED.test("https://private.example/models/sd-xl.safetensors")          # True
+
+# The two attacks from above
+TRUSTED.test("https://malicious.example/private.example/theft.safetensors")    # False
+TRUSTED.test("https://private.example.malicious.example/theft.safetensors")    # False
+
+# Cleartext is rejected at the pattern level
+TRUSTED.test("http://private.example/models/sd-xl.safetensors")           # False
+```
+
+The first negative case fails because `private.example` (a path segment) is
+not the *hostname* — URLPattern parsed the URL first, then asked "does the
+hostname literal match?" The second fails because `private.example.malicious.example`
+is the full hostname, and `private.example` (the pattern) does not equal it.
+The third fails because `protocol: "https"` is in the pattern; there is no
+separate "and also require HTTPS" check to forget elsewhere.
+
+## Allowing legitimate subdomains
+
+If the desired policy is *"`private.example` itself or any of its subdomains,"*
+spell that as a component-aware pattern — not a regex tweak:
+
+```python
+TRUSTED = URLPattern({
+    "protocol": "https",
+    "hostname": "{:subdomain.}*private.example",
+})
+
+TRUSTED.test("https://private.example/models/sd-xl.safetensors")        # True
+TRUSTED.test("https://eu.private.example/models/sd-xl.safetensors")     # True
+
+# Still rejected — the attacker cannot prepend the legit host as a label
+TRUSTED.test("https://private.example.malicious.example/theft.safetensors")   # False
+```
+
+The `{:subdomain.}*` part matches zero or more dot-separated labels *before*
+the suffix `private.example`. It is parsed against the hostname component,
+so a host like `private.example.malicious.example` — whose final label is
+`malicious` — cannot satisfy the suffix constraint.
+
+## Multi-host allowlist
+
+A list of trusted hosts is one URLPattern per host, kept next to the credential.
+The pattern table is a security-review artifact: a reviewer can read the
+allowlist directly without auditing imperative control flow.
+
+```python
+TRUSTED_UPSTREAMS = [
+    (URLPattern({"protocol": "https", "hostname": "private.example"}),
+     "secret-private-example"),
+    (URLPattern({"protocol": "https", "hostname": "{:subdomain.}*models.acme.example"}),
+     "secret-acme-models"),
+    (URLPattern({"protocol": "https", "hostname": "huggingface.co"}),
+     "secret-hf"),
+]
+
+def credential_for(url: str) -> str | None:
+    for pattern, token in TRUSTED_UPSTREAMS:
+        if pattern.test(url):
+            return token
+    return None
+```
+
+If `credential_for` returns `None`, the client sends the request unauthenticated.
+There is no way for an attacker-controlled URL to "almost match" a trusted entry.
+
+## What you get for free
+
+- **Component-aware matching by construction.** A hostname pattern matches
+  the hostname; a pathname pattern matches the pathname. The grammar of the
+  matcher mirrors the structure of the URL, so substring-fallthrough attacks
+  cannot reach the wrong field.
+- **WHATWG URL parsing under the hood.** Inputs are parsed via
+  [`yarl`](https://github.com/aio-libs/yarl) — the same WHATWG-flavoured rules
+  browsers apply — *before* the pattern is asked anything. Userinfo, ports,
+  trailing dots, and IDN labels are normalised the way an attacker's
+  ambiguity tricks expect them to *not* be.
+- **No manual scheme check.** `protocol: "https"` lives in the pattern;
+  cleartext HTTP cannot match. One fewer thing to forget in the call site.
+- **Auditable allowlist.** The list of trusted-host patterns is the
+  allowlist. Reviewers don't have to trace imperative control flow.
+
+## Further reading
+
+- The seed issue:
+  [`invoke-ai/InvokeAI#7518` — "remote_api_tokens should use URL Patterns instead of regular expressions"](https://github.com/invoke-ai/InvokeAI/issues/7518)
+- The class of bug: substring-vs-component confusion in URL-allowlist regexes
+  has produced public CVEs in routing and reverse-proxy code repeatedly;
+  searching CVE databases for "ReDoS" or "host header bypass via regex" turns
+  up real examples in projects far larger than InvokeAI.
+- Background on why the URLPattern spec exists in the first place — service
+  workers needed component-aware scope matching for exactly this reason:
+  see [**Overview → What is URLPattern?**](../overview/what-is-urlpattern.md).
diff --git a/docs/examples/index.md b/docs/examples/index.md
index b7c42a9..d8d369a 100644
--- a/docs/examples/index.md
+++ b/docs/examples/index.md
@@ -11,6 +11,10 @@ been verified against the test suite.
 - [Add subdomain routing to FastAPI](add-subdomain-routing-to-fastapi.md)
 - [Validate inbound webhooks by URL shape](validate-inbound-webhooks-by-url-shape.md)
 
+## Security
+
+- [Avoid regex hostname-allowlist credential leaks](avoid-regex-hostname-allowlist-vulns.md)
+
 ## AI / model serving
 
 - [Match the KServe `/v2/models` inference path](match-the-kserve-v2-inference-path.md)
diff --git a/docs/index.md b/docs/index.md
index 94f80d2..1e5721e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -44,7 +44,10 @@ result.pathname["groups"]["0"]         # 'users/42'
 That's the URLPattern differentiator: matching *across* protocol,
 hostname, port, path, and search at once, returning structured
 named-group results per component. Flask / FastAPI / Starlette `:id`
-routers only match the path.
+routers only match the path; [hand-rolled regexes for URL allowlists
+are routinely
+error-prone](https://github.com/invoke-ai/InvokeAI/issues/7518)
+because a regex source is a flat character sequence and a URL is not.
 
 ## Where to go next
 
diff --git a/properdocs.yml b/properdocs.yml
index dce9060..28eff65 100644
--- a/properdocs.yml
+++ b/properdocs.yml
@@ -94,6 +94,7 @@ nav:
       - Add subdomain routing to aiohttp: examples/add-subdomain-routing-to-aiohttp.md
       - Add subdomain routing to FastAPI: examples/add-subdomain-routing-to-fastapi.md
       - Validate inbound webhooks by URL shape: examples/validate-inbound-webhooks-by-url-shape.md
+      - Avoid regex hostname-allowlist credential leaks: examples/avoid-regex-hostname-allowlist-vulns.md
       - Match the KServe /v2/models inference path: examples/match-the-kserve-v2-inference-path.md
       - Pick an LLM backend by model name: examples/pick-an-llm-backend-by-model-name.md
       - Replace MCP resource URI templates: examples/replace-mcp-resource-uri-templates.md