Skip to content

fix: validate URL origin in fetch_hf_docs to prevent SSRF#83

Open
JasonOA888 wants to merge 2 commits intohuggingface:mainfrom
JasonOA888:fix/ssrf-fetch-hf-docs
Open

fix: validate URL origin in fetch_hf_docs to prevent SSRF#83
JasonOA888 wants to merge 2 commits intohuggingface:mainfrom
JasonOA888:fix/ssrf-fetch-hf-docs

Conversation

@JasonOA888
Copy link
Copy Markdown

Summary

fetch_hf_docs accepted arbitrary URLs from LLM-generated arguments without origin validation. Since the request carries the user's HF Bearer token (Authorization: Bearer {hf_token}) and uses follow_redirects=True, a crafted prompt could trick the LLM into sending the token to an attacker-controlled server.

Fix

Add _is_allowed_doc_url() which restricts requests to huggingface.co, hf.co, and gradio.app — the three origins this tool is actually designed to fetch from.

The check runs before the HTTP request is made, so no auth headers ever leave allowed origins.

Before

url = arguments.get("url", "")
if not url:
    return "Error: No URL provided", False
# ... directly fetches url with Bearer token

After

url = arguments.get("url", "")
if not url:
    return "Error: No URL provided", False
if not _is_allowed_doc_url(url):
    return ("Error: URL not allowed. ...", False)
# ... only reaches here for allowed origins

Testing

  • Syntax validated (ast.parse clean)
  • Allowlist covers all existing doc endpoints (huggingface.co/docs/*, gradio.app/llms.txt)
  • Subdomains supported via host.endswith() check

The fetch_hf_docs tool accepted arbitrary URLs from LLM-generated
arguments without origin validation. Since the request carries the
user's HF Bearer token and follows redirects, a crafted prompt could
trick the LLM into exfiltrating the token to an attacker-controlled
server.

Add _is_allowed_doc_url() which restricts fetch requests to
huggingface.co, hf.co, and gradio.app origins (matching the tool's
intended purpose of fetching HF/Gradio documentation).
23 test cases covering:
- Allowed origins (exact + subdomain)
- Blocked schemes (HTTP, FTP)
- Disallowed hosts (evil.com, metadata endpoint, prefix attacks)
- SSRF payloads (127.0.0.1, 0.0.0.0, ::1, localhost)
- Edge cases (empty, garbage, port numbers)
@akseljoonas
Copy link
Copy Markdown
Collaborator

@claude please review this PR. Focus on correctness, security, LiteLLM/Bedrock routing impact, and whether it's safe to merge against current main. Keep it concise and prefer inline comments where it matters.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @akseljoonas's task in 1m 37s —— View job


PR Review: fix SSRF in fetch_hf_docs

  • Read changed files (docs_tools.py, test_docs_tools_ssrf.py)
  • Diff against origin/main
  • Review: correctness, security, routing impact, merge safety

Verdict: ✅ Safe to merge — real fix, tight scope, good tests, zero routing impact.


Security correctness

The _is_allowed_doc_url guard is well-implemented:

  • parsed.hostname not netloc — port numbers (huggingface.co:8080) don't bypass the check. ✓
  • HTTPS-only — rejects HTTP/FTP. ✓
  • Subdomain via endswith(f".{d}") — the leading dot prevents evil-huggingface.co from matching. ✓
  • @-trick (https://evil.com@huggingface.co/) — safe: urlparse returns hostname = "huggingface.co"; httpx routes to that host. ✓

Redirect concern is a non-issue. httpx strips the Authorization header on cross-origin redirects (RFC-compliant behavior), so even with follow_redirects=True the Bearer token cannot leak to an external origin via a redirect chain. The allowlist check + httpx's behavior provide defence in depth.

One thing worth noting

docs_tools.py:424-425 rewrites the URL after origin validation:

if not url.endswith(".md"):
    url = f"{url}.md"

Appending .md to the path doesn't affect the hostname, so this isn't a bypass — but it means the validated URL and the fetched URL are different. No action needed, just worth knowing if anyone debugs a 404 later.

Out-of-scope paths (no regression introduced)

_fetch_endpoint_docs (line 100) constructs URLs from a hardcoded https://huggingface.co/docs/{endpoint} prefix — origin is fixed, not LLM-controlled. Not affected by this PR and not a new risk.

LiteLLM / Bedrock routing

Zero impact. No model routing, client configuration, or LiteLLM/Bedrock code is touched. hf_docs_fetch_handler's signature is unchanged.

Tests

Coverage is solid: allowed origins, subdomains, wrong schemes, disallowed hosts, internal SSRF payloads, edge cases. The importlib + manual stub approach is a bit fragile if docs_tools.py gains new top-level imports, but it works today.

One minor dead-code note: urlparse in CPython never raises, so the try/except Exception: return False at docs_tools.py:394-396 is unreachable. Not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants