fix: validate URL origin in fetch_hf_docs to prevent SSRF by JasonOA888 · Pull Request #83 · huggingface/ml-intern

JasonOA888 · 2026-04-23T17:18:29Z

Summary

fetch_hf_docs accepted arbitrary URLs from LLM-generated arguments without origin validation. Since the request carries the user's HF Bearer token (Authorization: Bearer {hf_token}) and uses follow_redirects=True, a crafted prompt could trick the LLM into sending the token to an attacker-controlled server.

Fix

Add _is_allowed_doc_url() which restricts requests to huggingface.co, hf.co, and gradio.app — the three origins this tool is actually designed to fetch from.

The check runs before the HTTP request is made, so no auth headers ever leave allowed origins.

Before

url = arguments.get("url", "")
if not url:
    return "Error: No URL provided", False
# ... directly fetches url with Bearer token

After

url = arguments.get("url", "")
if not url:
    return "Error: No URL provided", False
if not _is_allowed_doc_url(url):
    return ("Error: URL not allowed. ...", False)
# ... only reaches here for allowed origins

Testing

Syntax validated (ast.parse clean)
Allowlist covers all existing doc endpoints (huggingface.co/docs/*, gradio.app/llms.txt)
Subdomains supported via host.endswith() check

The fetch_hf_docs tool accepted arbitrary URLs from LLM-generated arguments without origin validation. Since the request carries the user's HF Bearer token and follows redirects, a crafted prompt could trick the LLM into exfiltrating the token to an attacker-controlled server. Add _is_allowed_doc_url() which restricts fetch requests to huggingface.co, hf.co, and gradio.app origins (matching the tool's intended purpose of fetching HF/Gradio documentation).

23 test cases covering: - Allowed origins (exact + subdomain) - Blocked schemes (HTTP, FTP) - Disallowed hosts (evil.com, metadata endpoint, prefix attacks) - SSRF payloads (127.0.0.1, 0.0.0.0, ::1, localhost) - Edge cases (empty, garbage, port numbers)

akseljoonas · 2026-04-24T14:08:07Z

@claude please review this PR. Focus on correctness, security, LiteLLM/Bedrock routing impact, and whether it's safe to merge against current main. Keep it concise and prefer inline comments where it matters.

claude · 2026-04-24T14:08:24Z

Claude finished @akseljoonas's task in 1m 37s —— View job

PR Review: fix SSRF in `fetch_hf_docs`

Read changed files (docs_tools.py, test_docs_tools_ssrf.py)
Diff against origin/main
Review: correctness, security, routing impact, merge safety

Verdict: ✅ Safe to merge — real fix, tight scope, good tests, zero routing impact.

Security correctness

The _is_allowed_doc_url guard is well-implemented:

parsed.hostname not netloc — port numbers (huggingface.co:8080) don't bypass the check. ✓
HTTPS-only — rejects HTTP/FTP. ✓
Subdomain via endswith(f".{d}") — the leading dot prevents evil-huggingface.co from matching. ✓
@-trick (https://evil.com@huggingface.co/) — safe: urlparse returns hostname = "huggingface.co"; httpx routes to that host. ✓

Redirect concern is a non-issue. httpx strips the Authorization header on cross-origin redirects (RFC-compliant behavior), so even with follow_redirects=True the Bearer token cannot leak to an external origin via a redirect chain. The allowlist check + httpx's behavior provide defence in depth.

One thing worth noting

docs_tools.py:424-425 rewrites the URL after origin validation:

if not url.endswith(".md"):
    url = f"{url}.md"

Appending .md to the path doesn't affect the hostname, so this isn't a bypass — but it means the validated URL and the fetched URL are different. No action needed, just worth knowing if anyone debugs a 404 later.

Out-of-scope paths (no regression introduced)

_fetch_endpoint_docs (line 100) constructs URLs from a hardcoded https://huggingface.co/docs/{endpoint} prefix — origin is fixed, not LLM-controlled. Not affected by this PR and not a new risk.

LiteLLM / Bedrock routing

Zero impact. No model routing, client configuration, or LiteLLM/Bedrock code is touched. hf_docs_fetch_handler's signature is unchanged.

Tests

Coverage is solid: allowed origins, subdomains, wrong schemes, disallowed hosts, internal SSRF payloads, edge cases. The importlib + manual stub approach is a bit fragile if docs_tools.py gains new top-level imports, but it works today.

One minor dead-code note: urlparse in CPython never raises, so the try/except Exception: return False at docs_tools.py:394-396 is unreachable. Not a bug.

JasonOA888 added 2 commits April 24, 2026 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate URL origin in fetch_hf_docs to prevent SSRF#83

fix: validate URL origin in fetch_hf_docs to prevent SSRF#83
JasonOA888 wants to merge 2 commits intohuggingface:mainfrom
JasonOA888:fix/ssrf-fetch-hf-docs

JasonOA888 commented Apr 23, 2026

Uh oh!

akseljoonas commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonOA888 commented Apr 23, 2026

Summary

Fix

Before

After

Testing

Uh oh!

akseljoonas commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix SSRF in fetch_hf_docs

Security correctness

One thing worth noting

Out-of-scope paths (no regression introduced)

LiteLLM / Bedrock routing

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 24, 2026 •

edited

Loading

PR Review: fix SSRF in `fetch_hf_docs`