π‘οΈ Sentinel: [HIGH] Fix Server-Side Request Forgery in JobParser#224
π‘οΈ Sentinel: [HIGH] Fix Server-Side Request Forgery in JobParser#224
Conversation
Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
|
π Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a π emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuideAdds SSRF protection to JobParser URL fetching by introducing a safe URL fetch helper that validates scheme and resolved IP ranges, updates error handling accordingly, documents the incident in sentinel.md, and includes ad-hoc scripts for manually testing blocked metadata and 0.0.0.0 endpoints. Sequence diagram for JobParser.parse_from_url with SSRF-safe fetchingsequenceDiagram
actor Client
participant JobParser
participant _fetch_url_safe
participant DNSResolver
participant IPAddressValidator
participant HttpClient
Client->>JobParser: parse_from_url(url)
JobParser->>_fetch_url_safe: _fetch_url_safe(url)
_fetch_url_safe->>_fetch_url_safe: parse URL
alt invalid_scheme
_fetch_url_safe-->>JobParser: ValueError (invalid scheme)
JobParser-->>Client: RuntimeError Failed to fetch URL
end
_fetch_url_safe->>DNSResolver: getaddrinfo(hostname)
DNSResolver-->>_fetch_url_safe: addrinfo or socket.gaierror
alt resolution_failed
_fetch_url_safe-->>JobParser: ValueError could not resolve hostname
JobParser-->>Client: RuntimeError Failed to fetch URL
else resolution_succeeded
loop each_resolved_address
_fetch_url_safe->>IPAddressValidator: ip_address(ip_addr)
IPAddressValidator-->>_fetch_url_safe: ip object or ValueError
alt invalid_ip
_fetch_url_safe-->>JobParser: ValueError invalid IP address
JobParser-->>Client: RuntimeError Failed to fetch URL
else internal_or_disallowed_range
_fetch_url_safe-->>JobParser: RuntimeError internal IP not allowed
JobParser-->>Client: RuntimeError Failed to fetch URL
end
end
_fetch_url_safe->>HttpClient: requests.get(url, headers, timeout=30)
HttpClient-->>_fetch_url_safe: Response or RequestException
alt http_error
_fetch_url_safe-->>JobParser: requests.RequestException
JobParser-->>Client: RuntimeError Failed to fetch URL
else http_success
_fetch_url_safe-->>JobParser: Response
JobParser->>JobParser: response.raise_for_status()
JobParser->>JobParser: _parse_html(response.text)
JobParser-->>Client: JobDetails
end
end
Updated class diagram for JobParser with SSRF-safe URL fetchingclassDiagram
class JobParser {
+parse_from_file(file_path: Path, url: Optional[str]) JobDetails
+parse_from_url(url: str) JobDetails
-_fetch_url_safe(url: str) requests.Response
-_parse_html(html: str) JobDetails
}
class Requests {
+get(url: str, headers: dict, timeout: int) Response
}
class Response {
+text: str
+raise_for_status() None
}
JobParser ..> Requests: uses
JobParser ..> Response: returns_and_consumes
Flow diagram for _fetch_url_safe SSRF validation logicflowchart TD
A_start["Start _fetch_url_safe(url)"] --> B_parse_url["Parse URL with urlparse"]
B_parse_url --> C_check_scheme{"Is scheme http or https?"}
C_check_scheme -- No --> Z1_invalid_scheme["Raise ValueError Invalid URL scheme"]
C_check_scheme -- Yes --> D_check_hostname{"Hostname present?"}
D_check_hostname -- No --> Z2_no_hostname["Raise ValueError Invalid URL no hostname"]
D_check_hostname -- Yes --> E_resolve_dns["Call socket.getaddrinfo(hostname, None)"]
E_resolve_dns --> F_dns_result{"DNS resolution successful?"}
F_dns_result -- No (socket.gaierror) --> Z3_dns_error["Raise ValueError Could not resolve hostname"]
F_dns_result -- Yes --> G_loop_addresses["For each addrinfo result"]
G_loop_addresses --> H_extract_ip["Extract ip_addr from addrinfo"]
H_extract_ip --> I_parse_ip["Parse with ipaddress.ip_address(ip_addr)"]
I_parse_ip --> J_ip_parse_ok{"IP parsed successfully?"}
J_ip_parse_ok -- No (ValueError) --> Z4_invalid_ip["Raise ValueError Invalid IP address resolved"]
J_ip_parse_ok -- Yes --> K_check_range{"IP is private, loopback, link-local, reserved, or unspecified?"}
K_check_range -- Yes --> Z5_internal_ip["Raise RuntimeError Access to internal IP address not allowed"]
K_check_range -- No --> L_next_or_continue["Next resolved address or proceed"]
L_next_or_continue --> M_all_checked{"All resolved addresses checked?"}
M_all_checked -- No --> G_loop_addresses
M_all_checked -- Yes --> N_make_request["Call requests.get(url, headers, timeout=30)"]
N_make_request --> O_return_response["Return Response"]
O_return_response --> P_end["End _fetch_url_safe"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The current SSRF mitigation still has a TOCTOU window for DNS rebinding because you validate the resolved IP but then call
requests.getwith the hostname; consider either constructing the request using the validated IP (and explicitly setting theHostheader) or otherwise ensuring the actual connection target cannot change after validation. - The new
_fetch_url_safehelper importsurlparse,socket, andipaddresson every call; moving these imports to module scope would reduce per-call overhead and make the function easier to read. - The standalone
test_ssrf*.pyscripts at the repository root look like ad-hoc helpers; consider moving them into an appropriate tests or tools directory (or gating them behind a CLI entry point) so they donβt clutter the main package surface.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The current SSRF mitigation still has a TOCTOU window for DNS rebinding because you validate the resolved IP but then call `requests.get` with the hostname; consider either constructing the request using the validated IP (and explicitly setting the `Host` header) or otherwise ensuring the actual connection target cannot change after validation.
- The new `_fetch_url_safe` helper imports `urlparse`, `socket`, and `ipaddress` on every call; moving these imports to module scope would reduce per-call overhead and make the function easier to read.
- The standalone `test_ssrf*.py` scripts at the repository root look like ad-hoc helpers; consider moving them into an appropriate tests or tools directory (or gating them behind a CLI entry point) so they donβt clutter the main package surface.
## Individual Comments
### Comment 1
<location path="cli/integrations/job_parser.py" line_range="287" />
<code_context>
"URL fetching requires 'requests' library. Install with: pip install requests"
)
- except requests.RequestException as e:
+ except (requests.RequestException, ValueError, RuntimeError) as e:
raise RuntimeError(f"Failed to fetch URL: {e}")
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Catching broad `ValueError`/`RuntimeError` here may hide programming errors unrelated to URL fetching.
This `except (requests.RequestException, ValueError, RuntimeError)` will also catch `ValueError`/`RuntimeError` from `_parse_html` or other later code, incorrectly rewrapping them as fetch failures. Please restrict this handler to the concrete exceptions raised by `_fetch_url_safe`/`requests`, or split the `try` so parsing errors are not masked as network/URL-fetch errors.
Suggested implementation:
```python
try:
response = self._fetch_url_safe(url)
response.raise_for_status()
except (requests.RequestException, ValueError, RuntimeError) as e:
# Limit this handler strictly to URL fetching / validation failures
raise RuntimeError(f"Failed to fetch URL: {e}")
job_details = self._parse_html(response.text)
raise NotImplementedError(
"URL fetching requires 'requests' library. Install with: pip install requests"
)
```
If there are other calls to `_parse_html` wrapped in the same broad `except (requests.RequestException, ValueError, RuntimeError)` pattern elsewhere in this file, they should be refactored similarly: keep the exception handler around only the fetch/validation logic, and let parsing errors propagate normally.
</issue_to_address>
### Comment 2
<location path="test_ssrf.py" line_range="1-10" />
<code_context>
+import os
+import sys
+
+# Make sure we can import from the cli package
+sys.path.insert(0, os.path.abspath('.'))
+
+from cli.integrations.job_parser import JobParser
+
+parser = JobParser()
+try:
+ print("Testing JobParser SSRF with metadata endpoint...")
+ parser.parse_from_url("http://169.254.169.254/latest/meta-data/")
+except Exception as e:
+ print(f"Error: {e}")
+ import traceback
+ traceback.print_exc()
</code_context>
<issue_to_address>
**suggestion (testing):** Convert this script into an automated test (e.g., pytest/unittest) with concrete assertions instead of a manual runner with print/traceback output.
Right now this behaves as a manual debug script (prints output, catches all exceptions, prints tracebacks), so it wonβt be picked up or enforced by the test runner/CI. Either convert it into a real test (e.g., `test_*.py` with pytest assertions / `with pytest.raises(...)` on the expected exception type/message) so SSRF regressions fail CI, or move it to a `scripts/` or `manual_tests/` directory and keep automated coverage in a dedicated test module.
</issue_to_address>Help me be more useful! Please click π or π on each comment and I'll use the feedback to improve your reviews.
| "URL fetching requires 'requests' library. Install with: pip install requests" | ||
| ) | ||
| except requests.RequestException as e: | ||
| except (requests.RequestException, ValueError, RuntimeError) as e: |
There was a problem hiding this comment.
suggestion (bug_risk): Catching broad ValueError/RuntimeError here may hide programming errors unrelated to URL fetching.
This except (requests.RequestException, ValueError, RuntimeError) will also catch ValueError/RuntimeError from _parse_html or other later code, incorrectly rewrapping them as fetch failures. Please restrict this handler to the concrete exceptions raised by _fetch_url_safe/requests, or split the try so parsing errors are not masked as network/URL-fetch errors.
Suggested implementation:
try:
response = self._fetch_url_safe(url)
response.raise_for_status()
except (requests.RequestException, ValueError, RuntimeError) as e:
# Limit this handler strictly to URL fetching / validation failures
raise RuntimeError(f"Failed to fetch URL: {e}")
job_details = self._parse_html(response.text)
raise NotImplementedError(
"URL fetching requires 'requests' library. Install with: pip install requests"
)If there are other calls to _parse_html wrapped in the same broad except (requests.RequestException, ValueError, RuntimeError) pattern elsewhere in this file, they should be refactored similarly: keep the exception handler around only the fetch/validation logic, and let parsing errors propagate normally.
test_ssrf.py
Outdated
| import os | ||
| import sys | ||
|
|
||
| # Make sure we can import from the cli package | ||
| sys.path.insert(0, os.path.abspath('.')) | ||
|
|
||
| from cli.integrations.job_parser import JobParser | ||
|
|
||
| parser = JobParser() | ||
| try: |
There was a problem hiding this comment.
suggestion (testing): Convert this script into an automated test (e.g., pytest/unittest) with concrete assertions instead of a manual runner with print/traceback output.
Right now this behaves as a manual debug script (prints output, catches all exceptions, prints tracebacks), so it wonβt be picked up or enforced by the test runner/CI. Either convert it into a real test (e.g., test_*.py with pytest assertions / with pytest.raises(...) on the expected exception type/message) so SSRF regressions fail CI, or move it to a scripts/ or manual_tests/ directory and keep automated coverage in a dedicated test module.
Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
π¨ Severity: HIGH
π‘ Vulnerability: Server-Side Request Forgery (SSRF) was possible because
parse_from_urlpassed arbitrary user-supplied URLs torequests.getwithout prior validation.π― Impact: An attacker could exploit this to force the application server to make unauthorized requests to internal services, metadata endpoints (e.g., AWS IMDS), or the local loopback interface.
π§ Fix: Implemented a new
_fetch_url_safemethod that parses the URL scheme, resolves the hostname to its IP address, and verifies that the destination IP does not belong to private, loopback, link-local, reserved, or unspecified (0.0.0.0) ranges before passing the request along. The method fails securely on resolution errors.β Verification: Verified by attempting to access
169.254.169.254and0.0.0.0programmatically withJobParserand confirming the operations threw exceptions, along with successful execution of the full test suite.PR created automatically by Jules for task 13007455400231532099 started by @anchapin
Summary by Sourcery
Harden JobParser URL fetching to prevent server-side request forgery when parsing jobs from remote URLs.
New Features:
Bug Fixes:
Documentation:
Tests: