π‘οΈ Sentinel: [HIGH] Fix SSRF vulnerability in JobParser#214
π‘οΈ Sentinel: [HIGH] Fix SSRF vulnerability in JobParser#214
Conversation
Added IP address validation in `cli/integrations/job_parser.py` before fetching URLs to prevent Server-Side Request Forgery (SSRF) attacks against internal services or cloud metadata endpoints. π¨ Severity: HIGH π‘ Vulnerability: SSRF when parsing job posting from URL π― Impact: An attacker could force the CLI or API to fetch internal resources (e.g., http://localhost/ or http://169.254.169.254/) and access sensitive data. π§ Fix: Implemented `_fetch_url_safe` that uses `socket.getaddrinfo` and `ipaddress` to verify the resolved IP is not private, loopback, link-local, or reserved before calling `requests.get()`. Fails closed if resolution fails or returns an invalid format. β Verification: Ran `pytest tests/test_job_parser_integration.py` to ensure normal URL fetching still works. Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
|
π Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a π emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuideAdds a safe URL fetcher to JobParser that validates resolved IP addresses to mitigate SSRF, wires it into parse_from_url error handling, and documents the vulnerability and mitigation in the Sentinel security log. Sequence diagram for safe URL fetching in JobParser.parse_from_urlsequenceDiagram
actor User
participant JobParser
participant urllib
participant socket
participant ipaddress
participant requests
User->>JobParser: parse_from_url(url)
JobParser->>JobParser: _fetch_url_safe(url)
JobParser->>urllib: urlparse(url)
urllib-->>JobParser: parsed_url
JobParser->>JobParser: validate scheme and hostname
JobParser->>socket: getaddrinfo(hostname, None)
alt hostname resolution fails
socket-->>JobParser: socket.gaierror
JobParser-->>JobParser: raise ValueError
JobParser-->>User: RuntimeError(unsafe or invalid URL)
else hostname resolves
socket-->>JobParser: addr_info
JobParser->>ipaddress: ip_address(ip)
ipaddress-->>JobParser: ip_obj
alt ip is private or loopback or link_local or reserved or multicast
JobParser-->>JobParser: raise ValueError
JobParser-->>User: RuntimeError(unsafe or invalid URL)
else ip allowed
JobParser->>requests: get(url, headers, timeout)
alt network error
requests-->>JobParser: RequestException
JobParser-->>User: RuntimeError(failed to fetch URL)
else success
requests-->>JobParser: response
JobParser->>JobParser: response.raise_for_status()
JobParser-->>JobParser: html_content
JobParser->>JobParser: _parse_html(html_content)
JobParser-->>JobParser: job_details
JobParser->>JobParser: job_details.url = url
JobParser->>JobParser: _save_to_cache(cache_key, job_details)
JobParser-->>User: job_details
end
end
end
Class diagram for updated JobParser URL parsing and fetchingclassDiagram
class JobParser {
+parse_from_url(url: str) JobDetails
-_parse_html(html: str) JobDetails
-_fetch_url_safe(url: str) str
-_is_linkedin(html: str) bool
-_parse_generic(html: str) JobDetails
-_save_to_cache(cache_key: str, job_details: JobDetails) None
}
class JobDetails {
+url: str
+title: str
+company: str
+location: str
+description: str
}
JobParser --> JobDetails : parses
Flowchart for SSRF-safe _fetch_url_safe URL validationflowchart TD
A["Start _fetch_url_safe(url)"] --> B{"requests library available?"}
B -- "No" --> C["Raise NotImplementedError: install requests"]
B -- "Yes" --> D["parsed_url = urllib.parse.urlparse(url)"]
D --> E{"Scheme in http or https?"}
E -- "No" --> F["Raise ValueError: invalid scheme"]
E -- "Yes" --> G{"Hostname present?"}
G -- "No" --> H["Raise ValueError: missing hostname"]
G -- "Yes" --> I["hostname = hostname.strip('[]')"]
I --> J["addr_info = socket.getaddrinfo(hostname, None)"]
J --> K{"socket.gaierror?"}
K -- "Yes" --> L["Raise ValueError: failed to resolve hostname"]
K -- "No" --> M["ip = addr_info[0][4][0]"]
M --> N["ip_obj = ipaddress.ip_address(ip)"]
N --> O{"ip is private or loopback or link_local or reserved or multicast?"}
O -- "Yes" --> P["Raise ValueError: restricted IP"]
O -- "No" --> Q["response = requests.get(url, headers, timeout=30)"]
Q --> R["response.raise_for_status()"]
R --> S["Return response.text"]
N --> T{"ValueError from ipaddress or previous checks?"}
T -- "Yes" --> U["Raise ValueError: invalid IP address format or restricted IP"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
_fetch_url_safe, you only validate the first IP returned bysocket.getaddrinfo; consider iterating over and validating all resolved addresses and rejecting the URL if any of them are in a restricted range to better guard against multi-A records and DNS-based tricks. - The current SSRF check runs only before the initial
requests.get; if redirects are followed, responses could be fetched from internal addressesβconsider disabling redirects or re-validating the resolved IP of each redirect target.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_fetch_url_safe`, you only validate the first IP returned by `socket.getaddrinfo`; consider iterating over and validating all resolved addresses and rejecting the URL if any of them are in a restricted range to better guard against multi-A records and DNS-based tricks.
- The current SSRF check runs only before the initial `requests.get`; if redirects are followed, responses could be fetched from internal addressesβconsider disabling redirects or re-validating the resolved IP of each redirect target.
## Individual Comments
### Comment 1
<location path="cli/integrations/job_parser.py" line_range="228-237" />
<code_context>
# Fetch and parse
try:
-
- headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
</code_context>
<issue_to_address>
**π¨ suggestion (security):** Consider validating all resolved IPs instead of only the first from getaddrinfo.
Relying on `addr_info[0]` assumes the first DNS record is representative. If an attacker controls DNS, they could put a public IP first and a private/internal IP later; depending on the resolver, the client might still connect to the unvalidated address. Instead, iterate over all results from `socket.getaddrinfo`, require that every resolved IP is allowed, and reject the request if any entry is private, loopback, link-local, reserved, or multicast.
Suggested implementation:
```python
addr_info = socket.getaddrinfo(hostname, port, type=socket.SOCK_STREAM)
# Validate all resolved IPs; reject if any of them is disallowed.
# This prevents an attacker-controlled DNS server from hiding a
# private/internal IP behind a seemingly safe first record.
for _family, _socktype, _proto, _canonname, sockaddr in addr_info:
ip = sockaddr[0]
self._validate_ip(ip)
```
Because I canβt see the full implementation of `_fetch_url_safe` and `_validate_ip`, youβll likely need to:
1. Adjust the `SEARCH` block to match the actual code around your current `socket.getaddrinfo` call and the place where `addr_info[0]` is used.
2. Ensure `_validate_ip(ip: str)` rejects all of the following:
- `ipaddress.ip_address(ip).is_private`
- `.is_loopback`
- `.is_link_local`
- `.is_reserved`
- `.is_multicast`
and raises an appropriate exception if any of these are true.
3. Confirm that `hostname` and `port` are the same variables currently passed to `socket.getaddrinfo` in `_fetch_url_safe`, or adjust the names in the replacement accordingly.
</issue_to_address>
### Comment 2
<location path="cli/integrations/job_parser.py" line_range="226" />
<code_context>
try:
-
- headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
- response = requests.get(url, headers=headers, timeout=30)
- response.raise_for_status()
-
</code_context>
<issue_to_address>
**π¨ issue (security):** Redirect handling can bypass the SSRF protection by redirecting to internal/private IPs.
The SSRF validation only runs on the initial URL, but `requests.get` follows redirects, so an attacker could use a public URL that redirects to an internal/loopback IP and bypass the check. To keep the protection effective, either disable redirects (`allow_redirects=False` and treat 3xx as an error), or handle redirects yourself and re-validate each new URL/IP with the same `_fetch_url_safe` logic.
</issue_to_address>
### Comment 3
<location path="cli/integrations/job_parser.py" line_range="237-238" />
<code_context>
- raise NotImplementedError(
- "URL fetching requires 'requests' library. Install with: pip install requests"
- )
+ except ValueError as e:
+ raise RuntimeError(f"Unsafe or invalid URL provided: {e}")
except requests.RequestException as e:
raise RuntimeError(f"Failed to fetch URL: {e}")
</code_context>
<issue_to_address>
**suggestion:** Preserve the original traceback when wrapping ValueError into RuntimeError.
The re-raised `RuntimeError` currently loses the original traceback, which complicates debugging. Please chain the exception (`raise RuntimeError("Unsafe or invalid URL provided") from e`) so callers retain the original stack from `_fetch_url_safe` along with the higher-level error.
```suggestion
except ValueError as e:
raise RuntimeError("Unsafe or invalid URL provided") from e
```
</issue_to_address>Help me be more useful! Please click π or π on each comment and I'll use the feedback to improve your reviews.
| try: | ||
|
|
||
| headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"} | ||
| response = requests.get(url, headers=headers, timeout=30) | ||
| response.raise_for_status() | ||
|
|
||
| job_details = self._parse_html(response.text) | ||
| html_content = self._fetch_url_safe(url) | ||
| job_details = self._parse_html(html_content) | ||
| job_details.url = url | ||
|
|
||
| # Save to cache | ||
| self._save_to_cache(cache_key, job_details) | ||
|
|
||
| return job_details | ||
|
|
||
| except ImportError: | ||
| raise NotImplementedError( | ||
| "URL fetching requires 'requests' library. Install with: pip install requests" | ||
| ) | ||
| except ValueError as e: |
There was a problem hiding this comment.
π¨ suggestion (security): Consider validating all resolved IPs instead of only the first from getaddrinfo.
Relying on addr_info[0] assumes the first DNS record is representative. If an attacker controls DNS, they could put a public IP first and a private/internal IP later; depending on the resolver, the client might still connect to the unvalidated address. Instead, iterate over all results from socket.getaddrinfo, require that every resolved IP is allowed, and reject the request if any entry is private, loopback, link-local, reserved, or multicast.
Suggested implementation:
addr_info = socket.getaddrinfo(hostname, port, type=socket.SOCK_STREAM)
# Validate all resolved IPs; reject if any of them is disallowed.
# This prevents an attacker-controlled DNS server from hiding a
# private/internal IP behind a seemingly safe first record.
for _family, _socktype, _proto, _canonname, sockaddr in addr_info:
ip = sockaddr[0]
self._validate_ip(ip)Because I canβt see the full implementation of _fetch_url_safe and _validate_ip, youβll likely need to:
- Adjust the
SEARCHblock to match the actual code around your currentsocket.getaddrinfocall and the place whereaddr_info[0]is used. - Ensure
_validate_ip(ip: str)rejects all of the following:ipaddress.ip_address(ip).is_private.is_loopback.is_link_local.is_reserved.is_multicast
and raises an appropriate exception if any of these are true.
- Confirm that
hostnameandportare the same variables currently passed tosocket.getaddrinfoin_fetch_url_safe, or adjust the names in the replacement accordingly.
| @@ -221,23 +226,16 @@ def parse_from_url(self, url: str) -> JobDetails: | |||
|
|
|||
There was a problem hiding this comment.
π¨ issue (security): Redirect handling can bypass the SSRF protection by redirecting to internal/private IPs.
The SSRF validation only runs on the initial URL, but requests.get follows redirects, so an attacker could use a public URL that redirects to an internal/loopback IP and bypass the check. To keep the protection effective, either disable redirects (allow_redirects=False and treat 3xx as an error), or handle redirects yourself and re-validate each new URL/IP with the same _fetch_url_safe logic.
| except ValueError as e: | ||
| raise RuntimeError(f"Unsafe or invalid URL provided: {e}") |
There was a problem hiding this comment.
suggestion: Preserve the original traceback when wrapping ValueError into RuntimeError.
The re-raised RuntimeError currently loses the original traceback, which complicates debugging. Please chain the exception (raise RuntimeError("Unsafe or invalid URL provided") from e) so callers retain the original stack from _fetch_url_safe along with the higher-level error.
| except ValueError as e: | |
| raise RuntimeError(f"Unsafe or invalid URL provided: {e}") | |
| except ValueError as e: | |
| raise RuntimeError("Unsafe or invalid URL provided") from e |
π‘οΈ Sentinel: [HIGH] Fix SSRF vulnerability in JobParser
Added IP address validation in
cli/integrations/job_parser.pybefore fetching URLs to prevent Server-Side Request Forgery (SSRF) attacks against internal services or cloud metadata endpoints.π¨ Severity: HIGH
π‘ Vulnerability: SSRF when parsing job posting from URL
π― Impact: An attacker could force the CLI or API to fetch internal resources (e.g., http://localhost/ or http://169.254.169.254/) and access sensitive data.
π§ Fix: Implemented
_fetch_url_safethat usessocket.getaddrinfoandipaddressto verify the resolved IP is not private, loopback, link-local, or reserved before callingrequests.get(). Fails closed if resolution fails or returns an invalid format.β Verification: Ran
pytest tests/test_job_parser_integration.pyto ensure normal URL fetching still works.PR created automatically by Jules for task 3169344061033975925 started by @anchapin
Summary by Sourcery
Harden JobParser URL fetching to prevent SSRF by validating target IPs before making HTTP requests and updating security documentation accordingly.
Bug Fixes:
Enhancements:
Documentation: