Skip to content

πŸ›‘οΈ Sentinel: [HIGH] Fix SSRF vulnerability in JobParser#214

Open
anchapin wants to merge 1 commit intomainfrom
sentinel-ssrf-job-parser-3169344061033975925
Open

πŸ›‘οΈ Sentinel: [HIGH] Fix SSRF vulnerability in JobParser#214
anchapin wants to merge 1 commit intomainfrom
sentinel-ssrf-job-parser-3169344061033975925

Conversation

@anchapin
Copy link
Copy Markdown
Owner

@anchapin anchapin commented Mar 27, 2026

πŸ›‘οΈ Sentinel: [HIGH] Fix SSRF vulnerability in JobParser

Added IP address validation in cli/integrations/job_parser.py before fetching URLs to prevent Server-Side Request Forgery (SSRF) attacks against internal services or cloud metadata endpoints.

🚨 Severity: HIGH
πŸ’‘ Vulnerability: SSRF when parsing job posting from URL
🎯 Impact: An attacker could force the CLI or API to fetch internal resources (e.g., http://localhost/ or http://169.254.169.254/) and access sensitive data.
πŸ”§ Fix: Implemented _fetch_url_safe that uses socket.getaddrinfo and ipaddress to verify the resolved IP is not private, loopback, link-local, or reserved before calling requests.get(). Fails closed if resolution fails or returns an invalid format.
βœ… Verification: Ran pytest tests/test_job_parser_integration.py to ensure normal URL fetching still works.


PR created automatically by Jules for task 3169344061033975925 started by @anchapin

Summary by Sourcery

Harden JobParser URL fetching to prevent SSRF by validating target IPs before making HTTP requests and updating security documentation accordingly.

Bug Fixes:

  • Prevent SSRF in JobParser by resolving hostnames and blocking URLs that map to private, loopback, link-local, reserved, or multicast IP addresses.
  • Treat missing or invalid hostnames and unsupported URL schemes as errors, failing closed before any HTTP request is issued.

Enhancements:

  • Refactor JobParser to centralize safe URL retrieval via a dedicated helper used by URL-based parsing.

Documentation:

  • Extend Sentinel security report with a new entry documenting the JobParser SSRF vulnerability, its impact, and mitigations.

Added IP address validation in `cli/integrations/job_parser.py` before fetching URLs to prevent Server-Side Request Forgery (SSRF) attacks against internal services or cloud metadata endpoints.

🚨 Severity: HIGH
πŸ’‘ Vulnerability: SSRF when parsing job posting from URL
🎯 Impact: An attacker could force the CLI or API to fetch internal resources (e.g., http://localhost/ or http://169.254.169.254/) and access sensitive data.
πŸ”§ Fix: Implemented `_fetch_url_safe` that uses `socket.getaddrinfo` and `ipaddress` to verify the resolved IP is not private, loopback, link-local, or reserved before calling `requests.get()`. Fails closed if resolution fails or returns an invalid format.
βœ… Verification: Ran `pytest tests/test_job_parser_integration.py` to ensure normal URL fetching still works.

Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

πŸ‘‹ Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a πŸ‘€ emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Mar 27, 2026

Reviewer's Guide

Adds a safe URL fetcher to JobParser that validates resolved IP addresses to mitigate SSRF, wires it into parse_from_url error handling, and documents the vulnerability and mitigation in the Sentinel security log.

Sequence diagram for safe URL fetching in JobParser.parse_from_url

sequenceDiagram
    actor User
    participant JobParser
    participant urllib
    participant socket
    participant ipaddress
    participant requests

    User->>JobParser: parse_from_url(url)
    JobParser->>JobParser: _fetch_url_safe(url)

    JobParser->>urllib: urlparse(url)
    urllib-->>JobParser: parsed_url
    JobParser->>JobParser: validate scheme and hostname

    JobParser->>socket: getaddrinfo(hostname, None)
    alt hostname resolution fails
        socket-->>JobParser: socket.gaierror
        JobParser-->>JobParser: raise ValueError
        JobParser-->>User: RuntimeError(unsafe or invalid URL)
    else hostname resolves
        socket-->>JobParser: addr_info
        JobParser->>ipaddress: ip_address(ip)
        ipaddress-->>JobParser: ip_obj
        alt ip is private or loopback or link_local or reserved or multicast
            JobParser-->>JobParser: raise ValueError
            JobParser-->>User: RuntimeError(unsafe or invalid URL)
        else ip allowed
            JobParser->>requests: get(url, headers, timeout)
            alt network error
                requests-->>JobParser: RequestException
                JobParser-->>User: RuntimeError(failed to fetch URL)
            else success
                requests-->>JobParser: response
                JobParser->>JobParser: response.raise_for_status()
                JobParser-->>JobParser: html_content
                JobParser->>JobParser: _parse_html(html_content)
                JobParser-->>JobParser: job_details
                JobParser->>JobParser: job_details.url = url
                JobParser->>JobParser: _save_to_cache(cache_key, job_details)
                JobParser-->>User: job_details
            end
        end
    end
Loading

Class diagram for updated JobParser URL parsing and fetching

classDiagram
    class JobParser {
        +parse_from_url(url: str) JobDetails
        -_parse_html(html: str) JobDetails
        -_fetch_url_safe(url: str) str
        -_is_linkedin(html: str) bool
        -_parse_generic(html: str) JobDetails
        -_save_to_cache(cache_key: str, job_details: JobDetails) None
    }

    class JobDetails {
        +url: str
        +title: str
        +company: str
        +location: str
        +description: str
    }

    JobParser --> JobDetails : parses
Loading

Flowchart for SSRF-safe _fetch_url_safe URL validation

flowchart TD
    A["Start _fetch_url_safe(url)"] --> B{"requests library available?"}
    B -- "No" --> C["Raise NotImplementedError: install requests"]
    B -- "Yes" --> D["parsed_url = urllib.parse.urlparse(url)"]
    D --> E{"Scheme in http or https?"}
    E -- "No" --> F["Raise ValueError: invalid scheme"]
    E -- "Yes" --> G{"Hostname present?"}
    G -- "No" --> H["Raise ValueError: missing hostname"]
    G -- "Yes" --> I["hostname = hostname.strip('[]')"]
    I --> J["addr_info = socket.getaddrinfo(hostname, None)"]

    J --> K{"socket.gaierror?"}
    K -- "Yes" --> L["Raise ValueError: failed to resolve hostname"]
    K -- "No" --> M["ip = addr_info[0][4][0]"]
    M --> N["ip_obj = ipaddress.ip_address(ip)"]

    N --> O{"ip is private or loopback or link_local or reserved or multicast?"}
    O -- "Yes" --> P["Raise ValueError: restricted IP"]
    O -- "No" --> Q["response = requests.get(url, headers, timeout=30)"]
    Q --> R["response.raise_for_status()"]
    R --> S["Return response.text"]

    N --> T{"ValueError from ipaddress or previous checks?"}
    T -- "Yes" --> U["Raise ValueError: invalid IP address format or restricted IP"]
Loading

File-Level Changes

Change Details Files
Route JobParser URL parsing through a new SSRF-safe fetch helper instead of calling requests.get directly.
  • Replace direct requests.get call in parse_from_url with _fetch_url_safe and pass returned HTML into _parse_html
  • Set job_details.url and cache behavior unchanged after safe fetch
  • Change exception handling in parse_from_url to convert ValueError from URL validation into a RuntimeError with a clear unsafe/invalid URL message
  • Remove the ImportError special-case in parse_from_url; requests-availability check is now centralized in the helper
cli/integrations/job_parser.py
Introduce _fetch_url_safe to enforce scheme and IP-based restrictions before performing HTTP requests.
  • Add _fetch_url_safe method that validates scheme is http/https using urllib.parse.urlparse
  • Validate hostname presence and normalize IPv6 literals by stripping brackets
  • Resolve hostname via socket.getaddrinfo, pick the first IP, and parse it with ipaddress.ip_address
  • Block requests whose IP is private, loopback, link-local, reserved, or multicast, failing closed on DNS resolution errors or invalid IP formats via ValueError
  • Retain and reuse the existing User-Agent header and timeout when issuing requests.get, raising for HTTP errors before returning response.text
  • Move the requests presence check into _fetch_url_safe, raising NotImplementedError with installation guidance if requests is missing
cli/integrations/job_parser.py
Document the discovered SSRF vulnerability in JobParser and its mitigation in the Sentinel security knowledge file.
  • Add a new dated section describing the high-severity SSRF in JobParser URL fetching, its impact, and the risks of using requests.get on untrusted URLs
  • Record the IP-based pre-fetch validation using socket.getaddrinfo and ipaddress as the primary mitigation and note limitations regarding DNS rebinding and redirects
.jules/sentinel.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In _fetch_url_safe, you only validate the first IP returned by socket.getaddrinfo; consider iterating over and validating all resolved addresses and rejecting the URL if any of them are in a restricted range to better guard against multi-A records and DNS-based tricks.
  • The current SSRF check runs only before the initial requests.get; if redirects are followed, responses could be fetched from internal addressesβ€”consider disabling redirects or re-validating the resolved IP of each redirect target.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_fetch_url_safe`, you only validate the first IP returned by `socket.getaddrinfo`; consider iterating over and validating all resolved addresses and rejecting the URL if any of them are in a restricted range to better guard against multi-A records and DNS-based tricks.
- The current SSRF check runs only before the initial `requests.get`; if redirects are followed, responses could be fetched from internal addressesβ€”consider disabling redirects or re-validating the resolved IP of each redirect target.

## Individual Comments

### Comment 1
<location path="cli/integrations/job_parser.py" line_range="228-237" />
<code_context>

         # Fetch and parse
         try:
-
-            headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Consider validating all resolved IPs instead of only the first from getaddrinfo.

Relying on `addr_info[0]` assumes the first DNS record is representative. If an attacker controls DNS, they could put a public IP first and a private/internal IP later; depending on the resolver, the client might still connect to the unvalidated address. Instead, iterate over all results from `socket.getaddrinfo`, require that every resolved IP is allowed, and reject the request if any entry is private, loopback, link-local, reserved, or multicast.

Suggested implementation:

```python
        addr_info = socket.getaddrinfo(hostname, port, type=socket.SOCK_STREAM)

        # Validate all resolved IPs; reject if any of them is disallowed.
        # This prevents an attacker-controlled DNS server from hiding a
        # private/internal IP behind a seemingly safe first record.
        for _family, _socktype, _proto, _canonname, sockaddr in addr_info:
            ip = sockaddr[0]
            self._validate_ip(ip)

```

Because I can’t see the full implementation of `_fetch_url_safe` and `_validate_ip`, you’ll likely need to:

1. Adjust the `SEARCH` block to match the actual code around your current `socket.getaddrinfo` call and the place where `addr_info[0]` is used.
2. Ensure `_validate_ip(ip: str)` rejects all of the following:
   - `ipaddress.ip_address(ip).is_private`
   - `.is_loopback`
   - `.is_link_local`
   - `.is_reserved`
   - `.is_multicast`
   and raises an appropriate exception if any of these are true.
3. Confirm that `hostname` and `port` are the same variables currently passed to `socket.getaddrinfo` in `_fetch_url_safe`, or adjust the names in the replacement accordingly.
</issue_to_address>

### Comment 2
<location path="cli/integrations/job_parser.py" line_range="226" />
<code_context>
         try:
-
-            headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
-            response = requests.get(url, headers=headers, timeout=30)
-            response.raise_for_status()
-
</code_context>
<issue_to_address>
**🚨 issue (security):** Redirect handling can bypass the SSRF protection by redirecting to internal/private IPs.

The SSRF validation only runs on the initial URL, but `requests.get` follows redirects, so an attacker could use a public URL that redirects to an internal/loopback IP and bypass the check. To keep the protection effective, either disable redirects (`allow_redirects=False` and treat 3xx as an error), or handle redirects yourself and re-validate each new URL/IP with the same `_fetch_url_safe` logic.
</issue_to_address>

### Comment 3
<location path="cli/integrations/job_parser.py" line_range="237-238" />
<code_context>
-            raise NotImplementedError(
-                "URL fetching requires 'requests' library. Install with: pip install requests"
-            )
+        except ValueError as e:
+            raise RuntimeError(f"Unsafe or invalid URL provided: {e}")
         except requests.RequestException as e:
             raise RuntimeError(f"Failed to fetch URL: {e}")
</code_context>
<issue_to_address>
**suggestion:** Preserve the original traceback when wrapping ValueError into RuntimeError.

The re-raised `RuntimeError` currently loses the original traceback, which complicates debugging. Please chain the exception (`raise RuntimeError("Unsafe or invalid URL provided") from e`) so callers retain the original stack from `_fetch_url_safe` along with the higher-level error.

```suggestion
        except ValueError as e:
            raise RuntimeError("Unsafe or invalid URL provided") from e
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click πŸ‘ or πŸ‘Ž on each comment and I'll use the feedback to improve your reviews.

Comment on lines 228 to +237
try:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()

job_details = self._parse_html(response.text)
html_content = self._fetch_url_safe(url)
job_details = self._parse_html(html_content)
job_details.url = url

# Save to cache
self._save_to_cache(cache_key, job_details)

return job_details

except ImportError:
raise NotImplementedError(
"URL fetching requires 'requests' library. Install with: pip install requests"
)
except ValueError as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 suggestion (security): Consider validating all resolved IPs instead of only the first from getaddrinfo.

Relying on addr_info[0] assumes the first DNS record is representative. If an attacker controls DNS, they could put a public IP first and a private/internal IP later; depending on the resolver, the client might still connect to the unvalidated address. Instead, iterate over all results from socket.getaddrinfo, require that every resolved IP is allowed, and reject the request if any entry is private, loopback, link-local, reserved, or multicast.

Suggested implementation:

        addr_info = socket.getaddrinfo(hostname, port, type=socket.SOCK_STREAM)

        # Validate all resolved IPs; reject if any of them is disallowed.
        # This prevents an attacker-controlled DNS server from hiding a
        # private/internal IP behind a seemingly safe first record.
        for _family, _socktype, _proto, _canonname, sockaddr in addr_info:
            ip = sockaddr[0]
            self._validate_ip(ip)

Because I can’t see the full implementation of _fetch_url_safe and _validate_ip, you’ll likely need to:

  1. Adjust the SEARCH block to match the actual code around your current socket.getaddrinfo call and the place where addr_info[0] is used.
  2. Ensure _validate_ip(ip: str) rejects all of the following:
    • ipaddress.ip_address(ip).is_private
    • .is_loopback
    • .is_link_local
    • .is_reserved
    • .is_multicast
      and raises an appropriate exception if any of these are true.
  3. Confirm that hostname and port are the same variables currently passed to socket.getaddrinfo in _fetch_url_safe, or adjust the names in the replacement accordingly.

@@ -221,23 +226,16 @@ def parse_from_url(self, url: str) -> JobDetails:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): Redirect handling can bypass the SSRF protection by redirecting to internal/private IPs.

The SSRF validation only runs on the initial URL, but requests.get follows redirects, so an attacker could use a public URL that redirects to an internal/loopback IP and bypass the check. To keep the protection effective, either disable redirects (allow_redirects=False and treat 3xx as an error), or handle redirects yourself and re-validate each new URL/IP with the same _fetch_url_safe logic.

Comment on lines +237 to +238
except ValueError as e:
raise RuntimeError(f"Unsafe or invalid URL provided: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Preserve the original traceback when wrapping ValueError into RuntimeError.

The re-raised RuntimeError currently loses the original traceback, which complicates debugging. Please chain the exception (raise RuntimeError("Unsafe or invalid URL provided") from e) so callers retain the original stack from _fetch_url_safe along with the higher-level error.

Suggested change
except ValueError as e:
raise RuntimeError(f"Unsafe or invalid URL provided: {e}")
except ValueError as e:
raise RuntimeError("Unsafe or invalid URL provided") from e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant