abhimehro · google-labs-jules · Jan 10, 2026
diff --git a/.jules/sentinel.md b/.jules/sentinel.md
@@ -1,41 +1,6 @@
-## 2024-05-23 - [Input Validation and Syntax Fix]
-**Vulnerability:** The `create_folder` function contained a syntax error (positional arg after keyword arg) preventing execution. Additionally, `folder_url` and `profile_id` lacked validation, potentially allowing SSRF (via non-HTTPS URLs) or path traversal/injection (via crafted profile IDs).
-**Learning:** Even simple scripts need robust input validation, especially when inputs are used to construct URLs or file paths. A syntax error can mask security issues by preventing the code from running in the first place.
-**Prevention:**
-1. Always validate external inputs against a strict allowlist (e.g., regex for IDs, protocol check for URLs).
-2. Use linters/static analysis to catch syntax errors before runtime.
+# Sentinel's Journal
 
-## 2024-12-13 - [Sensitive Data Exposure in Logs]
-**Vulnerability:** The application was logging full HTTP response bodies at ERROR level when requests failed. This could expose sensitive information (secrets, PII) returned by the API during failure states.
-**Learning:** Defaulting to verbose logging in error handlers (e.g., `log.error(e.response.text)`) is risky because API error responses often contain context that should not be persisted in production logs.
-**Prevention:**
-1. Log sensitive data (like full request/response bodies) only at DEBUG level.
-2. Sanitize or truncate log messages if they must be logged at higher levels.
-## 2024-12-15 - [Sensitive Data Exposure in Logs]
-**Vulnerability:** The application was logging full HTTP error response bodies at `ERROR` level. API error responses can often contain sensitive data like tokens, PII, or internal debug info.
-**Learning:** Default logging configurations can lead to data leaks if raw response bodies are logged without sanitization or level checks.
-**Prevention:**
-1. Log potentially sensitive data (like raw HTTP bodies) only at `DEBUG` level.
-2. At `INFO`/`ERROR` levels, log only safe summaries or status codes.
-
-## 2024-12-16 - [DoS via Unbounded Response Size]
-**Vulnerability:** The `_gh_get` function downloaded external JSON resources without any size limit. A malicious URL or compromised server could serve a massive file (e.g., 10GB), causing the application to consume all available memory (RAM) and crash (Denial of Service).
-**Learning:** When fetching data from external sources, never assume the response size is safe. `httpx.get()` (and `requests.get`) reads the entire body into memory by default.
-**Prevention:**
-1. Use streaming responses (`client.stream("GET", ...)`) when fetching external resources.
-2. Inspect `Content-Length` headers if available.
-3. Enforce a hard limit on the number of bytes read during the stream loop.
-
-## 2024-12-22 - [Sensitive Data Exposure in Logs (Headers)]
-**Vulnerability:** The application's `sanitize_for_log` function was insufficient, only escaping characters but not redacting secrets. If an exception occurred that included headers (e.g. `Authorization`), the `TOKEN` could be exposed in logs.
-**Learning:** Generic sanitization (like `repr()`) is not enough for secrets. Explicit redaction of known secrets is required.
-**Prevention:**
-1. Maintain a list of sensitive values (tokens, keys).
-2. Ensure logging utilities check against this list and mask values before outputting.
-
-## 2025-01-21 - [SSRF Protection and Input Limits]
-**Vulnerability:** The `folder_url` validation checked for HTTPS but allowed internal IP addresses (e.g., `127.0.0.1`, `10.0.0.0/8`). This could theoretically allow Server-Side Request Forgery (SSRF) if the script is run in an environment with access to sensitive internal services. Additionally, `profile_id` had no length limit.
-**Learning:** HTTPS validation alone is insufficient to prevent SSRF against internal services that might support HTTPS or use self-signed certs (if verification was disabled or bypassed). Explicitly blocking private IP ranges provides necessary defense-in-depth.
-**Prevention:**
-1. Parse URLs and check hostnames against `localhost` and private IP ranges using `ipaddress` module.
-2. Enforce strict length limits on user inputs (e.g., profile IDs) to prevent resource exhaustion or buffer abuse.
+## 2025-01-20 - SSRF Vulnerability in URL Validation
+**Vulnerability:** The `validate_folder_url` function checked for explicit localhost strings and IP literals but failed to resolve domain names. This allowed attackers to bypass SSRF protection by using a domain name that resolves to a private IP (e.g., `local.example.com` -> `127.0.0.1`).
+**Learning:** Checking hostnames against a blocklist is insufficient because DNS resolution decouples the name from the IP. `ipaddress` library only validates literals.
+**Prevention:** Always resolve the hostname to an IP address and check the resolved IP against private ranges (`is_private`, `is_loopback`) before making a request. Be aware of TOCTOU (Time-of-Check Time-of-Use) issues like DNS rebinding, though basic resolution is a good first line of defense.
diff --git a/main.py b/main.py
@@ -23,6 +23,7 @@
 import concurrent.futures
 import threading
 import ipaddress
+import socket
 from urllib.parse import urlparse
 from typing import Dict, List, Optional, Any, Set, Sequence
 
@@ -200,7 +201,7 @@

        # Check for potentially malicious hostnames
        if hostname.lower() in ('localhost', '127.0.0.1', '::1'):
             log.warning(f"Skipping unsafe URL (localhost detected): {sanitize_for_log(url)}")
             return False

        try:
@@ -210,7 +211,23 @@
                 return False
         except ValueError:
             # Not an IP literal, it's a domain.
-            pass
+            try:
+                # Resolve domain to IP to check for private addresses (SSRF protection)
+                resolved_ip_str = socket.gethostbyname(hostname)
+                resolved_ip = ipaddress.ip_address(resolved_ip_str)
+                if resolved_ip.is_private or resolved_ip.is_loopback:
+                    log.warning(f"Skipping unsafe URL (domain resolves to private IP {resolved_ip_str}): {sanitize_for_log(url)}")
+                    return False
+            except socket.gaierror:
+                 # DNS resolution failed, which might happen if the domain is invalid
+                 # We'll allow it to proceed to HTTP request which will likely fail,
+                 # or we could be strict and reject it.
+                 # Being strict is safer for security.
+                 log.warning(f"Skipping unsafe URL (DNS resolution failed): {sanitize_for_log(url)}")
+                 return False
+            except Exception as e:
+                log.warning(f"Failed to resolve domain for URL {sanitize_for_log(url)}: {e}")
+                return False
 
     except Exception as e:
         log.warning(f"Failed to validate URL {sanitize_for_log(url)}: {e}")
@@ -373,7 +390,7 @@
        log.error(f"Failed to get existing rules: {sanitize_for_log(e)}")
        return set()

 def fetch_folder_data(url: str) -> Dict[str, Any]:
    js = _gh_get(url)
    if not validate_folder_data(js, url):
        raise KeyError(f"Invalid folder data from {sanitize_for_log(url)}")
@@ -662,7 +679,7 @@

            with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
                future_to_folder = {
                    executor.submit(
                        _process_single_folder,
                        folder_data,
                        profile_id,