Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,7 @@
## 2024-05-24 - Parallelize DNS Validation
**Learning:** DNS lookups (`socket.getaddrinfo`) are blocking I/O operations. Performing them sequentially in a list comprehension (e.g., to filter URLs) can be a major bottleneck. Parallelizing them alongside the fetch operation can significantly reduce startup time.
**Action:** Move validation logic that involves network I/O into the parallel worker thread instead of pre-filtering sequentially.

## 2026-01-27 - Redundant Validation for Cached Data
**Learning:** Re-validating resource properties (like DNS/IP) when using *cached content* is pure overhead. If the content is served from memory (proven safe at fetch time), checking the *current* state of the source is disconnected from the data being used.
**Action:** When using a multi-stage pipeline (Warmup -> Process), ensure validation state persists alongside the data cache. Avoid clearing validation caches between stages if the data cache is not also cleared.
99 changes: 61 additions & 38 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import socket
import stat
import sys
import threading
import time
from functools import lru_cache
from typing import Any, Callable, Dict, List, Optional, Sequence, Set
Expand Down Expand Up @@ -315,13 +316,16 @@
# 3. Helpers
# --------------------------------------------------------------------------- #
_cache: Dict[str, Dict] = {}
# Use RLock (reentrant lock) to allow nested acquisitions by the same thread
# This prevents deadlocks when _fetch_if_valid calls fetch_folder_data which calls _gh_get
_cache_lock = threading.RLock()

Check warning

Code scanning / Pylint (reported by Codacy)

Constant name "_cache_lock" doesn't conform to UPPER_CASE naming style Warning

Constant name "_cache_lock" doesn't conform to UPPER_CASE naming style


@lru_cache(maxsize=128)
def validate_folder_url(url: str) -> bool:
"""
Validates a folder URL.
Cached to avoid repeated DNS lookups (socket.getaddrinfo) for the same URL

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Use lazy % formatting in logging functions Note

Use lazy % formatting in logging functions
during warm-up and sync phases.
"""
if not url.startswith("https://"):
Expand Down Expand Up @@ -485,7 +489,7 @@
def _api_get(client: httpx.Client, url: str) -> httpx.Response:
return _retry_request(lambda: client.get(url))


Check warning

Code scanning / Pylint (reported by Codacy)

Trailing whitespace Warning

Trailing whitespace
def _api_delete(client: httpx.Client, url: str) -> httpx.Response:
return _retry_request(lambda: client.delete(url))

Expand Down Expand Up @@ -521,51 +525,62 @@
)
time.sleep(wait_time)


Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "r" doesn't conform to snake_case naming style Warning

Variable name "r" doesn't conform to snake_case naming style

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Variable name "r" doesn't conform to snake_case naming style Warning

Variable name "r" doesn't conform to snake_case naming style
def _gh_get(url: str) -> Dict:
if url not in _cache:
# Explicitly let HTTPError propagate (no need to catch just to re-raise)
with _gh.stream("GET", url) as r:
r.raise_for_status()

# 1. Check Content-Length header if present
cl = r.headers.get("Content-Length")
if cl:
try:
if int(cl) > MAX_RESPONSE_SIZE:
raise ValueError(
f"Response too large from {sanitize_for_log(url)} "
f"({int(cl) / (1024 * 1024):.2f} MB)"
)
except ValueError as e:
# Only catch the conversion error, let the size error propagate
if "Response too large" in str(e):
raise e
log.warning(
f"Malformed Content-Length header from {sanitize_for_log(url)}: {cl!r}. "
"Falling back to streaming size check."
)

# 2. Stream and check actual size
chunks = []
current_size = 0
for chunk in r.iter_bytes():
current_size += len(chunk)
if current_size > MAX_RESPONSE_SIZE:
# First check: Quick check without holding lock for long
with _cache_lock:
if url in _cache:

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "cl" doesn't conform to snake_case naming style Warning

Variable name "cl" doesn't conform to snake_case naming style

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Variable name "cl" doesn't conform to snake_case naming style Warning

Variable name "cl" doesn't conform to snake_case naming style
return _cache[url]

# Fetch data if not cached
# Explicitly let HTTPError propagate (no need to catch just to re-raise)
with _gh.stream("GET", url) as r:
r.raise_for_status()

# 1. Check Content-Length header if present

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "e" doesn't conform to snake_case naming style Warning

Variable name "e" doesn't conform to snake_case naming style

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Variable name "e" doesn't conform to snake_case naming style Warning

Variable name "e" doesn't conform to snake_case naming style
cl = r.headers.get("Content-Length")
if cl:
try:
if int(cl) > MAX_RESPONSE_SIZE:

Check warning

Code scanning / Prospector (reported by Codacy)

Use lazy % formatting in logging functions (logging-fstring-interpolation) Warning

Use lazy % formatting in logging functions (logging-fstring-interpolation)
raise ValueError(
f"Response too large from {sanitize_for_log(url)} "
f"(> {MAX_RESPONSE_SIZE / (1024 * 1024):.2f} MB)"
f"({int(cl) / (1024 * 1024):.2f} MB)"
)
chunks.append(chunk)
except ValueError as e:
# Only catch the conversion error, let the size error propagate
if "Response too large" in str(e):
raise e
log.warning(
f"Malformed Content-Length header from {sanitize_for_log(url)}: {cl!r}. "
"Falling back to streaming size check."
)

try:
_cache[url] = json.loads(b"".join(chunks))
except json.JSONDecodeError as e:
# 2. Stream and check actual size
chunks = []
current_size = 0
for chunk in r.iter_bytes():
current_size += len(chunk)
if current_size > MAX_RESPONSE_SIZE:

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "e" doesn't conform to snake_case naming style Warning

Variable name "e" doesn't conform to snake_case naming style

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Variable name "e" doesn't conform to snake_case naming style Warning

Variable name "e" doesn't conform to snake_case naming style
raise ValueError(
f"Invalid JSON response from {sanitize_for_log(url)}"
) from e
f"Response too large from {sanitize_for_log(url)} "
f"(> {MAX_RESPONSE_SIZE / (1024 * 1024):.2f} MB)"
)
chunks.append(chunk)

return _cache[url]
try:
data = json.loads(b"".join(chunks))
except json.JSONDecodeError as e:
raise ValueError(
f"Invalid JSON response from {sanitize_for_log(url)}"
) from e

# Double-checked locking: Check again after fetch to avoid duplicate fetches
# If another thread already cached it while we were fetching, use theirs
# for consistency (return _cache[url] instead of data to ensure single source of truth)
with _cache_lock:
if url not in _cache:
_cache[url] = data
return _cache[url]


def check_api_access(client: httpx.Client, profile_id: str) -> bool:
Expand Down Expand Up @@ -693,7 +708,8 @@

def warm_up_cache(urls: Sequence[str]) -> None:
urls = list(set(urls))
urls_to_process = [u for u in urls if u not in _cache]
with _cache_lock:
urls_to_process = [u for u in urls if u not in _cache]
if not urls_to_process:
return

Expand All @@ -719,7 +735,7 @@
for future in concurrent.futures.as_completed(futures):
completed += 1
render_progress_bar(completed, total, "Warming up cache", prefix="⏳")
try:

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Trailing whitespace Warning

Trailing whitespace
future.result()
except Exception as e:
if USE_COLORS:
Expand Down Expand Up @@ -1038,6 +1054,13 @@
# OPTIMIZATION: Move validation inside the thread pool to parallelize DNS lookups.
# Previously, sequential validation blocked the main thread.
def _fetch_if_valid(url: str):
# Optimization: If we already have the content in cache, return it directly.
# The content was validated at the time of fetch (warm_up_cache).
# Read directly from cache to avoid calling fetch_folder_data while holding lock.
with _cache_lock:
if url in _cache:
return _cache[url]
Comment on lines +1057 to +1062
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical security vulnerability: This optimization bypasses validation that was explicitly cleared for security reasons. At line 1048, validate_folder_url.cache_clear() is called to prevent TOCTOU issues. However, this optimization returns cached content without re-validating the URL, defeating the security measure.

The comment claims "The content was validated at the time of fetch (warm_up_cache)", but that validation was already cleared by line 1048. This means:

  1. If a domain's DNS changes between warm_up_cache and sync_profile, it won't be detected
  2. If multiple profiles are synced, the second+ profiles will use cached content without any validation at all
  3. An attacker could exploit this by timing DNS changes to bypass validation

The validation cache clear at line 1048 exists specifically to ensure fresh validation for each sync run. This optimization must either:

  • Not skip validation when the validation cache has been cleared, OR
  • Not clear the validation cache if using this optimization (which would require careful security analysis)

Simply removing the cache_clear() is not safe without understanding why it was added for TOCTOU prevention.

Copilot uses AI. Check for mistakes.

if validate_folder_url(url):
return fetch_folder_data(url)
return None
Expand Down
Loading
Loading