diff --git a/.jules/bolt.md b/.jules/bolt.md index c5f9902b..ac1f997f 100644 --- a/.jules/bolt.md +++ b/.jules/bolt.md @@ -1,41 +1,3 @@ -# Bolt's Journal - -## 2024-03-24 - [Reusing HTTP Clients] -**Learning:** Instantiating `httpx.Client` (or `requests.Session`) inside a loop for API calls defeats the purpose of connection pooling and Keep-Alive. Reusing a single client instance across serial or parallel tasks significantly reduces TCP/SSL overhead. -**Action:** Always check loop bodies for client/session instantiation. Lift the instantiation to the outer scope and pass the client down. - -## 2024-05-23 - Initial Setup -**Learning:** Initialized Bolt's journal. -**Action:** Always check this journal for past learnings before starting. - -## 2024-05-23 - Parallel IO for independent resources -**Learning:** Python's `concurrent.futures.ThreadPoolExecutor` is a low-effort, high-reward optimization for independent IO operations (like fetching multiple URLs). Even with standard synchronous libraries like `httpx` (unless using its async version), threading can significantly reduce total execution time from sum(latency) to max(latency). -**Action:** Always look for loops performing IO that don't depend on each other's results and parallelize them. Be mindful of thread safety if shared resources (like a cache) are modified. - -## 2024-05-24 - Thread Safety in Parallel IO -**Learning:** When parallelizing IO operations that update a shared collection (like a set of existing rules), always use a `threading.Lock` for the write operations. While Python's GIL makes some operations atomic, explicit locking ensures correctness and prevents race conditions during complex update logic (e.g. checks then writes). -**Action:** Use `threading.Lock` when refactoring sequential loops into `ThreadPoolExecutor` if they modify shared state. - -## 2024-05-24 - Avoid Copying Large Sets for Membership Checks -**Learning:** Copying a large set (e.g. 100k items) to create a snapshot for read-only membership checks is expensive O(N) and unnecessary. Python's set membership testing is thread-safe. -**Action:** When filtering data against a shared large set, iterate and check membership directly instead of snapshotting, unless strict transactional consistency across the entire iteration is required. - -## 2024-05-24 - Deduplicate before API calls -**Learning:** Sending duplicate items in API requests wastes bandwidth and processing time. If the input list might contain duplicates (common in aggregated blocklists), deduplicate it locally before sending. -**Action:** Use `set` logic to filter duplicates from input lists before batching for API calls. - -## 2024-05-24 - Parallelize independent batches -**Learning:** When sending large amounts of data in batches to an API, processing batches sequentially blocks on network latency. Using a thread pool to send multiple batches concurrently can significantly speed up the process, provided the API rate limits are respected. -**Action:** Refactor sequential batch processing loops to use `ThreadPoolExecutor` with a conservative number of workers (e.g., 3-5) for write operations. - -## 2024-05-24 - Pass Local State to Avoid Redundant Reads -**Learning:** When a process involves modifying remote state (e.g. deleting folders) and then querying it (e.g. getting rules from remaining folders), maintaining a local replica of the state avoids redundant API calls. If you know what you deleted, you don't need to ask the server "what's left?". -**Action:** Identify sequences of "Read -> Modify -> Read" and optimize to "Read -> Modify (update local) -> Use local". - -## 2024-05-24 - Parallelize DNS Validation -**Learning:** DNS lookups (`socket.getaddrinfo`) are blocking I/O operations. Performing them sequentially in a list comprehension (e.g., to filter URLs) can be a major bottleneck. Parallelizing them alongside the fetch operation can significantly reduce startup time. -**Action:** Move validation logic that involves network I/O into the parallel worker thread instead of pre-filtering sequentially. - -## 2026-01-27 - Redundant Validation for Cached Data -**Learning:** Re-validating resource properties (like DNS/IP) when using *cached content* is pure overhead. If the content is served from memory (proven safe at fetch time), checking the *current* state of the source is disconnected from the data being used. -**Action:** When using a multi-stage pipeline (Warmup -> Process), ensure validation state persists alongside the data cache. Avoid clearing validation caches between stages if the data cache is not also cleared. +## 2025-02-23 - Python Module Patching Flakiness +**Learning:** Patching global variables (like `USE_COLORS`) in a module can fail if previous tests manipulated `sys.modules` (e.g., `del sys.modules['main']`) or used `sys.path.append` to re-import the module from a different path. This results in the test holding a reference to an old module object while `patch` modifies the new one. +**Action:** Avoid `sys.path.append` in tests if possible. Use `patch.object(module, 'var', val)` instead of `patch('module.var', val)` where `module` is imported inside the test function to ensure it targets the current `sys.modules` entry. diff --git a/main.py b/main.py index 4b766144..1805de9e 100644 --- a/main.py +++ b/main.py @@ -1151,23 +1151,36 @@ def process_batch(batch_idx: int, batch_data: List[str]) -> Optional[List[str]]: # Optimization 3: Parallelize batch processing # Using 3 workers to speed up writes without hitting aggressive rate limits. - with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: - futures = { - executor.submit(process_batch, i, batch): i - for i, batch in enumerate(batches, 1) - } + if total_batches == 1: + # Avoid thread pool overhead for single batch (very common case) + result = process_batch(1, batches[0]) + if result: + successful_batches = 1 + existing_rules.update(result) + + render_progress_bar( + successful_batches, + total_batches, + f"Folder {sanitize_for_log(folder_name)}", + ) + else: + with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: + futures = { + executor.submit(process_batch, i, batch): i + for i, batch in enumerate(batches, 1) + } - for future in concurrent.futures.as_completed(futures): - result = future.result() - if result: - successful_batches += 1 - existing_rules.update(result) - - render_progress_bar( - successful_batches, - total_batches, - f"Folder {sanitize_for_log(folder_name)}", - ) + for future in concurrent.futures.as_completed(futures): + result = future.result() + if result: + successful_batches += 1 + existing_rules.update(result) + + render_progress_bar( + successful_batches, + total_batches, + f"Folder {sanitize_for_log(folder_name)}", + ) if successful_batches == total_batches: if USE_COLORS: diff --git a/tests/test_cache_optimization.py b/tests/test_cache_optimization.py index 2fd18439..0e525b3e 100644 --- a/tests/test_cache_optimization.py +++ b/tests/test_cache_optimization.py @@ -13,7 +13,7 @@ import os # Add root to path to import main -sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) +# sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) import main diff --git a/tests/test_plan_details.py b/tests/test_plan_details.py index 12cacb2c..89a34a62 100644 --- a/tests/test_plan_details.py +++ b/tests/test_plan_details.py @@ -2,12 +2,12 @@ from unittest.mock import patch -import main - def test_print_plan_details_no_colors(capsys): """Test print_plan_details output when colors are disabled.""" - with patch("main.USE_COLORS", False): + import main + + with patch.object(main, "USE_COLORS", False): plan_entry = { "profile": "test_profile", "folders": [ @@ -29,7 +29,9 @@ def test_print_plan_details_no_colors(capsys): def test_print_plan_details_empty_folders(capsys): """Test print_plan_details with no folders.""" - with patch("main.USE_COLORS", False): + import main + + with patch.object(main, "USE_COLORS", False): plan_entry = {"profile": "test_profile", "folders": []} main.print_plan_details(plan_entry) @@ -42,7 +44,9 @@ def test_print_plan_details_empty_folders(capsys): def test_print_plan_details_with_colors(capsys): """Test print_plan_details output when colors are enabled.""" - with patch("main.USE_COLORS", True): + import main + + with patch.object(main, "USE_COLORS", True): plan_entry = { "profile": "test_profile", "folders": [{"name": "Folder A", "rules": 10}], diff --git a/tests/test_push_rules_perf.py b/tests/test_push_rules_perf.py new file mode 100644 index 00000000..13ad9e19 --- /dev/null +++ b/tests/test_push_rules_perf.py @@ -0,0 +1,77 @@ + +import unittest +from unittest.mock import patch, MagicMock +import concurrent.futures + +# Assumes pytest is running from root, so 'main' is importable directly +import main + +class TestPushRulesPerformance(unittest.TestCase): + def setUp(self): + self.mock_client = MagicMock() + self.profile_id = "test-profile" + self.folder_name = "test-folder" + self.folder_id = "test-folder-id" + self.do = 0 + self.status = 1 + self.existing_rules = set() + + @patch('concurrent.futures.ThreadPoolExecutor') + def test_single_batch_thread_pool_usage(self, mock_executor): + """ + Verify optimization: ThreadPoolExecutor is NOT used for single batch. + """ + # Setup: < 500 rules (one batch) + hostnames = [f"example{i}.com" for i in range(100)] + + # Mock executor context manager (should not be called, but just in case) + mock_executor_instance = MagicMock() + mock_executor.return_value.__enter__.return_value = mock_executor_instance + + # Call push_rules + main.push_rules( + self.profile_id, + self.folder_name, + self.folder_id, + self.do, + self.status, + hostnames, + self.existing_rules, + self.mock_client + ) + + # Verification: Check if ThreadPoolExecutor was used + # OPTIMIZATION: Should NOT be called + mock_executor.assert_not_called() + + @patch('concurrent.futures.ThreadPoolExecutor') + def test_multiple_batches_thread_pool_usage(self, mock_executor): + """ + Verify multiple batches still use ThreadPoolExecutor. + """ + # Setup: > 500 rules (multiple batches) + hostnames = [f"example{i}.com" for i in range(1000)] + + # Mock executor context manager + mock_executor_instance = MagicMock() + mock_executor.return_value.__enter__.return_value = mock_executor_instance + + # Mock submit to return a Future + future = concurrent.futures.Future() + future.set_result(["processed"]) + mock_executor_instance.submit.return_value = future + + # Call push_rules + main.push_rules( + self.profile_id, + self.folder_name, + self.folder_id, + self.do, + self.status, + hostnames, + self.existing_rules, + self.mock_client + ) + + # Verification: executor should be called + mock_executor.assert_called()