⚡ Bolt: [performance improvement] Optimize CivicRAG retrieval with pre-tokenization#699
⚡ Bolt: [performance improvement] Optimize CivicRAG retrieval with pre-tokenization#699RohanExploit wants to merge 1 commit intomainfrom
Conversation
…e-tokenization 💡 What: Implemented pre-tokenization and regex pre-compilation in the CivicRAG service. - Pre-compiled the tokenization regular expression. - Pre-calculated token sets for all civic policies during service initialization. - Refactored the `retrieve` method to use these cached token sets for Jaccard similarity and title boost calculations. 🎯 Why: The previous implementation performed $O(N)$ tokenization operations (regex matching and set creation) on every retrieval call, where $N$ is the number of policies. This resulted in redundant CPU overhead and increased latency for every issue submission that used RAG. 📊 Impact: Reduces retrieval latency by approximately 4.8x. - Baseline: ~0.0957 ms per retrieval. - Optimized: ~0.0198 ms per retrieval. 🔬 Measurement: Verified using `benchmark_rag.py` (5000 iterations over the standard policy corpus). Ensured logic correctness by running `backend/tests/test_rag_service.py` and the full backend test suite (107 tests passed).
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
✅ Deploy Preview for fixmybharat canceled.
|
🙏 Thank you for your contribution, @RohanExploit!PR Details:
Quality Checklist:
Review Process:
Note: The maintainers will monitor code quality and ensure the overall project flow isn't broken. |
📝 WalkthroughWalkthroughDocumentation and implementation of a RAG performance optimization that pre-tokenizes the policy corpus and pre-compiles regex patterns during initialization, eliminating redundant tokenization operations in the retrieve loop to reduce latency. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Optimizes CivicRAG retrieval latency by reducing per-query tokenization work through regex pre-compilation and policy corpus pre-tokenization at initialization time.
Changes:
- Pre-compiles the tokenization regex and reuses it for all tokenization calls.
- Pre-tokenizes policy title/content once during service initialization and reuses token sets during retrieval.
- Adds a Bolt learning note documenting the RAG pre-tokenization optimization.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
backend/rag_service.py |
Pre-compiles token regex and pre-tokenizes policies to avoid repeated work in retrieve(). |
.jules/bolt.md |
Documents the optimization as an engineering learning/action item. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Performance Boost: Pre-tokenize all policies during initialization | ||
| # to avoid redundant O(N) processing on every retrieve call. | ||
| for policy in self.policies: | ||
| content = f"{policy.get('title', '')} {policy.get('text', '')}" | ||
| self.pretokenized_policies.append({ | ||
| "content_tokens": self._tokenize(content), | ||
| "title_tokens": self._tokenize(policy.get('title', '')) | ||
| }) |
There was a problem hiding this comment.
The pre-tokenization loop assumes every policy is a dict (uses .get). If the JSON contains a non-dict entry, this will raise and be caught by the broad except, leaving self.policies populated but self.pretokenized_policies only partially built. That can lead to silent retrieval gaps later. Consider validating each item (e.g., skip/normalize non-dicts) and ensuring pretokenized_policies stays aligned with policies even if one entry is malformed (or fall back to on-the-fly tokenization for that entry).
| for policy, pretokenized in zip(self.policies, self.pretokenized_policies): | ||
| # Performance Boost: Use pre-calculated token sets | ||
| policy_tokens = pretokenized["content_tokens"] |
There was a problem hiding this comment.
zip(self.policies, self.pretokenized_policies) will silently drop any trailing policies if pretokenized_policies is shorter (e.g., due to a partial initialization failure). Using an index-based loop with a length check (or iterating self.policies and tokenizing on-demand when pretokenized data is missing) avoids silently skipping documents.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/rag_service.py (1)
30-47:⚠️ Potential issue | 🟡 MinorKeep
self.policiesandself.pretokenized_policiesconsistent on failure.The current try/except catches any exception during load or pre-tokenization and logs it, but leaves whatever partial state was built on
self. If pre-tokenization fails afterself.policieshas been assigned fromjson.load(Line 33) but mid-way through the loop at Lines 38–43,self.policieswill have N entries whileself.pretokenized_policieshas fewer. Downstreamretrievethen silently operates on a truncated corpus (amplified by the non-strictzipon Line 71).🛡️ Proposed defensive fix
try: if os.path.exists(policies_path): with open(policies_path, 'r') as f: - self.policies = json.load(f) - logger.info(f"Loaded {len(self.policies)} civic policies for RAG.") - - # Performance Boost: Pre-tokenize all policies during initialization - # to avoid redundant O(N) processing on every retrieve call. - for policy in self.policies: - content = f"{policy.get('title', '')} {policy.get('text', '')}" - self.pretokenized_policies.append({ - "content_tokens": self._tokenize(content), - "title_tokens": self._tokenize(policy.get('title', '')) - }) + policies = json.load(f) + logger.info(f"Loaded {len(policies)} civic policies for RAG.") + + # Performance Boost: Pre-tokenize all policies during initialization + # to avoid redundant O(N) processing on every retrieve call. + pretokenized = [ + { + "content_tokens": self._tokenize(f"{p.get('title', '')} {p.get('text', '')}"), + "title_tokens": self._tokenize(p.get('title', '')), + } + for p in policies + ] + self.policies = policies + self.pretokenized_policies = pretokenized else: logger.warning(f"Civic policies file not found at {policies_path}") except Exception as e: logger.error(f"Error loading policies: {e}") + self.policies = [] + self.pretokenized_policies = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/rag_service.py` around lines 30 - 47, Pre-tokenization can fail mid-loop leaving self.policies and self.pretokenized_policies inconsistent, so load and preprocess into local variables first and only assign to self.policies and self.pretokenized_policies after all policies are successfully pre-tokenized; alternatively, on exception ensure you rollback/clear both attributes before re-raising/logging. Specifically, use a local list (e.g., temp_policies and temp_pretokenized) while calling json.load and self._tokenize, and only set self.policies = temp_policies and self.pretokenized_policies = temp_pretokenized after the loop completes; also ensure the except block clears both self.policies and self.pretokenized_policies to avoid partial state affecting retrieve (which uses zip).
🧹 Nitpick comments (1)
backend/rag_service.py (1)
71-71: Usezip(..., strict=True)to guard against policy/token list desync.If
self.pretokenized_policiesever diverges in length fromself.policies(e.g., pre-tokenization partially fails inside thetry/exceptat Lines 30–47 and subsequent policies are skipped),zip()will silently truncate andretrievewill quietly ignore the tail of the corpus. Since Python 3.10+ supportsstrict=True, enforcing it converts this silent data loss into a loud error and also satisfies Ruff B905.♻️ Proposed diff
- for policy, pretokenized in zip(self.policies, self.pretokenized_policies): + for policy, pretokenized in zip(self.policies, self.pretokenized_policies, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/rag_service.py` at line 71, Replace the plain zip used when iterating over self.policies and self.pretokenized_policies with zip(..., strict=True) so mismatched lengths raise immediately; specifically, change the loop that currently reads "for policy, pretokenized in zip(self.policies, self.pretokenized_policies):" to "for policy, pretokenized in zip(self.policies, self.pretokenized_policies, strict=True):" (this enforces that self.policies and self.pretokenized_policies stay in sync, surfaces any pretokenization failures as an error, and satisfies Ruff B905).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.jules/bolt.md:
- Line 85: The playbook entry header "## 2026-05-16 - RAG Pre-tokenization
Bottleneck" is dated in the future; update that header to the actual
authoring/PR date (e.g., "## 2026-04-24 - RAG Pre-tokenization Bottleneck") so
the chronology matches other entries and retains the exact title and body of the
entry.
---
Outside diff comments:
In `@backend/rag_service.py`:
- Around line 30-47: Pre-tokenization can fail mid-loop leaving self.policies
and self.pretokenized_policies inconsistent, so load and preprocess into local
variables first and only assign to self.policies and self.pretokenized_policies
after all policies are successfully pre-tokenized; alternatively, on exception
ensure you rollback/clear both attributes before re-raising/logging.
Specifically, use a local list (e.g., temp_policies and temp_pretokenized) while
calling json.load and self._tokenize, and only set self.policies = temp_policies
and self.pretokenized_policies = temp_pretokenized after the loop completes;
also ensure the except block clears both self.policies and
self.pretokenized_policies to avoid partial state affecting retrieve (which uses
zip).
---
Nitpick comments:
In `@backend/rag_service.py`:
- Line 71: Replace the plain zip used when iterating over self.policies and
self.pretokenized_policies with zip(..., strict=True) so mismatched lengths
raise immediately; specifically, change the loop that currently reads "for
policy, pretokenized in zip(self.policies, self.pretokenized_policies):" to "for
policy, pretokenized in zip(self.policies, self.pretokenized_policies,
strict=True):" (this enforces that self.policies and self.pretokenized_policies
stay in sync, surfaces any pretokenization failures as an error, and satisfies
Ruff B905).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6a24e1db-0145-4047-9736-fb018b3e6bd5
📒 Files selected for processing (2)
.jules/bolt.mdbackend/rag_service.py
| **Learning:** Caching raw Python objects (like SQLAlchemy models or Pydantic instances) in a high-traffic API still incurs significant overhead because FastAPI/Pydantic must re-serialize the data on every request. | ||
| **Action:** Serialize data to a JSON string using `json.dumps()` BEFORE caching. On cache hits, return a raw `fastapi.Response(content=..., media_type="application/json")`. This bypasses the validation and serialization layer, resulting in significant performance gains (up to 50x in benchmarks). | ||
|
|
||
| ## 2026-05-16 - RAG Pre-tokenization Bottleneck |
There was a problem hiding this comment.
Playbook entry dated in the future.
The new entry is dated 2026-05-16, but this PR was opened on 2026-04-24. Other entries use the date the learning was added, so consider correcting this to the actual authoring date to keep the playbook's chronology accurate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.jules/bolt.md at line 85, The playbook entry header "## 2026-05-16 - RAG
Pre-tokenization Bottleneck" is dated in the future; update that header to the
actual authoring/PR date (e.g., "## 2026-04-24 - RAG Pre-tokenization
Bottleneck") so the chronology matches other entries and retains the exact title
and body of the entry.
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/rag_service.py">
<violation number="1" location="backend/rag_service.py:71">
P2: Using `zip(self.policies, self.pretokenized_policies)` can silently skip valid policies when pretokenization list length is shorter than policies.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| for policy, pretokenized in zip(self.policies, self.pretokenized_policies): | ||
| # Performance Boost: Use pre-calculated token sets | ||
| policy_tokens = pretokenized["content_tokens"] |
There was a problem hiding this comment.
P2: Using zip(self.policies, self.pretokenized_policies) can silently skip valid policies when pretokenization list length is shorter than policies.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/rag_service.py, line 71:
<comment>Using `zip(self.policies, self.pretokenized_policies)` can silently skip valid policies when pretokenization list length is shorter than policies.</comment>
<file context>
@@ -54,10 +68,9 @@ def retrieve(self, query: str, threshold: float = 0.05) -> Optional[str]:
- # combine title and text for matching
- policy_content = f"{policy.get('title', '')} {policy.get('text', '')}"
- policy_tokens = self._tokenize(policy_content)
+ for policy, pretokenized in zip(self.policies, self.pretokenized_policies):
+ # Performance Boost: Use pre-calculated token sets
+ policy_tokens = pretokenized["content_tokens"]
</file context>
| for policy, pretokenized in zip(self.policies, self.pretokenized_policies): | |
| # Performance Boost: Use pre-calculated token sets | |
| policy_tokens = pretokenized["content_tokens"] | |
| for idx, policy in enumerate(self.policies): | |
| pretokenized = self.pretokenized_policies[idx] if idx < len(self.pretokenized_policies) else { | |
| "content_tokens": self._tokenize(f"{policy.get('title', '')} {policy.get('text', '')}"), | |
| "title_tokens": self._tokenize(policy.get('title', '')) | |
| } | |
| # Performance Boost: Use pre-calculated token sets | |
| policy_tokens = pretokenized["content_tokens"] |
This PR implements a performance optimization for the
CivicRAGservice by pre-tokenizing the policy corpus and pre-compiling the tokenizer's regular expression. These changes significantly reduce the computational overhead of the retrieval process, resulting in a ~4.8x speedup in retrieval latency as measured by benchmarks. All existing RAG tests and the full backend test suite pass successfully.PR created automatically by Jules for task 6287993921712223033 started by @RohanExploit
Summary by cubic
Optimized
CivicRAGretrieval by pre-tokenizing all policies and pre-compiling the tokenizer regex, removing per-request tokenization. Retrieval latency drops ~4.8x;retrievenow uses cached token sets computed at initialization.Written for commit dc32172. Summary will update on new commits.
Summary by CodeRabbit
Performance Improvements
Documentation