Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,11 @@
## 2026-05-16 - Pre-processing for RAG Retrieval
**Learning:** In RAG (Retrieval-Augmented Generation) systems with static or semi-static policy datasets, performing tokenization, regex substitution, and string formatting inside the retrieval loop is a significant bottleneck that scales with the number of policies.
**Action:** Move all deterministic operations (tokenization, formatting, regex matching prep) to a one-time initialization step to ensure the retrieval hot-path only performs necessary set intersections and similarity calculations.

## 2025-05-18 - Optimized Jaccard Similarity for RAG
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Duplicate section: there are two identical ## 2025-05-18 - Optimized Jaccard Similarity for RAG headings. This first copy also has missing inline code (Combining this with for early exits β€” should be isdisjoint()). Remove this broken duplicate and keep the complete entry below.

Prompt for AI agents
Check if this issue is valid β€” if so, understand the root cause and fix it. At .jules/bolt.md, line 89:

<comment>Duplicate section: there are two identical `## 2025-05-18 - Optimized Jaccard Similarity for RAG` headings. This first copy also has missing inline code (`Combining this with  for early exits` β€” should be `isdisjoint()`). Remove this broken duplicate and keep the complete entry below.</comment>

<file context>
@@ -85,3 +85,11 @@
 **Learning:** In RAG (Retrieval-Augmented Generation) systems with static or semi-static policy datasets, performing tokenization, regex substitution, and string formatting inside the retrieval loop is a significant bottleneck that scales with the number of policies.
 **Action:** Move all deterministic operations (tokenization, formatting, regex matching prep) to a one-time initialization step to ensure the retrieval hot-path only performs necessary set intersections and similarity calculations.
+
+## 2025-05-18 - Optimized Jaccard Similarity for RAG
+**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with  for early exits significantly reduces CPU cycles for non-matching documents.
+**Action:** Use mathematical union length and  for set similarity comparisons in high-frequency retrieval paths.
</file context>
Fix with Cubic

**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and for set similarity comparisons in high-frequency retrieval paths.
Comment on lines +90 to +91
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry has missing inline code: "Combining this with for early exits" / "Use ... and for ...". It looks like isdisjoint() was intended here; please fill in the missing method name (and wrap it in backticks for consistency).

Suggested change
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and for set similarity comparisons in high-frequency retrieval paths.
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with `isdisjoint()` for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and `isdisjoint()` for set similarity comparisons in high-frequency retrieval paths.

Copilot uses AI. Check for mistakes.

## 2025-05-18 - Optimized Jaccard Similarity for RAG
Comment on lines +90 to +93
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is duplicated (two identical "## 2025-05-18 - Optimized Jaccard Similarity for RAG" entries). Please remove one to avoid conflicting guidance / unnecessary repetition in the Bolt notes.

Suggested change
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and for set similarity comparisons in high-frequency retrieval paths.
## 2025-05-18 - Optimized Jaccard Similarity for RAG

Copilot uses AI. Check for mistakes.
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with `isdisjoint()` for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and `isdisjoint()` for set similarity comparisons in high-frequency retrieval paths.
Comment on lines +89 to +95
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Remove the duplicate/broken RAG optimization entry.

Line 89 duplicates the heading from Line 93, and the first copy has incomplete inline references (with ... / and ...). Keep only one corrected section to avoid MD024 and unclear guidance.

🧹 Suggested cleanup
-## 2025-05-18 - Optimized Jaccard Similarity for RAG
-**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with  for early exits significantly reduces CPU cycles for non-matching documents.
-**Action:** Use mathematical union length and  for set similarity comparisons in high-frequency retrieval paths.
-
 ## 2025-05-18 - Optimized Jaccard Similarity for RAG
 **Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with `isdisjoint()` for early exits significantly reduces CPU cycles for non-matching documents.
 **Action:** Use mathematical union length and `isdisjoint()` for set similarity comparisons in high-frequency retrieval paths.
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 2025-05-18 - Optimized Jaccard Similarity for RAG
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and for set similarity comparisons in high-frequency retrieval paths.
## 2025-05-18 - Optimized Jaccard Similarity for RAG
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with `isdisjoint()` for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and `isdisjoint()` for set similarity comparisons in high-frequency retrieval paths.
## 2025-05-18 - Optimized Jaccard Similarity for RAG
**Learning:** Calculating Jaccard similarity in a hot loop can be optimized by using the inclusion-exclusion principle (|A βˆͺ B| = |A| + |B| - |A ∩ B|) to avoid the overhead of set union construction. Combining this with `isdisjoint()` for early exits significantly reduces CPU cycles for non-matching documents.
**Action:** Use mathematical union length and `isdisjoint()` for set similarity comparisons in high-frequency retrieval paths.
🧰 Tools
πŸͺ› markdownlint-cli2 (0.22.1)

[warning] 93-93: Multiple headings with the same content

(MD024, no-duplicate-heading)

πŸ€– Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.jules/bolt.md around lines 89 - 95, Remove the duplicate/broken "Optimized
Jaccard Similarity for RAG" entry and keep the corrected version: locate the two
identical headings "Optimized Jaccard Similarity for RAG", delete the first
block that contains incomplete inline references ("with  ..." / "and  ..."), and
ensure only the second, corrected paragraph (mentioning inclusion-exclusion
principle and `isdisjoint()` with the action to use mathematical union length
and `isdisjoint()`) remains as the single entry.

25 changes: 15 additions & 10 deletions backend/rag_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,12 @@ def _prepare_policies(self):
source = policy.get('source', 'Unknown')

content = f"{title} {text}"
content_tokens = self._tokenize(content)

self._prepared_policies.append({
'title_tokens': self._tokenize(title),
'content_tokens': self._tokenize(content),
'content_tokens': content_tokens,
'content_tokens_len': len(content_tokens),
'formatted': f"**{title}**: {text} (Source: {source})",
'original': policy
})
Expand All @@ -73,30 +75,33 @@ def retrieve(self, query: str, threshold: float = 0.05) -> Optional[str]:
if not query_tokens:
return None

query_tokens_len = len(query_tokens)
best_score = 0.0
best_formatted = None

for prepared in self._prepared_policies:
policy_tokens = prepared['content_tokens']

if not policy_tokens:
# Optimization: Use isdisjoint() for fast early exit
if query_tokens.isdisjoint(policy_tokens):
continue

# Jaccard Similarity
# Optimization: Mathematical union length |A union B| = |A| + |B| - |A intersection B|
# This avoids the overhead of building a new set with .union()
intersection = query_tokens.intersection(policy_tokens)
# Use pre-calculated set for union if possible?
# Union depends on query_tokens, so must be calculated.
union = query_tokens.union(policy_tokens)
intersection_len = len(intersection)

if not union:
union_len = query_tokens_len + prepared['content_tokens_len'] - intersection_len

if union_len == 0:
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This if union_len == 0: continue branch is unreachable. After the isdisjoint() early-exit above, both sets are guaranteed non-empty and share at least one token, so union_len (= query_tokens_len + content_tokens_len - intersection_len) is always β‰₯ 1. Remove the dead branch to simplify the hot path.

Prompt for AI agents
Check if this issue is valid β€” if so, understand the root cause and fix it. At backend/rag_service.py, line 97:

<comment>This `if union_len == 0: continue` branch is unreachable. After the `isdisjoint()` early-exit above, both sets are guaranteed non-empty and share at least one token, so `union_len` (= `query_tokens_len + content_tokens_len - intersection_len`) is always β‰₯ 1. Remove the dead branch to simplify the hot path.</comment>

<file context>
@@ -73,30 +75,33 @@ def retrieve(self, query: str, threshold: float = 0.05) -> Optional[str]:
-            if not union:
+            union_len = query_tokens_len + prepared['content_tokens_len'] - intersection_len
+
+            if union_len == 0:
                 continue
 
</file context>
Fix with Cubic

continue

Comment on lines +96 to 99
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the isdisjoint() early-exit (and with query_tokens already verified non-empty), union_len cannot be 0 here because the sets must have at least one shared token. This if union_len == 0: continue branch is therefore unreachable and can be removed to simplify the hot path.

Suggested change
if union_len == 0:
continue

Copilot uses AI. Check for mistakes.
score = len(intersection) / len(union)
score = intersection_len / union_len

# Boost score if title words match (weighted)
title_tokens = prepared['title_tokens']
title_match = len(query_tokens.intersection(title_tokens))
if title_match > 0:
# Optimization: Use isdisjoint() for faster boolean check
if not query_tokens.isdisjoint(prepared['title_tokens']):
score += 0.2 # Bonus for title match

if score > best_score:
Expand Down
Loading