Skip to content

Improve retrieval path quality and metadata-aware expansion#7

Draft
dtype2100 wants to merge 1 commit into
cursor/advanced-rag-refactor-24cafrom
cursor/rag-retrieval-path-improve-24ca
Draft

Improve retrieval path quality and metadata-aware expansion#7
dtype2100 wants to merge 1 commit into
cursor/advanced-rag-refactor-24cafrom
cursor/rag-retrieval-path-improve-24ca

Conversation

@dtype2100
Copy link
Copy Markdown
Owner

Summary

  • improve CRAG retrieval path to fuse multi-query results with RRF for higher recall before reranking
  • preserve ingest metadata during recursive chunking and store chunk/doc provenance for downstream expansion quality
  • fix expansion policy to measure actual child text length for dict-shaped retrieval results
  • harden Qdrant point ID generation to include provenance metadata and reduce cross-document overwrite risk
  • add/extend unit tests for expansion policy dict inputs and RRF consensus ranking behavior

Testing

  • export PATH="$HOME/.local/bin:$PATH" && make lint
  • python3 -m pytest tests/unit/test_expansion_policy.py tests/unit/test_retrievers.py -q

Notes

  • pytest emitted one config warning (Unknown config option: asyncio_mode) but tests passed.
  • to run lint/tests in fresh environments, user-local binaries (~/.local/bin) may need to be added to PATH.
Open in Web Open in Cursor 

Co-authored-by: jwlee <dtype2100@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3eee1fe8-b2a8-4031-9249-0c9e0cb1ae5b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cursor/rag-retrieval-path-improve-24ca

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the RAG pipeline by implementing multi-query RRF fusion in the retrieval stage and improving the context expansion logic to support parent-child chunk relationships. It also updates the chunking process to preserve metadata and modifies the vector store ID generation to be more robust by including provenance fields. Review feedback highlights potential TypeError risks when handling null metadata, suggests a safer serialization method for deterministic IDs to avoid collisions, and recommends returning a filtered list in the expansion node to ensure data consistency.

Comment on lines +72 to 81
key_parts = [
text,
str(metadata.get("source", "")),
str(metadata.get("doc_id", "")),
str(metadata.get("parent_id", "")),
str(metadata.get("chunk_index", "")),
]
key = "|".join(key_parts)
return str(uuid.uuid5(uuid.NAMESPACE_DNS, key))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a simple pipe separator | to join metadata fields for ID generation is prone to collisions if the chunk text or metadata values (like source or doc_id) also contain pipes. This could lead to unintended document overwrites in Qdrant. Using a more robust serialization like json.dumps is safer for generating deterministic keys from multiple fields.

Suggested change
key_parts = [
text,
str(metadata.get("source", "")),
str(metadata.get("doc_id", "")),
str(metadata.get("parent_id", "")),
str(metadata.get("chunk_index", "")),
]
key = "|".join(key_parts)
return str(uuid.uuid5(uuid.NAMESPACE_DNS, key))
"""Generate a stable UUID-v5 using text + provenance metadata."""
import json
# Use JSON to safely serialize key parts and avoid delimiter collisions
key = json.dumps([
text,
metadata.get("source"),
metadata.get("doc_id"),
metadata.get("parent_id"),
metadata.get("chunk_index"),
])
return str(uuid.uuid5(uuid.NAMESPACE_DNS, key))

Comment thread app/graphs/crag/nodes.py
}
)

return {"expanded_contexts": expanded or children}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback expanded or children can lead to inconsistent results. If expanded is empty because all retrieved children were filtered out (e.g., due to being empty strings or duplicates), returning the original children list re-introduces those invalid items and potentially breaks the expected dictionary structure of the context list. It is better to return the expanded list directly, even if empty.

Suggested change
return {"expanded_contexts": expanded or children}
return {"expanded_contexts": expanded}

Comment thread app/graphs/crag/nodes.py
expanded.append({"text": text, "score": 0.0, "metadata": {}})
continue

metadata = dict(child.get("metadata", {}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling dict() on child.get("metadata", {}) will raise a TypeError if the "metadata" key exists in the dictionary but its value is explicitly set to null (None). Using or {} ensures a dictionary is always passed to the constructor.

Suggested change
metadata = dict(child.get("metadata", {}))
metadata = dict(child.get("metadata") or {})

for doc_idx, item in enumerate(docs):
if isinstance(item, dict):
text = str(item.get("text", ""))
base_meta = dict(item.get("metadata", {}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the issue in the CRAG nodes, item.get("metadata", {}) will return None if the key is present but null, causing dict(None) to raise a TypeError. Using or {} provides a safer fallback for metadata extraction.

Suggested change
base_meta = dict(item.get("metadata", {}))
base_meta = dict(item.get("metadata") or {})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants