Improve retrieval path quality and metadata-aware expansion by dtype2100 · Pull Request #7 · dtype2100/Advanced-RAG

dtype2100 · 2026-04-20T15:36:49Z

Summary

improve CRAG retrieval path to fuse multi-query results with RRF for higher recall before reranking
preserve ingest metadata during recursive chunking and store chunk/doc provenance for downstream expansion quality
fix expansion policy to measure actual child text length for dict-shaped retrieval results
harden Qdrant point ID generation to include provenance metadata and reduce cross-document overwrite risk
add/extend unit tests for expansion policy dict inputs and RRF consensus ranking behavior

Testing

export PATH="$HOME/.local/bin:$PATH" && make lint
python3 -m pytest tests/unit/test_expansion_policy.py tests/unit/test_retrievers.py -q

Notes

pytest emitted one config warning (Unknown config option: asyncio_mode) but tests passed.
to run lint/tests in fresh environments, user-local binaries (~/.local/bin) may need to be added to PATH.

Co-authored-by: jwlee <dtype2100@users.noreply.github.com>

coderabbitai · 2026-04-20T15:36:56Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3eee1fe8-b2a8-4031-9249-0c9e0cb1ae5b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cursor/rag-retrieval-path-improve-24ca

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request enhances the RAG pipeline by implementing multi-query RRF fusion in the retrieval stage and improving the context expansion logic to support parent-child chunk relationships. It also updates the chunking process to preserve metadata and modifies the vector store ID generation to be more robust by including provenance fields. Review feedback highlights potential TypeError risks when handling null metadata, suggests a safer serialization method for deterministic IDs to avoid collisions, and recommends returning a filtered list in the expansion node to ensure data consistency.

gemini-code-assist · 2026-04-20T15:38:18Z

+        key_parts = [
+            text,
+            str(metadata.get("source", "")),
+            str(metadata.get("doc_id", "")),
+            str(metadata.get("parent_id", "")),
+            str(metadata.get("chunk_index", "")),
+        ]
+        key = "|".join(key_parts)
+        return str(uuid.uuid5(uuid.NAMESPACE_DNS, key))



The use of a simple pipe separator | to join metadata fields for ID generation is prone to collisions if the chunk text or metadata values (like source or doc_id) also contain pipes. This could lead to unintended document overwrites in Qdrant. Using a more robust serialization like json.dumps is safer for generating deterministic keys from multiple fields.

Suggested change

key_parts = [

text,

str(metadata.get("source", "")),

str(metadata.get("doc_id", "")),

str(metadata.get("parent_id", "")),

str(metadata.get("chunk_index", "")),

]

key = "|".join(key_parts)

return str(uuid.uuid5(uuid.NAMESPACE_DNS, key))

"""Generate a stable UUID-v5 using text + provenance metadata."""

import json

# Use JSON to safely serialize key parts and avoid delimiter collisions

key = json.dumps([

text,

metadata.get("source"),

metadata.get("doc_id"),

metadata.get("parent_id"),

metadata.get("chunk_index"),

])

return str(uuid.uuid5(uuid.NAMESPACE_DNS, key))

gemini-code-assist · 2026-04-20T15:38:19Z

+            }
+        )
+
+    return {"expanded_contexts": expanded or children}


The fallback expanded or children can lead to inconsistent results. If expanded is empty because all retrieved children were filtered out (e.g., due to being empty strings or duplicates), returning the original children list re-introduces those invalid items and potentially breaks the expected dictionary structure of the context list. It is better to return the expanded list directly, even if empty.

Suggested change

return {"expanded_contexts": expanded or children}

return {"expanded_contexts": expanded}

gemini-code-assist · 2026-04-20T15:38:19Z

+                expanded.append({"text": text, "score": 0.0, "metadata": {}})
+            continue
+
+        metadata = dict(child.get("metadata", {}))


Calling dict() on child.get("metadata", {}) will raise a TypeError if the "metadata" key exists in the dictionary but its value is explicitly set to null (None). Using or {} ensures a dictionary is always passed to the constructor.

Suggested change

metadata = dict(child.get("metadata", {}))

metadata = dict(child.get("metadata") or {})

gemini-code-assist · 2026-04-20T15:38:19Z

+    for doc_idx, item in enumerate(docs):
+        if isinstance(item, dict):
+            text = str(item.get("text", ""))
+            base_meta = dict(item.get("metadata", {}))


Similar to the issue in the CRAG nodes, item.get("metadata", {}) will return None if the key is present but null, causing dict(None) to raise a TypeError. Using or {} provides a safer fallback for metadata extraction.

Suggested change

base_meta = dict(item.get("metadata", {}))

base_meta = dict(item.get("metadata") or {})

Improve RAG retrieval path and metadata-aware expansion

1d8a40d

Co-authored-by: jwlee <dtype2100@users.noreply.github.com>

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve retrieval path quality and metadata-aware expansion#7

Improve retrieval path quality and metadata-aware expansion#7
dtype2100 wants to merge 1 commit into
cursor/advanced-rag-refactor-24cafrom
cursor/rag-retrieval-path-improve-24ca

dtype2100 commented Apr 20, 2026

Uh oh!

coderabbitai Bot commented Apr 20, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return {"expanded_contexts": expanded or children}
	return {"expanded_contexts": expanded}

	metadata = dict(child.get("metadata", {}))
	metadata = dict(child.get("metadata") or {})

	base_meta = dict(item.get("metadata", {}))
	base_meta = dict(item.get("metadata") or {})

Conversation

dtype2100 commented Apr 20, 2026

Summary

Testing

Notes

Uh oh!

coderabbitai Bot commented Apr 20, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants