Improve retrieval path quality and metadata-aware expansion#7
Conversation
Co-authored-by: jwlee <dtype2100@users.noreply.github.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the RAG pipeline by implementing multi-query RRF fusion in the retrieval stage and improving the context expansion logic to support parent-child chunk relationships. It also updates the chunking process to preserve metadata and modifies the vector store ID generation to be more robust by including provenance fields. Review feedback highlights potential TypeError risks when handling null metadata, suggests a safer serialization method for deterministic IDs to avoid collisions, and recommends returning a filtered list in the expansion node to ensure data consistency.
| key_parts = [ | ||
| text, | ||
| str(metadata.get("source", "")), | ||
| str(metadata.get("doc_id", "")), | ||
| str(metadata.get("parent_id", "")), | ||
| str(metadata.get("chunk_index", "")), | ||
| ] | ||
| key = "|".join(key_parts) | ||
| return str(uuid.uuid5(uuid.NAMESPACE_DNS, key)) | ||
|
|
There was a problem hiding this comment.
The use of a simple pipe separator | to join metadata fields for ID generation is prone to collisions if the chunk text or metadata values (like source or doc_id) also contain pipes. This could lead to unintended document overwrites in Qdrant. Using a more robust serialization like json.dumps is safer for generating deterministic keys from multiple fields.
| key_parts = [ | |
| text, | |
| str(metadata.get("source", "")), | |
| str(metadata.get("doc_id", "")), | |
| str(metadata.get("parent_id", "")), | |
| str(metadata.get("chunk_index", "")), | |
| ] | |
| key = "|".join(key_parts) | |
| return str(uuid.uuid5(uuid.NAMESPACE_DNS, key)) | |
| """Generate a stable UUID-v5 using text + provenance metadata.""" | |
| import json | |
| # Use JSON to safely serialize key parts and avoid delimiter collisions | |
| key = json.dumps([ | |
| text, | |
| metadata.get("source"), | |
| metadata.get("doc_id"), | |
| metadata.get("parent_id"), | |
| metadata.get("chunk_index"), | |
| ]) | |
| return str(uuid.uuid5(uuid.NAMESPACE_DNS, key)) |
| } | ||
| ) | ||
|
|
||
| return {"expanded_contexts": expanded or children} |
There was a problem hiding this comment.
The fallback expanded or children can lead to inconsistent results. If expanded is empty because all retrieved children were filtered out (e.g., due to being empty strings or duplicates), returning the original children list re-introduces those invalid items and potentially breaks the expected dictionary structure of the context list. It is better to return the expanded list directly, even if empty.
| return {"expanded_contexts": expanded or children} | |
| return {"expanded_contexts": expanded} |
| expanded.append({"text": text, "score": 0.0, "metadata": {}}) | ||
| continue | ||
|
|
||
| metadata = dict(child.get("metadata", {})) |
There was a problem hiding this comment.
Calling dict() on child.get("metadata", {}) will raise a TypeError if the "metadata" key exists in the dictionary but its value is explicitly set to null (None). Using or {} ensures a dictionary is always passed to the constructor.
| metadata = dict(child.get("metadata", {})) | |
| metadata = dict(child.get("metadata") or {}) |
| for doc_idx, item in enumerate(docs): | ||
| if isinstance(item, dict): | ||
| text = str(item.get("text", "")) | ||
| base_meta = dict(item.get("metadata", {})) |
There was a problem hiding this comment.
Similar to the issue in the CRAG nodes, item.get("metadata", {}) will return None if the key is present but null, causing dict(None) to raise a TypeError. Using or {} provides a safer fallback for metadata extraction.
| base_meta = dict(item.get("metadata", {})) | |
| base_meta = dict(item.get("metadata") or {}) |
Summary
Testing
export PATH="$HOME/.local/bin:$PATH" && make lintpython3 -m pytest tests/unit/test_expansion_policy.py tests/unit/test_retrievers.py -qNotes
Unknown config option: asyncio_mode) but tests passed.~/.local/bin) may need to be added to PATH.