Rebuild and validate article search indexes#233
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
user1303836
left a comment
There was a problem hiding this comment.
Review
Overall
Good defensive feature -- startup health checks, automatic rebuild from SQLite, and dimension/model validation for zvec collections. Well-structured with solid test coverage.
Blocking
- mypy type error.
vector_store.py:94--_read_collection_metadatareturnsjson.loads(...)which isAny, but the return type is declared asdict[str, Any] | None. CI confirms this fails type checking. Fix with an explicit cast or intermediate variable, e.g.:data = json.loads(path.read_text()) assert isinstance(data, dict) return data
Non-blocking
-
Redundant dimension checking in
_collection_needs_recreate. It checksactual_dimension != self._dimensionsfrom the schema, then also checksmetadata.get("dimensions") != self._dimensionsfrom the JSON file. These should always agree after the first write. The second check is only useful if someone manually edits the metadata file. Not harmful, just redundant. -
/indexalways recreates the collection now. The old behavior was additive (upsert). The new behavior callsrecreate_articles_collection()which destroys and rebuilds everything. This means running/indexalways re-embeds all content, even if 99% is already correct. For large datasets this could be expensive in embedding API calls. Was additive-with-cleanup insufficient? -
Health check false positive risk. The probe embeds a sample item and searches for it in the top 10 results. If the item exists but isn't in the top 10 (e.g., many similar items), the check reports unhealthy and triggers a full rebuild.
HEALTH_CHECK_TOPK = 10is reasonable but not bulletproof. -
rendered_resultstracking is a good catch. Previously, if vector search returned IDs that no longer exist in SQLite (orphaned references), the embed would show 0 fields with a misleading "N results" footer. The new code handles this correctly. -
Merge conflict with PR #234. Both PRs modify
vector_store.pyandrepository.pyfrom the same base commit. This PR should merge first (it's more foundational), then #234 should rebase.
CI
Lint failure is pre-existing (7 unrelated files). Type check failure is real and caused by this PR -- must fix before merging.
Verdict
Fix the mypy error, then good to merge. Should go in before PR #234.
a310b73 to
0315d84
Compare
Summary
/indexperform a full rebuild so it can repair drift instead of only upsertingTesting
PYTHONPATH=/Users/user1303836/Development/intelstream-codex-search-index-recovery/src /Users/user1303836/Development/intelstream/.venv/bin/pytest tests/test_vector_store.py tests/test_discord/test_search.py tests/test_database.py -q/Users/user1303836/Development/intelstream/.venv/bin/ruff check src/intelstream/database/vector_store.py src/intelstream/database/repository.py src/intelstream/discord/cogs/search.py src/intelstream/bot.py tests/test_vector_store.py tests/test_discord/test_search.py/Users/user1303836/Development/intelstream/.venv/bin/ruff format --check src/intelstream/database/vector_store.py src/intelstream/database/repository.py src/intelstream/discord/cogs/search.py src/intelstream/bot.py tests/test_vector_store.py tests/test_discord/test_search.pyCloses #228
Closes #229