Skip to content

Add SQLite-backed registry with JSON migration support#15

Open
kdush wants to merge 5 commits into
VectifyAI:mainfrom
kdush:feat/sqlite-storage-backend
Open

Add SQLite-backed registry with JSON migration support#15
kdush wants to merge 5 commits into
VectifyAI:mainfrom
kdush:feat/sqlite-storage-backend

Conversation

@kdush
Copy link
Copy Markdown

@kdush kdush commented Apr 11, 2026

Summary

Add SQLite-backed registry as the default storage backend, with automatic JSON migration support.

Changes

  • add DbRegistry class with SQLite backend, WAL mode, and JSON migration
  • add storage_backend config option (sqlite | json)
  • update CLI commands to use get_registry() factory
  • document storage backend options and migration behavior
  • add tests for SQLite registry, migration, and backend selection

Testing

  • 25 new tests passed for SQLite backend and migration
  • all existing tests pass
  • end-to-end verified: init, add, query, status, list
  • confirmed SQLite creates hashes.db, hashes.db-wal, hashes.db-shm

Backward Compatibility

  • storage_backend: json still works for existing setups
  • automatic migration from hashes.json to hashes.db when switching to SQLite
  • hashes.json preserved after migration for safety

@kdush kdush force-pushed the feat/sqlite-storage-backend branch from a56ee15 to 9436ad6 Compare April 11, 2026 04:50
@kdush kdush force-pushed the feat/sqlite-storage-backend branch 2 times, most recently from 71f9b4f to 9124f51 Compare April 12, 2026 13:54
@KylinMountain
Copy link
Copy Markdown
Collaborator

Code review

Found 2 issues:

  1. run_lint was not migrated to get_registry() and still reads hashes.json directly. With SQLite as the new default backend, hashes.json is never created, so hashes is always {} and openkb lint exits with "Nothing to lint — no documents indexed yet" regardless of how many documents are actually indexed. print_list (line 602) and print_status (line 689) were migrated correctly — only this call site was missed.

OpenKB/openkb/cli.py

Lines 541 to 551 in 9124f51

# Skip lint entirely when the KB has no indexed documents
hashes_file = openkb_dir / "hashes.json"
if hashes_file.exists():
hashes = json.loads(hashes_file.read_text(encoding="utf-8"))
else:
hashes = {}
if not hashes:
click.echo("Nothing to lint — no documents indexed yet. Run `openkb add` first.")
return

  1. JSON-to-SQLite migration is not atomic. _init_db() creates and commits the empty schema file before _migrate_from_json() runs. If the process is killed (SIGKILL, OOM, power loss, disk full) between those two steps, the DB file exists on disk but is empty. On the next startup, should_migrate = migrate_from is not None and not path.exists() evaluates to False, so migration is never retried — every entry in hashes.json is silently abandoned and openkb thinks the KB has never been ingested. The JSON file is preserved on disk but never read again. Fix: do _init_db + insert in a single transaction, or use a sentinel (schema_version row, completion marker) instead of file existence to gate migration.

OpenKB/openkb/state.py

Lines 82 to 94 in 9124f51

def __init__(self, path: Path, migrate_from: Path | None = None) -> None:
"""Initialize DbRegistry.
Args:
path: Path to SQLite database file.
migrate_from: Optional path to JSON file to migrate from.
Migration only happens if DB doesn't exist yet.
"""
self._path = path
should_migrate = migrate_from is not None and not path.exists()
self._init_db()
if should_migrate:
self._migrate_from_json(migrate_from)

Coordination note (not an issue, but worth flagging): PR #30 (still open) adds get_by_path and remove_by_doc_name to HashRegistry and calls them from cli.py / converter.py. DbRegistry here does not implement either method, so whichever PR lands second will need to add them to the other backend or users on storage_backend: sqlite will hit AttributeError on re-add.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

kdush added 5 commits May 13, 2026 11:14
review 反馈:当 storage_backend 为 sqlite(默认)时,hashes.json 不存在,
导致 lint 总是误判为无文档。对齐 print_list 和 print_status 的已有实现。
- 用 schema_meta 完成标记替代 DB 文件存在性判断,避免中断后永不重试
- get_registry() 只要 hashes.json 存在即传递 migrate_from,让 DbRegistry 自主判断
- 新增 get_by_path / remove_by_doc_name 到 HashRegistry 与 DbRegistry,
  防止后续 PR VectifyAI#30 合并时 SQLite 后端出现 AttributeError
@kdush kdush force-pushed the feat/sqlite-storage-backend branch from 9124f51 to 8cb4522 Compare May 13, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants