Skip to content

feat: add recall repair for data and index maintenance#62

Merged
edheltzel merged 2 commits into
mainfrom
feat/46-repair
Jun 11, 2026
Merged

feat: add recall repair for data and index maintenance#62
edheltzel merged 2 commits into
mainfrom
feat/46-repair

Conversation

@edheltzel

Copy link
Copy Markdown
Owner

Closes #46

What

Adds recall repair — explicit data/index maintenance, deliberately separate from recall doctor --fix (which stays symlinks-only and never runs data repair).

recall repair             # dry-run (default) — plan only, writes nothing
recall repair --execute   # apply repairs
recall repair -t messages # scope to one table
recall repair --no-embed  # skip the embedding pass

Repair scope (per the issue's acceptance criteria)

  • FTS5 rebuild for messages, decisions, learnings, breadcrumbs, loa_entries, telos, documents. Detection per table: indexed-row count via the _docsize shadow table (a plain COUNT(*) on an external-content FTS table reads through to the content table and proves nothing), sync-trigger presence, and the FTS5 integrity-check command. Missing indexes/triggers are recreated from canonical per-table schema DDL and rebuilt — this directly covers the un-migrated-DB → silently-empty-search gap flagged in the PR feat: add recall dedup — non-destructive dedup with provenance-aware survivor selection #60 review.
  • Re-embedding of rows missing embeddings (loa_entries, decisions, learnings, assistant messages) when Ollama is available and the row has enough source text. Service unavailable → reported, exit stays 0 unless another requested repair failed. Partial results report embedded/skipped/failed counts — failures are never hidden. Marked duplicates (dedup Add non-destructive dedup with provenance-aware survivor selection #45) are excluded.
  • Report-only invariants (never auto-repaired): orphaned embeddings, unknown embedding source tables, dedup lineage pointing at missing rows, messages without a session, broken LoA message ranges/parent links, pending migrations (recommends recall init).

Safety

  • Dry-run by default; --execute prints a recall export --backup recommendation.
  • Never hard-deletes rows; never changes Record Provenance (no source-table column is written) — both covered by tests.
  • recall doctor gains a read-only FTS health check that recommends recall repair; it carries no repair closure, so the --fix loop structurally cannot run data repair.

Implementation notes

  • schema.ts FTS DDL restructured into a per-table FTS_SCHEMA map; the legacy CREATE_FTS/CREATE_FTS_TRIGGERS strings are derived from it (single source of truth, enables per-table recreation).
  • embed.ts's per-table content composition now delegates to the canonical embeddingTextFor in src/lib/repair.ts (DRY).
  • All repair SQL binds a fixed parameter count; the embed-candidate walk uses keyset pagination per the src/lib/chunk.ts convention.
  • Embedding/service clients are injectable in runRepair so tests run deterministically offline.

Testing

  • 41 new tests (tests/lib/repair.test.ts, tests/commands/repair.test.ts): dry-run vs execute, drift/missing-index/missing-trigger rebuild, service-unavailable success path, partial embed success, no-hard-delete, provenance no-touch, orphan report-only, --table scoping, pending-migration report, doctor separation.
  • Full suite: 616 pass / 0 fail; bun run lint clean.
  • Live CLI smoke test against a scratch DB: drift detected in dry-run with no writes, rebuilt on --execute, search returns results again.

…t-only integrity checks

Explicit data/index maintenance separate from doctor --fix (issue #46).
Dry-run by default; --execute applies. Rebuilds drifted FTS5 indexes from
source tables, recreates missing indexes/sync triggers from canonical
per-table schema DDL (the un-migrated-DB silently-empty-search gap),
re-embeds rows missing embeddings when Ollama is available, and reports
orphan/invariant problems without mutating them. Never hard-deletes rows,
never touches Record Provenance. Doctor gains a read-only FTS health check
that recommends repair; --fix remains symlinks-only.

Closes #46
@edheltzel

Copy link
Copy Markdown
Owner Author

Review — recall repair (feat/46-repair @ 779be25)

Reviewed via multi-agent pass (general code review, test-coverage analysis, silent-failure audit) plus independent hands-on verification: live CLI exercise against scratch DBs, byte-level schema DDL comparison, and CI failure root-cause analysis. Every claim below was verified against the code or reproduced empirically, not taken from the PR description.

Verified ✅ (the core design claims hold)

  • _docsize drift detection is correct SQLite behavior — empirically demonstrated. On a scratch DB I removed one row from the index only (direct shadow-table writes are blocked by SQLite defensive mode, so I used the FTS5 'delete' special command): SELECT COUNT(*) FROM breadcrumbs_fts still returned 3 (external-content FTS reads through to the content table), while breadcrumbs_fts_docsize returned the true indexed count of 2. A naive COUNT would prove nothing; the _docsize count caught the desync exactly. Dry-run reported drift (3 source row(s)) — index has 2 row(s), source has 3 [would rebuild] and wrote nothing (search stayed broken); --execute rebuilt, search returned the row again, _docsize back to 3, and an immediate re-run reported "Nothing to repair." Exit 0 on both.
  • Repair is genuinely separate from doctor --fix — structural claim confirmed. checkFtsIndexes() (src/commands/doctor.ts) returns a plain CheckResult with no repair closure; the --fix loop applies repairs only from buildSymlinkProbes().map(probeSymlink), so data repair via doctor is structurally impossible. The FTS5 integrity-check special command the check relies on was verified non-mutating (byte-identical DB image before/after).
  • Dry-run writes nothing — verified by byte-level DB comparison and behaviorally (drift persists, search still broken after dry-run).
  • Report-only invariants stay report-only; no hard-deletes; provenance untouched. No DELETE statement exists anywhere in src/lib/repair.ts; writes go only to FTS shadow tables (rebuild) and the embeddings table. Covered by the 9-table row-count snapshot and provenance-snapshot tests.
  • Keyset pagination matches the chunk.ts conventiont.id > ? ORDER BY t.id LIMIT ? with fixed binds (src/lib/repair.ts:248-261); batches are materialized via stmt.all() before yielding, so inserting embeddings mid-walk can't corrupt the cursor, and the forward-only keyset can't loop on failed rows.
  • Schema restructure introduced zero semantic drift. I imported main's literal CREATE_FTS/CREATE_FTS_TRIGGERS and the PR's FTS_SCHEMA-derived strings side by side: byte-identical (1316 and 5867 chars respectively). All 7 _ad triggers intact; table/trigger order, tokenizers, and trigger bodies unchanged. The embed.ts delegation to embeddingTextFor is behavior-preserving (the loa alias maps to loa_entries).
  • 616/616 tests pass locally, tsc --noEmit clean. SQL injection surface clean (--table is allowlist-validated, plus an independent throw in checkFts; all dynamic identifiers come from hardcoded maps).

Blocker 1 — CI is red on the head commit, and it's not a flake (root-caused, one-line fix)

The macOS primary job failed on both attempt 1 and my rerun (attempt 2) with the same paradox: 616 pass / 0 fail, then bun test exits 1. I reproduced it locally and isolated the mechanism:

  1. invalid --table is rejected (tests/commands/repair.test.ts:311-314) calls runRepair({ table: 'sessions' }), which sets process.exitCode = 1 (src/commands/repair.ts:61) — and asserts it. So far so good.
  2. The afterEach restore (process.exitCode = originalExitCode, line 46) restores to undefined — and Bun silently ignores assigning undefined to process.exitCode. Probe: bun -e 'process.exitCode = 1; process.exitCode = undefined' exits 1. The poison is never cleared.
  3. Whether the suite then exits 0 or 1 depends on which test files happen to run afterward and whether their cleanup clears it — environment-dependent global state. On a dev machine with Ollama running the suite happens to end clean; in CI (no Ollama) it ends poisoned. Deterministic repro: OLLAMA_URL=http://127.0.0.1:9 bun test → 616 pass, exit 1. Even alone, bun test tests/commands/repair.test.ts -t "invalid" exits 1 with all tests passing.

Fix (verified): the repo already has the correct convention — tests/commands/search.test.ts:10 and tests/commands/provenance.test.ts:153 restore with process.exitCode = originalExitCode ?? 0;. Applying that one line to tests/commands/repair.test.ts:46 makes the no-Ollama full suite exit 0 (verified locally). Note tests/commands/dedup.test.ts:45 carries the same latent landmine (pre-existing, main is green only by file-order luck) — worth a follow-up issue rather than a change here.

Blocker 2 — default recall repair crashes on exactly the legacy DBs it's marketed for (reproduced)

A database created before the dedup migration has no dedup_lineage table (its DDL ships via migration). embedGapQuery bakes notMarkedDuplicateSql(...) — which references dedup_lineage — into every gap query, and planRepair guards tableExists for the source table and embeddings (src/lib/repair.ts:458-464) but not dedup_lineage. Result, reproduced on a scratch DB with the table dropped:

$ recall repair          # plain dry-run
SQLiteError: no such table: dedup_lineage
      at iterateEmbedGapRows (src/lib/repair.ts:253)
      at countEmbedGaps (src/lib/repair.ts:280)

Raw stack trace, exit 1, zero plan output — the crash happens in planRepair before the dry-run banner, so the migration report that would have told the user the actual fix (run 'recall init') never prints, and closeDb() never runs. docs/troubleshooting.md explicitly markets repair for "a database created by an older version" — the headline use case. (--no-embed works and degrades gracefully, which shows the intended pattern: checkOrphans catches per-check and reports check failed — no such table: dedup_lineage.)

The existing test (tests/commands/repair.test.ts:286) fakes legacy state with PRAGMA user_version = 5 on a fully-migrated current schema, which is why this never surfaced. Fix: guard the embed pass on tableExists(db, 'dedup_lineage') (drop the exclusion clause or report the pass as skipped pending migration), and add a test that actually drops the table.

Should fix (not blocking individually, but #3 contradicts the PR's stated policy)

  1. Per-row embedding failures never set a nonzero exit code — src/commands/repair.ts:130-137 prints FAILED table#id lines but never touches process.exitCode (contrast FTS failures at :97-100, which correctly set it). With the service up but every row failing, output reads Embedded 0 row(s), … N failure(s) and the process exits 0. The PR's own policy says "exit stays 0 unless another requested repair failed" — a row that failed to embed with the service up is a failed requested repair. On the policy itself, my judgment: the design is right — service-unavailable is an environment state, and exit 0 + clear reporting is the correct diagnostic-path behavior (and checkService is correctly only probed when there's something to embed, so a healthy DB never touches the network). The implementation just misses the failed-rows clause: if (embedResult.failed.length > 0) process.exitCode = 1; plus a test. Neither exit-code path (FTS failure → 1, embed failure → 1) currently has any test.
  2. ftsIndexConsistent swallows every non-corruption error as "index is consistent" (src/lib/repair.ts:150-158) — including database is locked, which is a normal operating condition here (MCP server, Stop/PreCompact hooks, and the batch-extract cron write concurrently, and bun:sqlite sets no busy timeout). Under lock contention both recall repair and the new doctor check report "in sync" when nothing was verified — and this check is the only detector for stale-content drift (counts equal, content differs). The doc comment justifies the swallow only for the unsupported-command case on older SQLite. Suggest a three-state result (consistent / corrupt / check-unavailable) with the third surfaced in detail, or whitelist only the unsupported-command error.

Minor (follow-ups, no change required here)

  • runDoctor({ fix: true }) against drifted data is never behaviorally tested — the doctor-separation guarantee currently rests on code shape (the test calls checkFtsIndexes() directly). A future refactor generalizing the fix loop would violate the PR's central safety promise with zero failing tests.
  • Keyset pagination never crosses a batch boundary in tests (SQLITE_SAFE_CHUNK_SIZE = 500; largest fixture is 3 rows).
  • countIndexedRows → null silently downgrades to "could not compare", which can render as ok/"in sync" — "could not verify" shouldn't read as "verified" (minor: defensive mode makes this hard to reach).
  • On a DB where report-only checks themselves error, dry-run still concludes "Nothing to repair." — conflates "could not verify invariants" with "verified clean".
  • Doctor's WARN recommends recall repair for missing-source, but repair treats that as report-only; the real fix is recall init.
  • recall embed backfill doesn't apply the marked-duplicates exclusion that repair's gap query does — the two now disagree on gap counts; worth aligning in a follow-up.
  • onProgress (src/lib/repair.ts:531) is never wired by the CLI nor tested — dead parameter.
  • --table loa (the alias recall embed accepts) is rejected by repair, which wants loa_entries — minor UX inconsistency.

Summary

The core of this PR is genuinely solid — the drift-detection mechanism is correct and empirically validated, the safety model (dry-run default, report-only invariants, no deletes, no provenance writes) holds under both byte-level and behavioral verification, and the schema restructure is provably semantics-preserving. But CI is red on the head commit (root-caused above, one-line fix), and the headline legacy-DB use case crashes before producing any output (reproduced, small fix + honest test). Both must land before merge; #3 should ride along since it contradicts the PR's own stated exit-code contract.

VERDICT: REQUEST CHANGES

… legacy DB crash

Blocker 1: restore process.exitCode with the repo's ?? 0 convention in
repair.test.ts — Bun silently ignores assigning undefined, so the invalid
--table test's exit code poisoned the suite (CI-red with 0 failures).

Blocker 2: drop the marked-duplicates exclusion clause from the embed gap
query when dedup_lineage does not exist. A pre-dedup legacy DB (the
command's headline use case) crashed with a raw SQLiteError inside
planRepair before any output; it now completes the plan, still reports
embed gaps, and prints the 'recall init' recommendation. The legacy-DB
test now builds a genuine pre-dedup schema (full DDL, dedup_lineage
dropped, user_version 9) instead of faking it with PRAGMA on a current
schema.
@edheltzel

Copy link
Copy Markdown
Owner Author

Re-review — fix delta 779be254e65c6e

Delta verified surgical: only src/lib/repair.ts (embed-gap query) and tests/commands/repair.test.ts changed. Both blockers re-tested against the same reproductions from the original review:

  1. Exit-code poisoning (Blocker 1): fixed. afterEach now restores with originalExitCode ?? 0 per the repo convention, with a comment explaining the Bun undefined no-op. My reproduction OLLAMA_URL=http://127.0.0.1:9 bun test now exits 0 (616 pass / 0 fail). CI is green on head 4e65c6e — macOS primary and Ubuntu smoke both pass.
  2. Pre-dedup legacy DB crash (Blocker 2): fixed. embedGapQuery drops the marked-duplicates clause when dedup_lineage doesn't exist (correct semantics — nothing can be marked on that schema). Re-ran the real scenario (full schema, DROP TABLE dedup_lineage, user_version 9): plain recall repair now completes the entire plan — FTS section, embed gaps counted (decisions: 1 missing), lineage checks degrade per-check via the existing checkOrphans pattern, and the recall init recommendation actually prints (1 migration(s) pending (version 9, target 10)). Exit 0. The rewritten test now builds a genuinely old schema instead of faking PRAGMA user_version, and asserts the plan completes, gaps are counted, and the exit code stays clean — exactly the honest test the original review asked for.

Deferred items confirmed untouched (per agreed scope): per-row embed-failure exit code, ftsIndexConsistent lock-error swallowing, and the minor follow-ups. The pre-existing dedup.test.ts:45 instance of the same restore bug is correctly left alone and flagged for a follow-up issue.

Both blockers genuinely resolved, CI green, no new issues introduced.

VERDICT: APPROVE

@edheltzel edheltzel merged commit 362909b into main Jun 11, 2026
2 checks passed
@edheltzel edheltzel deleted the feat/46-repair branch June 11, 2026 20:06
edheltzel added a commit that referenced this pull request Jun 11, 2026
Roadmap complete (#40-#48 all closed, PRs #54-#62, #64 merged). Tracker
and handoff moved to .agents/atlas/plans/archive/ per the completed-plans
rule; archive/ is gitignored (local artifacts), full content remains in
git history.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add mem repair for data and index maintenance

1 participant