Skip to content

fix: reindex stability — sentinel resume, donor safety, worktree subdirs, progress bar#67

Merged
aeneasr merged 22 commits intomainfrom
reindex-fixes
Mar 27, 2026
Merged

fix: reindex stability — sentinel resume, donor safety, worktree subdirs, progress bar#67
aeneasr merged 22 commits intomainfrom
reindex-fixes

Conversation

@aeneasr
Copy link
Copy Markdown
Member

@aeneasr aeneasr commented Mar 27, 2026

Summary

  • Sentinel file resume: Indexing interrupted mid-run (e.g. Ollama timeout) leaves files with hash="". Previously the next session would see a matching root hash and return early, leaving those files permanently unembedded. Now HasSentinelFiles() is checked before the early-return; any sentinel triggers incremental indexing to complete the run.
  • Metadata persistence on partial failure: saveMeta() closure saves root_hash + timestamps on both success and mid-batch embedding failures, so the next session can match the hash and skip already-complete files.
  • Donor safety gate: SeedFromDonor now verifies the donor has a non-empty root_hash before copying. Seeding from an incomplete donor propagated corrupted state; this now returns (false, nil) and the worktree starts fresh.
  • Worktree subdirectory donor discovery: FindDonorIndexBase previously compared worktree root paths, missing cases where the effective project root is a subdirectory (e.g. monorepo/backoffice). The fix computes the relative suffix inside the worktree and looks for a DB at <sibling_worktree>/<relSuffix>.
  • Progress bar duplication: Long file paths caused pterm line-wrapping, breaking cursor positioning and leaving duplicate output. Titles are now truncated to (terminal_width - 45) chars before UpdateTitle().

Test plan

  • go test ./... — all packages pass (verified locally)
  • TestSeedFromDonor_IncompleteDonor — new test covering donor safety gate
  • Manual: index a large repo, kill mid-run, restart — confirm remaining files are embedded rather than skipped
  • Manual: open worktree in a monorepo subdirectory — confirm donor seeding finds the sibling worktree's DB
  • Manual: index with long file paths — confirm no duplicate progress bar output

🤖 Generated with Claude Code

aeneasr and others added 22 commits March 23, 2026 18:07
When reindexing takes longer than 15s, semantic_search returns stale
results with a warning instead of blocking the agent indefinitely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Buffered done channel (cap 1) to prevent goroutine leak on timeout
- Goroutine calls touchChecked on success for correct TTL behavior
- Nil progress func in goroutine (request ctx may be gone)
- Log errors from background EnsureFresh at Warn level
- sync.WaitGroup for graceful shutdown in Close()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7-task plan with TDD approach: struct changes, WaitGroup, timeout
goroutine, formatSearchResults, and tests including a test hook
(ensureFreshFunc) to exercise the 15s timeout path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… reindex

EnsureFresh now runs in a goroutine. If it completes within 15s, results
are returned normally. If it exceeds the timeout, stale results are
returned immediately with a StaleWarning while reindexing continues in
the background (up to 10min). The goroutine acquires an exclusive flock
to avoid concurrent writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Go 1.25+ provides wg.Go() which simplifies goroutine tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…exed

Add ensureFreshFunc test hook to indexerCache (follows existing
findDonorFunc/seedFunc pattern) and three new tests:

- TestEnsureIndexed_TimeoutReturnsStaleWarning: injects a slow
  EnsureFresh that exceeds the 15s timeout, verifies StaleWarning
  is returned and Reindexed=false.
- TestEnsureIndexed_FastEnsureFreshNoWarning: injects an instant
  EnsureFresh, verifies no warning and correct stats propagation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…or vec_chunks

Handles slow embedding batches and retries on SQLite contention
without timing out. INSERT OR REPLACE prevents duplicate key errors
when re-embedding chunks that already exist in the vector table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three root causes fixed:

1. SessionStart double-spawn and no freshness gate (hook.go):
   - Remove unconditional spawnBackgroundIndexer from runHookSessionStart;
     generateSessionContextInternal now owns all spawn decisions
   - After opening the DB for stats, check last_indexed_at: skip spawn
     when indexed within backgroundIndexStaleness (5 min), spawn when
     stale or never completed. Prevents every new terminal from triggering
     a full merkle walk.

2. Goroutine zero-result treated as "fresh" (stdio.go):
   - Add skipped bool to freshResult. When TryAcquire returns nil (TOCTOU
     race — another process grabbed the lock) or errors, send
     freshResult{skipped: true}. Main select now returns StaleWarning for
     skipped results, consistent with the IsHeld fast-path. Previously the
     zero result looked like "index is fresh", silently skipping
     touchChecked and causing the next search to immediately re-spawn.

3. Redundant merkle walk after lumen index finishes (stdio.go):
   - In the goroutine, after acquiring the flock, check idx.LastIndexedAt().
     If within freshnessTTL, call touchChecked() and return without calling
     EnsureFresh. Uses the DB timestamp as a shared cross-process freshness
     signal so the MCP server doesn't duplicate the walk just completed by
     the background indexer.

Also fix pre-existing errcheck lint in tui/progress.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nge breakdown

- index.go: add newDebugLogger() for background path; log start, skip,
  cancel, error, and completion with full Stats fields; pass logger to
  Indexer via SetLogger() so indexWithTree can log the indexing plan
- index/index.go: add FilesAdded/FilesModified/FilesRemoved/Reason/
  OldRootHash/NewRootHash to Stats; populate them in Index, EnsureFresh,
  and indexWithTree; add SetLogger/logger field to Indexer
- hook_spawn_unix.go: discard stderr of background indexer (slog writes
  to debug.log; piping stderr would mix pterm progress into the log)
- search.go: pass nil logger to setupIndexer (interactive command)
- CLAUDE.md: document interactive vs background output strategy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The force_reindex parameter on semantic_search is removed. Reindexing is
exclusively triggered by the SessionStart hook and by the background
goroutine inside ensureIndexed.

Progress notifications are restored and now flow through the background
goroutine path so the Claude Code status indicator animates during indexing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enrich the "indexing plan" slog entry with:
- old_root_hash: stored merkle root before this run
- new_root_hash: computed merkle root from current filesystem
- main_worktree: main git repo root (only when projectDir is a worktree)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When SQLite reports "database disk image is malformed" or "disk I/O
error", the index is permanently broken until manually purged. Every
subsequent semantic_search call would fail with the same error because
touchChecked is never set and each retry hits the same corrupted file.

This change adds automatic recovery at two layers:

- store.New: if open/schema-setup fails with a corruption error, delete
  the DB file and its WAL/SHM sidecars and retry once from a clean state.
  In-memory databases are never deleted.

- Indexer.EnsureFresh / Index: if indexWithTree returns a corruption
  error mid-operation, log ERROR "corrupted database detected, rebuilding",
  call rebuildStore() (close → delete files → reopen), then retry with an
  empty stored hash so the fresh DB receives a full index pass.

Adds IsCorruptionErr(err) to the store package as the single source of
truth for what constitutes a SQLite corruption error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pterm's cursor positioning assumes bar title fits on one line. Long
paths cause wrapping which shifts the cursor, leaving duplicated output
on each redraw. Truncate to (terminal_width - 45) chars, reserving
space for the bar chrome, appending an ellipsis when truncated.
Also benefits terminal resize: pterm.GetTerminalWidth() is called live
on every Update(), so the budget adjusts automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Files registered mid-run have hash="" (sentinel). Previously, if the
root hash hadn't changed between sessions the indexer returned early,
leaving those files permanently unembedded.

- Add HasSentinelFiles() to store: EXISTS query on files WHERE hash=''
- In Index() and EnsureFresh(), check sentinels before the early-return:
  if any exist, fall through to incremental indexing regardless of hash
- Replace four separate SetMeta calls at end-of-run with a saveMeta()
  closure; call it on mid-batch embedding failures too so progress is
  persisted even when Ollama times out partway through a large repo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A donor whose first indexing pass was interrupted has no root_hash in
project_meta. Seeding from such a donor propagates partially-indexed
state to the new worktree, causing it to believe it is current when
it is not.

Guard: open the donor read-only, query root_hash before the WAL
checkpoint, and bail out (return false, nil) if the value is missing or
empty. The new TestSeedFromDonor_IncompleteDonor test covers this path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
FindDonorIndexBase compared worktree root paths directly. When the
effective project root is a subdirectory (e.g. monorepo/backoffice),
the DB path is derived from that subdirectory, not from the worktree
root — so no sibling worktrees were ever found.

Fix: identify which worktree contains the project, compute the relative
suffix (e.g. "backoffice"), then look for a DB at
<sibling_worktree>/<relSuffix> in each sibling. This correctly resolves
donor indexes regardless of how deep the effective root sits inside the
worktree. Symlinks are resolved at every comparison point.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aeneasr aeneasr enabled auto-merge (rebase) March 27, 2026 15:26
auto-merge was automatically disabled March 27, 2026 15:43

Rebase failed

@aeneasr aeneasr merged commit 07782d6 into main Mar 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant