fix(watcher): index files up to 2MB; surface oversized skips (#635)#636
Conversation
watcher.parseInitialScanEntry hard-skips files > 512KB (src/watcher.zig:451) even though the trigram threshold (line 462) was deliberately raised to 1MB so large code files stay searchable. A 600KB file gets no outline/symbol/word/ search at all, silently. Test scans a dir with a 600KB file and asserts its unique token is searchable; fails on main. No fix in this commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The scan hard-dropped any file > 512KB (5 sites) even though the trigram threshold was deliberately 1MB, so 512KB-1MB source files were silently invisible to search/symbol/outline/word (reachable only via codedb_read's disk fallback). Hoist the cap to a named constant max_indexed_file_bytes = 2MB used at every gate, so files up to 2MB are indexed (>1MB still skip trigram via effective_skip_trigram) and a 600KB source file is searchable. Log a warning when a file exceeds the cap so the skip is no longer silent. Closes #635. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5201a0ba7e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (stat.size > 512 * 1024) return null; | ||
| const c = dir.readFileAlloc(io, entry.path, arena_alloc, .limited(512 * 1024)) catch return null; | ||
| if (stat.size > max_indexed_file_bytes) return null; | ||
| const c = dir.readFileAlloc(io, entry.path, arena_alloc, .limited(max_indexed_file_bytes)) catch return null; |
There was a problem hiding this comment.
Avoid reading files the trigram worker discards
In the cold codedb search path with search_skips_outlines=true, this helper is used by trigramExtractWorker, which immediately skips any r.content.len > 1024 * 1024. Raising this read limit to 2 MiB means every 1–2 MiB file is now read into the worker arena and then discarded, so repos with many large source files pay extra I/O and memory in the benchmark-critical cold search/index path without gaining trigram coverage. Gate this helper at the 1 MiB trigram limit or avoid reading those files in the trigram-only path.
Useful? React with 👍 / 👎.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Problem
watcherhard-dropped any file > 512 KB at five scan sites, even though the trigram threshold was deliberately raised to 1 MB. So 512 KB–1 MB source files (parsers, big modules, generated schemas) were silently invisible tosearch/symbol/outline/word/callers/deps— reachable only viacodedb_read's disk fallback — andstatusgave no "skipped" signal.Closes #635.
Fix
max_indexed_file_bytes = 2 * 1024 * 1024used at every gate (parseInitialScanEntry,readFileEntry,indexFileOutline,hashFile,indexFileContent).effective_skip_trigrampath; past 2 MB the file is skipped.parseInitialScanEntrynow logs astd.log.warnwhen a file exceeds the cap (stderr for both CLI and daemon) instead of dropping it silently.Tests
test "issue-635: files between 512KB and 1MB are silently dropped from the index"(src/test_index.zig) — fails onmain, passes here.Follow-up
The
status-line "skipped files" count is flagged as a separate focused follow-up; this PR covers indexing + the scan-time warning.🤖 Generated with Claude Code