Skip to content

fix(watcher): index files up to 2MB; surface oversized skips (#635)#636

Merged
justrach merged 2 commits into
release/0.2.5826from
issue-635-large-file-skip
Jun 22, 2026
Merged

fix(watcher): index files up to 2MB; surface oversized skips (#635)#636
justrach merged 2 commits into
release/0.2.5826from
issue-635-large-file-skip

Conversation

@justrach

Copy link
Copy Markdown
Owner

Problem

watcher hard-dropped any file > 512 KB at five scan sites, even though the trigram threshold was deliberately raised to 1 MB. So 512 KB–1 MB source files (parsers, big modules, generated schemas) were silently invisible to search/symbol/outline/word/callers/deps — reachable only via codedb_read's disk fallback — and status gave no "skipped" signal.

Closes #635.

Fix

  • Hoist the cap to a named constant max_indexed_file_bytes = 2 * 1024 * 1024 used at every gate (parseInitialScanEntry, readFileEntry, indexFileOutline, hashFile, indexFileContent).
  • Files up to 1 MB still get full trigram coverage; 1 MB–2 MB get outline+word but skip trigram via the existing effective_skip_trigram path; past 2 MB the file is skipped.
  • Surface the skip: parseInitialScanEntry now logs a std.log.warn when a file exceeds the cap (stderr for both CLI and daemon) instead of dropping it silently.

Tests

  • test "issue-635: files between 512KB and 1MB are silently dropped from the index" (src/test_index.zig) — fails on main, passes here.
  • Full suite green (the cap change touches indexing broadly).

Follow-up

The status-line "skipped files" count is flagged as a separate focused follow-up; this PR covers indexing + the scan-time warning.

🤖 Generated with Claude Code

justrach and others added 2 commits June 21, 2026 20:03
watcher.parseInitialScanEntry hard-skips files > 512KB (src/watcher.zig:451)
even though the trigram threshold (line 462) was deliberately raised to 1MB so
large code files stay searchable. A 600KB file gets no outline/symbol/word/
search at all, silently. Test scans a dir with a 600KB file and asserts its
unique token is searchable; fails on main. No fix in this commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The scan hard-dropped any file > 512KB (5 sites) even though the trigram
threshold was deliberately 1MB, so 512KB-1MB source files were silently
invisible to search/symbol/outline/word (reachable only via codedb_read's disk
fallback). Hoist the cap to a named constant max_indexed_file_bytes = 2MB used
at every gate, so files up to 2MB are indexed (>1MB still skip trigram via
effective_skip_trigram) and a 600KB source file is searchable. Log a warning
when a file exceeds the cap so the skip is no longer silent.

Closes #635.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5201a0ba7e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/watcher.zig
if (stat.size > 512 * 1024) return null;
const c = dir.readFileAlloc(io, entry.path, arena_alloc, .limited(512 * 1024)) catch return null;
if (stat.size > max_indexed_file_bytes) return null;
const c = dir.readFileAlloc(io, entry.path, arena_alloc, .limited(max_indexed_file_bytes)) catch return null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid reading files the trigram worker discards

In the cold codedb search path with search_skips_outlines=true, this helper is used by trigramExtractWorker, which immediately skips any r.content.len > 1024 * 1024. Raising this read limit to 2 MiB means every 1–2 MiB file is now read into the worker arena and then discarded, so repos with many large source files pay extra I/O and memory in the benchmark-critical cold search/index path without gaining trigram coverage. Gate this helper at the 1 MiB trigram limit or avoid reading those files in the trigram-only path.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 72515 72508 -0.01% -7 OK
codedb_changes 11726 11368 -3.05% -358 OK
codedb_context 768983 769470 +0.06% +487 OK
codedb_deps 358 353 -1.40% -5 OK
codedb_edit 48958 45921 -6.20% -3037 OK
codedb_find 2931 3103 +5.87% +172 OK
codedb_hot 27414 26488 -3.38% -926 OK
codedb_outline 16827 16535 -1.74% -292 OK
codedb_read 14604 13889 -4.90% -715 OK
codedb_search 66939 67545 +0.91% +606 OK
codedb_snapshot 90678 71928 -20.68% -18750 OK
codedb_status 10883 10348 -4.92% -535 OK
codedb_symbol 59150 56603 -4.31% -2547 OK
codedb_tree 24090 25539 +6.01% +1449 OK
codedb_word 17438 12090 -30.67% -5348 OK

@justrach justrach merged commit 5201a0b into release/0.2.5826 Jun 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant