fix(beir-bench): streaming row-bounded npy loader (--max-docs honored, ~2x less peak RAM)#258
Conversation
…, ~2x less peak RAM) load_npy_f32 had two issues that made large scaling sweeps painful: - ~2x memory peak: std::fs::read pulled the whole file into a Vec<u8>, then a second full Vec<f32> was allocated while the bytes were still alive (~72GB peak on an 8.8M x 1024 f32 corpus). - --max-docs was ignored at load time: the full corpus was read off disk and only sliced afterward, so every sub-sampled point in a scaling sweep paid the full read + single-threaded parse. New load_npy_f32_rows(path, max_rows) seeks past the header and reads ONLY the kept rows, parses the payload in parallel (rayon) directly into the output Vec<f32> with no intermediate full-size copy. Added npy_row_count (header-only) for the corpus_ids length assertion. load_npy_f32 keeps its signature (max_rows=None). Verified: builds clean; --max-docs 5000 loads exactly 5000 rows; full-corpus path unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Todd Baur <todd@baursoftware.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
/agentic_review |
Code Review by Qodo
1.
|
|
/codereview |
35d8acc
into
Project-Navi:main
What
load_npy_f32inbenchmarks/beir-benchhad two issues that make large scaling sweeps slow and memory-hungry:std::fs::readloads the whole file into aVec<u8>, then a second fullVec<f32>is allocated while the bytes are still alive. On an 8.8M × 1024 f32 corpus that's ~72 GB peak.--max-docsignored at load — the full corpus was read off disk and only sliced afterward, so every sub-sampled point in a--max-docsscaling sweep paid the full read + single-threaded byte-by-byte parse.Change
load_npy_f32_rows(path, max_rows): seeks past the header and reads only the kept rows, then parses the payload in parallel (rayon) directly into the outputVec<f32>— no intermediate full-size buffer. Peak ≈ 1× kept data instead of 2× whole file.load_npy_f32keeps its signature (max_rows = None); the corpus loader passesSome(n_docs)so--max-docsbounds the disk read.npy_row_count(header-only read) for thecorpus_idslength assertion.Single-file change, no API/behavior change for the full-corpus path.
Verification
cargo build --release -p beir-benchclean.--max-docs 5000now loads exactly 5000 rows (n_docs=5000 (sub-sampled)); full-corpus run unchanged.🤖 Generated with Claude Code