feat(beir-bench): --ef-search flag for HNSW (was hardcoded const)#265
Conversation
HNSW ef_search was a compile-time const (128), so the harness ran HNSW at one fixed recall/latency operating point. Expose it as --ef-search so HNSW's own recall-vs-latency Pareto can be swept and compared fairly against ordvec's candidate-count sweep. Default unchanged (128). No other behavior change. Verified: ef=16 -> p50 0.32ms / nDCG 0.456; ef=256 -> 2.43ms / 0.470 on fiqa (the knob traces HNSW's frontier as intended). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Todd Baur <todd@baursoftware.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc4c26f539
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
Two issues raised on the --ef-search flag:
1. ef-sweep emitted all points under the fixed 'hnsw' slug with no ef field, so
topk/summary/timing rows collided (full-corpus overwrite; timing dedup by
method/n_docs) and the recall/latency frontier couldn't be reconstructed.
Fix: slug is now 'hnsw_ef{N}' so every operating point is recorded distinctly.
2. hnsw_rs clamps ef=max(ef,knbn) internally, so --ef-search below --top-k would be
silently bumped, flattening the low end of a sweep. Fix: clamp ef_search up to
top_k explicitly with a warning, so the recorded ef matches what actually ran.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Todd Baur <todd@baursoftware.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc7753a757
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste — Clean, focused change that exactly solves the stated problem.
The ef-sweep validation (lines 147-159) correctly prevents silent clamping by hnsw_rs, with a clear warning. Threading cfg.ef_search into both search paths (serial and parallel) is correct. The slug format hnsw_ef{value} ensures distinct records per operating point.
One note for maintainers: benchmarks/beir/beir_plot.py hardcodes "hnsw" in METHOD_STYLE and will only plot records matching that exact slug. When sweeping --ef-search, results will be recorded as hnsw_ef128, hnsw_ef256, etc., which the current plot script won't recognize. This is out-of-scope for this single-file PR but worth tracking for the sweep workflow.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/Project-Navi/ordvec/actions/runs/27853997793
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
ae7cf65
into
Project-Navi:main
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d00313be6b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".
| eprintln!(" building HNSW M={HNSW_M} ef_c={HNSW_EF_CONSTRUCTION} ({n_docs} docs) ..."); | ||
| // ef in the slug so an ef-sweep does not overwrite topk/summary/timing rows | ||
| // (each operating point on the recall/latency frontier is recorded distinctly). | ||
| let slug = format!("hnsw_ef{}", cfg.ef_search); |
There was a problem hiding this comment.
Remove stale default HNSW artifacts
When rerunning the standard Makefile targets in a results directory that already contains the old default hnsw.topk.jsonl/hnsw.summary.json, this new default slug writes hnsw_ef128.* instead of overwriting those files. beir_eval.discover_runs() globs every *.topk.jsonl, and bench-beir-quality/bench-beir-scaling only remove timing.jsonl, so the regenerated report includes a stale hnsw row alongside the new hnsw_ef128 row. Please remove/migrate the legacy HNSW artifacts when emitting the ef-qualified slug, or keep the default output name and record ef separately.
Useful? React with 👍 / 👎.
| slugs = {r["method"] for r in records if "method" in r} | ||
| return sorted(slugs, key=_method_order_key) |
There was a problem hiding this comment.
Filter probe methods out of default plots
Because this now returns every method slug present in an append-only timing.jsonl, plot_scaling and plot_bars no longer restrict the README figures to METHOD_STYLE/BENCH_METHODS. If the file contains a prior probe run such as sign-rq2-threaded (which the Makefile explicitly excludes from committed figures until reviewed), the plotter will render that stale/probe method with the gray fallback instead of only adding ef-qualified HNSW variants. Please filter to the supported/default families while normalizing hnsw_ef* to the HNSW family.
Useful? React with 👍 / 👎.
What
HNSW
ef_searchwas a compile-time const (128), so the harness only ever measured HNSW at one fixed recall/latency operating point. This exposes it as a--ef-searchCLI flag (default unchanged at 128).Why
To compare ordvec vs HNSW fairly, you need HNSW's own recall-vs-latency Pareto, not a single point.
--ef-searchlets you sweep it alongside ordvec's--candidatessweep. On fiqa (57k, single-thread): ef=16 → p50 0.32ms / nDCG@10 0.456; ef=256 → 2.43ms / 0.470 — the knob traces HNSW's frontier as intended, and the two methods land co-Pareto (ordvec ahead at the fast-cheap end).Scope
Single file, +13/−4. No behavior change at the default. Removes the now-unused
HNSW_EF_SEARCHconst; threadscfg.ef_searchinto both the serial and batched search calls; reports it in the build log.🤖 Generated with Claude Code