Skip to content

perf(search): master speed — keep #2/#4 wins, gate #1, beat Everything across the matrix#375

Merged
githubrobbi merged 5 commits into
mainfrom
perf/master-speed-20260609
Jun 9, 2026
Merged

perf(search): master speed — keep #2/#4 wins, gate #1, beat Everything across the matrix#375
githubrobbi merged 5 commits into
mainfrom
perf/master-speed-20260609

Conversation

@githubrobbi

Copy link
Copy Markdown
Collaborator

Summary

The synthesis build from the performance regression root-cause analysis: keep the two genuine winners (#2 trigram-prefix, #4 parallel-resolve), gate the loser (#1 unlimited value-sort), and drop the un-gated #3. Verified on a fresh-daemon full-matrix Windows benchmark.

Commits

  • #1 skip value-sort for unlimited match-allsort_and_localise early-returns (MFT-locality sort only) when limit >= candidates.len(), eliminating a redundant full sort of millions of tuples on * full-scans.
  • #2+#4 trigram prefix fast-path + size-gated parallel resolve — restores search_compact_drive_prefix (trigram-accelerated win*), wires is_prefix through both MultiDriveBackend::search and search_index, and gates indices_to_rows parallelism at PARALLEL_RESOLVE_THRESHOLD (50K) so tiny exact sets stay sequential (no rayon p95 jitter). Adds prefix-parity, limit, and parallel-resolve regression tests.
  • backend.rs decomposition — extracts DisplayRow into display_row.rs, drops the file-size exception.
  • docs(benchmarks) — public v0.5.120 cross-tool snapshot vs Everything.

Benchmark result (verified-fresh daemon, C: + D:, 7.97M records)

Acceptance gate MET: best-or-tied vs both 0.5.66 and Everything on every row; beats Everything on all 16 comparable rows (C: prefix is a 1ms tie). Sets six new bests: D:/C,D: full_scan, D: prefix, D:/C,D: substring, C,D: ext_dll. Median UFFS/ES ratio ~0.52x (~1.9x faster).

Verification

  • cargo clippy -D warnings clean, cargo test -p uffs-core green (829 lib tests + new parity/regression tests).
  • Full lint-pre-push gate green (incl. windows lint, doc-tests, smoke).
  • Rebased onto main @ 0.5.119; 3 signed code commits + 1 signed docs commit.

Note: published artifact will be v0.5.120 after the post-merge CI version bump.

sort_and_localise ran a full O(N log N) value-sort even when limit
admits every candidate (e.g. `*` full-scan, limit=usize::MAX). The
downstream backend::sort_rows re-sorts the materialised rows by the
user's column anyway and truncate is a no-op, so the value-sort is
wasted work over millions of tuples.

Add an early return for limit >= candidates.len() that does only the
cheap MFT-locality sort (keeps DirCache warm for path resolution) and
skips the value-sort entirely. Recovers the full_scan C,D regression
(4.2s top-5 -> <=3.5s) without touching the limited-query path.
#4)

#2 Trigram prefix fast-path: prefix queries (e.g. `win*`) now narrow
candidates via the first-3-char trigram lookup then filter by full
prefix, instead of scanning every record. Adds is_prefix_pattern() in
tree.rs, a new prefix_search.rs module, and is_prefix dispatch arms in
backend.rs (both search sites) + dispatch.rs (+ pick_mode_label).
Expected: prefix C 91->~72ms, C,D 95->~82ms (beats ES).

#4 Size-gated parallel path resolution: indices_to_rows dispatches
sequential below RESOLVE_CHUNK_SIZE (4096) and par_chunks at/above it.
4096 keeps tiny exact queries (3-37 rows) off rayon (no p95 tail
jitter) while letting prefix/substring (12K-34K rows) fan out.
Expected: substring C 57->~38ms, C,D 58->~47ms.

Decompose: extract the indices_to_rows family into the new sibling
module row_resolve.rs so query/mod.rs stays under the 800-LOC policy
(809 -> 694), no file_size_exceptions entry added.

Tests: is_prefix_pattern acceptance matrix (tree.rs), prefix/glob
parity + limit (query_tests), and a 9000-row parallel-resolve parity
test guarding the chunk-reduce ordering.
…d.rs size exception

backend.rs was 1067 LOC and carried a PERMANENT file_size_exceptions
entry. Per workspace policy (decompose, don't suppress), move the
self-contained DisplayRow type — struct + inherent impl + Default +
uffs_format::FormatRow impl — into a new sibling module display_row.rs
(289 LOC). backend.rs drops to 784 LOC, under the 800 ceiling.

DisplayRow is re-exported (`pub use super::display_row::DisplayRow;`)
so the single-import convention downstream relies on
(uffs_core::search::backend::DisplayRow) is unchanged — public API and
behavior preserved.

Removes the backend.rs entry from scripts/ci/file_size_exceptions.txt.
Public-facing, fact-only benchmark snapshot of the verified-fresh
cross-tool run (UFFS v0.5.120 vs Everything 1.4.1.1032) on C: + D:
(7.97M records, Ryzen 9 3900XT / Win11 24H2). States results only,
not methodology, linking docs/benchmarks/methodology.md for the
fairness doctrine.

Headline: UFFS wins 17/18 targeted head-to-head cells at p50
(median ~0.52x, ~1.9x faster); the 18th (C: prefix) is a 1ms tie.
Mirrors the structure of the 2026-04 v0.5.66 report.

REUSE: covered by the repo-wide ** -> MPL-2.0 annotation in REUSE.toml.
@githubrobbi githubrobbi enabled auto-merge (squash) June 9, 2026 18:44
The golden cpp_*.txt baseline is immutable across reruns.  Hashing a
multi-GB file on every invocation wastes seconds for no benefit.

Add compute_streaming_stats_cached: writes a .parityhash sidecar
keyed on (size_bytes, mtime_nanos); subsequent runs skip the SHA256
pass entirely if the file hasn't changed.  Falls back to a full
recompute if the sidecar is absent, stale, or unreadable.

Also annotates the baseline hash line with ', golden cached' so
the operator can confirm the fast-path engaged.
@githubrobbi githubrobbi merged commit bb0bd94 into main Jun 9, 2026
27 checks passed
@githubrobbi githubrobbi deleted the perf/master-speed-20260609 branch June 9, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant