From 08f177dde7acab4c3a9ffb895aceee8a494910ea Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Wed, 27 May 2026 09:18:39 -0500 Subject: [PATCH 1/3] docs: attribute Lean formalisations to their repos and correct framing The README and docs/RANK_MODES.md claimed the rank-transform monotone- reparametrisation invariance was machine-checked in Lean without linking an artifact, and RANK_MODES still said the Bayes-optimality theorem was not claimed. Both are now public, sorry-free Lean 4 proofs: - ordinal (rank) invariance under strictly monotone reparametrisation -> Project-Navi/takens-formalization (isOrdinalPatternOf_comp_strictMono) - in-model Bayes-threshold optimality of the popcount overlap cutoff + exact hypergeometric null -> Fieldnote-Echo/ordvec-formalization (exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail) Link both and keep the in-model scoping explicit (optimal under the modeled MLR/FNCH assumptions, not a claim that any corpus obeys them). Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Nelson Spence --- README.md | 24 +++++++++++++++++++----- docs/RANK_MODES.md | 26 +++++++++++++++++--------- 2 files changed, 36 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 096ba353..11340bbb 100644 --- a/README.md +++ b/README.md @@ -78,11 +78,25 @@ independent uniform top-bucket sets — overlap **hypergeometrically**, unrelated document clears a given overlap threshold — is closed-form and data-independent, not a tuned cutoff. (Whether *true* neighbours clear the bar is empirical; this is an exact candidate-generation null, not a -retrieval-optimality theorem.) The invariance underneath it — that the rank -transform is unchanged by any strictly monotone reparametrisation of the -coordinates — is separately machine-checked in Lean, with the formalisation -accompanying the paper. Details in -[`docs/RANK_MODES.md`](docs/RANK_MODES.md). +retrieval-optimality theorem.) Two pieces of this are separately machine-checked in Lean 4, both `sorry`-free +on Lean's standard axiom base (`propext`, `Classical.choice`, `Quot.sound`): + +- the **ordinal invariance** the rank transform rests on — that a vector's + sorting permutation is unchanged by any strictly monotone reparametrisation + of its coordinates — in + [`takens-formalization`](https://github.com/Project-Navi/takens-formalization) + (theorem `isOrdinalPatternOf_comp_strictMono`); and +- the **shape of the bitmap candidate filter** — that under a finite + monotone-likelihood-ratio overlap-tilt model, an overlap-count *threshold* + (the popcount cutoff) is the Bayes-optimal deterministic admission rule, and + the uniform constant-weight null gives that threshold exactly the + hypergeometric upper tail — in + [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization) + (theorem `exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`). + This is an *in-model* result: it fixes the optimal rule's shape under the + stated assumptions, not a claim that any given corpus obeys the model. + +Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md). ## Quickstart diff --git a/docs/RANK_MODES.md b/docs/RANK_MODES.md index 1f88caed..8eecf3d7 100644 --- a/docs/RANK_MODES.md +++ b/docs/RANK_MODES.md @@ -128,15 +128,23 @@ don't have a hypergeometric null because they don't have fixed bucket cardinalities — their score distribution depends on the unknown embedding distribution. -**A research program this suggests.** The chain — representation → -statistic → retrieval theorem → systems implementation — has a -plausible formal target. Under a shared-latent-support model where -relevant documents have elevated coordinates on a query-specific -support set `S_q`, the top-bucket overlap statistic is monotone in -the likelihood ratio for relevance, suggesting that bitmap probing -may approach Bayes-optimality under that model. We do not claim -that theorem here; this section flags it as the natural -mathematical direction for the empirical results below. +**A research program this suggests — now partly machine-checked.** The +chain — representation → statistic → retrieval theorem → systems +implementation — now has a checked link in the middle. Under a finite +monotone-likelihood-ratio overlap-tilt model (the formal cousin of a +shared-latent-support model, where relevant documents have elevated +coordinates on a query-specific support set `S_q`), the top-bucket +overlap statistic is monotone in the likelihood ratio for relevance, +and an overlap-count threshold — the popcount cutoff — is the +Bayes-optimal deterministic admission rule, with the uniform +constant-weight null assigning that threshold exactly the +hypergeometric upper tail. That is proved `sorry`-free in +[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization) +(`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`). +It is an *in-model* result: it fixes the shape of the optimal rule +under the stated assumptions, not a claim that any given corpus obeys +the model — whether real neighbours clear the bar stays empirical, +which is what the bench and the paper measure. The systems consequence is what the bench measures: at a moderate M the bitmap probe captures most of exact RankQuant's top-10 neighbours, From 82801039a584d410cc8079ce63891e45089765d4 Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Wed, 27 May 2026 09:18:39 -0500 Subject: [PATCH 2/3] feat: mark SearchResults and candidate-gen methods #[must_use] Discarding a search result is a silent no-op footgun: e.g. `index.search_asymmetric(&queries, 10);` does the work and drops it. Annotate SearchResults (covers every search* method that returns it) and the pure top_m_candidates* query methods on Bitmap/SignBitmap so the compiler warns on an unused result. Compile-only metadata; no runtime change. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Nelson Spence --- src/bitmap.rs | 3 +++ src/lib.rs | 1 + src/sign_bitmap.rs | 2 ++ 3 files changed, 6 insertions(+) diff --git a/src/bitmap.rs b/src/bitmap.rs index 319ec9b3..3f48b678 100644 --- a/src/bitmap.rs +++ b/src/bitmap.rs @@ -180,6 +180,7 @@ impl Bitmap { /// `Vec` of all N scores and `select_nth_unstable` the /// top-`m`: O(N + m log m). The ~808 KiB temp at N=207k is /// cheap relative to the cost it saves at M ≥ 1000. + #[must_use] pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec { assert_all_finite(q); let m_eff = m.min(self.n_vectors); @@ -224,6 +225,7 @@ impl Bitmap { /// `queries` is a flat `batch * dim` f32 slice. Returns /// `Vec>` with one top-`m` set per query, sorted by /// overlap descending. + #[must_use] pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec> { let dim = self.dim; let batch = queries.len() / dim; @@ -294,6 +296,7 @@ impl Bitmap { /// [`Self::top_m_candidates_batched`] in parallel via rayon. Use /// when the full query workload is larger than one batch fits /// efficiently in L2/L3. + #[must_use] pub fn top_m_candidates_batched_chunked( &self, queries: &[f32], diff --git a/src/lib.rs b/src/lib.rs index 2182437e..7fb43af0 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -109,6 +109,7 @@ pub type RankQuantFastscanIndex = RankQuantFastscan; /// type. Prefer the slice accessors above for read-only per-query access — /// exposing the flat buffers as the stable representation is the trade-off for /// that zero-copy interop. +#[must_use = "search produces results by value; dropping them does no work"] pub struct SearchResults { pub scores: Vec, pub indices: Vec, diff --git a/src/sign_bitmap.rs b/src/sign_bitmap.rs index dd6066e6..94c3a5fe 100644 --- a/src/sign_bitmap.rs +++ b/src/sign_bitmap.rs @@ -125,6 +125,7 @@ impl SignBitmap { /// `m_eff` produce a deterministic survivor set across runs and /// SIMD dispatch paths — same audit discipline as /// [`crate::Bitmap::top_m_candidates`]. + #[must_use] pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec { let m_eff = m.min(self.n_vectors); if m_eff == 0 { @@ -157,6 +158,7 @@ impl SignBitmap { /// top-`m` candidate sets for `batch` queries in parallel. Mirrors /// [`crate::Bitmap::top_m_candidates_batched`] in kernel /// shape (CHUNK=8 hot + tail) and tie-break semantics. + #[must_use] pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec> { let dim = self.dim; let batch = queries.len() / dim; From 1d81187ca07c62f35cba28bf30aa37ea993ab811 Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Wed, 27 May 2026 09:43:07 -0500 Subject: [PATCH 3/3] =?UTF-8?q?fix:=20address=20PR=20#89=20review=20?= =?UTF-8?q?=E2=80=94=20must=5Fuse=20messages=20+=20README=20grammar?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - SearchResults #[must_use]: reword (Copilot) — the scan already ran, so a dropped result discards work; it is not a no-op. - top_m_candidates* (Bitmap/SignBitmap): add descriptive #[must_use] messages for consistency (gemini). - README: 'on which the rank transform rests' grammar (gemini). Declined: RANK_MODES 'rather than claiming that' reword — kept 'not a claim that' parallel with the README. Codex's theorem-name finding is a false alarm (it read a pre-sweep snapshot of ordvec-formalization; the headline theorem is real at HEAD) — name-stability guard tracked in ordvec-formalization#5. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Nelson Spence --- README.md | 2 +- src/bitmap.rs | 6 +++--- src/lib.rs | 2 +- src/sign_bitmap.rs | 4 ++-- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 11340bbb..ab838a50 100644 --- a/README.md +++ b/README.md @@ -81,7 +81,7 @@ is empirical; this is an exact candidate-generation null, not a retrieval-optimality theorem.) Two pieces of this are separately machine-checked in Lean 4, both `sorry`-free on Lean's standard axiom base (`propext`, `Classical.choice`, `Quot.sound`): -- the **ordinal invariance** the rank transform rests on — that a vector's +- the **ordinal invariance** on which the rank transform rests — that a vector's sorting permutation is unchanged by any strictly monotone reparametrisation of its coordinates — in [`takens-formalization`](https://github.com/Project-Navi/takens-formalization) diff --git a/src/bitmap.rs b/src/bitmap.rs index 3f48b678..6b9cf3c8 100644 --- a/src/bitmap.rs +++ b/src/bitmap.rs @@ -180,7 +180,7 @@ impl Bitmap { /// `Vec` of all N scores and `select_nth_unstable` the /// top-`m`: O(N + m log m). The ~808 KiB temp at N=207k is /// cheap relative to the cost it saves at M ≥ 1000. - #[must_use] + #[must_use = "this scans the corpus to generate candidates; dropping the result discards that work"] pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec { assert_all_finite(q); let m_eff = m.min(self.n_vectors); @@ -225,7 +225,7 @@ impl Bitmap { /// `queries` is a flat `batch * dim` f32 slice. Returns /// `Vec>` with one top-`m` set per query, sorted by /// overlap descending. - #[must_use] + #[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"] pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec> { let dim = self.dim; let batch = queries.len() / dim; @@ -296,7 +296,7 @@ impl Bitmap { /// [`Self::top_m_candidates_batched`] in parallel via rayon. Use /// when the full query workload is larger than one batch fits /// efficiently in L2/L3. - #[must_use] + #[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"] pub fn top_m_candidates_batched_chunked( &self, queries: &[f32], diff --git a/src/lib.rs b/src/lib.rs index 7fb43af0..a2134e0c 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -109,7 +109,7 @@ pub type RankQuantFastscanIndex = RankQuantFastscan; /// type. Prefer the slice accessors above for read-only per-query access — /// exposing the flat buffers as the stable representation is the trade-off for /// that zero-copy interop. -#[must_use = "search produces results by value; dropping them does no work"] +#[must_use = "search runs the full scan to produce these results; dropping them discards that work"] pub struct SearchResults { pub scores: Vec, pub indices: Vec, diff --git a/src/sign_bitmap.rs b/src/sign_bitmap.rs index 94c3a5fe..650d54b5 100644 --- a/src/sign_bitmap.rs +++ b/src/sign_bitmap.rs @@ -125,7 +125,7 @@ impl SignBitmap { /// `m_eff` produce a deterministic survivor set across runs and /// SIMD dispatch paths — same audit discipline as /// [`crate::Bitmap::top_m_candidates`]. - #[must_use] + #[must_use = "this scans the corpus to generate candidates; dropping the result discards that work"] pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec { let m_eff = m.min(self.n_vectors); if m_eff == 0 { @@ -158,7 +158,7 @@ impl SignBitmap { /// top-`m` candidate sets for `batch` queries in parallel. Mirrors /// [`crate::Bitmap::top_m_candidates_batched`] in kernel /// shape (CHUNK=8 hot + tail) and tie-break semantics. - #[must_use] + #[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"] pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec> { let dim = self.dim; let batch = queries.len() / dim;