From 7729d4c4daea5661b23961c084fd6407d5ecdeb6 Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 11:40:54 -0700 Subject: [PATCH 01/16] Add ordinal-routing research: probes, impossibility proofs, CRT vernier formalization Investigates whether prime/spectral structure can improve training-free vector routing over ordvec codes. Net: the genuinely defensible results are the CRT vernier seam structure and the rigidity theory; the empirical rigidity/routing probes are exploratory with identified confounds (documented inline). Deliverables: - examples/crt_seam_oracle.rs: exhaustive finite proof of the CRT seam-density structure (capped form, lcm spacing, single coincidence, honest no-floor result) - examples/{spectral_probe,shard_recall,twonn_id,gen_corpus}.rs: routing-key probes + corpus-zoo generator (npy I/O matching bench_rank) - benchmarks/rigidity_impossibility_proofs.md: Thm 2/3 (key is not number-variance-rigid; Theorem 2 binomial value L(1-L/n); over-broad optimality claim retracted per review) - benchmarks/crt_seam_oracle_results.md: capped density form min(2t+1,m_i)/m_i - benchmarks/lean/VernierSeamDensity.lean: Lean 4 skeleton for ordvec-formalization (crtEquiv crux carries required NeZero hypothesis) - benchmarks/ADVERSARIAL_REVIEW.md: three hostile-reviewer findings, with each contested claim tiered SOUND / CORRECTED / EXPLORATORY / RETRACTED Adversarial-reviewed before commit; correctness fixes applied, empirical overclaims demoted to exploratory with salvage paths. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/ADVERSARIAL_REVIEW.md | 70 ++++ benchmarks/REAL_CORPUS_RUNBOOK.md | 48 +++ benchmarks/conjecture_citation_audit.md | 41 ++ benchmarks/corpus_zoo_results.md | 66 +++ benchmarks/crt_seam_oracle_results.md | 83 ++++ benchmarks/lean/README.md | 59 +++ benchmarks/lean/VernierSeamDensity.lean | 141 +++++++ benchmarks/rigidity_impossibility_proofs.md | 121 ++++++ benchmarks/shard_recall_results.md | 64 +++ benchmarks/spectral_probe_results.md | 52 +++ benchmarks/twonn_id_results.md | 61 +++ examples/crt_seam_oracle.rs | 164 ++++++++ examples/gen_corpus.rs | 275 +++++++++++++ examples/shard_recall.rs | 432 ++++++++++++++++++++ examples/spectral_probe.rs | 327 +++++++++++++++ examples/twonn_id.rs | 235 +++++++++++ 16 files changed, 2239 insertions(+) create mode 100644 benchmarks/ADVERSARIAL_REVIEW.md create mode 100644 benchmarks/REAL_CORPUS_RUNBOOK.md create mode 100644 benchmarks/conjecture_citation_audit.md create mode 100644 benchmarks/corpus_zoo_results.md create mode 100644 benchmarks/crt_seam_oracle_results.md create mode 100644 benchmarks/lean/README.md create mode 100644 benchmarks/lean/VernierSeamDensity.lean create mode 100644 benchmarks/rigidity_impossibility_proofs.md create mode 100644 benchmarks/shard_recall_results.md create mode 100644 benchmarks/spectral_probe_results.md create mode 100644 benchmarks/twonn_id_results.md create mode 100644 examples/crt_seam_oracle.rs create mode 100644 examples/gen_corpus.rs create mode 100644 examples/shard_recall.rs create mode 100644 examples/spectral_probe.rs create mode 100644 examples/twonn_id.rs diff --git a/benchmarks/ADVERSARIAL_REVIEW.md b/benchmarks/ADVERSARIAL_REVIEW.md new file mode 100644 index 00000000..83245672 --- /dev/null +++ b/benchmarks/ADVERSARIAL_REVIEW.md @@ -0,0 +1,70 @@ +# Adversarial review findings (three independent hostile reviewers) + +This branch was reviewed by three adversarial agents (Rust probes, math proofs, +Lean skeleton). Their findings are recorded here verbatim-in-summary so the PR +carries its own critique. Conclusions are tiered by what SURVIVED review. + +## SOUND (ships as claimed) + +- **CRT seam oracle** (`examples/crt_seam_oracle.rs`): exhaustive finite proof. + Coincidence spacing = lcm; exactly one all-L coincidence per period; the + honest negative result that phases cannot generically rescue a pointwise floor. + All three reviewers independent-agree this file is honest. +- **TwoNN metric fix** (chord vs cosine): the squared-distance bug fix is correct + and validated on sphere controls. + +## CORRECTED IN THIS BRANCH (was wrong/loose, now fixed) + +- **CRT density closed form**: must be `∏ min(2t+1, m_i)/m_i`, NOT `∏(2t+1)/m_i`. + The literal product exceeds 1 at t=3 (period-3 grid saturates) — the SAME error + class the doc attributed to the rejected `(2t)^L/M` form. The Rust code already + caps with `.min(p)`; the markdown and Lean statements are corrected to match. + Valid precondition for the uncapped form: `2t+1 ≤ min_i m_i` (t≤1 here). +- **Lean `crtEquiv`**: false as typed — `ZMod 0 = ℤ` breaks the cardinality + match. Needs `[∀ i, NeZero (m i)]`. Signature corrected; `hcongr` lemma plan + corrected to `Equiv.subtypeEquiv` + `Fintype.card_congr` (the cited + `Equiv.card_filter_map` does not exist). + +## DEMOTED TO EXPLORATORY (known confounds — NOT a settled finding) + +- **"Routing keys are super-Poisson, never rigid"** (`spectral_probe.rs`, + `corpus_zoo_results.md`): the fixed **Gaussian** unfold only fits the isotropic + marginal. Non-isotropic corpora are unfolded by the wrong density, which can + MANUFACTURE apparent clustering. The isotropic control and zoo are therefore + partly circular. To salvage: unfold by each corpus's **empirical** smooth + marginal (monotone spline on sorted keys — NOT the rank-overwrite currently + mislabeled `--unfold-empirical`, which returns Σ²=0 by construction). UNTIL + THEN this is exploratory, not a proven property of embedding keys. +- **"Random offsets redundant / coprime adds nothing"** (`shard_recall.rs`): the + RandomOffset arm draws an extra RNG value per projection, desyncing the stream + so arms get DIFFERENT projection directions — it was never the clean phase-only + ablation claimed. Plus the "fair envelope" is not fair (coprime arms saturate + below high budgets). The *retreat* ("needs a within-axis harness") is sound; + the positive tie claim is withdrawn. + +## IMPOSSIBILITY PROOFS — repairable, currently overstated + +- **Theorem 2**: conclusion (no rigidity) holds, but proof text is WRONG as + written — n i.i.d. uniforms are a BINOMIAL process, Σ²(L) = L(1−L/n), not + "Poisson, Σ²=L exactly, independent". Restate with the binomial value; the + Θ(L) conclusion is unaffected. +- **Theorem 3**: correct under its hypothesis but the hypothesis is smuggled — + "any fixed-distribution corpus" must be stated as "conditionally i.i.d. given + latent θ (mixture of i.i.d.)". de Finetti is decorative; it's the law of total + variance. Finite-without-replacement (the escape hatch) is excluded by + ASSUMPTION, not proof. +- **NON-SEQUITUR (must retract)**: "quantile bucketing is optimal against the + entire achievable class / prime-spectral structure provably is not there" does + NOT follow from Σ²(L) ≥ L. Number variance and partition recall/load-balance + are different figures of merit. Narrow the claim to: "the key is not + number-variance-rigid" (true), and drop the optimality-over-all-partitions + claim unless separately proved (likely false). + +## Net + +The mathematically defensible deliverables are: the CRT vernier structure +(oracle + corrected density + Lean skeleton with corrected `crtEquiv` signature), +and the TwoNN metric fix. The empirical rigidity/routing findings are +exploratory with identified confounds and concrete salvage paths. The +impossibility theorems are directionally right but need restatement (binomial +not Poisson; mixture hypothesis explicit; optimality claim retracted). diff --git a/benchmarks/REAL_CORPUS_RUNBOOK.md b/benchmarks/REAL_CORPUS_RUNBOOK.md new file mode 100644 index 00000000..84848edf --- /dev/null +++ b/benchmarks/REAL_CORPUS_RUNBOOK.md @@ -0,0 +1,48 @@ +# Real-corpus runbook for the conjecture probes + +All four examples now consume the SAME `.npy` format `bench_rank` documents: +2-D little-endian float32 (`random-offset gap that survives reseeding would be the surprise worth +chasing. diff --git a/benchmarks/conjecture_citation_audit.md b/benchmarks/conjecture_citation_audit.md new file mode 100644 index 00000000..7fd0161c --- /dev/null +++ b/benchmarks/conjecture_citation_audit.md @@ -0,0 +1,41 @@ +# Citation audit — prime/RH/embedding-geometry conjecture + +Verified by direct WebFetch against primary sources (subagent sandbox had no web +egress; their "VERIFIED" tags were memory passes and are NOT relied on here). +WebSearch was backend-down; all checks below are WebFetch on known URLs. + +## VERIFIED ✓ (fetched primary source) + +| Claim | Verified value | Source | +|---|---|---| +| GPT-2 final-layer anisotropy | ≈0.99 ("almost perfect cosine similarity"); ~0.6 in layers 2–8, rising exponentially layers 8–12 | Ethayarajh 2019, arXiv:1909.00512 (via ar5iv) | +| Rogue dimensions dominate cosine | "a small number of rogue dimensions, often just 1–3, dominate these measures" (exact quote) | Timkey & van Schijndel 2021, arXiv:2109.04404 | +| Super-prime counting function | π_q(x) ~ x/(log x)²; q_n ~ n(log n)² | Broughan & Barnett 2009, J. Integer Seq. vol 12 | +| Mathlib CRT primitive | `ZMod.chineseRemainder : ZMod (m*n) ≃+* ZMod m × ZMod n` (exact signature) | mathlib4_docs ZMod/Basic | +| Discrepancy rates | uniform grid O(1/N); optimal LDS O(log N/N); random/Poisson O(√(log log N/N)) | Wikipedia Low-discrepancy_sequence | + +## PARTIAL / QUALITATIVE (source confirms direction, not exact figure) + +| Claim | Status | +|---|---| +| Mu & Viswanath D≈d/100 rule | Principle confirmed ("common mean + a few top directions"); the exact d/100 ratio NOT stated in fetched text. Cite as "a few top components", not d/100. arXiv:1702.01417 | +| Ansuini intrinsic dimension | NUMBERS CONFIRMED (ar5iv): last hidden layers ID ≈ 12–25; peak (early layers, rel.depth 0.2–0.4) ≈ 100–120; ID/ambient ≈ 2e-4. Hunchback = rise then contract. arXiv:1905.12784. NOTE: vision/CNN nets, not sentence encoders. | +| Valeriani et al. 2023 transformer ID | Qualitative confirmed: same expand–contract–stabilize ID profile in transformers (protein LM + image). Exact peak number not extracted. arXiv:2302.00294 | +| Routing implication | d_int ≈ low tens ⇒ JL R ≈ c·d_int ⇒ R∈{8,16} projections suffice (matches shard_recall envelope still rising at R=16). Projection budget sized by ~20, not ambient 256/1024. | + +## CORRECTED (subagent confabulations caught) + +- k-fold prime-cascade ~1/(log x)^k: **HEURISTIC, not proved.** Subagent falsely + cited Goldston–Yıldırım (that paper is about small prime gaps — unrelated). + Only the single iteration (Broughan&Barnett) is proved. +- Super-prime count "x/(ln x · ln ln x)": **wrong form**; paper proves x/(log x)². +- Prime-sequence discrepancy "O(n^-1/2)" (my own earlier claim): **unsupported**; + no clean published rate for scaled primes. Verdict holds via uniform/quantile + dominance regardless. + +## STILL UNVERIFIED (needs working WebSearch or specific PDF) + +- Exact intrinsic-dimension number for modern sentence embeddings + (sentence-transformers / E5 / text-embedding-3). No confident published TwoNN + figure located. This is the one empirical gap for the Level-2 anisotropy argument. +- Cai et al. 2021 "local isotropy" arXiv id unconfirmed. diff --git a/benchmarks/corpus_zoo_results.md b/benchmarks/corpus_zoo_results.md new file mode 100644 index 00000000..aea46547 --- /dev/null +++ b/benchmarks/corpus_zoo_results.md @@ -0,0 +1,66 @@ +> ⚠️ EXPLORATORY — CIRCULAR AS WRITTEN (adversarial review). Every non-isotropic +> geometry is unfolded by a fixed Gaussian CDF it does not match, so the +> "robustness across geometries" partly measures the unfold mismatch, not the +> corpus. Isotropic is the only marginal the unfold fits. NOT a settled finding; +> re-run with per-corpus empirical-marginal unfolding to salvage. See +> benchmarks/ADVERSARIAL_REVIEW.md. + +# Corpus zoo: number-variance robustness across geometries + +Rigorous multi-corpus verification of the "embedding routing keys are not rigid" +finding. Generates 6 geometries as .npy (`examples/gen_corpus.rs`) and runs the +number-variance probe on each. Includes instrument controls at BOTH ends. + +n=50k, dim=256, 8 random-projection keys, Gaussian unfold. + +## Σ²(L)/L (≈1 Poisson · >1 clustered · <1 RIGID) + +| kind | L=8 | L=32 | L=128 | L=512 | reading | +|-------------|-------|-------|-------|--------|---------| +| isotropic | 0.99 | 0.99 | 0.99 | 0.93 | Poisson NULL ✓ | +| clustered | 1.34 | 2.39 | 6.13 | 18.5 | super-Poisson | +| anisotropic | 2.38 | 6.48 | 22.7 | 85.0 | super-Poisson | +| rogue | 1.60 | 3.34 | 10.0 | 34.1 | super-Poisson | +| manifold | 2.93 | 8.73 | 30.3 | 96.4 | super-Poisson | + +## Instrument controls (CRITICAL — proves the probe isn't blind) + +| control | result | proves | +|---------|--------|--------| +| `--rigid-selftest` (perfect lattice key) | Σ²/L = 0.0000 all L | probe DETECTS rigidity ✓ | +| isotropic corpus | Σ²/L = 0.99 flat | probe DETECTS Poisson ✓ | + +## Findings + +1. **Robust across 5 geometries: embedding-like keys are super-Poisson + (clustered), never rigid.** Crucially this includes the NONLINEAR MANIFOLD + (2.9→96) — the one geometry where angular rigidity could plausibly have + hidden. It didn't. +2. **The probe is validated at both ends.** Perfect lattice → Σ²=0 (sees + rigidity); isotropic → Σ²/L≈1 (sees Poisson). So the super-Poisson readings + are real, not instrument blindness. This was the gap that would have let a + skeptic dismiss the whole spectral finding; it is now closed. +3. **Bonus substantive result (random-projection lattice control):** a sequence + that is rigid along ONE direction does NOT produce a rigid key after random + projection + L2 normalization (tested: still super-Poisson). I.e. even if an + embedding had rigid structure on a special axis, a data-oblivious + random-projection router would not see it — reinforcing that spectral + structure is not exploitable training-free. (The instrument self-test + isolates the estimator from this pipeline effect.) + +## Verdict + +The "keys are non-rigid → quantile bucketing is the whole story, no spectral +opening" conclusion is now verified across diverse corpus geometries with +validated instrument controls, not a single synthetic distribution. A FALLING +Σ²/L on a real embedding dump would still be the surprise worth chasing — run +the zoo recipe on real data via the runbook. + +Reproduce: +``` +for K in isotropic clustered anisotropic rogue manifold; do + cargo run --release --example gen_corpus -- --kind $K --n 50000 --dim 256 --out zoo_$K.npy + cargo run --release --example spectral_probe -- --corpus-npy zoo_$K.npy +done +cargo run --release --example spectral_probe -- --rigid-selftest # instrument control +``` diff --git a/benchmarks/crt_seam_oracle_results.md b/benchmarks/crt_seam_oracle_results.md new file mode 100644 index 00000000..5f691390 --- /dev/null +++ b/benchmarks/crt_seam_oracle_results.md @@ -0,0 +1,83 @@ +# CRT seam oracle — corrected vernier theorem (exhaustive finite proof) + +`examples/crt_seam_oracle.rs` enumerates the full ring Z/M to verify the +within-axis vernier (coprime-grid) theorem exactly. This is the SPEC for the +vernier recall experiment and the corrected target for any Lean formalization. +Exhaustive enumeration over the finite ring IS the proof for these parameters. + +Params: L=4 grids on one axis, coprime periods {3,5,7,11}, M=1155. + +## CONFIRMED exactly + +1. **Coincidence spacing = lcm.** First common seam of grids (m_i,m_j) at phi=0 + is exactly lcm(m_i,m_j). Coprime ⇒ lcm = m_i·m_j (maximal spacing: 3,5→15; + 5,7→35). Non-coprime coincide early (4,6→12; 6,9→18). This is WHY coprimality + maximizes seam separation. +2. **Exactly one all-L coincidence per period M.** count = 1 over [0,M). ✓ + +## CORRECTED (the closed form is the CAPPED product) + +3. **Joint near-seam density is ∏_i min(2t+1, m_i)/m_i.** The uncapped + ∏(2t+1)/m_i is correct ONLY in the interior `2t+1 ≤ min_i m_i` (here t≤1); + beyond that, bands saturate a grid and that grid's factor caps at 1. The + uncapped form exceeds 1 at t=3 (period-3 grid: 7/3 > 1) — the SAME nonsense + as the rejected continuous (2t)^L/M. (Caught in adversarial review; the Rust + code already computes the capped product via `.min(p)`.) + + | t | measured | ∏min(2t+1,m_i)/m_i (capped) | ∏(2t+1)/m_i (uncapped) | (2t)^L/M | + |---|----------|------------------------------|------------------------|----------| + | 0 | 0.000866 | 0.000866 | 0.000866 | 0.000000 | + | 1 | 0.070130 | 0.070130 | 0.070130 | 0.013853 | + | 2 | 0.324675 | 0.324675 | 1.039 (>1, wrong) | 0.221645 | + | 3 | 0.636364 | 0.636364 | 2.078 (>1, wrong) | 1.122 | + + Measured matches the CAPPED product to 6 d.p. at all t. The uncapped product + is valid only for t≤1 (where 2t+1 ≤ min m_i = 3); it already exceeds 1 at t=2. + The provable/Lean statement is the capped form, OR the uncapped form under the + explicit precondition `2t+1 ≤ min_i m_i`. + +## REFUTED (stronger than the earlier caveat) + +4. **No generic pointwise seam-margin floor — phases cannot rescue it.** + min_x max_i dist_to_seam = 0 with phi=0 AND with staggered phases [0,1,2,3]. + For L=4 with these small periods, some position always sits on every grid's + seam regardless of phase. The "choose phases for a pointwise floor >= c" + claim FAILS in this regime; the floor is not generic, it exists only for + specific (periods, phases, L) and must be verified per-instance, not assumed. + +## Proof of the density closed form (was: numerically confirmed only) + +Claim: joint near-seam density = ∏_i min(2t+1, m_i)/m_i for pairwise-coprime +m_i. Equivalently = ∏_i (2t+1)/m_i under the precondition 2t+1 ≤ min_i m_i. + +Proof. CRT: x ↦ (x mod m_1, …, x mod m_L) is a BIJECTION Z/M → ∏ Z/m_i (M=∏m_i, +pairwise coprime). Position x is within band t of a seam on grid i iff +(x − φ_i) mod m_i lies in the centered band {−t,…,+t} mod m_i. The number of +DISTINCT such residues is min(2t+1, m_i): exactly 2t+1 when the band fits +(2t+1 ≤ m_i), capping at m_i once it saturates (the band wraps to cover all +residues). The wraparound/min in the distance condition is INTRA-coordinate +(depends only on x mod m_i), so it does not couple grids — CRT independence +across coordinates is untouched. The joint count therefore factors: + + count = ∏_i min(2t+1, m_i), density = count/M = ∏_i min(2t+1,m_i)/m_i. ∎ + +(Even m_i have an extra antipodal-residue caveat at the exact half-period; the +chosen odd periods avoid it.) The independence is precisely what CRT supplies; +without coprimality the residue coordinates are coupled and the product fails +(matches Check 1: non-coprime grids coincide early). This is the provable form — +NOT the uncapped ∏(2t+1)/m_i (>1 for t≥2 here) nor the continuous (2t)^L/M. + +## Consequence for the workplan + +- The provable theorem is the DENSITY/COUNT statement (checks 1–3, CAPPED product + ∏min(2t+1,m_i)/m_i), not a pointwise floor. A Lean proof should target the + capped form (or the uncapped form under explicit 2t+1 ≤ min m_i) and + "one coincidence per period", and should NOT claim a phase-tunable margin. +- The vernier RECALL experiment must therefore test whether the (small, + bounded) joint blind-spot region actually costs recall — it cannot lean on a + guaranteed per-query margin, because there isn't one. +- This is the ambush avoided: building the recall test on a phase-tunable-floor + premise, then having Lean fail to prove it, would have invalidated the + experiment after the fact. Caught by finite enumeration first. + +Reproduce: `cargo run --release --example crt_seam_oracle` diff --git a/benchmarks/lean/README.md b/benchmarks/lean/README.md new file mode 100644 index 00000000..422e4820 --- /dev/null +++ b/benchmarks/lean/README.md @@ -0,0 +1,59 @@ +# Lean formalization — vernier CRT seam-density + +`VernierSeamDensity.lean` formalizes the proved-on-paper seam-density theorem +(benchmarks/crt_seam_oracle_results.md) as a standalone module for the +`Fieldnote-Echo/ordvec-formalization` repo (Lean 4.28.0, Apache header, +`OrdvecFormalization` namespace). + +(An earlier `CrtSeamDensity.lean` draft was superseded by this module and removed.) + +## Placement note (corrected) + +This module does NOT build on the repo's `Finite*Quotient` / `QuotientKernel` +stack. That layer proves fiber-invariance/sufficiency of abstract quotient maps +`Ω → Z` (`productMap = Prod.map`). The seam-density theorem is a CRT *bijection* +`ZMod M ≃ ∏ ZMod mᵢ` plus a *cardinality count* — a different toolkit (Mathlib +`ZMod` + `Fintype.piFinset`), which the formalization repo does not otherwise +import. Correct home (that repo), independent module. + +## Status: convergent, not staged + +There is no `0.5.0` branch/tag in any Fieldnote-Echo repo (verified with auth). +Nelson independently reached similar conclusions; "target 0.5.0" means this is +intended for the next release, not that a branch exists to diff against. When his +working branch surfaces, reconcile theorem statements (impossibility + CRT +density) — convergence is corroboration for the paper. + +## What is established vs. open + +| Statement | Lean status | +|---|---| +| (A) product-space count `∏(2t+1)`, NO coprimality | structurally complete, modulo name-checks ([VERIFY] `Fintype.mem_piFinset`, `Fintype.card_piFinset`) | +| (B) position-space count over `ZMod M` | depends on `crtEquiv` (sorry) + `hcongr` (sorry) | +| one-coincidence-per-period (t=0) | follows from (B) | +| `card_admissible` (per-grid = 2t+1) | sorry — routine `ZMod.val` window injectivity | + +## The honest crux + +Coprimality is used in EXACTLY ONE place: `crtEquiv`, the indexed bijection +`ZMod (∏ m i) ≃ ∏ i, ZMod (m i)`. The product-space half (A) is true for any +family and deliberately carries no `cop` hypothesis — stating it that way +prevents the classic self-deception of "proving" a CRT theorem that never +actually touches coprimality. + +## Proof debt, ranked + +1. `crtEquiv` — fold `ZMod.chineseRemainder` over `Fin L` by induction, or + locate an existing indexed iso. Highest effort; the real mathematical content. +2. `card_admissible` — injectivity of `k ↦ (k : ZMod m) − t` on `{0..2t}` when + `2t+1 ≤ m`. Routine. +3. `hcongr` — cardinality transport across `crtEquiv`. Standard once (1) exists. +4. Confirm [VERIFY] names against installed Mathlib. + +## Build + +No Lean toolchain was present at drafting; names carry inline confidence tags. +Drop into the ordvec-formalization Mathlib project, `lake build`, resolve the +four debts above. de Finetti / rigidity theorems (rigidity_impossibility_proofs.md) +are NOT formalized here — they need Mathlib `ProbabilityTheory` and de Finetti, +which is a separate, larger project (de Finetti is not currently in Mathlib). diff --git a/benchmarks/lean/VernierSeamDensity.lean b/benchmarks/lean/VernierSeamDensity.lean new file mode 100644 index 00000000..48a38ef6 --- /dev/null +++ b/benchmarks/lean/VernierSeamDensity.lean @@ -0,0 +1,141 @@ +/- +Copyright (c) 2026 ordvec authors. All rights reserved. +Released under Apache 2.0 license as described in the file LICENSE. +Authors: (ordvec contributors) +-/ + +-- NOTE: does NOT build on the Finite*Quotient stack. That layer is about +-- fiber-invariance/sufficiency of quotient maps (Ω → Z); this theorem is a CRT +-- bijection + cardinality count, a different toolkit. Standalone module in the +-- OrdvecFormalization namespace, importing Mathlib ZMod directly (the repo does +-- not otherwise use ZMod — confirmed against its module list). +import Mathlib.Data.ZMod.Basic +import Mathlib.Data.Nat.GCD.Basic +import Mathlib.Data.Fintype.Pi +import Mathlib.Data.Fintype.BigOperators + +open scoped BigOperators + +namespace OrdvecFormalization + +/-! +# Vernier seam density (within-axis coprime grids) + +For pairwise-coprime grid periods `m : Fin L → ℕ` on a single routing axis, the +positions in `ZMod M` (`M = ∏ m i`) that lie within band `t` of a seam on EVERY +grid number exactly `∏ i, (2t+1)`, hence density `∏ i, (2t+1)/(m i)` when +`2t+1 ≤ m i`. Verified exhaustively in `examples/crt_seam_oracle.rs`; the closed +form is proved on paper in `benchmarks/crt_seam_oracle_results.md`. + +Structure mirrors the repo's quotient idiom: the CRT residue map is a +`productMap`-style coordinate split, and pairwise coprimality is what makes the +position-space map a bijection (the only place coprimality is used). + +DRAFT — drafted without a Lean toolchain. `sorry`s mark confirmed proof debt, +ranked at the file end. Lean 4.28.0 / Mathlib pinned by the repo `lake-manifest`. +-/ + +/-! ## Per-grid admissible residues -/ + +/-- Residues mod `m` within centered band `t` of `0`, as the image of the + window `{0,…,2t}` shifted by `-t`. -/ +def admissible (m t : ℕ) : Finset (ZMod m) := + (Finset.Icc 0 (2 * t)).image (fun k => (k : ZMod m) - (t : ZMod m)) + +/-- When the window fits (`2t+1 ≤ m`) the shift map is injective on it, so the + admissible set has exactly `2t+1` elements. -/ +theorem card_admissible {m t : ℕ} (h : 2 * t + 1 ≤ m) : + (admissible m t).card = 2 * t + 1 := by + -- card_image_of_injOn + injectivity of k ↦ (k:ZMod m) - t on a width-(2t+1) + -- window (no wraparound since width < m). Routine ZMod.val arithmetic. + sorry + +/-! ## Product-space count (coprimality NOT used) -/ + +/-- The joint blind spot over the residue PRODUCT space has cardinality + `∏ (2t+1)`. True for any family — carries no coprimality hypothesis, which + is the honest statement of what this half proves (it is `Fintype.card_piFinset`, + not CRT). -/ +theorem blindspot_card_product + {L : ℕ} (m : Fin L → ℕ) (t : ℕ) (hband : ∀ i, 2 * t + 1 ≤ m i) : + (Finset.univ.filter + (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)).card + = ∏ i, (2 * t + 1) := by + have hfilter : + (Finset.univ.filter (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)) + = Fintype.piFinset (fun i => admissible (m i) t) := by + ext r; simp [Fintype.mem_piFinset] -- [VERIFY] Fintype.mem_piFinset + rw [hfilter, Fintype.card_piFinset] -- [VERIFY] Fintype.card_piFinset + exact Finset.prod_congr rfl (fun i _ => card_admissible (hband i)) + +/-! ## CRT bijection (the only place coprimality is used) -/ + +/-- Indexed CRT equivalence `ZMod (∏ m i) ≃ ∏ i, ZMod (m i)` for a + pairwise-coprime family. [CRUX] Construct by folding `ZMod.chineseRemainder` + over `Fin L`, or via an existing indexed iso. This is the load-bearing + object and the sole consumer of `cop`. + + `[∀ i, NeZero (m i)]` is REQUIRED, not cosmetic: without it some `m i = 0` + gives `ZMod 0 = ℤ` (infinite), so the cardinalities of the two sides do not + match and the Equiv is FALSE. (Caught in adversarial review.) Downstream this + is discharged from `2t+1 ≤ m i ⇒ 1 ≤ m i ⇒ NeZero (m i)`. -/ +def crtEquiv {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] + (cop : ∀ i j, i ≠ j → Nat.Coprime (m i) (m j)) : + ZMod (∏ i, m i) ≃ (∀ i, ZMod (m i)) := + sorry + +/-! ## Position-space count (the real theorem) -/ + +/-- Positions `x ∈ ZMod M` (M = ∏ m i) within band `t` of a seam on EVERY grid + number exactly `∏ (2t+1)`. Transports `blindspot_card_product` across + `crtEquiv`; coprimality is essential here (it is what makes `crtEquiv` a + bijection). Phase offsets translate each per-grid set without changing its + cardinality, so the count is phase-independent. -/ +theorem blindspot_card_positions + {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] + (cop : ∀ i j, i ≠ j → Nat.Coprime (m i) (m j)) + (t : ℕ) (hband : ∀ i, 2 * t + 1 ≤ m i) : + (Finset.univ.filter + (fun x : ZMod (∏ i, m i) => + ∀ i, (crtEquiv m cop x) i ∈ admissible (m i) t)).card + = ∏ i, (2 * t + 1) := by + have hcongr : + (Finset.univ.filter + (fun x : ZMod (∏ i, m i) => ∀ i, (crtEquiv m cop x) i ∈ admissible (m i) t)).card + = (Finset.univ.filter + (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)).card := by + -- Pushforward of the filter along the bijection crtEquiv. The position-space + -- predicate is literally (product-space predicate) ∘ crtEquiv, so transport + -- by `Fintype.card_congr (Equiv.subtypeEquiv (crtEquiv m cop) (fun _ => Iff.rfl))` + -- over the corresponding subtypes (or `Finset.card_nbij'` with crtEquiv / + -- crtEquiv.symm). NOTE: `Equiv.card_filter_map` does NOT exist (review). + sorry + rw [hcongr]; exact blindspot_card_product m t hband + +/-- Oracle Check 2: exactly one all-grids coincidence per period (`t = 0`). + `NeZero (m i)` follows from `1 ≤ m i`; we take it as an instance argument + consistent with `crtEquiv`'s requirement. -/ +theorem unique_coincidence_per_period + {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] + (cop : ∀ i j, i ≠ j → Nat.Coprime (m i) (m j)) + (hpos : ∀ i, 1 ≤ m i) : + (Finset.univ.filter + (fun x : ZMod (∏ i, m i) => + ∀ i, (crtEquiv m cop x) i ∈ admissible (m i) 0)).card = 1 := by + simpa using blindspot_card_positions m cop 0 (by simpa using hpos) + +/-! ### Proof debt, ranked (post-review) + 1. `crtEquiv` — indexed CRT bijection; fold `ZMod.chineseRemainder` over + `Fin L`. THE crux, sole consumer of `cop`. Now carries the REQUIRED + `[∀ i, NeZero (m i)]` (without it `ZMod 0 = ℤ` makes the Equiv false). + 2. `card_admissible` — window injectivity (routine `ZMod.val`). Statement + confirmed correct by review (exact `2t+1` under `2t+1 ≤ m`). + 3. `hcongr` — cardinality transport via `Equiv.subtypeEquiv` + `Fintype.card_congr` + (NOT `Equiv.card_filter_map`, which does not exist). + 4. [VERIFY] names: `Fintype.mem_piFinset`, `Fintype.card_piFinset` (both resolve + per review). `Finset.Icc` is over ℕ then cast — no type error. + Product half `blindspot_card_product` is the ONLY structurally-complete piece + (modulo 4); the intended theorem (B) is entirely deferred to crux (1). Do not + represent this as "most content proved." -/ + +end OrdvecFormalization diff --git a/benchmarks/rigidity_impossibility_proofs.md b/benchmarks/rigidity_impossibility_proofs.md new file mode 100644 index 00000000..2ebc0406 --- /dev/null +++ b/benchmarks/rigidity_impossibility_proofs.md @@ -0,0 +1,121 @@ +# Why oblivious random-projection keys cannot be rigid — proofs, not corpora + +The zoo gives evidence; these give the reason. Three claims, stated and proved. +"Rigid" = number variance Σ²(L) grows sub-linearly (o(L)); "Poisson-or-worse" += Σ²(L) = Ω(L). The routing key is K_r(x) = r·x / ||x|| for a probe direction r +(cosine geometry: the key is a function of the RAY, magnitude-free). + +------------------------------------------------------------------------ +## Lemma 0 (magnitude is quotiented out) + +For any x ≠ 0 and scalar c > 0, K_r(cx) = K_r(x). So any structure a corpus +encodes purely in vector NORMS is invisible to the key. In particular the +`projected-rigid` construction x_i = t_i·u collapses under normalization to +sign(t_i)·u ∈ {+u, −u}: 50k rigid magnitudes become 2 distinct keys. Measured +Σ²/L = 1936 (degenerate), exactly as the quotient predicts. ∎ + +Consequence: rigidity must live in the ANGULAR distribution on the sphere +S^{d-1}, or it does not exist for the key. + +------------------------------------------------------------------------ +## Theorem 1 (isotropy ⇒ exact Poisson key) + +Let x be distributed so that x/||x|| is uniform on S^{d-1} (isotropic). For a +fixed unit r, the projected key K_r = r·(x/||x||) has the known marginal with +density ∝ (1−s²)^((d-3)/2) on [−1,1]. Drawing n i.i.d. corpus points gives n +i.i.d. key values from a fixed continuous law F. After unfolding by F, the +points are a homogeneous Poisson process of rate 1, for which Σ²(L) = L exactly. + +Proof: i.i.d. samples under their own CDF are uniform on [0,1]; counts in +disjoint windows have Binomial counts → Σ²(L) = L(1−L/n) (≈ L for L ≪ n). ∎ +(Matches isotropic zoo row: Σ²/L = 0.99, flat.) + +------------------------------------------------------------------------ +## Theorem 2 (i.i.d. corpus ⇒ Σ²(L) = L(1−L/n), NO rigidity, for ANY distribution) + +This is the special case Var_θ(p) = 0 of Theorem 3 (i.i.d. = degenerate mixing +measure). Let corpus points be i.i.d. from ANY distribution D on R^d (isotropic +or not, clustered, manifold, heavy-tailed). Fix r. The keys K_r(x_1..x_n) are +i.i.d. from the induced 1-D law F_r. Unfolding by F_r (probability integral +transform) gives n i.i.d. Uniform[0,1] points. + +Proof. n i.i.d. uniforms form a BINOMIAL point process (NOT Poisson — review +correction). A window of width L/n has count N ~ Binomial(n, L/n), so + Σ²(L) = Var(N) = n·(L/n)(1 − L/n) = L(1 − L/n). +Thus Σ²(L) = L(1−L/n) = Θ(L): sub-LINEAR rigidity (Σ² = o(L)) is impossible. ∎ + +NOTE (corrected): the earlier "homogeneous Poisson, Σ²=L exactly, independent +window counts, Var=mean=L" was WRONG — that is the Poisson framing this work +exists to debunk. For fixed n the process is binomial; counts in disjoint +windows are NEGATIVELY correlated (they sum to n), and Σ²(L)=L(1−L/n) < L. The +CONCLUSION (no sub-linear rigidity) is unaffected because the value is Θ(L). + +This is why clustered/anisotropic/rogue/manifold are not sub-linear: by Thm 3 +the mixing variance only ADDS (Σ² ≥ L(1−L/n), and clustering pushes above L). +Genuine rigidity (Σ² = o(L)) needs a negatively-associated, FINITELY-exchangeable +generator (lattice / determinantal point process) — see Thm 3's escape hatch — +which a fixed data distribution does not produce. + +------------------------------------------------------------------------ +## Theorem 3 (no rigidity for any fixed-distribution corpus — full derivation) + +Setup. Corpus rows exchangeable ⇒ keys exchangeable ⇒ unfolded positions +U_1..U_n exchangeable on [0,1]. Window W of length L/n, so E[count]=L. Let +I_j = 1{U_j ∈ W}, N = Σ I_j, p = L/n. By exchangeability every I_j has mean p +and every pair shares one covariance c = Cov(I_1,I_2): + + Var(N) = n·p(1−p) + n(n−1)·c. (★) + +de Finetti (infinite exchangeable ⇒ conditionally i.i.d. given latent θ): +I_j are independent Bernoulli(p(θ)) given θ, so + c = Cov(I_1,I_2) = E[p(θ)²] − p̄² = Var_θ(p(θ)) ≥ 0, p̄ = E[p(θ)]. +Substituting into (★): + + Σ²(L) = Var(N) = n·p̄(1−p̄) + n(n−1)·Var_θ(p(θ)) + ≥ n·p̄(1−p̄) = L(1 − L/n). ∎ + +So Σ²(L) ≥ L(1−L/n): rigidity (sub-Poisson, Σ²=o(L)) is IMPOSSIBLE because the +mixing-variance term Var_θ(p(θ)) is non-negative and can only RAISE the count +variance above the Poisson line. Clustering = large Var_θ(p(θ)) = strict excess. + +THE PRECISE ESCAPE HATCH (named, not gestured at). de Finetti requires INFINITE +exchangeability. Finitely-exchangeable sequences (sampling WITHOUT replacement +from a fixed finite set — i.e. a lattice / determinantal point process) violate +the c ≥ 0 step: there c < 0 and Σ²(L) < L (genuinely rigid). But "drawn from a +fixed data distribution" IS conditional-i.i.d.-given-θ — the infinite-exchangeable +case — so any corpus an encoder produces from a data distribution falls under +the theorem. Rigidity requires a negatively-associated generator (lattice/DPP) +AND preserved row order; an oblivious router over a fixed distribution has +neither. ∎ + +------------------------------------------------------------------------ +## What this proves about the conjecture (NARROWED post-review) + +The original hope was that prime/spectral structure could exploit *rigidity* in +the routing key. Theorems 1–3 show the key has Σ²(L) ≥ L(1−L/n) for any +conditionally-i.i.d. (mixture-of-i.i.d.) corpus — equality at isotropy, strict +excess under clustering. So: + + **The routing key is not number-variance-rigid.** Sub-linear Σ²(L) requires a + finitely-exchangeable / negatively-associated generator (lattice/DPP) AND + preserved row order — neither holds for an oblivious router over a fixed + distribution. + +That is the defensible claim, and it is all the mathematics supports. + +RETRACTED (non-sequitur, flagged in adversarial review). The earlier conclusion +— "quantile bucketing is optimal against the entire achievable class; no +spectral/prime partition can beat it because the structure is provably not +there" — DOES NOT FOLLOW from Σ²(L) ≥ L. Number variance and partition +quality (load balance / recall) are different figures of merit; Σ² ≥ L says +nothing about whether some partition outperforms quantile bucketing. A Poisson +key can still admit a useful partition. The impossibility result refutes only +the specific "exploit rigidity" route, NOT all spectral/prime partitions, and +NOT in favor of quantile-optimality-over-all-partitions. Establishing the latter +would need a separate argument (that partition quality is monotone in Σ²), which +is not provided and is likely false. + +(The zoo runs in corpus_zoo_results.md are EVIDENCE for the narrow claim, with +the unfolding caveat in ADVERSARIAL_REVIEW.md. The selftest Σ²=0 only shows the +estimator can represent the o(L) regime; Thm 3 says a fixed-distribution corpus +cannot reach it.) diff --git a/benchmarks/shard_recall_results.md b/benchmarks/shard_recall_results.md new file mode 100644 index 00000000..4bf7440b --- /dev/null +++ b/benchmarks/shard_recall_results.md @@ -0,0 +1,64 @@ +> ⚠️ EXPLORATORY — UNCONTROLLED COMPARISON (adversarial review). The RandomOffset +> arm draws an extra RNG value per projection, desyncing the stream so arms get +> DIFFERENT projection directions — it was never the clean phase-only ablation +> claimed, so the "random offsets are redundant" tie is WITHDRAWN. The "fair +> envelope" is also not fair (coprime arms saturate below high budgets). What +> survives: the retreat — "coprimality across R directions is the wrong geometry; +> needs a within-axis harness." See benchmarks/ADVERSARIAL_REVIEW.md. + +# R-projection shard-recall: does coprime seam-decorrelation help? + +Experiment for the training-free routing layer. Five projection arms over a +random-projection ensemble; recall@10 measured against FP32 cosine top-10, +compared at EQUAL candidates-scanned (the only fair axis). + +Source: `examples/shard_recall.rs`. Synthetic clustered corpus n=50k, dim=256, +200 queries, k=10, seed=1. + +## Fair envelope (max recall@10 at candidates-scanned <= budget) + +| budget | coprime | aligned | random-offset | both | +|--------|---------|---------|---------------|------| +| 1000 | 0.120 | 0.143 | 0.088 | 0.110 | +| 2000 | 0.214 | 0.239 | 0.218 | 0.207 | +| 4000 | 0.370 | 0.402 | 0.375 | 0.390 | +| 8000 | 0.518 | 0.674 | 0.651 | 0.554 | +| 16000 | 0.518 | 0.895 | 0.883 | 0.554 | + +## Findings + +**1. CLEAN RESULT (matched granularity): random offsets are redundant.** +`aligned` and `random-offset` use identical bucket width and differ only in +phase — they isolate seam decorrelation exactly. They are statistically TIED +(0.895 vs 0.883 at 16k; aligned marginally ahead). Across R *different* random +projection directions, the direction randomness already decorrelates seams; +adding random phase buys nothing. Confirms the literature prediction +(multi-probe LSH / random-rotation decorrelation). + +**2. CONFOUND (do NOT read as "coprime is worse"):** the coprime/both arms +subdivide each projection by a distinct prime (W/2..W/53), so their grids are +mostly tiny — they scan few candidates and plateau at ~0.52 due to collapsed +bucket VOLUME, not seam structure. This is a parameterization flaw. + +**3. GEOMETRY MISMATCH (the real lesson):** coprime PERIODS are a +within-single-axis stacked-grid (vernier) effect. Across R different random +directions there is only ONE grid per axis, so coprimality of periods across +directions is not even the right test. The vernier idea, to be tested properly, +needs multiple coprime-period grids overlaid on ONE projection direction — a +different harness (TODO). + +## Verdict + +For the multi-direction routing layer: **plain R random projections with a +single shared grid width are as good as any seam-decorrelation scheme tried +here.** Coprimality adds nothing in this geometry and is only potentially +meaningful in the within-axis stacked-grid setup, which this experiment does +not test. Recommendation: build the oblivious router with R shared-width +random-projection grids; do not invest in coprime periods unless the within-axis +vernier experiment shows a surprise. + +Next: (a) within-axis vernier test (coprime grids on one direction); +(b) re-run on real embeddings via --corpus-npy; +(c) add a quantile-k-means control arm to size the train-free vs trained gap. + +Reproduce: `cargo run --release --example shard_recall` diff --git a/benchmarks/spectral_probe_results.md b/benchmarks/spectral_probe_results.md new file mode 100644 index 00000000..eb0012ac --- /dev/null +++ b/benchmarks/spectral_probe_results.md @@ -0,0 +1,52 @@ +> ⚠️ EXPLORATORY — KNOWN CONFOUND (adversarial review). The fixed Gaussian unfold +> only fits the isotropic marginal; non-isotropic corpora unfolded by the wrong +> density can MANUFACTURE apparent clustering. The "super-Poisson" reading is NOT +> a settled property of embedding keys. Salvage path (unfold by each corpus's +> empirical smooth marginal) and full critique in benchmarks/ADVERSARIAL_REVIEW.md. +> The theory (rigidity_impossibility_proofs.md Thm 3) is the load-bearing result; +> this probe is illustrative only. + +# Number-variance probe: is a 1-D routing key rigid or Poisson? + +Experiment for the "prime mile-marker / spectral index" conjecture. The +conjecture needs the routing key to exhibit **rigidity** (sub-linear number +variance, Σ²(L) ~ log L, like GUE eigenvalue / zeta-zero spectra) for any +spectral/prime-gap partition structure to beat plain quantile bucketing. + +Source: `examples/spectral_probe.rs`. Number variance Σ²(L) = variance of the +count of keys in a window of length L, after unfolding to unit mean density. +Poisson ⇒ Σ²(L) = L (ratio 1, flat). Rigid ⇒ Σ²(L) ~ log L (ratio → 0). + +## Results (synthetic, n=200k, dim=256, 8 random-projection keys, seed=1) + +| L | clustered Σ²/L | isotropic Σ²/L (control) | quantile-unfold Σ² | +|---|---|---|---| +| 2 | 1.28 | 1.00 | 0.0000 | +| 8 | 3.05 | 1.00 | 0.0000 | +| 32 | 3.26 | 1.00 | 0.0000 | +| 128 | 5.94 | 1.01 | 0.0000 | +| 512 | 17.93 | 1.01 | 0.0000 | + +## Verdict + +1. **Isotropic control = exact Poisson (ratio 1.00, flat).** Validates the + probe: it reports "no structure" when there is none. The clustered signal is + therefore real, not an unfold artifact. +2. **Clustered key is SUPER-Poissonian (ratio climbs 1.3→18, Σ² ≈ 0.037·L²).** + Clustering, the *opposite* of the rigidity the conjecture requires. There is + no sub-linear (log-L) regime for spectral/prime structure to grip. +3. **Quantile unfold = Σ² 0.0000 exactly.** Quantile (inverse-CDF) tiling + balances the key perfectly by construction — empirical confirmation that it + strictly dominates any fixed-density tiling (including Li(x)) on occupancy. + +Conclusion: on this corpus the scalar routing key is Poisson-to-clustered, never +rigid. Quantile bucketing is the whole story for occupancy balance. The opening +the conjecture needed (rigidity) is absent. Re-run on real embeddings with +`--corpus-npy` to confirm; prior is strongly that real keys are also non-rigid. + +Reproduce: +``` +cargo run --release --example spectral_probe # clustered +cargo run --release --example spectral_probe -- --isotropic # Poisson control +cargo run --release --example spectral_probe -- --unfold-empirical # quantile +``` diff --git a/benchmarks/twonn_id_results.md b/benchmarks/twonn_id_results.md new file mode 100644 index 00000000..1e7b3428 --- /dev/null +++ b/benchmarks/twonn_id_results.md @@ -0,0 +1,61 @@ +> ⚠️ PARTIAL (adversarial review). The chord-vs-cosine METRIC FIX is correct and +> sphere-validated — that stands. BUT the estimator uses OLS-through-origin on the +> linearized CDF (F=(i+1)/N), the biased non-MLE variant; some of the "finite- +> sample deflation" attributed to sampling is actually estimator bias. Treat the +> low-tens ID as indicative, and prefer the MLE TwoNN estimator before quoting +> exact values. See benchmarks/ADVERSARIAL_REVIEW.md. + +# TwoNN intrinsic dimension probe + +TwoNN estimator (Facco et al. 2017) to measure the intrinsic dimension of an +embedding corpus directly — closing the citation gap (no published TwoNN figure +for sentence-transformers) with our own measurement. + +Source: `examples/twonn_id.rs`. μ = r2/r1 (2nd/1st NN distance ratio), Pareto +fit d = Σ(log μ · -log(1-F)) / Σ(log μ)², top-10% tail trimmed. Anchors sampled +(≤3000), each searched exactly against the full corpus. + +## CRITICAL: metric must be locally LINEAR in distance + +First implementation used cosine distance (1 - cos θ ≈ θ²/2), a *squared* +distance — this HALVED every estimate (μ squared → d halved through log μ). +Fixed to chord/Euclidean distance between unit vectors: sqrt(2 - 2cos) ∝ +sin(θ/2), locally linear in the angle. Anyone adapting this for cosine spaces +must use the chord metric, not cosine distance. + +## Validation (isotropic sphere, true ID = ambient - 1) + +| ambient dim | true ID | TwoNN (chord) | TwoNN (cosine bug) | +|-------------|---------|---------------|--------------------| +| 3 | 2 | 1.99 | 0.99 | +| 5 | 4 | 4.19 | 2.09 | +| 8 | 7 | 6.99 | 3.49 | +| 12 | 11 | 10.41 | 5.21 | +| 20 | 19 | 16.89 | — | +| 50 | 49 | 35.18 | — | + +Verdict: **calibrated and trustworthy through ID ~12** (dense sampling at +n=20k). Above that, finite-sample deflation sets in (ID=20→16.9, ID=50→35) — +the curse of dimensionality floor on ALL NN-based ID estimators: ~exp(d) points +needed to populate a d-dim neighborhood. Reported IDs in the tens are therefore +LOWER BOUNDS; true ID may be higher. (This also implies Ansuini's verified +12-25 last-layer figures are themselves likely deflated — consistent direction.) + +## Measurements + +| corpus | n | ambient | TwoNN ID | +|--------|---|---------|----------| +| synthetic clustered (latent_dim=64) | 20k | 256 | 27.5 | +| isotropic dim=256 | 10k | 256 | (sampling-limited) | + +The synthetic clustered corpus (built from a 64-D latent manifold) reads ID≈27 +— deflated from 64 by finite-sample + the within-cluster local geometry being +lower-dim than the global latent space. + +## For real embeddings + +Run: `cargo run --release --example twonn_id -- --corpus-npy your_embeddings.npy` +Loads 2-D = c" floor is FALSE there. +//! +//! Run: cargo run --release --example crt_seam_oracle + +fn gcd(mut a: u64, mut b: u64) -> u64 { + while b != 0 { + let t = b; + b = a % b; + a = t; + } + a +} +fn lcm(a: u64, b: u64) -> u64 { + a / gcd(a, b) * b +} + +/// Is x a seam of a grid with period m and phase phi (integer phase)? +/// Seam <=> (x - phi) ≡ 0 (mod m). +fn is_seam(x: i64, m: u64, phi: i64) -> bool { + let m = m as i64; + (((x - phi) % m) + m) % m == 0 +} + +/// Distance (in integer steps, on the ring) from x to the nearest seam of +/// grid (m, phi): min over k of |x - phi - k*m|, i.e. the centered residue. +fn dist_to_seam(x: i64, m: u64, phi: i64) -> u64 { + let m = m as i64; + let r = (((x - phi) % m) + m) % m; // in [0, m) + r.min(m - r) as u64 +} + +fn main() { + println!("# CRT seam oracle — exhaustive finite verification"); + + // ---- Check 1: pairwise coincidence spacing ---- + println!("\n## Check 1: first common seam of two grids (phi=0)"); + println!("m_i\tm_j\tcoprime\tfirst_common\tlcm(=expected)"); + for (mi, mj) in [(3u64, 5u64), (4, 6), (5, 7), (6, 9), (8, 12)] { + let mut first = None; + for x in 1..=(mi * mj) as i64 { + if is_seam(x, mi, 0) && is_seam(x, mj, 0) { + first = Some(x); + break; + } + } + let cop = gcd(mi, mj) == 1; + println!( + "{mi}\t{mj}\t{cop}\t{}\t{}", + first.unwrap(), + lcm(mi, mj) + ); + } + + // remaining checks appended in next pieces + checks_2_3_4(); +} + +fn checks_2_3_4() { + // L coprime grids on one axis. + let periods: [u64; 4] = [3, 5, 7, 11]; + let m: u64 = periods.iter().product(); // M = 1155 + let l = periods.len(); + + // ---- Check 2: all-L coincidence count over one period, phi=0 ---- + // Claim: exactly ONE position in [0, M) is a seam of all L grids. + let mut all_seam_zero = 0u64; + for x in 0..m as i64 { + if periods.iter().all(|&p| is_seam(x, p, 0)) { + all_seam_zero += 1; + } + } + println!("\n## Check 2: all-{l} coincidences in [0,M), phi=0 (M={m})"); + println!("count = {all_seam_zero} (claim: exactly 1, at x=0)"); + + // ---- Check 3: near-seam blind-spot density vs (2t)^L / M ---- + // A position is in the joint blind spot if it is within band |.|<=t of a + // seam on ALL L grids. Predicted density = product over grids of + // (per-grid near-seam density) = ((2t+1)/m_i) ... but the clean continuous + // form is (2t)^L / M. We report measured vs both the integer band count + // ((2t+1) residues per grid) and the continuous (2t)^L/M headline. + println!("\n## Check 3: joint near-seam density vs prediction (phi=0)"); + println!("t\tmeasured\t(2t+1)^L/M_int\t(2t)^L/M_cont"); + for t in [0u64, 1, 2, 3] { + let mut hits = 0u64; + for x in 0..m as i64 { + if periods.iter().all(|&p| dist_to_seam(x, p, 0) <= t) { + hits += 1; + } + } + let measured = hits as f64 / m as f64; + // integer band: each grid admits (2t+1) residues (centered), but capped + // at m_i; product of per-grid fractions = prod((2t+1)/m_i). + let int_pred: f64 = periods + .iter() + .map(|&p| ((2 * t + 1).min(p) as f64) / p as f64) + .product(); + let cont_pred = (2 * t).pow(l as u32) as f64 / m as f64; + println!("{t}\t{measured:.6}\t{int_pred:.6}\t{cont_pred:.6}"); + } + + // ---- Check 4: the phase refutation ---- + // With phi=0, x=0 is a seam on every grid -> max-over-grids dist = 0, so a + // pointwise floor "some grid keeps you >= c from its seam" is FALSE there. + // With chosen distinct phases, the all-seam coincidence can be MOVED but, + // by CRT, still occurs exactly once per period (we just relocate it). + let phis_zero = [0i64; 4]; + let worst_zero = (0..m as i64) + .map(|x| { + periods + .iter() + .zip(phis_zero.iter()) + .map(|(&p, &ph)| dist_to_seam(x, p, ph)) + .max() + .unwrap() + }) + .min() + .unwrap(); + // staggered phases: spread the seams + let phis_stag = [0i64, 1, 2, 3]; + let worst_stag = (0..m as i64) + .map(|x| { + periods + .iter() + .zip(phis_stag.iter()) + .map(|(&p, &ph)| dist_to_seam(x, p, ph)) + .max() + .unwrap() + }) + .min() + .unwrap(); + println!("\n## Check 4: pointwise max-over-grids dist-to-seam (the floor)"); + println!("min_x max_i dist (phi=0) = {worst_zero} (=> 0: floor is FALSE at origin)"); + println!("min_x max_i dist (phi staggered) = {worst_stag} (phases can lift the floor)"); + + // verdict line for downstream + println!("\n## Verdict"); + println!( + "coincidence_count_per_period={all_seam_zero} (=1 ✓); \ + floor_phi0={worst_zero} (=0, naive claim false ✓); \ + floor_staggered={worst_stag} (phase-dependent)" + ); +} diff --git a/examples/gen_corpus.rs b/examples/gen_corpus.rs new file mode 100644 index 00000000..605e2e53 --- /dev/null +++ b/examples/gen_corpus.rs @@ -0,0 +1,275 @@ +//! Corpus zoo generator — writes diverse embedding geometries as .npy so the +//! conjecture probes (spectral_probe, shard_recall, twonn_id, bench_rank) can +//! be verified across distributions, not just the one synthetic clustered +//! Gaussian. Decouples generation from measurement and dogfoods the .npy path. +//! +//! Kinds (each stresses a different failure mode / mimics a real pathology): +//! clustered — low-rank Gaussian clusters (the existing baseline) +//! isotropic — N(0,I) on the sphere (Poisson NULL control) +//! anisotropic — power-law singular spectrum sigma_k ~ k^-alpha (real +//! embedding decay; Gao et al. narrow-cone) +//! rogue — a few dimensions with huge variance (Timkey rogue dims) +//! manifold — nonlinear low-D manifold (sinusoidal) in high-D: the one +//! place angular RIGIDITY could plausibly hide +//! lattice — jittered 1-D lattice projected up: POSITIVE rigidity +//! control. If the number-variance probe never reports +//! sub-Poisson, it can't see rigidity; this proves it can. +//! +//! Run: cargo run --release --example gen_corpus -- --kind anisotropic \ +//! --n 50000 --dim 256 --out corpus_aniso.npy +//! Also emits a matched query file with --queries-out (smaller-noise draws). + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; + +struct Cfg { + kind: String, + n: usize, + n_queries: usize, + dim: usize, + latent: usize, + clusters: usize, + alpha: f32, + rogue: usize, + out: String, + queries_out: Option, + seed: u64, +} + +fn parse() -> Cfg { + let mut c = Cfg { + kind: "clustered".into(), + n: 50_000, + n_queries: 200, + dim: 256, + latent: 64, + clusters: 200, + alpha: 1.0, + rogue: 3, + out: "corpus.npy".into(), + queries_out: None, + seed: 1, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--kind" => c.kind = a.next().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent = a.next().unwrap().parse().unwrap(), + "--clusters" => c.clusters = a.next().unwrap().parse().unwrap(), + "--alpha" => c.alpha = a.next().unwrap().parse().unwrap(), + "--rogue" => c.rogue = a.next().unwrap().parse().unwrap(), + "--out" => c.out = a.next().unwrap(), + "--queries-out" => c.queries_out = Some(a.next().unwrap()), + "--seed" => c.seed = a.next().unwrap().parse().unwrap(), + other => panic!("unknown arg {other}"), + } + } + c +} + +/// Inverse standard-normal CDF (Acklam's rational approximation, |err|<1e-9). +fn inv_norm_cdf(p: f64) -> f64 { + let p = p.clamp(1e-12, 1.0 - 1e-12); + let a = [-3.969683028665376e+01, 2.209460984245205e+02, -2.759285104469687e+02, + 1.383577518672690e+02, -3.066479806614716e+01, 2.506628277459239e+00]; + let b = [-5.447609879822406e+01, 1.615858368580409e+02, -1.556989798598866e+02, + 6.680131188771972e+01, -1.328068155288572e+01]; + let c = [-7.784894002430293e-03, -3.223964580411365e-01, -2.400758277161838e+00, + -2.549732539343734e+00, 4.374664141464968e+00, 2.938163982698783e+00]; + let d = [7.784695709041462e-03, 3.224671290700398e-01, 2.445134137142996e+00, + 3.754408661907416e+00]; + let plow = 0.02425; + if p < plow { + let q = (-2.0 * p.ln()).sqrt(); + (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + } else if p <= 1.0 - plow { + let q = p - 0.5; + let r = q * q; + (((((a[0] * r + a[1]) * r + a[2]) * r + a[3]) * r + a[4]) * r + a[5]) * q + / (((((b[0] * r + b[1]) * r + b[2]) * r + b[3]) * r + b[4]) * r + 1.0) + } else { + let q = (-2.0 * (1.0 - p).ln()).sqrt(); + -(((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Write a 2-D row-major f32 buffer as a NumPy v1 .npy (().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } +} + +/// Generate `count` L2-normalized rows of dimension cfg.dim for the chosen +/// geometry. `is_query` reduces noise so queries sit near corpus structure. +fn generate(cfg: &Cfg, rng: &mut ChaCha8Rng, count: usize, is_query: bool) -> Vec { + let d = cfg.dim; + let l = cfg.latent; + let noise = if is_query { 0.1 } else { 0.3 }; + // shared random projection A: d x l (reused across rows in this call) + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(rng); + } + // cluster prototypes (latent space) + let mut protos = vec![0.0f32; cfg.clusters * l]; + for x in protos.iter_mut() { + *x = gauss(rng); + } + // power-law per-latent scaling for "anisotropic": sigma_k ~ (k+1)^-alpha + let scale: Vec = (0..l) + .map(|k| ((k + 1) as f32).powf(-cfg.alpha)) + .collect(); + + let mut out = vec![0.0f32; count * d]; + for i in 0..count { + let row = &mut out[i * d..(i + 1) * d]; + match cfg.kind.as_str() { + "isotropic" => { + for x in row.iter_mut() { + *x = gauss(rng); + } + } + "clustered" => { + let c = rng.random_range(0..cfg.clusters); + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = protos[c * l + j] + noise * gauss(rng); + } + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[ii * l + j] * z[j]; + } + row[ii] = acc; + } + } + "anisotropic" => { + // power-law spectrum: latent coords scaled by k^-alpha + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = scale[j] * gauss(rng); + } + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[ii * l + j] * z[j]; + } + row[ii] = acc; + } + } + "rogue" => { + // baseline isotropic + a few dims with ~10x variance + for x in row.iter_mut() { + *x = gauss(rng); + } + for r in 0..cfg.rogue.min(d) { + row[r] = 10.0 * gauss(rng); + } + } + "manifold" => { + // nonlinear low-D manifold: latent t in R^m mapped through + // sinusoids of varying frequency, then projected up. Angular + // structure here is the best chance for rigidity to appear. + let m = l.min(8).max(2); + let mut t = vec![0.0f32; m]; + for tj in t.iter_mut() { + *tj = rng.random_range(0.0..std::f32::consts::TAU); + } + // build a smooth high-D embedding: row[ii] = sum_j sin((j+1) t_j + phase_ii) + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..m { + let phase = a[ii * l + j]; // reuse A as fixed phases + acc += ((j + 1) as f32 * t[j] + phase).sin(); + } + row[ii] = acc + noise * gauss(rng); + } + } + "projected-rigid" => { + // STRONGEST rigidity control: place a quantile-spaced target on + // a FIXED direction u (= col 0 of A, unit-normalized once) and + // make it dominate. x_i = t_i * u (+ tiny ortho jitter). Then + // for ANY probe direction r, r.x_i = t_i * (r.u), a scaled copy + // of the rigid sequence — so the projected key inherits the + // quantile spacing (up to L2 distortion). This tests whether the + // random-projection + normalize pipeline PRESERVES rigidity that + // genuinely lives in the key, closing the selftest gap. + let t = inv_norm_cdf((i as f64 + 0.5) / count as f64) as f32; + for ii in 0..d { + row[ii] = t * a[ii * l] + 0.001 * gauss(rng); + } + } + "lattice" => { + // POSITIVE rigidity control. Each row gets a DISTINCT, evenly + // spaced position pos = i/count in [-1,1] (used once — no + // duplicates). Embedded on a small ARC: a constant base + // direction (col 0 of A) plus a small pos-scaled component along + // an orthogonal direction (col 1). The constant base survives + // L2 normalization (a pure line would collapse to +/- one + // vector); projecting onto the pos-direction recovers the + // evenly-spaced sequence -> SUB-Poisson (rigid) number variance. + // Gaussian-QUANTILE-spaced positions: pos_i = Phi^-1((i+0.5)/count). + // The marginal is then exactly N(0,1), matching the probe's + // Gaussian unfold, so any residual structure is REAL. Evenly + // spaced in probability = maximally regular = rigid. + let pos = inv_norm_cdf((i as f64 + 0.5) / count as f64) as f32; + for ii in 0..d { + let base = a[ii * l]; + let along = a[ii * l + 1]; + row[ii] = base + 0.35 * pos * along + 0.01 * gauss(rng); + } + } + other => panic!("unknown kind {other}"), + } + l2(row); + } + out +} + +fn main() { + let cfg = parse(); + let mut rng = ChaCha8Rng::seed_from_u64(cfg.seed); + let corpus = generate(&cfg, &mut rng, cfg.n, false); + write_npy(&cfg.out, &corpus, cfg.n, cfg.dim); + if let Some(ref qp) = cfg.queries_out { + let q = generate(&cfg, &mut rng, cfg.n_queries, true); + write_npy(qp, &q, cfg.n_queries, cfg.dim); + } +} diff --git a/examples/shard_recall.rs b/examples/shard_recall.rs new file mode 100644 index 00000000..41b2ac9e --- /dev/null +++ b/examples/shard_recall.rs @@ -0,0 +1,432 @@ +//! R-projection shard-recall experiment for the training-free routing layer. +//! +//! Open empirical question from the prime/index conjecture investigation: +//! can a *data-oblivious* router (R random 1-D projections, each bucketed by +//! a uniform grid) capture true neighbours well enough to route, and does +//! making the grid periods pairwise COPRIME beat independent random OFFSETS +//! at decorrelating bucket seams? +//! +//! Method (per agent-5 spec): for each query, the router probes a budget of +//! buckets across the R projections; shard-recall@k = fraction of the FP32 +//! cosine top-k that land in the probed union. We plot recall vs +//! CANDIDATES-SCANNED (not vs probe radius), so arms with different bucket +//! counts are compared at equal work — the only fair axis. +//! +//! Arms: +//! coprime — R grids, pairwise-coprime periods +//! aligned — R grids, identical period (worst case: seams stack) +//! random-offset — R grids, same period, independent random phase +//! both — coprime periods AND random offsets +//! +//! Decision rule: if coprime ~= random-offset within noise, coprimality adds +//! nothing (justified only on build-determinism). If coprime materially beats +//! random-offset at equal candidates-scanned and survives reseeding, that is +//! the publishable surprise. +//! +//! Run (synthetic): cargo run --release --example shard_recall +//! No external data, no BLAS. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + n_queries: usize, + k: usize, + latent_dim: usize, + n_clusters: usize, + r_values: Vec, // projection counts to sweep + isotropic: bool, + corpus_npy: Option, + queries_npy: Option, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 50_000, + n_queries: 200, + k: 10, + latent_dim: 64, + n_clusters: 200, + r_values: vec![1, 2, 4, 8, 16], + isotropic: false, + corpus_npy: None, + queries_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--k" => c.k = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--isotropic" => c.isotropic = true, + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--queries-npy" => c.queries_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays +/// (same format contract as bench_rank's --corpus-npy / --queries-npy). +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + let n = v.len() / dim; + for i in 0..n { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Low-rank clustered corpus + matched queries (same construction as bench_rank). +/// Returns (corpus, queries) as flat row-major dim-strided buffers. +fn make_corpus(cfg: &mut Cfg) -> (Vec, Vec) { + // Real corpus path: both files required (need real queries for honest recall). + if let (Some(cp), Some(qp)) = (cfg.corpus_npy.clone(), cfg.queries_npy.clone()) { + let (mut corpus, n, d) = load_npy_f32(&cp); + let (mut queries, nq, dq) = load_npy_f32(&qp); + assert_eq!(d, dq, "corpus/query dim mismatch"); + l2_normalize_rows(&mut corpus, d); + l2_normalize_rows(&mut queries, dq); + cfg.dim = d; + cfg.n = n; + cfg.n_queries = nq; + eprintln!("# loaded corpus {n}x{d}, queries {nq}x{dq}"); + return (corpus, queries); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let d = cfg.dim; + let l = cfg.latent_dim; + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc; + } + let nrm: f32 = out.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in out.iter_mut() { + *v /= nrm; + } + } + out + }; + let iso = |rng: &mut ChaCha8Rng| -> Vec { + let mut v = vec![0.0f32; d]; + for x in v.iter_mut() { + *x = gauss(rng); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let e = if cfg.isotropic { + iso(&mut rng) + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.3, &mut rng) + }; + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + let mut queries = vec![0.0f32; cfg.n_queries * d]; + for i in 0..cfg.n_queries { + let e = if cfg.isotropic { + iso(&mut rng) + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.1, &mut rng) + }; + queries[i * d..(i + 1) * d].copy_from_slice(&e); + } + (corpus, queries) +} + +/// FP32 brute-force cosine top-k ground truth (corpus is L2-normalized already). +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut scored: Vec<(usize, f32)> = (0..n) + .map(|di| { + let doc = &corpus[di * dim..(di + 1) * dim]; + let dot: f32 = q.iter().zip(doc).map(|(a, b)| a * b).sum(); + (di, dot) + }) + .collect(); + scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + scored.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +/// One projection: a unit random direction + a grid period + a phase offset. +/// A doc's bucket on this projection is floor((r.x - phase) / width). +struct Proj { + dir: Vec, + width: f32, + phase: f32, +} + +impl Proj { + fn bucket(&self, v: &[f32]) -> i64 { + let dot: f32 = self.dir.iter().zip(v).map(|(a, b)| a * b).sum(); + ((dot - self.phase) / self.width).floor() as i64 + } +} + +/// Pairwise-coprime-ish period multipliers (distinct small primes). The grid +/// WIDTH is base_width / period, so more-divided grids have more, finer buckets. +/// Coprime periods => the seam sets (multiples of width) share no interior +/// coincidence until the lcm — the vernier effect. +const COPRIME_PERIODS: [u32; 16] = [ + 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, +]; + +#[derive(Clone, Copy)] +enum Arm { + Coprime = 0, + Aligned = 1, + RandomOffset = 2, + Both = 3, +} + +impl Arm { + fn name(self) -> &'static str { + match self { + Arm::Coprime => "coprime", + Arm::Aligned => "aligned", + Arm::RandomOffset => "random-offset", + Arm::Both => "both", + } + } +} + +/// Build R projections for one arm. `base_width` is calibrated so a single +/// grid has ~sqrt(n) buckets (a reasonable shard granularity); coprime arms +/// subdivide by the prime multiplier. +fn build_projs(cfg: &Cfg, arm: Arm, r: usize, base_width: f32, rng: &mut ChaCha8Rng) -> Vec { + let d = cfg.dim; + (0..r) + .map(|i| { + let mut dir = vec![0.0f32; d]; + for x in dir.iter_mut() { + *x = gauss(rng); + } + let nrm: f32 = dir.iter().map(|t| t * t).sum::().sqrt(); + for x in dir.iter_mut() { + *x /= nrm; + } + let (width, phase) = match arm { + // coprime: finer grid per projection by a distinct prime factor + Arm::Coprime => (base_width / COPRIME_PERIODS[i] as f32, 0.0), + // aligned: identical grid, zero phase — seams stack exactly + Arm::Aligned => (base_width, 0.0), + // random-offset: identical period, independent random phase + Arm::RandomOffset => { + let p: f32 = rng.random_range(0.0..1.0); + (base_width, p * base_width) + } + // both: coprime period AND random phase + Arm::Both => { + let p: f32 = rng.random_range(0.0..1.0); + let w = base_width / COPRIME_PERIODS[i] as f32; + (w, p * w) + } + }; + Proj { dir, width, phase } + }) + .collect() +} + +/// Assign every doc a bucket-key tuple (one i64 per projection), grouped into +/// an index: for each projection, map bucket-id -> list of doc ids. +type BucketIndex = Vec>>; + +fn build_index(projs: &[Proj], corpus: &[f32], dim: usize) -> BucketIndex { + let n = corpus.len() / dim; + projs + .par_iter() + .map(|p| { + let mut m: std::collections::HashMap> = std::collections::HashMap::new(); + for di in 0..n { + let b = p.bucket(&corpus[di * dim..(di + 1) * dim]); + m.entry(b).or_default().push(di as u32); + } + m + }) + .collect() +} + +/// For one query at a given probe radius `rad` (probe buckets b-rad..=b+rad on +/// each projection), return the candidate union and its size (candidates +/// scanned). Recall is measured against `truth_set`. +fn probe_recall( + projs: &[Proj], + index: &BucketIndex, + q: &[f32], + rad: i64, + truth: &[usize], +) -> (usize, f32) { + use std::collections::HashSet; + let mut cand: HashSet = HashSet::new(); + for (pi, p) in projs.iter().enumerate() { + let b = p.bucket(q); + for bb in (b - rad)..=(b + rad) { + if let Some(ids) = index[pi].get(&bb) { + cand.extend(ids.iter().copied()); + } + } + } + let truth_set: HashSet = truth.iter().map(|&i| i as u32).collect(); + let hits = truth_set.iter().filter(|i| cand.contains(i)).count(); + let recall = hits as f32 / truth_set.len().max(1) as f32; + (cand.len(), recall) +} + +fn run_arms(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { + // Calibrate base_width so a single unit-direction grid yields ~sqrt(n) + // buckets. r.x ~ N(0, 1/dim) for L2-normalized x, so its spread ~ 6/sqrt(dim) + // across the corpus; divide that range into ~sqrt(n) cells. + let spread = 6.0 / (cfg.dim as f32).sqrt(); + let base_width = spread / (cfg.n as f32).sqrt(); + let arms = [Arm::Coprime, Arm::Aligned, Arm::RandomOffset, Arm::Both]; + // Collect (arm, cand, recall) across all (R, rad) for the fair envelope. + let mut pts: Vec<(&'static str, f64, f64)> = Vec::new(); + println!("# recall vs candidates-scanned (mean over queries), per arm per R"); + println!("arm\tR\trad\tcand_scanned\trecall@k"); + for &r in &cfg.r_values { + for &arm in &arms { + // Aligned/RandomOffset don't subdivide, so they only differ from + // each other by phase; coprime/both subdivide per projection. + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ (r as u64) << 8 ^ arm as u64); + let projs = build_projs(cfg, arm, r, base_width, &mut rng); + let index = build_index(&projs, corpus, cfg.dim); + for rad in [0i64, 1, 2, 4] { + let stats: Vec<(usize, f32)> = (0..cfg.n_queries) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * cfg.dim..(qi + 1) * cfg.dim]; + probe_recall(&projs, &index, q, rad, &truth[qi]) + }) + .collect(); + let mean_cand = + stats.iter().map(|s| s.0 as f64).sum::() / stats.len() as f64; + let mean_rec = + stats.iter().map(|s| s.1 as f64).sum::() / stats.len() as f64; + println!( + "{}\t{r}\t{rad}\t{:.0}\t{:.4}", + arm.name(), + mean_cand, + mean_rec + ); + pts.push((arm.name(), mean_cand, mean_rec)); + } + } + } + // Fair envelope: best recall achievable per arm at or below each candidate + // budget. This is the only honest cross-arm comparison — equal work. + println!("\n# FAIR ENVELOPE: max recall@k at candidates-scanned <= budget"); + let budgets = [1000.0, 2000.0, 4000.0, 8000.0, 16000.0]; + print!("budget"); + for a in &arms { + print!("\t{}", a.name()); + } + println!(); + for &bud in &budgets { + print!("{bud:.0}"); + for a in &arms { + let best = pts + .iter() + .filter(|(name, c, _)| *name == a.name() && *c <= bud) + .map(|(_, _, r)| *r) + .fold(0.0f64, f64::max); + print!("\t{best:.4}"); + } + println!(); + } +} + +fn main() { + let mut cfg = parse(); + let (corpus, queries) = make_corpus(&mut cfg); + let truth = ground_truth(&corpus, &queries, cfg.dim, cfg.k); + let src = if cfg.corpus_npy.is_some() { + "npy" + } else if cfg.isotropic { + "isotropic" + } else { + "clustered" + }; + println!( + "# shard_recall: n={} q={} dim={} k={} corpus={src}", + cfg.n, cfg.n_queries, cfg.dim, cfg.k + ); + run_arms(&cfg, &corpus, &queries, &truth); +} diff --git a/examples/spectral_probe.rs b/examples/spectral_probe.rs new file mode 100644 index 00000000..bea2dc79 --- /dev/null +++ b/examples/spectral_probe.rs @@ -0,0 +1,327 @@ +//! Spectral / number-variance probe for the "prime mile-marker" conjecture. +//! +//! Decisive cheap experiment #1: is a 1-D routing key over an embedding +//! corpus *Poisson* (number variance grows like the window length L) or +//! *rigid* (variance grows like log L, as GUE-spectra do)? +//! +//! If Poisson, there is nothing for spectral / prime-gap structure to grip +//! and plain quantile bucketing is the whole story. Rigidity (sub-linear +//! number variance) would be the only empirical opening for the conjecture. +//! +//! Key subtlety handled below: number variance is only meaningful after +//! *unfolding* by the SMOOTH density, not the empirical CDF. Unfolding by +//! rank trivially yields a perfect lattice (Sigma^2 = 0) and measures +//! nothing. A random projection r . x of an L2-normalized vector is +//! approximately N(0, 1/dim) by the CLT, so we unfold with the matching +//! Gaussian CDF Phi and measure the residual fluctuation. +//! +//! Run (synthetic, self-contained): +//! cargo run --release --example spectral_probe +//! Run (real corpus): +//! cargo run --release --example spectral_probe -- --corpus-npy emb.npy +//! +//! No external data, no BLAS. Reuses the same construction as bench_rank. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + latent_dim: usize, + n_clusters: usize, + n_keys: usize, // number of independent random-projection keys to average over + corpus_npy: Option, + isotropic: bool, // control: N(0,I) corpus, no clusters (true Poisson expected) + unfold_empirical: bool, // unfold by empirical CDF (quantile) instead of Gaussian +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 200_000, + latent_dim: 64, + n_clusters: 200, + n_keys: 8, + corpus_npy: None, + isotropic: false, + unfold_empirical: false, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--keys" => c.n_keys = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--isotropic" => c.isotropic = true, + "--unfold-empirical" => c.unfold_empirical = true, + "--rigid-selftest" => {} // handled in main, bypasses corpus + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Phi(z) = 0.5 (1 + erf(z / sqrt 2)). Standard normal CDF. +fn normal_cdf(z: f64) -> f64 { + 0.5 * (1.0 + erf(z / std::f64::consts::SQRT_2)) +} + +fn erf(x: f64) -> f64 { + let t = 1.0 / (1.0 + 0.327_591_1 * x.abs()); + let y = 1.0 + - (((((1.061_405_429 * t - 1.453_152_027) * t) + 1.421_413_741) * t - 0.284_496_736) * t + + 0.254_829_592) + * t + * (-x * x).exp(); + if x >= 0.0 { + y + } else { + -y + } +} + +fn main() { + let cfg = parse(); + // Instrument self-test: feed a perfectly evenly-spaced (unfolded) key + // directly, bypassing corpus + projection. This is a maximally rigid 1-D + // sequence; the estimator MUST report Sigma^2/L << 1 (falling). Proves the + // probe can SEE rigidity when it is present, so a Poisson reading elsewhere + // is a real finding, not instrument blindness. + if std::env::args().any(|a| a == "--rigid-selftest") { + let n = 50_000usize; + let key: Vec = (0..n).map(|i| i as f64).collect(); // perfect lattice + println!("# RIGID SELF-TEST: perfectly even key, expect Sigma^2/L << 1"); + println!("L\tSigma2_mean\tSigma2/L"); + report(&[key], n); + return; + } + let (keys_per_proj, n) = build_keys(&cfg); + println!("# spectral_probe: n={n} keys={}", keys_per_proj.len()); + println!("# unfolded by Gaussian CDF; Poisson => Sigma^2(L) ~ L, rigid => ~log L"); + println!("L\tSigma2_mean\tSigma2/L\tslope_hint"); + report(&keys_per_proj, n); +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays +/// (same format contract as bench_rank's --corpus-npy). +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +/// Project a real corpus into the same per-projection unfolded keys the +/// synthetic path produces. L2-normalizes rows first (cosine geometry). +fn build_keys_from_corpus(cfg: &Cfg, corpus: &[f32], n: usize, d: usize) -> (Vec>, usize) { + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + // unit-normalize rows in a local copy + let mut c = corpus.to_vec(); + for i in 0..n { + let row = &mut c[i * d..(i + 1) * d]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } + let mut out = Vec::with_capacity(cfg.n_keys); + for _ in 0..cfg.n_keys { + let mut r = vec![0.0f32; d]; + for v in r.iter_mut() { + *v = gauss(&mut rng); + } + let rn: f32 = r.iter().map(|v| v * v).sum::().sqrt(); + for v in r.iter_mut() { + *v /= rn; + } + let mut keys: Vec = (0..n) + .map(|i| { + let doc = &c[i * d..(i + 1) * d]; + r.iter().zip(doc).map(|(a, b)| (a * b) as f64).sum() + }) + .collect(); + if cfg.unfold_empirical { + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + for (i, kv) in keys.iter_mut().enumerate() { + *kv = i as f64; + } + } else { + let sigma = (1.0 / d as f64).sqrt(); + for kv in keys.iter_mut() { + *kv = normal_cdf(*kv / sigma) * n as f64; + } + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + } + out.push(keys); + } + (out, n) +} + +/// Returns (one unfolded key-position list per projection, n). +fn build_keys(cfg: &Cfg) -> (Vec>, usize) { + if let Some(ref path) = cfg.corpus_npy { + let (corpus, n, d) = load_npy_f32(path); + eprintln!("# loaded {n} x {d} from {path}"); + return build_keys_from_corpus(cfg, &corpus, n, d); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let d = cfg.dim; + let l = cfg.latent_dim; + // low-rank clustered corpus, same construction as bench_rank + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc; + } + let nrm: f32 = out.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in out.iter_mut() { + *v /= nrm; + } + } + out + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let e = if cfg.isotropic { + // control: draw directly in R^d as N(0,I), normalize. No cluster + // structure, no low-rank latent -> the projection key should be a + // clean Poisson process after Gaussian unfold. + let mut v = vec![0.0f32; d]; + for x in v.iter_mut() { + *x = gauss(&mut rng); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.3, &mut rng) + }; + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + // random projection keys + let mut out = Vec::with_capacity(cfg.n_keys); + for _k in 0..cfg.n_keys { + let mut r = vec![0.0f32; d]; + for v in r.iter_mut() { + *v = gauss(&mut rng); + } + let rn: f32 = r.iter().map(|v| v * v).sum::().sqrt(); + for v in r.iter_mut() { + *v /= rn; + } + // key value for each doc = r . x ; for L2-normalized x and unit r, + // distributed ~ N(0, 1/dim) under the CLT for isotropic x. + let mut keys: Vec = (0..cfg.n) + .map(|i| { + let doc = &corpus[i * d..(i + 1) * d]; + let dot: f32 = r.iter().zip(doc).map(|(a, b)| a * b).sum(); + dot as f64 + }) + .collect(); + if cfg.unfold_empirical { + // quantile unfold: sort, assign rank/N. By construction this maps + // to a perfect lattice -> Sigma^2 -> 0 if the sort is the only + // structure. Confirms quantile tiling balances the key exactly. + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + for (i, kv) in keys.iter_mut().enumerate() { + *kv = i as f64; // already the unfolded lattice position + } + } else { + // unfold by the matching Gaussian CDF -> positions scaled to n. + let sigma = (1.0 / d as f64).sqrt(); + for kv in keys.iter_mut() { + *kv = normal_cdf(*kv / sigma) * cfg.n as f64; + } + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + } + out.push(keys); + } + (out, cfg.n) +} + +/// Number variance Sigma^2(L): variance of the count of points in a window +/// of length L, over many window placements, averaged across projections. +fn report(keys_per_proj: &[Vec], n: usize) { + let ls = [2.0_f64, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0]; + let n_windows = 4000usize; + for &lwin in ls.iter() { + if lwin >= n as f64 { + break; + } + let mut acc_var = 0.0f64; + for positions in keys_per_proj { + // place n_windows windows uniformly in [0, n-L], count via binary search + let mut counts = Vec::with_capacity(n_windows); + for w in 0..n_windows { + let start = (w as f64) * ((n as f64 - lwin) / n_windows as f64); + let end = start + lwin; + let lo = positions.partition_point(|&p| p < start); + let hi = positions.partition_point(|&p| p < end); + counts.push((hi - lo) as f64); + } + let mean = counts.iter().sum::() / counts.len() as f64; + let var = counts.iter().map(|c| (c - mean).powi(2)).sum::() + / counts.len() as f64; + acc_var += var; + } + let sigma2 = acc_var / keys_per_proj.len() as f64; + let log_l = lwin.ln(); + println!( + "{lwin:.0}\t{sigma2:.4}\t{:.4}\t(L={lwin:.0}, lnL={log_l:.3})", + sigma2 / lwin + ); + } +} diff --git a/examples/twonn_id.rs b/examples/twonn_id.rs new file mode 100644 index 00000000..32381bb7 --- /dev/null +++ b/examples/twonn_id.rs @@ -0,0 +1,235 @@ +//! TwoNN intrinsic-dimension estimator (Facco et al. 2017) for embedding corpora. +//! +//! For each point, μ = r2/r1 = ratio of 2nd to 1st nearest-neighbour distance. +//! Under locally-uniform density of intrinsic dimension d, μ ~ Pareto(d), so +//! the linear fit log(1 - F(μ_i)) = -d * log(μ_i) through the origin +//! recovers d. No binning, no manifold reconstruction — two distances/point. +//! +//! Metric: embeddings are L2-normalized and retrieval is ANGULAR, so NN +//! distances here are cosine distance (1 - cos). Measuring Euclidean ID on +//! normalized vectors would estimate the ID of the wrong metric. +//! +//! Validation control: the synthetic corpus is a latent_dim-D manifold +//! projected into dim-D, so TwoNN should recover ID ≈ latent_dim. Run the +//! default and check it lands near 64 before trusting it on real .npy data. +//! +//! Run (synthetic control): cargo run --release --example twonn_id +//! Run (real corpus): cargo run --release --example twonn_id -- --corpus-npy emb.npy +//! No external data, no BLAS. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + latent_dim: usize, + n_clusters: usize, + corpus_npy: Option, + isotropic: bool, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 20_000, + latent_dim: 64, + n_clusters: 200, + corpus_npy: None, + isotropic: false, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--isotropic" => c.isotropic = true, + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays. +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +/// L2-normalize each row in place (cosine geometry). +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + let n = v.len() / dim; + for i in 0..n { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +/// Returns (corpus, n, dim). Synthetic = same low-rank clustered construction +/// as bench_rank (latent_dim manifold in dim-space → TwoNN should recover +/// ~latent_dim). Real = whatever the .npy holds, L2-normalized. +fn load_corpus(cfg: &Cfg) -> (Vec, usize, usize) { + if let Some(ref path) = cfg.corpus_npy { + let (mut v, n, dim) = load_npy_f32(path); + l2_normalize_rows(&mut v, dim); + return (v, n, dim); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let (d, l) = (cfg.dim, cfg.latent_dim); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc; + } + out + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let e = if cfg.isotropic { + let mut v = vec![0.0f32; d]; + for x in v.iter_mut() { + *x = gauss(&mut rng); + } + v + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.3, &mut rng) + }; + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + l2_normalize_rows(&mut corpus, d); + (corpus, cfg.n, d) +} + +/// For sampled anchors, compute μ = r2/r1 over COSINE distance (1 - cos), +/// searching each anchor against the full corpus. Anchors are sampled (the +/// Pareto fit is robust to it) but each search is exact, so r1/r2 are exact. +/// Returns the μ values (>1, with degenerate/duplicate cases dropped). +fn nn_ratios(corpus: &[f32], n: usize, dim: usize) -> Vec { + // sample up to 3000 anchors deterministically (stride sampling, seed-free) + let n_anchors = n.min(3000); + let stride = (n / n_anchors).max(1); + let anchors: Vec = (0..n).step_by(stride).take(n_anchors).collect(); + anchors + .par_iter() + .filter_map(|&ai| { + let a = &corpus[ai * dim..(ai + 1) * dim]; + // cosine distance = 1 - dot (vectors are unit norm). Track the two + // smallest distances to OTHER points. + let (mut d1, mut d2) = (f64::INFINITY, f64::INFINITY); + for bi in 0..n { + if bi == ai { + continue; + } + let b = &corpus[bi * dim..(bi + 1) * dim]; + let dot: f32 = a.iter().zip(b).map(|(x, y)| x * y).sum(); + // Euclidean / chord distance between unit vectors: + // sqrt(2 - 2cos) ∝ sin(θ/2), locally LINEAR in the angle. + // (Cosine distance 1-cos ≈ θ²/2 is a *squared* distance and + // halves the TwoNN estimate — TwoNN needs a locally-linear metric.) + let dist = (2.0 - 2.0 * dot as f64).max(0.0).sqrt(); + if dist < d1 { + d2 = d1; + d1 = dist; + } else if dist < d2 { + d2 = dist; + } + } + // drop degenerate anchors (exact duplicate neighbour => r1≈0) + if d1 > 1e-12 && d2.is_finite() { + let mu = d2 / d1; + if mu > 1.0 + 1e-9 { + return Some(mu); + } + } + None + }) + .collect() +} + +/// TwoNN Pareto fit. Sort μ ascending, empirical CDF F_i = (i+1)/N, then fit +/// d through the origin on y = -log(1 - F) vs x = log μ. Robust variant: +/// discard the top 10% of μ (Facco et al.'s recommended tail trim) and use the +/// closed-form slope d = Σ x·y / Σ x². +fn twonn_fit(mus: &[f64]) -> f64 { + let mut m: Vec = mus.to_vec(); + m.sort_by(|a, b| a.partial_cmp(b).unwrap()); + let n = m.len(); + let keep = ((n as f64) * 0.9) as usize; // trim top 10% tail + let (mut sxx, mut sxy) = (0.0f64, 0.0f64); + for (i, &mu) in m.iter().enumerate().take(keep) { + let f = (i + 1) as f64 / n as f64; // empirical CDF + let x = mu.ln(); + let y = -(1.0 - f).ln(); + sxx += x * x; + sxy += x * y; + } + sxy / sxx +} + +fn main() { + let cfg = parse(); + let (corpus, n, dim) = load_corpus(&cfg); + eprintln!("# twonn_id: n={n} dim={dim} source={}", + if cfg.corpus_npy.is_some() { "npy" } else if cfg.isotropic { "isotropic" } else { "clustered" }); + let mus = nn_ratios(&corpus, n, dim); + let d = twonn_fit(&mus); + println!("TwoNN intrinsic dimension estimate: {d:.2} (ambient dim {dim})"); + if cfg.corpus_npy.is_none() && !cfg.isotropic { + println!("(synthetic control: expect ~{} = latent_dim)", cfg.latent_dim); + } +} From c03e4bd2d620df5cc301c9dbafe2ec2c4ad386a2 Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 11:49:33 -0700 Subject: [PATCH 02/16] Close adversarial-review gaps: fix correctness bugs, withdraw unsupported probe Acting on three internal adversarial reviews: Correctness fixes: - gen_corpus Bug O: corpus and queries now share one latent geometry (A + prototypes from a dedicated seed), so shard_recall ground truth is valid. - shard_recall Bug L: build_projs seeds direction/phase RNGs separately and identically across arms -> aligned vs random-offset share directions, clean ablation. Post-fix: tied (0.9095 vs 0.9080), "random offsets redundant" holds. - CRT density: capped form min(2t+1,m_i)/m_i (uncapped exceeds 1 at t>=2). - Lean crtEquiv: now carries required [forall i, NeZero (m i)] (ZMod 0 = Z would make the Equiv false); hcongr lemma plan corrected. - rigidity proofs: Theorem 2 restated with binomial value L(1-L/n) (not "Poisson, exactly L"); over-broad "optimal over all partitions" claim RETRACTED as a non-sequitur. Withdrawn: - Number-variance "super-Poisson" finding: the salvage (smooth empirical unfold) INVERTED the isotropic/clustered ordering, proving the probe's unfold is uncalibrated. Withdrawn pending an estimator calibrated against a process with known closed-form variance. Theory is unaffected. All findings tiered in benchmarks/ADVERSARIAL_REVIEW.md. Examples build clean. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/ADVERSARIAL_REVIEW.md | 40 ++++++++++------ benchmarks/corpus_zoo_results.md | 12 ++--- benchmarks/shard_recall_results.md | 37 ++++++++------- benchmarks/spectral_probe_results.md | 16 ++++--- examples/gen_corpus.rs | 12 +++-- examples/shard_recall.rs | 32 +++++++------ examples/spectral_probe.rs | 68 ++++++++++++++++++++++++---- 7 files changed, 146 insertions(+), 71 deletions(-) diff --git a/benchmarks/ADVERSARIAL_REVIEW.md b/benchmarks/ADVERSARIAL_REVIEW.md index 83245672..2466931d 100644 --- a/benchmarks/ADVERSARIAL_REVIEW.md +++ b/benchmarks/ADVERSARIAL_REVIEW.md @@ -28,19 +28,33 @@ carries its own critique. Conclusions are tiered by what SURVIVED review. ## DEMOTED TO EXPLORATORY (known confounds — NOT a settled finding) - **"Routing keys are super-Poisson, never rigid"** (`spectral_probe.rs`, - `corpus_zoo_results.md`): the fixed **Gaussian** unfold only fits the isotropic - marginal. Non-isotropic corpora are unfolded by the wrong density, which can - MANUFACTURE apparent clustering. The isotropic control and zoo are therefore - partly circular. To salvage: unfold by each corpus's **empirical** smooth - marginal (monotone spline on sorted keys — NOT the rank-overwrite currently - mislabeled `--unfold-empirical`, which returns Σ²=0 by construction). UNTIL - THEN this is exploratory, not a proven property of embedding keys. -- **"Random offsets redundant / coprime adds nothing"** (`shard_recall.rs`): the - RandomOffset arm draws an extra RNG value per projection, desyncing the stream - so arms get DIFFERENT projection directions — it was never the clean phase-only - ablation claimed. Plus the "fair envelope" is not fair (coprime arms saturate - below high budgets). The *retreat* ("needs a within-axis harness") is sound; - the positive tie claim is withdrawn. + `corpus_zoo_results.md`): **WITHDRAWN — the probe does not measure what it + claims.** Attempted salvage (added `--unfold-smooth K`, a K-knot empirical + unfold) INVERTED the result: under the smooth unfold the isotropic corpus + reads super-Poisson (Σ²/L 1.7→12) and clustered reads LOWER (1.4→6.8) — the + opposite ordering from the Gaussian-unfold version. So the two unfolds + disagree and NEITHER is validated against analytic ground truth. The + Gaussian-unfold "isotropic = clean Poisson 0.99" was not a control passing — + it was the single marginal the wrong unfold happened to fit. The smooth unfold + has its own artifact (knot-scale + interpolation structure). CONCLUSION: the + number-variance empirical finding is UNSUPPORTED in either direction. Fixing it + requires rebuilding the estimator against a case with KNOWN analytic number + variance (e.g. a stationary process with closed-form Σ²(L)) to calibrate the + unfold + window estimator before trusting any corpus reading. This is open + follow-up work, not a patch. The THEORY (rigidity_impossibility_proofs.md) is + unaffected — it does not depend on this probe. +- **"Random offsets redundant / coprime adds nothing"** (`shard_recall.rs`): + **Bug L FIXED.** `build_projs` now seeds direction and phase RNGs separately + and identically across arms, so aligned vs random-offset share the same R + directions and the ablation is clean. Re-run: aligned 0.9095 vs random-offset + 0.9080 (tied) — the "random offsets add nothing" claim now holds on a + controlled comparison. Still-open caveat: the "fair envelope" undersells the + coprime arms (they subdivide buckets and saturate below high budgets), and + coprimality across R directions is the wrong geometry — the within-axis vernier + harness remains unbuilt (theory is in crt_seam_oracle). +- **`gen_corpus` Bug O FIXED.** Corpus and queries now share one geometry + (`A` + prototypes seeded from a dedicated geometry-only RNG keyed by cfg.seed), + so query/corpus latent spaces match and shard_recall ground truth is valid. ## IMPOSSIBILITY PROOFS — repairable, currently overstated diff --git a/benchmarks/corpus_zoo_results.md b/benchmarks/corpus_zoo_results.md index aea46547..c18404ee 100644 --- a/benchmarks/corpus_zoo_results.md +++ b/benchmarks/corpus_zoo_results.md @@ -1,9 +1,9 @@ -> ⚠️ EXPLORATORY — CIRCULAR AS WRITTEN (adversarial review). Every non-isotropic -> geometry is unfolded by a fixed Gaussian CDF it does not match, so the -> "robustness across geometries" partly measures the unfold mismatch, not the -> corpus. Isotropic is the only marginal the unfold fits. NOT a settled finding; -> re-run with per-corpus empirical-marginal unfolding to salvage. See -> benchmarks/ADVERSARIAL_REVIEW.md. +> ⛔ WITHDRAWN. This zoo runs the number-variance probe whose unfold is +> uncalibrated (see spectral_probe_results.md banner and ADVERSARIAL_REVIEW.md). +> The smooth-unfold salvage inverted the isotropic/clustered ordering, so the +> "robust across geometries" claim is unsupported. Retained as record only. The +> gen_corpus.rs zoo GENERATOR itself is fine and reusable; the measurement on top +> is what's withdrawn. # Corpus zoo: number-variance robustness across geometries diff --git a/benchmarks/shard_recall_results.md b/benchmarks/shard_recall_results.md index 4bf7440b..7e3c496d 100644 --- a/benchmarks/shard_recall_results.md +++ b/benchmarks/shard_recall_results.md @@ -1,10 +1,13 @@ -> ⚠️ EXPLORATORY — UNCONTROLLED COMPARISON (adversarial review). The RandomOffset -> arm draws an extra RNG value per projection, desyncing the stream so arms get -> DIFFERENT projection directions — it was never the clean phase-only ablation -> claimed, so the "random offsets are redundant" tie is WITHDRAWN. The "fair -> envelope" is also not fair (coprime arms saturate below high budgets). What -> survives: the retreat — "coprimality across R directions is the wrong geometry; -> needs a within-axis harness." See benchmarks/ADVERSARIAL_REVIEW.md. +> ✅ RNG-DESYNC BUG FIXED (was Bug L). build_projs now seeds direction and phase +> RNGs separately and identically across arms, so aligned vs random-offset share +> the SAME R projection directions and differ ONLY in phase — the clean ablation. +> Re-run result: aligned 0.9095 vs random-offset 0.9080 at 16k budget = TIED, so +> "random phase offsets add nothing across R different directions" now holds on a +> controlled comparison. CAVEAT STILL OPEN: the coprime/both arms subdivide +> buckets so the "fair envelope" undersells them (they saturate below high +> budgets); coprimality across R directions remains the wrong geometry — the +> within-axis vernier harness (crt_seam_oracle covers the theory) is the right +> test and is not built here. Numbers below are the post-fix run. # R-projection shard-recall: does coprime seam-decorrelation help? @@ -17,20 +20,22 @@ Source: `examples/shard_recall.rs`. Synthetic clustered corpus n=50k, dim=256, ## Fair envelope (max recall@10 at candidates-scanned <= budget) +Post-fix run (Bug L fixed: arms share identical projection directions): + | budget | coprime | aligned | random-offset | both | |--------|---------|---------|---------------|------| -| 1000 | 0.120 | 0.143 | 0.088 | 0.110 | -| 2000 | 0.214 | 0.239 | 0.218 | 0.207 | -| 4000 | 0.370 | 0.402 | 0.375 | 0.390 | -| 8000 | 0.518 | 0.674 | 0.651 | 0.554 | -| 16000 | 0.518 | 0.895 | 0.883 | 0.554 | +| 1000 | 0.109 | 0.0885 | 0.0835 | 0.1085 | +| 2000 | 0.1885 | 0.1795 | 0.1840 | 0.1830 | +| 4000 | 0.3370 | 0.3780 | 0.3880 | 0.3290 | +| 8000 | 0.5315 | 0.6425 | 0.6490 | 0.5300 | +| 16000 | 0.5315 | 0.9095 | 0.9080 | 0.5300 | ## Findings -**1. CLEAN RESULT (matched granularity): random offsets are redundant.** -`aligned` and `random-offset` use identical bucket width and differ only in -phase — they isolate seam decorrelation exactly. They are statistically TIED -(0.895 vs 0.883 at 16k; aligned marginally ahead). Across R *different* random +**1. CLEAN RESULT (controlled ablation): random offsets are redundant.** +After the Bug-L fix, `aligned` and `random-offset` share the SAME R projection +directions and identical bucket width — differing ONLY in phase. They are +statistically TIED (0.9095 vs 0.9080 at 16k). Across R *different* random projection directions, the direction randomness already decorrelates seams; adding random phase buys nothing. Confirms the literature prediction (multi-probe LSH / random-rotation decorrelation). diff --git a/benchmarks/spectral_probe_results.md b/benchmarks/spectral_probe_results.md index eb0012ac..7909e9ba 100644 --- a/benchmarks/spectral_probe_results.md +++ b/benchmarks/spectral_probe_results.md @@ -1,10 +1,12 @@ -> ⚠️ EXPLORATORY — KNOWN CONFOUND (adversarial review). The fixed Gaussian unfold -> only fits the isotropic marginal; non-isotropic corpora unfolded by the wrong -> density can MANUFACTURE apparent clustering. The "super-Poisson" reading is NOT -> a settled property of embedding keys. Salvage path (unfold by each corpus's -> empirical smooth marginal) and full critique in benchmarks/ADVERSARIAL_REVIEW.md. -> The theory (rigidity_impossibility_proofs.md Thm 3) is the load-bearing result; -> this probe is illustrative only. +> ⛔ WITHDRAWN — PROBE DOES NOT MEASURE WHAT IT CLAIMS. The salvage attempt +> (`--unfold-smooth K`) INVERTED the result: smooth unfold makes isotropic read +> super-Poisson (Σ²/L 1.7→12) and clustered read LOWER. The two unfolds disagree +> and neither is validated against analytic ground truth; the Gaussian "isotropic +> = 0.99 Poisson" control was just the one marginal that unfold happened to fit. +> The numbers below are RETAINED ONLY as the record of the failure. Fixing needs +> the estimator calibrated against a process with known closed-form Σ²(L). The +> theory (rigidity_impossibility_proofs.md) does NOT depend on this probe. Full +> account: benchmarks/ADVERSARIAL_REVIEW.md. # Number-variance probe: is a 1-D routing key rigid or Poisson? diff --git a/examples/gen_corpus.rs b/examples/gen_corpus.rs index 605e2e53..e12ab09d 100644 --- a/examples/gen_corpus.rs +++ b/examples/gen_corpus.rs @@ -142,15 +142,19 @@ fn generate(cfg: &Cfg, rng: &mut ChaCha8Rng, count: usize, is_query: bool) -> Ve let d = cfg.dim; let l = cfg.latent; let noise = if is_query { 0.1 } else { 0.3 }; - // shared random projection A: d x l (reused across rows in this call) + // Bug O fix: A and prototypes define the SHARED latent geometry and MUST be + // identical for corpus and queries (otherwise queries land in a different + // space and shard_recall ground truth is meaningless). Seed them from a + // dedicated, geometry-only RNG keyed by cfg.seed, NOT the streaming `rng` + // (whose state differs between the corpus and query calls). + let mut geo = ChaCha8Rng::seed_from_u64(cfg.seed ^ 0x9E37_79B9_7F4A_7C15); let mut a = vec![0.0f32; d * l]; for x in a.iter_mut() { - *x = gauss(rng); + *x = gauss(&mut geo); } - // cluster prototypes (latent space) let mut protos = vec![0.0f32; cfg.clusters * l]; for x in protos.iter_mut() { - *x = gauss(rng); + *x = gauss(&mut geo); } // power-law per-latent scaling for "anisotropic": sigma_k ~ (k+1)^-alpha let scale: Vec = (0..l) diff --git a/examples/shard_recall.rs b/examples/shard_recall.rs index 41b2ac9e..75c91b19 100644 --- a/examples/shard_recall.rs +++ b/examples/shard_recall.rs @@ -271,31 +271,34 @@ impl Arm { /// Build R projections for one arm. `base_width` is calibrated so a single /// grid has ~sqrt(n) buckets (a reasonable shard granularity); coprime arms /// subdivide by the prime multiplier. -fn build_projs(cfg: &Cfg, arm: Arm, r: usize, base_width: f32, rng: &mut ChaCha8Rng) -> Vec { +fn build_projs(cfg: &Cfg, arm: Arm, r: usize, base_width: f32) -> Vec { let d = cfg.dim; + // Bug L fix: directions and phases come from SEPARATE, fixed-seed RNGs that + // do not depend on the arm. Every arm therefore sees the IDENTICAL R + // projection directions (and identical phase draws where used), so the arms + // differ ONLY in width/phase policy — the clean ablation. Previously the + // RandomOffset arm consumed an extra value from one shared stream, desyncing + // the directions so arms were compared on different projections. + let mut dir_rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xD132_0000 ^ (r as u64)); + let mut phase_rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x9DA5E_000 ^ (r as u64)); (0..r) .map(|i| { let mut dir = vec![0.0f32; d]; for x in dir.iter_mut() { - *x = gauss(rng); + *x = gauss(&mut dir_rng); } let nrm: f32 = dir.iter().map(|t| t * t).sum::().sqrt(); for x in dir.iter_mut() { *x /= nrm; } + // draw a phase fraction for EVERY arm (keeps phase_rng in lockstep); + // arms that don't use it simply ignore the value. + let p: f32 = phase_rng.random_range(0.0..1.0); let (width, phase) = match arm { - // coprime: finer grid per projection by a distinct prime factor Arm::Coprime => (base_width / COPRIME_PERIODS[i] as f32, 0.0), - // aligned: identical grid, zero phase — seams stack exactly Arm::Aligned => (base_width, 0.0), - // random-offset: identical period, independent random phase - Arm::RandomOffset => { - let p: f32 = rng.random_range(0.0..1.0); - (base_width, p * base_width) - } - // both: coprime period AND random phase + Arm::RandomOffset => (base_width, p * base_width), Arm::Both => { - let p: f32 = rng.random_range(0.0..1.0); let w = base_width / COPRIME_PERIODS[i] as f32; (w, p * w) } @@ -363,10 +366,9 @@ fn run_arms(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { println!("arm\tR\trad\tcand_scanned\trecall@k"); for &r in &cfg.r_values { for &arm in &arms { - // Aligned/RandomOffset don't subdivide, so they only differ from - // each other by phase; coprime/both subdivide per projection. - let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ (r as u64) << 8 ^ arm as u64); - let projs = build_projs(cfg, arm, r, base_width, &mut rng); + // build_projs now seeds its own direction/phase RNGs internally + // (identical across arms for the same r) — Bug L fix. + let projs = build_projs(cfg, arm, r, base_width); let index = build_index(&projs, corpus, cfg.dim); for rad in [0i64, 1, 2, 4] { let stats: Vec<(usize, f32)> = (0..cfg.n_queries) diff --git a/examples/spectral_probe.rs b/examples/spectral_probe.rs index bea2dc79..8fb7a8e6 100644 --- a/examples/spectral_probe.rs +++ b/examples/spectral_probe.rs @@ -35,7 +35,8 @@ struct Cfg { n_keys: usize, // number of independent random-projection keys to average over corpus_npy: Option, isotropic: bool, // control: N(0,I) corpus, no clusters (true Poisson expected) - unfold_empirical: bool, // unfold by empirical CDF (quantile) instead of Gaussian + unfold_empirical: bool, // exact-rank unfold (lattice; demo only, not a rigidity test) + unfold_smooth_knots: usize, // K>0 => smooth empirical unfold with K knots (correct method) } fn parse() -> Cfg { @@ -48,6 +49,7 @@ fn parse() -> Cfg { corpus_npy: None, isotropic: false, unfold_empirical: false, + unfold_smooth_knots: 0, }; let mut a = std::env::args().skip(1); while let Some(x) = a.next() { @@ -60,6 +62,7 @@ fn parse() -> Cfg { "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), "--isotropic" => c.isotropic = true, "--unfold-empirical" => c.unfold_empirical = true, + "--unfold-smooth" => c.unfold_smooth_knots = a.next().unwrap().parse().unwrap(), "--rigid-selftest" => {} // handled in main, bypasses corpus other => panic!("unknown arg {other}"), } @@ -271,23 +274,68 @@ fn build_keys(cfg: &Cfg) -> (Vec>, usize) { dot as f64 }) .collect(); - if cfg.unfold_empirical { - // quantile unfold: sort, assign rank/N. By construction this maps - // to a perfect lattice -> Sigma^2 -> 0 if the sort is the only - // structure. Confirms quantile tiling balances the key exactly. - keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + if cfg.unfold_smooth_knots > 0 { + // SMOOTH EMPIRICAL unfold (correct method; removes the Gaussian- + // mismatch confound). Fit the CDF with K coarse knots placed at + // rank-quantiles, linear between. This subtracts the large-scale + // marginal density of ANY distribution WITHOUT collapsing local + // fluctuation to a lattice (which exact-rank unfolding does). + // Number variance must then be read at window scales L << n/K so + // the knot smoothing does not wash out the signal being measured. + let k = cfg.unfold_smooth_knots.min(cfg.n); + let n = cfg.n; + // knot j (0..=k): rank index kr = j*n/k, key value = keys[kr], maps + // to unfolded position j*n/k. Each key unfolds by linear interp of + // its bracketing knots in VALUE space onto POSITION space. + let knot_rank = |j: usize| -> usize { (j * n / k).min(n - 1) }; + let unfolded: Vec = keys + .iter() + .map(|&x| { + // find bracketing knots by value (knots are sorted in value) + // binary search over knot index + let mut lo = 0usize; + let mut hi = k; + while lo < hi { + let mid = (lo + hi) / 2; + if keys[knot_rank(mid)] < x { + lo = mid + 1; + } else { + hi = mid; + } + } + let j = lo.clamp(1, k); + let (v0, v1) = (keys[knot_rank(j - 1)], keys[knot_rank(j)]); + let (p0, p1) = + (knot_rank(j - 1) as f64, knot_rank(j) as f64); + if (v1 - v0).abs() < 1e-30 { + p0 + } else { + p0 + (x - v0) / (v1 - v0) * (p1 - p0) + } + }) + .collect(); + let mut u = unfolded; + u.sort_by(|a, b| a.partial_cmp(b).unwrap()); + out.push(u); + } else if cfg.unfold_empirical { + // exact-rank unfold: assigns rank i. Trivially a lattice (Sigma^2=0) + // for ANY input — this is NOT a rigidity test, only a demonstration + // that exact quantile tiling balances occupancy. Kept for that use. for (i, kv) in keys.iter_mut().enumerate() { - *kv = i as f64; // already the unfolded lattice position + *kv = i as f64; } + out.push(keys); } else { - // unfold by the matching Gaussian CDF -> positions scaled to n. + // fixed-Gaussian unfold: ONLY valid for the isotropic marginal. + // CONFOUNDED for non-isotropic corpora (see ADVERSARIAL_REVIEW.md); + // prefer --unfold-smooth. let sigma = (1.0 / d as f64).sqrt(); for kv in keys.iter_mut() { *kv = normal_cdf(*kv / sigma) * cfg.n as f64; } - keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + out.push(keys); } - out.push(keys); } (out, cfg.n) } From ebe4000af426585674256a79e53d20b20dc8c82c Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 12:19:19 -0700 Subject: [PATCH 03/16] Characterize ordvec density collapse: mechanism + recoverable permutation signal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reframes the prime/Sacks "dense-region structure" thread into the question it was actually reaching for, on the right substrate (the permutohedron S_D, not the integer line). Findings (examples/density_collapse.rs, FP32 ground truth): - Density collapse is NOT code collision (b=2 codes are length-D permutation- bucket sequences; exact collision ~0 even at noise=0.08). It is NEAR-collision: docs with Hamming-close b=2 codes the 2-bit scorer cannot rank apart. - The lost signal IS recoverable: among b2-lookalikes, FP32-true neighbours have lower top-k Kendall-tau (coordinate order) than far lookalikes — 87-91% of probes, both densities. That order is already in the Rank code, discarded by b=2, so a tau-rerank breaks dense-region ties with no new storage. Conclusion: exploitable dense-region structure lives in S_D (data's own order), not on N (primes/Sacks act on the index, carry no corpus information). Open: confirm on real embeddings; show tau-rerank beats b=4 at matched bytes on R@10. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/density_collapse_results.md | 64 ++++++ examples/density_collapse.rs | 297 +++++++++++++++++++++++++ 2 files changed, 361 insertions(+) create mode 100644 benchmarks/density_collapse_results.md create mode 100644 examples/density_collapse.rs diff --git a/benchmarks/density_collapse_results.md b/benchmarks/density_collapse_results.md new file mode 100644 index 00000000..a2edb1bb --- /dev/null +++ b/benchmarks/density_collapse_results.md @@ -0,0 +1,64 @@ +# Density collapse in ordvec: mechanism and a recoverable signal + +What "density collapse" actually is in RankQuant, and whether the lost signal is +recoverable. Source: `examples/density_collapse.rs`. Ground truth = FP32 cosine, +so this probe cannot miscalibrate (unlike the withdrawn number-variance probe). + +## Mechanism (corrected mid-investigation) + +First model was wrong: I expected b=2 codes to COLLIDE (identical) in dense +regions. They don't — the rank transform is a permutation of 0..D, so a b=2 code +is a length-D sequence of bucket ids; two docs collide only if all D coordinates +bucket identically (astronomically rare on continuous data; measured collision +rate ≈ 0 even at noise=0.08). + +The real mechanism is NEAR-collision: docs whose b=2 codes are HAMMING-CLOSE +(differ in few coordinate-buckets) are what the scorer cannot separate. +- noise=0.30: mean nearest-code Hamming = 59.9 / 256; ~123 docs within ~1.5×. +- noise=0.10: mean nearest-code Hamming = 18.4 / 256; ~58 docs within ~1.5×. +Denser clusters → tighter Hamming neighbourhoods → more docs the b=2 kernel +conflates. THAT is collapse: not identical codes, but codes too close for the +2-bit resolution to rank correctly. + +## The recoverable signal (positive result) + +Within each probe's b=2 "lookalike" set (M=40 Hamming-nearest), split by TRUE +FP32 cosine into the true-neighbour half vs the far-lookalike half. Compare the +top-k (k=16) Kendall-tau distance of the probe's coordinate ORDER to each: + +| density | true-nbr cosine | true-nbr tau | far cosine | far tau | probes tau_true Cfg { + let mut c = Cfg { + dim: 256, + n: 30_000, + latent: 64, + clusters: 200, + noise: 0.3, + bits: 2, + topk: 16, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent = a.next().unwrap().parse().unwrap(), + "--clusters" => c.clusters = a.next().unwrap().parse().unwrap(), + "--noise" => c.noise = a.next().unwrap().parse().unwrap(), + "--bits" => c.bits = a.next().unwrap().parse().unwrap(), + "--topk" => c.topk = a.next().unwrap().parse().unwrap(), + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Cosine of two unit vectors (corpus is L2-normalized). +fn cos(a: &[f32], b: &[f32]) -> f32 { + a.iter().zip(b).map(|(x, y)| x * y).sum() +} + +/// Kendall-tau DISTANCE between the orderings the two vectors induce on a fixed +/// coordinate set `coords` (fraction of discordant pairs, in [0,1]). Uses the +/// raw values to decide order on each coordinate pair. 0 = identical order. +fn kendall_tau(a: &[f32], b: &[f32], coords: &[usize]) -> f32 { + let m = coords.len(); + if m < 2 { + return 0.0; + } + let mut disc = 0usize; + let mut tot = 0usize; + for x in 0..m { + for y in (x + 1)..m { + let (cx, cy) = (coords[x], coords[y]); + let sa = (a[cx] - a[cy]).signum(); + let sb = (b[cx] - b[cy]).signum(); + if sa != sb { + disc += 1; + } + tot += 1; + } + } + disc as f32 / tot.max(1) as f32 +} + +/// Indices of the top-k coordinates of a vector by value (descending). +fn top_coords(v: &[f32], k: usize) -> Vec { + let mut idx: Vec = (0..v.len()).collect(); + idx.sort_by(|&i, &j| v[j].partial_cmp(&v[i]).unwrap()); + idx.truncate(k); + idx +} + +/// Hamming distance between two equal-length bucket codes (# coords that +/// landed in a different bucket). This is what the b=2 SCORER conflates: small +/// Hamming distance = codes the kernel can barely separate = real collapse. +fn hamming(a: &[u8], b: &[u8]) -> u32 { + a.iter().zip(b).filter(|(x, y)| x != y).count() as u32 +} + +/// Report the code-distance structure: exact collisions are ~0 on continuous +/// data (constant-composition codes are length-D sequences), so the real +/// "collapse" is NEAR-collision — how many docs sit within small Hamming +/// distance of a probe. Sampled over probes to stay O(n * samples). +fn collision_report(cfg: &Cfg, codes: &[Vec]) { + let n = cfg.n; + let n_probes = 200.min(n); + let stride = (n / n_probes).max(1); + let probes: Vec = (0..n).step_by(stride).take(n_probes).collect(); + // for each probe, min Hamming distance to any other doc + count within 2*min + let stats: Vec<(u32, usize)> = probes + .par_iter() + .map(|&p| { + let cp = &codes[p]; + let mut min_h = u32::MAX; + for j in 0..n { + if j == p { + continue; + } + let h = hamming(cp, &codes[j]); + if h < min_h { + min_h = h; + } + } + let thresh = (min_h + min_h / 2).max(min_h + 1); + let near = (0..n) + .filter(|&j| j != p && hamming(cp, &codes[j]) <= thresh) + .count(); + (min_h, near) + }) + .collect(); + let mean_min = stats.iter().map(|s| s.0 as f64).sum::() / stats.len() as f64; + let mean_near = stats.iter().map(|s| s.1 as f64).sum::() / stats.len() as f64; + println!("# density_collapse: n={} dim={} bits={} noise={} topk={} (smaller noise=denser)", + cfg.n, cfg.dim, cfg.bits, cfg.noise, cfg.topk); + println!("## Code near-collision structure (b={} bucket code, length {})", cfg.bits, cfg.dim); + println!("mean nearest-code Hamming dist = {mean_min:.1} of {} coords", cfg.dim); + println!("mean #docs within ~1.5x that distance = {mean_near:.1}"); + println!("(small Hamming = codes the b={} scorer can barely separate = collapse)", cfg.bits); +} + +fn main() { + let cfg = parse(); + let corpus = make_corpus(&cfg); + let d = cfg.dim; + + // b-bit bucket code: top bucket id per coordinate. rank r in [0,D) -> + // bucket r / (D / 2^bits). The b=2 "top-bucket membership" code we hash on + // is the multiset of bucket ids (order-free), exactly what RankQuant ties on. + let nbuckets = 1usize << cfg.bits; + let bucket_w = d / nbuckets; + let codes: Vec> = (0..cfg.n) + .into_par_iter() + .map(|i| { + let ranks = rank_transform(&corpus[i * d..(i + 1) * d]); + ranks + .iter() + .map(|&r| (r as usize / bucket_w).min(nbuckets - 1) as u8) + .collect() + }) + .collect(); + + collision_report(&cfg, &codes); + tau_report(&cfg, &corpus, &codes); +} + +/// THE TEST. For each probe, gather the M docs whose b=2 codes are +/// Hamming-CLOSEST (the "lookalikes" the scorer conflates with the probe). +/// Split them into the FP32-true neighbours (top half by cosine) and the +/// FP32-far lookalikes (bottom half). Question: does intra-bucket top-k +/// Kendall-tau SEPARATE these two groups — i.e. do the true neighbours have +/// lower tau than the false lookalikes that the b=2 code can't tell apart? +/// If yes, the fine permutation order recovers exactly the signal b=2 lost, +/// from the existing Rank code, no new storage. FP32 cosine = ground truth. +fn tau_report(cfg: &Cfg, corpus: &[f32], codes: &[Vec]) { + let d = cfg.dim; + let n = cfg.n; + let m = 40usize; // size of the b2-lookalike neighbourhood per probe + let n_probes = 300.min(n); + let stride = (n / n_probes).max(1); + let probes: Vec = (0..n).step_by(stride).take(n_probes).collect(); + + // Per probe: (mean tau of true-neighbour half, mean tau of far-lookalike half, + // mean cosine of each half) — averaged over probes. + let rows: Vec<(f32, f32, f32, f32)> = probes + .par_iter() + .map(|&p| { + let pv = &corpus[p * d..(p + 1) * d]; + let cp = &codes[p]; + // M code-nearest docs by Hamming + let mut by_h: Vec<(u32, usize)> = (0..n) + .filter(|&j| j != p) + .map(|j| (hamming(cp, &codes[j]), j)) + .collect(); + by_h.sort_by_key(|x| x.0); + by_h.truncate(m); + // among those lookalikes, rank by TRUE cosine to the probe + let mut by_cos: Vec<(f32, usize)> = by_h + .iter() + .map(|&(_, j)| (cos(pv, &corpus[j * d..(j + 1) * d]), j)) + .collect(); + by_cos.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap()); + let half = by_cos.len() / 2; + let coords = top_coords(pv, cfg.topk); + let tau_of = |slice: &[(f32, usize)]| -> (f32, f32) { + let mut st = 0.0f32; + let mut sc = 0.0f32; + for &(c, j) in slice { + st += kendall_tau(pv, &corpus[j * d..(j + 1) * d], &coords); + sc += c; + } + let k = slice.len().max(1) as f32; + (st / k, sc / k) + }; + let (tau_near, cos_near) = tau_of(&by_cos[..half]); + let (tau_far, cos_far) = tau_of(&by_cos[half..]); + (tau_near, tau_far, cos_near, cos_far) + }) + .collect(); + + let mean = |f: &dyn Fn(&(f32, f32, f32, f32)) -> f32| { + rows.iter().map(f).sum::() / rows.len().max(1) as f32 + }; + let tau_near = mean(&|r| r.0); + let tau_far = mean(&|r| r.1); + let cos_near = mean(&|r| r.2); + let cos_far = mean(&|r| r.3); + let wins = rows.iter().filter(|r| r.0 < r.1).count(); + println!("\n## Intra-code Kendall-tau test (top-{} coords, M={m} b2-lookalikes/probe, {} probes)", + cfg.topk, rows.len()); + println!("among b2-lookalikes: mean cosine mean top-k tau-distance"); + println!(" FP32-TRUE neighbours {cos_near:.4} {tau_near:.4}"); + println!(" FP32-FAR lookalikes {cos_far:.4} {tau_far:.4}"); + println!("probes where true-neighbour tau < far-lookalike tau: {wins}/{} = {:.3}", + rows.len(), wins as f64 / rows.len().max(1) as f64); + let verdict = if tau_near + 0.005 < tau_far { + "SIGNAL: fine permutation order separates true neighbours from b2-lookalikes -> recoverable, no new storage" + } else if tau_near > tau_far + 0.005 { + "INVERTED: true neighbours have HIGHER tau — order does not help here" + } else { + "NO SIGNAL: intra-code order does not separate true neighbours from b2-lookalikes" + }; + println!("verdict: {verdict}"); +} + +/// Low-rank clustered corpus; `noise` is the density dial. +fn make_corpus(cfg: &Cfg) -> Vec { + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let (d, l) = (cfg.dim, cfg.latent); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let c = rng.random_range(0..cfg.clusters); + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = protos[c * l + j] + cfg.noise * gauss(&mut rng); + } + let row = &mut corpus[i * d..(i + 1) * d]; + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[ii * l + j] * z[j]; + } + row[ii] = acc; + } + let nrm: f32 = row.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in row.iter_mut() { + *v /= nrm; + } + } + } + corpus +} From 571a20bff6eb726c214665851805edcfbbe59bae Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 12:37:54 -0700 Subject: [PATCH 04/16] Confirm density-collapse tau-signal on REAL embeddings (nomic-embed-text, GPU) Adds examples/embed_ollama.py (GPU embeddings via ollama -> .npy) and a --corpus-npy loader to density_collapse. Runs the intra-code Kendall-tau test on 8665 real repo sentences embedded with nomic-embed-text (768-d, RTX 5080). Real-data results: - TwoNN intrinsic dim of nomic-embed-text ~= 13 (ambient 768), matches the predicted low-tens range. - Real embeddings are far more b=2-entangled than synthetic (~5083/8665 docs in each probe's lookalike shell vs ~60-120 synthetic). - The tau-signal HOLDS and sharpens monotonically with top-k: win rate 0.667 (k=8) -> 0.683 (16) -> 0.847 (32) -> 0.930 (64). At k=64, 93% of probes: FP32-true neighbours have lower intra-code Kendall-tau than the b2-lookalikes the code conflates them with. Confirms on real encoder geometry: the signal b=2 discards (fine permutation order, already in the Rank code) recovers dense-region true neighbours. Open: broader/larger corpus; show a tau-rerank beats b=4 at matched bytes on R@10. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/density_collapse_results.md | 34 +++++++- examples/density_collapse.rs | 52 ++++++++++++- examples/embed_ollama.py | 104 +++++++++++++++++++++++++ 3 files changed, 186 insertions(+), 4 deletions(-) create mode 100644 examples/embed_ollama.py diff --git a/benchmarks/density_collapse_results.md b/benchmarks/density_collapse_results.md index a2edb1bb..3948b56f 100644 --- a/benchmarks/density_collapse_results.md +++ b/benchmarks/density_collapse_results.md @@ -47,10 +47,40 @@ structure is in S_D (the order the encoder induced), NOT on the integer line (primes / Sacks spiral, which act on the index and carry no corpus information — see the conjecture thread). +## REAL EMBEDDINGS (nomic-embed-text via ollama on RTX 5080, 768-d) + +Corpus: 8665 real sentences extracted from this repo's markdown + Rust, embedded +with `nomic-embed-text` (GPU-resident via ollama). Generator: +`examples/embed_ollama.py`. Run: `density_collapse --corpus-npy repo_real.npy`. + +First real-data facts: +- TwoNN intrinsic dimension of nomic-embed-text ≈ 13 (ambient 768) — our own + measurement, squarely in the predicted low-tens range. +- Real embeddings are FAR more entangled at b=2 than synthetic clusters: mean + nearest-code Hamming 314/768, and ~5083 of 8665 docs sit in each probe's + b2-lookalike shell (vs ~60–120 synthetic). Real geometry, much denser collapse. + +The tau-signal HOLDS on real data and SHARPENS monotonically with the coordinate +set size (top-k): + +| top-k | tau true-nbr | tau far | win rate (tau_true < tau_far) | +|-------|--------------|---------|-------------------------------| +| 8 | 0.3892 | 0.4128 | 0.667 | +| 16 | 0.3773 | 0.3968 | 0.683 | +| 32 | 0.3706 | 0.3931 | 0.847 | +| 64 | 0.3682 | 0.3926 | **0.930** | + +At top-k=64, 93% of probes have true neighbours with lower intra-code Kendall-tau +than the b2-lookalikes — on real encoder geometry. The monotonic climb +(0.667→0.930) is the signature of a real effect: sparse top-8 order can't +separate near-parallel real embeddings, but the fuller coordinate order does. +So the synthetic finding survives contact with real embeddings, and the lever +(intra-code permutation order, already in the Rank code) is confirmed. + ## Caveats / open -- Synthetic low-rank clustered corpus; confirm on real embeddings via a corpus - dump (the probe has no .npy loader yet — add one to test real data). +- Real corpus here is repo-domain (md+rust), 8665 docs — narrow domain, modest + size. Confirm on a larger, broader corpus (MS MARCO / Wikipedia passages). - The test shows the signal EXISTS (separation in tau); it does not yet show a tau-rerank improves end-to-end recall vs simply using b=4. The honest next experiment: tau-rerank of b=2 survivors vs b=4 at matched bytes, R@10 vs FP32. diff --git a/examples/density_collapse.rs b/examples/density_collapse.rs index 9a40915f..abce4cfe 100644 --- a/examples/density_collapse.rs +++ b/examples/density_collapse.rs @@ -36,6 +36,7 @@ struct Cfg { noise: f32, // cluster tightness: smaller = denser = more collapse bits: u32, // bucket bits (2 = quartiles) topk: usize, // # top coords used for the intra-bucket tau test + corpus_npy: Option, // real embeddings; overrides synthetic + dim/n } fn parse() -> Cfg { @@ -47,6 +48,7 @@ fn parse() -> Cfg { noise: 0.3, bits: 2, topk: 16, + corpus_npy: None, }; let mut a = std::env::args().skip(1); while let Some(x) = a.next() { @@ -58,6 +60,7 @@ fn parse() -> Cfg { "--noise" => c.noise = a.next().unwrap().parse().unwrap(), "--bits" => c.bits = a.next().unwrap().parse().unwrap(), "--topk" => c.topk = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), other => panic!("unknown arg {other}"), } } @@ -155,9 +158,54 @@ fn collision_report(cfg: &Cfg, codes: &[Vec]) { println!("(small Hamming = codes the b={} scorer can barely separate = collapse)", cfg.bits); } +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays. +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + // L2-normalize rows (cosine geometry) + for i in 0..n { + let row = &mut out[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } + (out, n, dim) +} + fn main() { - let cfg = parse(); - let corpus = make_corpus(&cfg); + let mut cfg = parse(); + let corpus = if let Some(path) = cfg.corpus_npy.clone() { + let (v, n, dim) = load_npy_f32(&path); + eprintln!("# loaded REAL corpus {n} x {dim} from {path}"); + cfg.n = n; + cfg.dim = dim; + v + } else { + make_corpus(&cfg) + }; let d = cfg.dim; // b-bit bucket code: top bucket id per coordinate. rank r in [0,D) -> diff --git a/examples/embed_ollama.py b/examples/embed_ollama.py new file mode 100644 index 00000000..56d1af7e --- /dev/null +++ b/examples/embed_ollama.py @@ -0,0 +1,104 @@ +#!/usr/bin/env python3 +"""Generate real sentence embeddings via ollama (GPU) -> .npy for the probes. + +ollama runs nomic-embed-text (768-d) resident on the GPU. We embed a corpus of +short texts and write a 2-D little-endian float32 C-order .npy matching the +format bench_rank / the conjecture probes read (--corpus-npy). + +Corpus source (in priority order): + 1. --texts FILE : one text per line + 2. built-in : a few thousand templated sentences across many topics, + enough varied geometry for a first real-encoder pass + (no external download, no datasets lib). + +Usage: + python examples/embed_ollama.py --out corpus_real.npy --n 5000 + python examples/embed_ollama.py --texts mylines.txt --out corpus_real.npy +""" +import argparse, json, struct, sys, urllib.request + +OLLAMA = "http://localhost:11434/api/embed" +MODEL = "nomic-embed-text" + + +def embed_batch(texts): + req = urllib.request.Request( + OLLAMA, + data=json.dumps({"model": MODEL, "input": texts}).encode(), + headers={"Content-Type": "application/json"}, + ) + with urllib.request.urlopen(req, timeout=120) as r: + return json.load(r)["embeddings"] + + +def build_builtin_corpus(n): + # Templated sentences spanning many semantic clusters -> real anisotropy, + # real cluster structure, without an external dataset. Deterministic. + subjects = ["the engineer", "a chef", "the astronomer", "my neighbor", + "the orchestra", "a startup", "the river", "an old library", + "the algorithm", "a mountain village", "the immune system", + "a jazz quartet", "the supply chain", "an ancient manuscript", + "the coral reef", "a chess grandmaster", "the power grid", + "a desert caravan", "the neural network", "a vineyard"] + verbs = ["studied", "rebuilt", "abandoned", "celebrated", "measured", + "optimized", "flooded", "catalogued", "trained", "harvested", + "defended", "improvised", "rerouted", "deciphered", "mapped"] + objects = ["under heavy rain", "for three decades", "with great precision", + "against all odds", "in the summer of 1998", "without any funding", + "across twelve countries", "before the deadline", "at dawn", + "using only open data", "despite the noise", "on a tight budget", + "in complete silence", "after the merger", "beyond the horizon"] + out = [] + i = 0 + while len(out) < n: + s = subjects[i % len(subjects)] + v = verbs[(i // len(subjects)) % len(verbs)] + o = objects[(i // (len(subjects) * len(verbs))) % len(objects)] + out.append(f"{s} {v} {o}.") + i += 1 + return out[:n] + + +def write_npy(path, vecs): + n = len(vecs) + dim = len(vecs[0]) + header = ("{'descr': ' Date: Sat, 13 Jun 2026 12:46:26 -0700 Subject: [PATCH 05/16] Address Gemini code review: panic-safety, perf, Lean typecheck fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Lean blindspot_card_product: add required [∀ i, NeZero (m i)] (ZMod 0 has no Fintype instance; the "easy half" needs it too, not just crtEquiv). - shard_recall: guard COPRIME_PERIODS[i] with i % len() (OOB panic if r>16); drop redundant truth_set HashSet, filter truth directly against cand. - spectral_probe: precompute knot (pos,val) arrays once instead of recomputing knot_rank inside the per-key binary search. - density_collapse / twonn_id: NaN-safe sort comparators (unwrap_or Equal). All probe outputs bit-identical after the changes (behavior-preserving): real-embedding density win 0.683 @ k=16, shard envelope 0.9095/0.9080, sphere control 6.99. Examples build clean. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/lean/VernierSeamDensity.lean | 10 ++++++---- examples/density_collapse.rs | 2 +- examples/shard_recall.rs | 13 ++++++++----- examples/spectral_probe.rs | 20 ++++++++++---------- examples/twonn_id.rs | 2 +- 5 files changed, 26 insertions(+), 21 deletions(-) diff --git a/benchmarks/lean/VernierSeamDensity.lean b/benchmarks/lean/VernierSeamDensity.lean index 48a38ef6..974fd89a 100644 --- a/benchmarks/lean/VernierSeamDensity.lean +++ b/benchmarks/lean/VernierSeamDensity.lean @@ -53,11 +53,13 @@ theorem card_admissible {m t : ℕ} (h : 2 * t + 1 ≤ m) : /-! ## Product-space count (coprimality NOT used) -/ /-- The joint blind spot over the residue PRODUCT space has cardinality - `∏ (2t+1)`. True for any family — carries no coprimality hypothesis, which - is the honest statement of what this half proves (it is `Fintype.card_piFinset`, - not CRT). -/ + `∏ (2t+1)`. Carries no COPRIMALITY hypothesis (that is the honest statement + of what this half proves — `Fintype.card_piFinset`, not CRT), but it DOES + need `[∀ i, NeZero (m i)]`: `ZMod 0` is infinite and has no `Fintype` + instance, so `Finset.univ` / `Fintype.piFinset` would not typecheck without + it. (Caught by code review.) -/ theorem blindspot_card_product - {L : ℕ} (m : Fin L → ℕ) (t : ℕ) (hband : ∀ i, 2 * t + 1 ≤ m i) : + {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] (t : ℕ) (hband : ∀ i, 2 * t + 1 ≤ m i) : (Finset.univ.filter (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)).card = ∏ i, (2 * t + 1) := by diff --git a/examples/density_collapse.rs b/examples/density_collapse.rs index abce4cfe..6a75fe15 100644 --- a/examples/density_collapse.rs +++ b/examples/density_collapse.rs @@ -105,7 +105,7 @@ fn kendall_tau(a: &[f32], b: &[f32], coords: &[usize]) -> f32 { /// Indices of the top-k coordinates of a vector by value (descending). fn top_coords(v: &[f32], k: usize) -> Vec { let mut idx: Vec = (0..v.len()).collect(); - idx.sort_by(|&i, &j| v[j].partial_cmp(&v[i]).unwrap()); + idx.sort_by(|&i, &j| v[j].partial_cmp(&v[i]).unwrap_or(std::cmp::Ordering::Equal)); idx.truncate(k); idx } diff --git a/examples/shard_recall.rs b/examples/shard_recall.rs index 75c91b19..1edc829e 100644 --- a/examples/shard_recall.rs +++ b/examples/shard_recall.rs @@ -294,12 +294,14 @@ fn build_projs(cfg: &Cfg, arm: Arm, r: usize, base_width: f32) -> Vec { // draw a phase fraction for EVERY arm (keeps phase_rng in lockstep); // arms that don't use it simply ignore the value. let p: f32 = phase_rng.random_range(0.0..1.0); + // i % len() guards against r > COPRIME_PERIODS.len() (OOB panic). + let period = COPRIME_PERIODS[i % COPRIME_PERIODS.len()] as f32; let (width, phase) = match arm { - Arm::Coprime => (base_width / COPRIME_PERIODS[i] as f32, 0.0), + Arm::Coprime => (base_width / period, 0.0), Arm::Aligned => (base_width, 0.0), Arm::RandomOffset => (base_width, p * base_width), Arm::Both => { - let w = base_width / COPRIME_PERIODS[i] as f32; + let w = base_width / period; (w, p * w) } }; @@ -347,9 +349,10 @@ fn probe_recall( } } } - let truth_set: HashSet = truth.iter().map(|&i| i as u32).collect(); - let hits = truth_set.iter().filter(|i| cand.contains(i)).count(); - let recall = hits as f32 / truth_set.len().max(1) as f32; + // truth is already unique ground-truth ids; filter directly against cand + // (no separate HashSet allocation — review suggestion). + let hits = truth.iter().filter(|&&i| cand.contains(&(i as u32))).count(); + let recall = hits as f32 / truth.len().max(1) as f32; (cand.len(), recall) } diff --git a/examples/spectral_probe.rs b/examples/spectral_probe.rs index 8fb7a8e6..fa654540 100644 --- a/examples/spectral_probe.rs +++ b/examples/spectral_probe.rs @@ -285,29 +285,29 @@ fn build_keys(cfg: &Cfg) -> (Vec>, usize) { // the knot smoothing does not wash out the signal being measured. let k = cfg.unfold_smooth_knots.min(cfg.n); let n = cfg.n; - // knot j (0..=k): rank index kr = j*n/k, key value = keys[kr], maps - // to unfolded position j*n/k. Each key unfolds by linear interp of - // its bracketing knots in VALUE space onto POSITION space. - let knot_rank = |j: usize| -> usize { (j * n / k).min(n - 1) }; + // Precompute the k+1 knots once: (position p_j = rank j*n/k, value + // keys[p_j]). Avoids recomputing knot_rank (div/mul/min) and + // re-indexing keys inside the per-key binary search (review). + let kr = |j: usize| -> usize { (j * n / k).min(n - 1) }; + let knot_pos: Vec = (0..=k).map(|j| kr(j) as f64).collect(); + let knot_val: Vec = (0..=k).map(|j| keys[kr(j)]).collect(); let unfolded: Vec = keys .iter() .map(|&x| { - // find bracketing knots by value (knots are sorted in value) - // binary search over knot index + // bracketing knot index by value (knots sorted ascending) let mut lo = 0usize; let mut hi = k; while lo < hi { let mid = (lo + hi) / 2; - if keys[knot_rank(mid)] < x { + if knot_val[mid] < x { lo = mid + 1; } else { hi = mid; } } let j = lo.clamp(1, k); - let (v0, v1) = (keys[knot_rank(j - 1)], keys[knot_rank(j)]); - let (p0, p1) = - (knot_rank(j - 1) as f64, knot_rank(j) as f64); + let (v0, v1) = (knot_val[j - 1], knot_val[j]); + let (p0, p1) = (knot_pos[j - 1], knot_pos[j]); if (v1 - v0).abs() < 1e-30 { p0 } else { diff --git a/examples/twonn_id.rs b/examples/twonn_id.rs index 32381bb7..d04ac6d7 100644 --- a/examples/twonn_id.rs +++ b/examples/twonn_id.rs @@ -207,7 +207,7 @@ fn nn_ratios(corpus: &[f32], n: usize, dim: usize) -> Vec { /// closed-form slope d = Σ x·y / Σ x². fn twonn_fit(mus: &[f64]) -> f64 { let mut m: Vec = mus.to_vec(); - m.sort_by(|a, b| a.partial_cmp(b).unwrap()); + m.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)); let n = m.len(); let keep = ((n as f64) * 0.9) as usize; // trim top 10% tail let (mut sxx, mut sxy) = (0.0f64, 0.0f64); From a0ec2656bfa0a1291565db8f49c7f45d4dc3017a Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 12:57:58 -0700 Subject: [PATCH 06/16] Second review round: de-circularize density test, fix bucketing, harden pipeline MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Acting on a second hostile review + Gemini/qodo PR bots on the real-embedding material: CRITICAL — density-collapse headline corrected (was an artifact): - M2: the win-rate climb 0.667->0.930 with top-k was estimator-variance, not a sharpening effect. Replaced win rate with the tau GAP (effect size) + bootstrap 95% CI over probes. - M1: tau was computed on the probe's OWN top-k coords, coupling it to cosine (near-tautological). Now uses the per-PAIR union of top coords, chosen independently of the cosine ranking. - Result survives but is MODEST and FLAT: gap ~= 0.04, CI strictly > 0 at every top-k. "Sharpening / signature of a real effect" RETRACTED; small real separation stands. qodo: bucketing bug — density_collapse used rank/(d/2^bits) (panics at d/2^bits==0; wrong for non-divisible dims). Now uses ordvec::rank::rank_to_bucket so it measures real RankQuant behavior. bits: u32 -> u8. embed_ollama.py: abort on embedding-count mismatch (E2 row-misalignment) and empty corpus (E3). Record the repo-sentence extraction procedure in results.md (E4/E5 repro gap). Soften single-corpus generality language (G1). Examples build clean; corrected gap result stable across topk. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/ADVERSARIAL_REVIEW.md | 23 ++++++++ benchmarks/density_collapse_results.md | 77 +++++++++++++++++++------ examples/density_collapse.rs | 80 ++++++++++++++++++-------- examples/embed_ollama.py | 12 +++- 4 files changed, 148 insertions(+), 44 deletions(-) diff --git a/benchmarks/ADVERSARIAL_REVIEW.md b/benchmarks/ADVERSARIAL_REVIEW.md index 2466931d..de6c19e5 100644 --- a/benchmarks/ADVERSARIAL_REVIEW.md +++ b/benchmarks/ADVERSARIAL_REVIEW.md @@ -74,6 +74,29 @@ carries its own critique. Conclusions are tiered by what SURVIVED review. number-variance-rigid" (true), and drop the optimality-over-all-partitions claim unless separately proved (likely false). +## Round 2 (real-embedding pipeline + post-PR code review) + +After the real-embedding work and the PR was opened, a second hostile review +(plus the Gemini/qodo PR bots) hit the new material: + +- **CRITICAL — density-collapse headline was an artifact, now corrected.** The + win-rate climb 0.667→0.930 with top-k was an estimator-variance effect (M2), + and tau was computed on the probe's own coords, coupling it to cosine (M1). + FIXED: tau now uses the per-pair UNION of top coords (de-circularized), and we + report the tau GAP (effect size) with a bootstrap 95% CI instead of win rate. + Result survives but is MODEST and FLAT: gap ≈ 0.04, CI strictly > 0 at every + top-k. The "sharpening / signature of a real effect" claim is RETRACTED; the + small-but-real separation stands. +- **qodo: bucketing bug FIXED.** density_collapse reimplemented bucketing as + `rank/(d/2^bits)` (panics at d/2^bits==0; wrong for non-divisible dims). Now + uses `ordvec::rank::rank_to_bucket` — measures REAL RankQuant behavior. +- **embed_ollama.py hardened:** E2 (silent row misalignment if ollama returns + wrong count) now aborts; E3 (empty corpus) guarded. +- **Reproducibility (E4/E5):** the repo-sentence extraction + embed procedure is + now recorded verbatim in density_collapse_results.md (was unrecorded). +- **G1 overclaim:** body language softened; single-corpus/single-model + generality explicitly NOT claimed. + ## Net The mathematically defensible deliverables are: the CRT vernier structure diff --git a/benchmarks/density_collapse_results.md b/benchmarks/density_collapse_results.md index 3948b56f..4362c69f 100644 --- a/benchmarks/density_collapse_results.md +++ b/benchmarks/density_collapse_results.md @@ -60,22 +60,36 @@ First real-data facts: nearest-code Hamming 314/768, and ~5083 of 8665 docs sit in each probe's b2-lookalike shell (vs ~60–120 synthetic). Real geometry, much denser collapse. -The tau-signal HOLDS on real data and SHARPENS monotonically with the coordinate -set size (top-k): - -| top-k | tau true-nbr | tau far | win rate (tau_true < tau_far) | -|-------|--------------|---------|-------------------------------| -| 8 | 0.3892 | 0.4128 | 0.667 | -| 16 | 0.3773 | 0.3968 | 0.683 | -| 32 | 0.3706 | 0.3931 | 0.847 | -| 64 | 0.3682 | 0.3926 | **0.930** | - -At top-k=64, 93% of probes have true neighbours with lower intra-code Kendall-tau -than the b2-lookalikes — on real encoder geometry. The monotonic climb -(0.667→0.930) is the signature of a real effect: sparse top-8 order can't -separate near-parallel real embeddings, but the fuller coordinate order does. -So the synthetic finding survives contact with real embeddings, and the lever -(intra-code permutation order, already in the Rank code) is confirmed. +### CORRECTED after a second adversarial review (M1 + M2) + +The first writeup reported a win-rate climb 0.667→0.930 with top-k as "the +signature of a real effect." TWO review findings invalidated that framing, and +the test was rewritten: + +- **M2 (the win-rate climb was an artifact):** the per-probe tau GAP is flat + across top-k; only the win-rate estimator's variance shrinks as k grows, + mechanically pushing the win rate up. Win rate was the wrong statistic. +- **M1 (circularity):** tau was computed on the PROBE'S OWN top-k coords, which + couples tau to cosine and makes the test near-tautological. Fixed to use the + per-PAIR UNION of top coords, chosen independently of the cosine ranking. + +Rewritten test reports the tau GAP (far − near) as an effect size, with a +bootstrap 95% CI over probes (de-circularized coords): + +| top-k | tau gap (far − near) | 95% CI | verdict | +|-------|----------------------|--------|---------| +| 8 | 0.0420 | [0.0380, 0.0463] | signal | +| 16 | 0.0417 | [0.0381, 0.0454] | signal | +| 32 | 0.0440 | [0.0409, 0.0472] | signal | +| 64 | 0.0453 | [0.0424, 0.0483] | signal | + +**Honest conclusion:** there IS a real effect — true neighbours have lower +intra-code Kendall-tau than the b2-lookalikes the code conflates them with, gap +≈ 0.04, CI strictly above 0 at every top-k. But it is MODEST and FLAT, not the +dramatic "sharpening" originally claimed. The win-rate monotonicity is retracted. +The lever (intra-code permutation order in the Rank code) shows a measurable but +small separation on this corpus/model; whether ~0.04 tau converts to useful +recall gain vs simply using b=4 is the unanswered deployment question. ## Caveats / open @@ -84,11 +98,36 @@ So the synthetic finding survives contact with real embeddings, and the lever - The test shows the signal EXISTS (separation in tau); it does not yet show a tau-rerank improves end-to-end recall vs simply using b=4. The honest next experiment: tau-rerank of b=2 survivors vs b=4 at matched bytes, R@10 vs FP32. -- Kendall-tau here is computed on FP32 values restricted to the probe's top-k - coords; a deployable version computes it on the stored ranks of those coords. +- Kendall-tau is computed on FP32 values restricted to the per-pair union of + top-k coords; a deployable version computes it on the stored ranks. +- Single corpus, single model (nomic-embed-text), no cross-encoder check — do + NOT read "confirmed" generality from one narrow corpus. Repeat on ≥1 more + encoder before any strong claim. -Reproduce: +Reproduce (synthetic): ``` cargo run --release --example density_collapse # noise=0.30 cargo run --release --example density_collapse -- --noise 0.10 # denser ``` + +Reproduce (real embeddings — full recorded procedure, was an E4 repro gap): +``` +# 1. extract repo sentences (md + rust, 30..300 chars, >=4 words, deduped): +python - <<'PY' +import re, glob, os, tempfile +texts=[] +for f in glob.glob("**/*.md",recursive=True)+glob.glob("**/*.rs",recursive=True): + if "target" in f.split(os.sep) or ".git" in f.split(os.sep): continue + t=re.sub(r'```.*?```','',open(f,encoding='utf-8',errors='ignore').read(),flags=re.S) + for line in re.split(r'(?<=[.!?])\s+|\n',t): + s=line.strip(" #-*/>|`").strip() + if 30<=len(s)<=300 and re.search(r'[a-zA-Z]{4}',s) and s.count(' ')>=4: texts.append(s) +seen=set(); out=[s for s in texts if not (s in seen or seen.add(s))] +open(os.path.join(tempfile.gettempdir(),"repo_texts.txt"),"w",encoding="utf-8").write("\n".join(out)) +print(len(out),"sentences") +PY +# 2. embed via ollama nomic-embed-text (GPU) -> .npy: +python examples/embed_ollama.py --texts "$TMP/repo_texts.txt" --out "$TMP/repo_real.npy" --n 9000 --batch 128 +# 3. run the probe: +cargo run --release --example density_collapse -- --corpus-npy "$TMP/repo_real.npy" --topk 32 +``` diff --git a/examples/density_collapse.rs b/examples/density_collapse.rs index 6a75fe15..39001fba 100644 --- a/examples/density_collapse.rs +++ b/examples/density_collapse.rs @@ -21,7 +21,7 @@ //! cargo run --release --example density_collapse -- --noise 0.15 (tighter) //! No external data, no BLAS. Uses ordvec::rank::rank_transform. -use ordvec::rank::rank_transform; +use ordvec::rank::{rank_to_bucket, rank_transform}; use rand::{RngExt, SeedableRng}; use rand_chacha::ChaCha8Rng; use rayon::prelude::*; @@ -34,7 +34,7 @@ struct Cfg { latent: usize, clusters: usize, noise: f32, // cluster tightness: smaller = denser = more collapse - bits: u32, // bucket bits (2 = quartiles) + bits: u8, // bucket bits (2 = quartiles); ordvec RankQuant uses {1,2,4} topk: usize, // # top coords used for the intra-bucket tau test corpus_npy: Option, // real embeddings; overrides synthetic + dim/n } @@ -208,19 +208,16 @@ fn main() { }; let d = cfg.dim; - // b-bit bucket code: top bucket id per coordinate. rank r in [0,D) -> - // bucket r / (D / 2^bits). The b=2 "top-bucket membership" code we hash on - // is the multiset of bucket ids (order-free), exactly what RankQuant ties on. - let nbuckets = 1usize << cfg.bits; - let bucket_w = d / nbuckets; + // Per-coordinate b-bit bucket id, using ordvec's OWN bucketing + // (`rank_to_bucket`) so the probe measures real RankQuant behavior — it is + // guarded (d>0, bits<=7, rank> = (0..cfg.n) .into_par_iter() .map(|i| { let ranks = rank_transform(&corpus[i * d..(i + 1) * d]); - ranks - .iter() - .map(|&r| (r as usize / bucket_w).min(nbuckets - 1) as u8) - .collect() + ranks.iter().map(|&r| rank_to_bucket(r, d, cfg.bits)).collect() }) .collect(); @@ -263,14 +260,23 @@ fn tau_report(cfg: &Cfg, corpus: &[f32], codes: &[Vec]) { .iter() .map(|&(_, j)| (cos(pv, &corpus[j * d..(j + 1) * d]), j)) .collect(); - by_cos.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap()); + by_cos.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap_or(std::cmp::Ordering::Equal)); let half = by_cos.len() / 2; - let coords = top_coords(pv, cfg.topk); + // M1 fix (review): tau coords must NOT be the probe's own top-k + // (that couples tau to cosine, making the test near-tautological). + // Use the UNION of each PAIR's own top coords — chosen independently + // of the cosine ranking — so tau measures order agreement on the + // dims the two vectors jointly care about, not the probe's alone. let tau_of = |slice: &[(f32, usize)]| -> (f32, f32) { let mut st = 0.0f32; let mut sc = 0.0f32; for &(c, j) in slice { - st += kendall_tau(pv, &corpus[j * d..(j + 1) * d], &coords); + let jv = &corpus[j * d..(j + 1) * d]; + let mut coords = top_coords(pv, cfg.topk); + coords.extend(top_coords(jv, cfg.topk)); + coords.sort_unstable(); + coords.dedup(); + st += kendall_tau(pv, jv, &coords); sc += c; } let k = slice.len().max(1) as f32; @@ -289,24 +295,50 @@ fn tau_report(cfg: &Cfg, corpus: &[f32], codes: &[Vec]) { let tau_far = mean(&|r| r.1); let cos_near = mean(&|r| r.2); let cos_far = mean(&|r| r.3); - let wins = rows.iter().filter(|r| r.0 < r.1).count(); - println!("\n## Intra-code Kendall-tau test (top-{} coords, M={m} b2-lookalikes/probe, {} probes)", + // Per-probe tau GAP (far - near): positive => true neighbours have lower + // tau. Report its mean with a bootstrap 95% CI over probes. The gap (an + // EFFECT SIZE) is the honest statistic — NOT the win rate, whose climb with + // top-k is just estimator-variance reduction, not a sharpening effect (M2). + let gaps: Vec = rows.iter().map(|r| (r.1 - r.0) as f64).collect(); + let gap_mean = gaps.iter().sum::() / gaps.len().max(1) as f64; + let (lo, hi) = bootstrap_ci(&gaps, 2000); + println!("\n## Intra-code Kendall-tau test (top-{} coords PER-PAIR union, M={m}, {} probes)", cfg.topk, rows.len()); - println!("among b2-lookalikes: mean cosine mean top-k tau-distance"); + println!("among b2-lookalikes: mean cosine mean tau-distance"); println!(" FP32-TRUE neighbours {cos_near:.4} {tau_near:.4}"); println!(" FP32-FAR lookalikes {cos_far:.4} {tau_far:.4}"); - println!("probes where true-neighbour tau < far-lookalike tau: {wins}/{} = {:.3}", - rows.len(), wins as f64 / rows.len().max(1) as f64); - let verdict = if tau_near + 0.005 < tau_far { - "SIGNAL: fine permutation order separates true neighbours from b2-lookalikes -> recoverable, no new storage" - } else if tau_near > tau_far + 0.005 { - "INVERTED: true neighbours have HIGHER tau — order does not help here" + println!("tau GAP (far - near) = {gap_mean:.4} 95% CI [{lo:.4}, {hi:.4}]"); + let verdict = if lo > 0.0 { + "SIGNAL: gap CI strictly > 0 — true neighbours have lower tau (effect, not just win-rate)" + } else if hi < 0.0 { + "INVERTED: gap CI strictly < 0" } else { - "NO SIGNAL: intra-code order does not separate true neighbours from b2-lookalikes" + "NO SIGNAL: gap CI spans 0 — cannot distinguish from noise" }; println!("verdict: {verdict}"); } +/// Percentile bootstrap 95% CI for the mean of `xs`, `b` resamples. +/// Deterministic RNG (no Date/rand-of-clock) so the CI is reproducible. +fn bootstrap_ci(xs: &[f64], b: usize) -> (f64, f64) { + if xs.is_empty() { + return (0.0, 0.0); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xB007); + let n = xs.len(); + let mut means: Vec = (0..b) + .map(|_| { + let mut s = 0.0; + for _ in 0..n { + s += xs[rng.random_range(0..n)]; + } + s / n as f64 + }) + .collect(); + means.sort_by(|a, c| a.partial_cmp(c).unwrap_or(std::cmp::Ordering::Equal)); + (means[(0.025 * b as f64) as usize], means[((0.975 * b as f64) as usize).min(b - 1)]) +} + /// Low-rank clustered corpus; `noise` is the density dial. fn make_corpus(cfg: &Cfg) -> Vec { let mut rng = ChaCha8Rng::seed_from_u64(SEED); diff --git a/examples/embed_ollama.py b/examples/embed_ollama.py index 56d1af7e..aa16f640 100644 --- a/examples/embed_ollama.py +++ b/examples/embed_ollama.py @@ -28,7 +28,15 @@ def embed_batch(texts): headers={"Content-Type": "application/json"}, ) with urllib.request.urlopen(req, timeout=120) as r: - return json.load(r)["embeddings"] + embs = json.load(r)["embeddings"] + # E2 fix: ollama must return exactly one vector per input, IN ORDER. + # A mismatch would silently misalign rows against the source texts. + if len(embs) != len(texts): + raise RuntimeError( + f"embedding count {len(embs)} != input count {len(texts)} " + "(row misalignment) — aborting rather than writing corrupt .npy" + ) + return embs def build_builtin_corpus(n): @@ -60,6 +68,8 @@ def build_builtin_corpus(n): def write_npy(path, vecs): + if not vecs: + raise ValueError("no vectors to write (empty corpus?)") n = len(vecs) dim = len(vecs[0]) header = ("{'descr': ' Date: Sat, 13 Jun 2026 13:09:03 -0700 Subject: [PATCH 07/16] Docs: add benchmarks/README front door, move withdrawn results to withdrawn/ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Makes the 11-doc benchmark set navigable for review without changing any findings: - benchmarks/README.md: tiered index (SOUND / THEORY / WITHDRAWN), a 3-minute reviewer path leading with the headline density-collapse result, and the open follow-up. No entry point existed before. - Move the WITHDRAWN number-variance docs (spectral_probe_results, corpus_zoo_results) to benchmarks/withdrawn/ with a README marking them record-only; fix the two cross-references. Top level now shows only what survived review. Pure documentation reorg — no code, no result, no API change. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/ADVERSARIAL_REVIEW.md | 2 +- benchmarks/README.md | 60 +++++++++++++++++++ benchmarks/rigidity_impossibility_proofs.md | 2 +- benchmarks/withdrawn/README.md | 16 +++++ .../{ => withdrawn}/corpus_zoo_results.md | 0 .../{ => withdrawn}/spectral_probe_results.md | 0 6 files changed, 78 insertions(+), 2 deletions(-) create mode 100644 benchmarks/README.md create mode 100644 benchmarks/withdrawn/README.md rename benchmarks/{ => withdrawn}/corpus_zoo_results.md (100%) rename benchmarks/{ => withdrawn}/spectral_probe_results.md (100%) diff --git a/benchmarks/ADVERSARIAL_REVIEW.md b/benchmarks/ADVERSARIAL_REVIEW.md index de6c19e5..ad7cfbde 100644 --- a/benchmarks/ADVERSARIAL_REVIEW.md +++ b/benchmarks/ADVERSARIAL_REVIEW.md @@ -28,7 +28,7 @@ carries its own critique. Conclusions are tiered by what SURVIVED review. ## DEMOTED TO EXPLORATORY (known confounds — NOT a settled finding) - **"Routing keys are super-Poisson, never rigid"** (`spectral_probe.rs`, - `corpus_zoo_results.md`): **WITHDRAWN — the probe does not measure what it + `withdrawn/corpus_zoo_results.md`): **WITHDRAWN — the probe does not measure what it claims.** Attempted salvage (added `--unfold-smooth K`, a K-knot empirical unfold) INVERTED the result: under the smooth unfold the isotropic corpus reads super-Poisson (Σ²/L 1.7→12) and clustered reads LOWER (1.4→6.8) — the diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 00000000..9a863034 --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,60 @@ +# Ordinal-routing research — reviewer's guide + +Exploratory investigation into ordvec's **density behavior** and whether +prime/spectral structure can improve training-free routing. Everything here is +in `examples/` (runnable probes) and `benchmarks/` (findings + proofs) — **no +changes to the `ordvec` crate or its public API.** + +Reviewed by three internal adversarial agents plus the PR bots; findings are +tiered below by **what survived scrutiny**. Read the tiers, not all 11 docs. + +## 3-minute path + +1. This file (the tiers). +2. **[density_collapse_results.md](density_collapse_results.md)** — the headline, + real-embedding result (and its honest correction). +3. **[ADVERSARIAL_REVIEW.md](ADVERSARIAL_REVIEW.md)** — what was challenged, + fixed, retracted, withdrawn. The integrity record. + +## SOUND — proven or real-data confirmed + +| doc | claim | +|-----|-------| +| [density_collapse_results.md](density_collapse_results.md) | **Headline.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, bootstrap CI > 0) on real `nomic-embed-text` embeddings. Modest but real; the lever is permutation order already in the `Rank` code. | +| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 skeleton in [lean/](lean/). | +| [twonn_id_results.md](twonn_id_results.md) | TwoNN intrinsic-dimension probe (chord-metric fix, sphere-validated). `nomic-embed-text` ID ≈ 13 / 768. Estimator-bias caveat noted. | +| [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | + +## THEORY — directionally right, restated honestly + +| doc | status | +|-----|--------| +| [rigidity_impossibility_proofs.md](rigidity_impossibility_proofs.md) | The routing key is not number-variance-rigid (Thm 2/3, binomial `L(1-L/n)`). The over-broad "quantile optimal over all partitions" claim is **retracted** as a non-sequitur. | +| [conjecture_citation_audit.md](conjecture_citation_audit.md) | Citations verified by direct fetch (von Koch, Broughan-Barnett, Montgomery, Ethayarajh, etc.). | + +## WITHDRAWN — see [withdrawn/](withdrawn/) + +The number-variance "super-Poisson" finding ([withdrawn/spectral_probe_results.md](withdrawn/spectral_probe_results.md), +[withdrawn/corpus_zoo_results.md](withdrawn/corpus_zoo_results.md)) did not +survive: its unfold is uncalibrated (a salvage attempt inverted the result). The +*theory* above does not depend on it. Kept for the record, not as a claim. + +## Conjecture verdict (the framing question) + +Prime / Li(x) / Sacks-spiral constructions don't help retrieval: they act on the +index (ℕ) and carry no corpus information. The exploitable dense-region structure +lives on the permutohedron `S_D` — the data's own order — which is the +density-collapse result above. Detail across the theory docs + ADVERSARIAL_REVIEW. + +## Reproduce + +Per-doc commands are at the bottom of each file. Real-embedding pipeline (GPU via +ollama) is fully recorded in [density_collapse_results.md](density_collapse_results.md); +external-corpus recipe in [REAL_CORPUS_RUNBOOK.md](REAL_CORPUS_RUNBOOK.md). + +## Open follow-up + +The decisive deployment question is unanswered: does the ≈0.04 tau gap convert to +real recall gain vs simply using b=4 at matched bytes (R@10 vs FP32)? That, plus a +second corpus/encoder, is what would move density-collapse from "real effect" to +"ship it." diff --git a/benchmarks/rigidity_impossibility_proofs.md b/benchmarks/rigidity_impossibility_proofs.md index 2ebc0406..b6048756 100644 --- a/benchmarks/rigidity_impossibility_proofs.md +++ b/benchmarks/rigidity_impossibility_proofs.md @@ -115,7 +115,7 @@ NOT in favor of quantile-optimality-over-all-partitions. Establishing the latter would need a separate argument (that partition quality is monotone in Σ²), which is not provided and is likely false. -(The zoo runs in corpus_zoo_results.md are EVIDENCE for the narrow claim, with +(The zoo runs in withdrawn/corpus_zoo_results.md were EVIDENCE for the narrow claim, with the unfolding caveat in ADVERSARIAL_REVIEW.md. The selftest Σ²=0 only shows the estimator can represent the o(L) regime; Thm 3 says a fixed-distribution corpus cannot reach it.) diff --git a/benchmarks/withdrawn/README.md b/benchmarks/withdrawn/README.md new file mode 100644 index 00000000..374b0515 --- /dev/null +++ b/benchmarks/withdrawn/README.md @@ -0,0 +1,16 @@ +# Withdrawn results + +These findings did NOT survive review and are kept only for the record — do not +cite them as results. See [../ADVERSARIAL_REVIEW.md](../ADVERSARIAL_REVIEW.md) +(Round 1) and [../README.md](../README.md) for the tiering. + +- **spectral_probe_results.md** / **corpus_zoo_results.md** — the number-variance + "embedding keys are super-Poisson, never rigid" finding. The probe's unfold is + uncalibrated: a smooth-unfold salvage attempt INVERTED the isotropic/clustered + ordering, proving neither unfold is validated against analytic ground truth. + The rigidity THEORY (../rigidity_impossibility_proofs.md) does not depend on + this probe and is unaffected. + +The `examples/spectral_probe.rs` instrument itself is retained (its +`--rigid-selftest` is a valid estimator check); only the corpus *findings* are +withdrawn. diff --git a/benchmarks/corpus_zoo_results.md b/benchmarks/withdrawn/corpus_zoo_results.md similarity index 100% rename from benchmarks/corpus_zoo_results.md rename to benchmarks/withdrawn/corpus_zoo_results.md diff --git a/benchmarks/spectral_probe_results.md b/benchmarks/withdrawn/spectral_probe_results.md similarity index 100% rename from benchmarks/spectral_probe_results.md rename to benchmarks/withdrawn/spectral_probe_results.md From 41a90076b1ab3c077d09071cee5ad2dea1f1adce Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 13:36:58 -0700 Subject: [PATCH 08/16] =?UTF-8?q?Run=20the=20decisive=20bake-off:=20tau-re?= =?UTF-8?q?rank=20does=20NOT=20beat=20b=3D4=20=E2=80=94=20clean=20negative?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The deployment question for the density-collapse finding, answered. Ceiling experiment (tau uses full stored ranks): does a Kendall-tau rerank of b=2 survivors beat spending the bits on b=4 at R@10 vs FP32? Result — NO, decisively, on synthetic AND real (nomic-embed-text) embeddings: b2 0.898 | b4 0.942 | tau-rerank 0.597 | fp32-rerank 1.000 (real, topk128/M50) - fp32-rerank=1.0 proves the b=2 candidate pool contains every true neighbour; candidate generation is fine. - tau-rerank (0.597) scores BELOW b=2's own ordering (0.898): the ~0.04 tau gap is a real binary discriminator but far too weak to ORDER candidates. - Even at the ceiling (full ranks, best params) tau only ties b=2, never reaches b=4. No compact tau codec could beat this, so codec work is not justified. Verdict: the density-collapse tau-signal is real-but-inert. Just use b=4. No ordvec feature follows. This closes the research line honestly: a characterized mechanism + a clean negative — the prime/spectral/permutation ideas do not beat the boring baseline. Confirms this branch is RESEARCH, not a contribution to the crate. Updates density_collapse_results.md (open question now RESOLVED) and the benchmarks README (verdict). Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/README.md | 28 +- benchmarks/density_collapse_results.md | 9 +- benchmarks/tau_rerank_bakeoff_results.md | 54 ++++ examples/tau_rerank_bakeoff.rs | 336 +++++++++++++++++++++++ 4 files changed, 415 insertions(+), 12 deletions(-) create mode 100644 benchmarks/tau_rerank_bakeoff_results.md create mode 100644 examples/tau_rerank_bakeoff.rs diff --git a/benchmarks/README.md b/benchmarks/README.md index 9a863034..3fca212e 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -10,9 +10,11 @@ tiered below by **what survived scrutiny**. Read the tiers, not all 11 docs. ## 3-minute path -1. This file (the tiers). -2. **[density_collapse_results.md](density_collapse_results.md)** — the headline, - real-embedding result (and its honest correction). +1. This file (the tiers + the verdict at the bottom). +2. **[density_collapse_results.md](density_collapse_results.md)** — the mechanism + (real-embedding, with its honest correction), then + **[tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md)** — the + decisive negative: it doesn't beat b=4. 3. **[ADVERSARIAL_REVIEW.md](ADVERSARIAL_REVIEW.md)** — what was challenged, fixed, retracted, withdrawn. The integrity record. @@ -20,7 +22,8 @@ tiered below by **what survived scrutiny**. Read the tiers, not all 11 docs. | doc | claim | |-----|-------| -| [density_collapse_results.md](density_collapse_results.md) | **Headline.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, bootstrap CI > 0) on real `nomic-embed-text` embeddings. Modest but real; the lever is permutation order already in the `Rank` code. | +| [density_collapse_results.md](density_collapse_results.md) | **Mechanism.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, CI > 0). Real but small. | +| [tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md) | **The verdict.** Does that tau signal beat b=4? NO — b=4 wins even at the tau ceiling; tau scores below b=2's own ordering. Signal is real-but-inert; just use b=4. Closes the line: research, not a feature. | | [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 skeleton in [lean/](lean/). | | [twonn_id_results.md](twonn_id_results.md) | TwoNN intrinsic-dimension probe (chord-metric fix, sphere-validated). `nomic-embed-text` ID ≈ 13 / 768. Estimator-bias caveat noted. | | [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | @@ -52,9 +55,16 @@ Per-doc commands are at the bottom of each file. Real-embedding pipeline (GPU vi ollama) is fully recorded in [density_collapse_results.md](density_collapse_results.md); external-corpus recipe in [REAL_CORPUS_RUNBOOK.md](REAL_CORPUS_RUNBOOK.md). -## Open follow-up +## The deployment question — RESOLVED (negative) -The decisive deployment question is unanswered: does the ≈0.04 tau gap convert to -real recall gain vs simply using b=4 at matched bytes (R@10 vs FP32)? That, plus a -second corpus/encoder, is what would move density-collapse from "real effect" to -"ship it." +[tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md): the decisive +matched-bytes experiment was run. **b=4 wins decisively, even at the tau ceiling** +(real embeddings: b4 0.942, b2 0.898, tau-rerank 0.597, fp32-rerank 1.000). The +b=2 candidate pool contains every true neighbour (fp32-rerank=1.0), but the ~0.04 +tau gap is too weak to ORDER them — it scores below b=2's own ordering. The +density-collapse signal is **real but inert**: "just use b=4," no ordvec feature +follows. + +This is the honest bottom line of the whole branch: a characterized mechanism and +a clean negative. **Research, not a feature** — the prime/spectral/permutation +ideas for dense-region retrieval do not beat the boring baseline (spend the bits). diff --git a/benchmarks/density_collapse_results.md b/benchmarks/density_collapse_results.md index 4362c69f..b2387ced 100644 --- a/benchmarks/density_collapse_results.md +++ b/benchmarks/density_collapse_results.md @@ -95,9 +95,12 @@ recall gain vs simply using b=4 is the unanswered deployment question. - Real corpus here is repo-domain (md+rust), 8665 docs — narrow domain, modest size. Confirm on a larger, broader corpus (MS MARCO / Wikipedia passages). -- The test shows the signal EXISTS (separation in tau); it does not yet show a - tau-rerank improves end-to-end recall vs simply using b=4. The honest next - experiment: tau-rerank of b=2 survivors vs b=4 at matched bytes, R@10 vs FP32. +- **RESOLVED (negative):** the tau-rerank-vs-b4 bake-off was run + (tau_rerank_bakeoff_results.md). b=4 wins decisively even at the tau ceiling + (real: b4 0.942 vs tau 0.597, and tau is worse than b2's own 0.898). The + ~0.04 tau gap is a real binary discriminator but too weak to ORDER candidates; + it does not convert to retrieval value. Conclusion: just use b=4. The signal + is real-but-inert — no ordvec feature follows. - Kendall-tau is computed on FP32 values restricted to the per-pair union of top-k coords; a deployable version computes it on the stored ranks. - Single corpus, single model (nomic-embed-text), no cross-encoder check — do diff --git a/benchmarks/tau_rerank_bakeoff_results.md b/benchmarks/tau_rerank_bakeoff_results.md new file mode 100644 index 00000000..e651aaa1 --- /dev/null +++ b/benchmarks/tau_rerank_bakeoff_results.md @@ -0,0 +1,54 @@ +# Matched-bytes bake-off: tau-rerank vs b=4 — the deployment verdict + +The decisive question for the density-collapse finding: does breaking dense-region +ties with permutation order (Kendall-tau rerank of b=2 survivors) beat simply +spending the bits on b=4? Source: `examples/tau_rerank_bakeoff.rs`. CEILING +experiment — tau uses the FULL stored rank order (best case for the method). + +## Result — NO. b=4 wins decisively, even at the tau ceiling. + +Real embeddings (nomic-embed-text, 8665 docs / 200 held-out queries, 768-d, +topk=128, M=50 — tau's best parameters): + +| arm | bytes/vec | R@10 | +|-----|-----------|------| +| b2 asym | 192 | 0.8980 | +| b4 asym | 384 | 0.9420 | +| b2 + tau-rerank | 192* | 0.5965 (*ceiling: full stored ranks) | +| b2 + fp32-rerank | — | 1.0000 (absolute ceiling) | + +Synthetic (dim 256, n 30k) tells the same story: b2 0.5785, b4 0.8095, +tau-rerank 0.22–0.57 (best case ties b2, never reaches b4), fp32-rerank 1.0000. + +## Why the tau-rerank fails (the precise lesson) + +1. **The candidate pool is fine.** fp32-rerank = 1.0000 proves b=2's top-M + *contains* every FP32 true neighbour. Candidate generation is not the problem. +2. **Tau MISORDERS them.** tau-rerank (0.597) is worse than b=2's OWN ordering + (0.898). The ~0.04 tau gap from density_collapse_results.md is real as a faint + BINARY discriminator (true-neighbour vs far-lookalike) but far too weak to + ORDER 50 candidates correctly — it actively destroys the ordering b=2's + scores already provide. +3. **Even at the ceiling** (full stored ranks, best topk/M) tau-rerank only + claws back to ≈ b=2; it never reaches b=4. No compact tau codec could do + better than this ceiling, so codec work is NOT justified. + +## Verdict + +**The density-collapse tau-signal is a real-but-inert observation, not a usable +lever. "Just use b=4."** This closes the loop opened in +density_collapse_results.md: the signal exists (binary separation, CI > 0) but +does not convert to retrieval value. There is no ordvec code change here worth +making. + +This is the honest negative that the whole research line was meant to resolve: +the prime/spectral/permutation ideas for improving dense-region retrieval do not +beat the boring baseline (spend the bits). The contribution of this branch is +therefore RESEARCH (a characterized mechanism + a clean negative), not a feature. + +Reproduce: +``` +cargo run --release --example tau_rerank_bakeoff # synthetic +cargo run --release --example tau_rerank_bakeoff -- \ + --corpus-npy repo_real.npy --queries-npy repo_q.npy --topk 128 --m 50 +``` diff --git a/examples/tau_rerank_bakeoff.rs b/examples/tau_rerank_bakeoff.rs new file mode 100644 index 00000000..c77246cc --- /dev/null +++ b/examples/tau_rerank_bakeoff.rs @@ -0,0 +1,336 @@ +//! Matched-bytes bake-off: does a Kendall-tau rerank of b=2 survivors beat +//! simply spending the bits on b=4, on R@10 vs FP32 cosine? +//! +//! This is the decisive deployment question for the density-collapse finding. +//! It is a CEILING experiment: the tau-rerank uses the FULL stored rank order +//! (ignoring how compactly that order could be stored). If the idealized +//! tau-rerank cannot beat b=4, the answer is "just use b=4" and no codec work +//! is justified. If it can, a compact tau codec becomes worth designing. +//! +//! Arms (all scored against FP32 brute-force cosine top-k): +//! b2 — RankQuant b=2 asymmetric (dim/4 bytes/vec) +//! b4 — RankQuant b=4 asymmetric (dim/2 bytes/vec) +//! b2+tau — b=2 top-M candidates, reranked by Kendall-tau of the +//! query's top-k coord order vs each doc's stored rank order +//! b2+fp32 — b=2 top-M candidates, reranked by exact FP32 cosine +//! (absolute ceiling / sanity — should approach FP32) +//! +//! Run: cargo run --release --example tau_rerank_bakeoff +//! cargo run --release --example tau_rerank_bakeoff -- --corpus-npy emb.npy --queries-npy q.npy +//! No external data, no BLAS. + +use ordvec::rank::rank_transform; +use ordvec::RankQuant; +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + n_queries: usize, + latent: usize, + clusters: usize, + k: usize, + m: usize, // candidate-set size for the rerank arms + topk: usize, // # query top coords used by the tau rerank + corpus_npy: Option, + queries_npy: Option, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 30_000, + n_queries: 200, + latent: 64, + clusters: 200, + k: 10, + m: 200, + topk: 32, + corpus_npy: None, + queries_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent = a.next().unwrap().parse().unwrap(), + "--clusters" => c.clusters = a.next().unwrap().parse().unwrap(), + "--k" => c.k = a.next().unwrap().parse().unwrap(), + "--m" => c.m = a.next().unwrap().parse().unwrap(), + "--topk" => c.topk = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--queries-npy" => c.queries_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +fn l2_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "len mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn main() { + let mut cfg = parse(); + let (corpus, queries) = load_corpus(&mut cfg); + let truth = fp32_topk(&corpus, &queries, cfg.dim, cfg.k); + println!( + "# tau_rerank_bakeoff: n={} q={} dim={} k={} M={} topk={}", + cfg.n, cfg.n_queries, cfg.dim, cfg.k, cfg.m, cfg.topk + ); + run(&cfg, &corpus, &queries, &truth); +} + +fn recall_at_k(pred: &[usize], truth: &[usize]) -> f32 { + use std::collections::HashSet; + let t: HashSet = truth.iter().copied().collect(); + let hits = pred.iter().filter(|i| t.contains(i)).count(); + hits as f32 / truth.len().max(1) as f32 +} + +/// Top-k indices of a vector by value, descending. +fn top_coords(v: &[f32], k: usize) -> Vec { + let mut idx: Vec = (0..v.len()).collect(); + idx.sort_by(|&i, &j| v[j].partial_cmp(&v[i]).unwrap_or(std::cmp::Ordering::Equal)); + idx.truncate(k); + idx +} + +/// Kendall-tau distance between the orderings q and the doc's stored RANKS +/// induce on `coords` (fraction of discordant pairs). Lower = more similar +/// order. `dranks[c]` is the stored rank of coord c for this doc. +fn tau_dist(q: &[f32], dranks: &[u16], coords: &[usize]) -> f32 { + let m = coords.len(); + if m < 2 { + return 0.0; + } + let (mut disc, mut tot) = (0usize, 0usize); + for x in 0..m { + for y in (x + 1)..m { + let (cx, cy) = (coords[x], coords[y]); + let sq = (q[cx] - q[cy]).partial_cmp(&0.0).unwrap_or(std::cmp::Ordering::Equal); + // stored ranks are integers; higher rank = larger value + let sd = dranks[cx].cmp(&dranks[cy]); + if sq != sd { + disc += 1; + } + tot += 1; + } + } + disc as f32 / tot.max(1) as f32 +} + +fn run(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { + let d = cfg.dim; + let nq = cfg.n_queries; + + // Build b=2 and b=4 indices on identical data. + let mut b2 = RankQuant::new(d, 2); + b2.add(corpus); + let mut b4 = RankQuant::new(d, 4); + b4.add(corpus); + eprintln!( + "# b2 {} B/vec, b4 {} B/vec", + b2.bytes_per_vec(), + b4.bytes_per_vec() + ); + + // Precompute stored ranks per doc (the CEILING tau signal: full order). + // This is what a compact tau codec would have to approximate. + let doc_ranks: Vec> = (0..cfg.n) + .into_par_iter() + .map(|i| rank_transform(&corpus[i * d..(i + 1) * d])) + .collect(); + + // Per query: candidate set = b=2's own asymmetric top-M (the cheap stage). + let b2_res = b2.search_asymmetric(queries, cfg.m); + let b4_res = b4.search_asymmetric(queries, cfg.k); + + let mut r_b2 = 0.0f64; + let mut r_b4 = 0.0f64; + let mut r_tau = 0.0f64; + let mut r_fp = 0.0f64; + + let per_q: Vec<(f32, f32, f32, f32)> = (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * d..(qi + 1) * d]; + // b2 top-k (first k of its top-M) + let b2_ids: Vec = + b2_res.indices_for_query(qi).iter().take(cfg.k).map(|&i| i as usize).collect(); + let b4_ids: Vec = + b4_res.indices_for_query(qi).iter().map(|&i| i as usize).collect(); + // candidate pool = b2 top-M + let cands: Vec = + b2_res.indices_for_query(qi).iter().map(|&i| i as usize).collect(); + let coords = top_coords(q, cfg.topk); + // tau rerank: ascending tau distance + let mut by_tau: Vec<(usize, f32)> = cands + .iter() + .map(|&di| (di, tau_dist(q, &doc_ranks[di], &coords))) + .collect(); + by_tau.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(std::cmp::Ordering::Equal)); + let tau_ids: Vec = by_tau.iter().take(cfg.k).map(|&(i, _)| i).collect(); + // fp32 rerank: descending cosine (absolute ceiling) + let mut by_fp: Vec<(usize, f32)> = cands + .iter() + .map(|&di| { + let dv = &corpus[di * d..(di + 1) * d]; + (di, q.iter().zip(dv).map(|(a, b)| a * b).sum()) + }) + .collect(); + by_fp.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + let fp_ids: Vec = by_fp.iter().take(cfg.k).map(|&(i, _)| i).collect(); + ( + recall_at_k(&b2_ids, &truth[qi]), + recall_at_k(&b4_ids, &truth[qi]), + recall_at_k(&tau_ids, &truth[qi]), + recall_at_k(&fp_ids, &truth[qi]), + ) + }) + .collect(); + for (a, b, c, e) in &per_q { + r_b2 += *a as f64; + r_b4 += *b as f64; + r_tau += *c as f64; + r_fp += *e as f64; + } + let nqf = nq as f64; + println!("\narm bytes/vec R@{}", cfg.k); + println!("b2 asym {:>9} {:.4}", b2.bytes_per_vec(), r_b2 / nqf); + println!("b4 asym {:>9} {:.4}", b4.bytes_per_vec(), r_b4 / nqf); + println!( + "b2 + tau-rerank {:>9}* {:.4} (*ceiling: full stored ranks)", + b2.bytes_per_vec(), + r_tau / nqf + ); + println!("b2 + fp32-rerank {:>9} {:.4} (absolute ceiling)", "—", r_fp / nqf); + println!("\nVERDICT:"); + let (tau, b4v) = (r_tau / nqf, r_b4 / nqf); + if tau > b4v + 0.005 { + println!(" tau-rerank BEATS b4 at half the bytes ({tau:.4} > {b4v:.4}) — codec work justified"); + } else if tau + 0.005 < b4v { + println!(" b4 BEATS idealized tau-rerank ({b4v:.4} > {tau:.4}) — just use b4, no codec"); + } else { + println!(" tau-rerank ~= b4 ({tau:.4} vs {b4v:.4}) — tie even at the ceiling; b4 simpler"); + } +} + +/// FP32 brute-force cosine top-k (corpus + queries L2-normalized). Vec per query. +fn fp32_topk(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + (0..queries.len() / dim) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(usize, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di, q.iter().zip(d).map(|(a, b)| a * b).sum()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +fn load_corpus(cfg: &mut Cfg) -> (Vec, Vec) { + if let (Some(cp), Some(qp)) = (cfg.corpus_npy.clone(), cfg.queries_npy.clone()) { + let (mut c, n, d) = load_npy_f32(&cp); + let (mut q, nq, dq) = load_npy_f32(&qp); + assert_eq!(d, dq, "dim mismatch"); + l2_rows(&mut c, d); + l2_rows(&mut q, dq); + cfg.n = n; + cfg.dim = d; + cfg.n_queries = nq; + eprintln!("# loaded corpus {n}x{d}, queries {nq}x{dq}"); + return (c, q); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let (d, l) = (cfg.dim, cfg.latent); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let mk = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut o = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + o[i] = acc; + } + o + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let c = rng.random_range(0..cfg.clusters); + corpus[i * d..(i + 1) * d].copy_from_slice(&mk(&protos[c * l..(c + 1) * l], 0.3, &mut rng)); + } + let mut queries = vec![0.0f32; cfg.n_queries * d]; + for i in 0..cfg.n_queries { + let c = rng.random_range(0..cfg.clusters); + queries[i * d..(i + 1) * d] + .copy_from_slice(&mk(&protos[c * l..(c + 1) * l], 0.1, &mut rng)); + } + l2_rows(&mut corpus, d); + l2_rows(&mut queries, d); + (corpus, queries) +} From 8b221986f0a12fd155651f59e5efb56d1d1f1a3b Mon Sep 17 00:00:00 2001 From: Todd Baur <523330+toadkicker@users.noreply.github.com> Date: Sat, 13 Jun 2026 14:27:50 -0700 Subject: [PATCH 09/16] Move Lean formalization to ordvec-formalization; reframe as research writeup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The vernier CRT seam-density Lean work belongs in the companion proof repo, not here. Removed benchmarks/lean/ from this branch and opened it as a draft there: ordvec-formalization#17 (kept out of that repo's sorry-free audited tree). Cross-links updated: benchmarks/README and crt_seam_oracle_results.md now point at the formalization PR. This branch is now purely the research writeup — probes, findings, proofs-on-paper, and the decisive bake-off negative — with no code destined for the ordvec crate. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur <523330+toadkicker@users.noreply.github.com> --- benchmarks/README.md | 2 +- benchmarks/crt_seam_oracle_results.md | 4 + benchmarks/lean/README.md | 59 ---------- benchmarks/lean/VernierSeamDensity.lean | 143 ------------------------ 4 files changed, 5 insertions(+), 203 deletions(-) delete mode 100644 benchmarks/lean/README.md delete mode 100644 benchmarks/lean/VernierSeamDensity.lean diff --git a/benchmarks/README.md b/benchmarks/README.md index 3fca212e..f73d37c2 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -24,7 +24,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not all 11 docs. |-----|-------| | [density_collapse_results.md](density_collapse_results.md) | **Mechanism.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, CI > 0). Real but small. | | [tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md) | **The verdict.** Does that tau signal beat b=4? NO — b=4 wins even at the tau ceiling; tau scores below b=2's own ordering. Signal is real-but-inert; just use b=4. Closes the line: research, not a feature. | -| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 skeleton in [lean/](lean/). | +| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (draft). | | [twonn_id_results.md](twonn_id_results.md) | TwoNN intrinsic-dimension probe (chord-metric fix, sphere-validated). `nomic-embed-text` ID ≈ 13 / 768. Estimator-bias caveat noted. | | [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | diff --git a/benchmarks/crt_seam_oracle_results.md b/benchmarks/crt_seam_oracle_results.md index 5f691390..f15397d8 100644 --- a/benchmarks/crt_seam_oracle_results.md +++ b/benchmarks/crt_seam_oracle_results.md @@ -1,5 +1,9 @@ # CRT seam oracle — corrected vernier theorem (exhaustive finite proof) +> Lean 4 formalization of this theorem lives in the companion repo: +> [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) +> (draft — proof sketch with proof debt, kept out of that repo's sorry-free tree). + `examples/crt_seam_oracle.rs` enumerates the full ring Z/M to verify the within-axis vernier (coprime-grid) theorem exactly. This is the SPEC for the vernier recall experiment and the corrected target for any Lean formalization. diff --git a/benchmarks/lean/README.md b/benchmarks/lean/README.md deleted file mode 100644 index 422e4820..00000000 --- a/benchmarks/lean/README.md +++ /dev/null @@ -1,59 +0,0 @@ -# Lean formalization — vernier CRT seam-density - -`VernierSeamDensity.lean` formalizes the proved-on-paper seam-density theorem -(benchmarks/crt_seam_oracle_results.md) as a standalone module for the -`Fieldnote-Echo/ordvec-formalization` repo (Lean 4.28.0, Apache header, -`OrdvecFormalization` namespace). - -(An earlier `CrtSeamDensity.lean` draft was superseded by this module and removed.) - -## Placement note (corrected) - -This module does NOT build on the repo's `Finite*Quotient` / `QuotientKernel` -stack. That layer proves fiber-invariance/sufficiency of abstract quotient maps -`Ω → Z` (`productMap = Prod.map`). The seam-density theorem is a CRT *bijection* -`ZMod M ≃ ∏ ZMod mᵢ` plus a *cardinality count* — a different toolkit (Mathlib -`ZMod` + `Fintype.piFinset`), which the formalization repo does not otherwise -import. Correct home (that repo), independent module. - -## Status: convergent, not staged - -There is no `0.5.0` branch/tag in any Fieldnote-Echo repo (verified with auth). -Nelson independently reached similar conclusions; "target 0.5.0" means this is -intended for the next release, not that a branch exists to diff against. When his -working branch surfaces, reconcile theorem statements (impossibility + CRT -density) — convergence is corroboration for the paper. - -## What is established vs. open - -| Statement | Lean status | -|---|---| -| (A) product-space count `∏(2t+1)`, NO coprimality | structurally complete, modulo name-checks ([VERIFY] `Fintype.mem_piFinset`, `Fintype.card_piFinset`) | -| (B) position-space count over `ZMod M` | depends on `crtEquiv` (sorry) + `hcongr` (sorry) | -| one-coincidence-per-period (t=0) | follows from (B) | -| `card_admissible` (per-grid = 2t+1) | sorry — routine `ZMod.val` window injectivity | - -## The honest crux - -Coprimality is used in EXACTLY ONE place: `crtEquiv`, the indexed bijection -`ZMod (∏ m i) ≃ ∏ i, ZMod (m i)`. The product-space half (A) is true for any -family and deliberately carries no `cop` hypothesis — stating it that way -prevents the classic self-deception of "proving" a CRT theorem that never -actually touches coprimality. - -## Proof debt, ranked - -1. `crtEquiv` — fold `ZMod.chineseRemainder` over `Fin L` by induction, or - locate an existing indexed iso. Highest effort; the real mathematical content. -2. `card_admissible` — injectivity of `k ↦ (k : ZMod m) − t` on `{0..2t}` when - `2t+1 ≤ m`. Routine. -3. `hcongr` — cardinality transport across `crtEquiv`. Standard once (1) exists. -4. Confirm [VERIFY] names against installed Mathlib. - -## Build - -No Lean toolchain was present at drafting; names carry inline confidence tags. -Drop into the ordvec-formalization Mathlib project, `lake build`, resolve the -four debts above. de Finetti / rigidity theorems (rigidity_impossibility_proofs.md) -are NOT formalized here — they need Mathlib `ProbabilityTheory` and de Finetti, -which is a separate, larger project (de Finetti is not currently in Mathlib). diff --git a/benchmarks/lean/VernierSeamDensity.lean b/benchmarks/lean/VernierSeamDensity.lean deleted file mode 100644 index 974fd89a..00000000 --- a/benchmarks/lean/VernierSeamDensity.lean +++ /dev/null @@ -1,143 +0,0 @@ -/- -Copyright (c) 2026 ordvec authors. All rights reserved. -Released under Apache 2.0 license as described in the file LICENSE. -Authors: (ordvec contributors) --/ - --- NOTE: does NOT build on the Finite*Quotient stack. That layer is about --- fiber-invariance/sufficiency of quotient maps (Ω → Z); this theorem is a CRT --- bijection + cardinality count, a different toolkit. Standalone module in the --- OrdvecFormalization namespace, importing Mathlib ZMod directly (the repo does --- not otherwise use ZMod — confirmed against its module list). -import Mathlib.Data.ZMod.Basic -import Mathlib.Data.Nat.GCD.Basic -import Mathlib.Data.Fintype.Pi -import Mathlib.Data.Fintype.BigOperators - -open scoped BigOperators - -namespace OrdvecFormalization - -/-! -# Vernier seam density (within-axis coprime grids) - -For pairwise-coprime grid periods `m : Fin L → ℕ` on a single routing axis, the -positions in `ZMod M` (`M = ∏ m i`) that lie within band `t` of a seam on EVERY -grid number exactly `∏ i, (2t+1)`, hence density `∏ i, (2t+1)/(m i)` when -`2t+1 ≤ m i`. Verified exhaustively in `examples/crt_seam_oracle.rs`; the closed -form is proved on paper in `benchmarks/crt_seam_oracle_results.md`. - -Structure mirrors the repo's quotient idiom: the CRT residue map is a -`productMap`-style coordinate split, and pairwise coprimality is what makes the -position-space map a bijection (the only place coprimality is used). - -DRAFT — drafted without a Lean toolchain. `sorry`s mark confirmed proof debt, -ranked at the file end. Lean 4.28.0 / Mathlib pinned by the repo `lake-manifest`. --/ - -/-! ## Per-grid admissible residues -/ - -/-- Residues mod `m` within centered band `t` of `0`, as the image of the - window `{0,…,2t}` shifted by `-t`. -/ -def admissible (m t : ℕ) : Finset (ZMod m) := - (Finset.Icc 0 (2 * t)).image (fun k => (k : ZMod m) - (t : ZMod m)) - -/-- When the window fits (`2t+1 ≤ m`) the shift map is injective on it, so the - admissible set has exactly `2t+1` elements. -/ -theorem card_admissible {m t : ℕ} (h : 2 * t + 1 ≤ m) : - (admissible m t).card = 2 * t + 1 := by - -- card_image_of_injOn + injectivity of k ↦ (k:ZMod m) - t on a width-(2t+1) - -- window (no wraparound since width < m). Routine ZMod.val arithmetic. - sorry - -/-! ## Product-space count (coprimality NOT used) -/ - -/-- The joint blind spot over the residue PRODUCT space has cardinality - `∏ (2t+1)`. Carries no COPRIMALITY hypothesis (that is the honest statement - of what this half proves — `Fintype.card_piFinset`, not CRT), but it DOES - need `[∀ i, NeZero (m i)]`: `ZMod 0` is infinite and has no `Fintype` - instance, so `Finset.univ` / `Fintype.piFinset` would not typecheck without - it. (Caught by code review.) -/ -theorem blindspot_card_product - {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] (t : ℕ) (hband : ∀ i, 2 * t + 1 ≤ m i) : - (Finset.univ.filter - (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)).card - = ∏ i, (2 * t + 1) := by - have hfilter : - (Finset.univ.filter (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)) - = Fintype.piFinset (fun i => admissible (m i) t) := by - ext r; simp [Fintype.mem_piFinset] -- [VERIFY] Fintype.mem_piFinset - rw [hfilter, Fintype.card_piFinset] -- [VERIFY] Fintype.card_piFinset - exact Finset.prod_congr rfl (fun i _ => card_admissible (hband i)) - -/-! ## CRT bijection (the only place coprimality is used) -/ - -/-- Indexed CRT equivalence `ZMod (∏ m i) ≃ ∏ i, ZMod (m i)` for a - pairwise-coprime family. [CRUX] Construct by folding `ZMod.chineseRemainder` - over `Fin L`, or via an existing indexed iso. This is the load-bearing - object and the sole consumer of `cop`. - - `[∀ i, NeZero (m i)]` is REQUIRED, not cosmetic: without it some `m i = 0` - gives `ZMod 0 = ℤ` (infinite), so the cardinalities of the two sides do not - match and the Equiv is FALSE. (Caught in adversarial review.) Downstream this - is discharged from `2t+1 ≤ m i ⇒ 1 ≤ m i ⇒ NeZero (m i)`. -/ -def crtEquiv {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] - (cop : ∀ i j, i ≠ j → Nat.Coprime (m i) (m j)) : - ZMod (∏ i, m i) ≃ (∀ i, ZMod (m i)) := - sorry - -/-! ## Position-space count (the real theorem) -/ - -/-- Positions `x ∈ ZMod M` (M = ∏ m i) within band `t` of a seam on EVERY grid - number exactly `∏ (2t+1)`. Transports `blindspot_card_product` across - `crtEquiv`; coprimality is essential here (it is what makes `crtEquiv` a - bijection). Phase offsets translate each per-grid set without changing its - cardinality, so the count is phase-independent. -/ -theorem blindspot_card_positions - {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] - (cop : ∀ i j, i ≠ j → Nat.Coprime (m i) (m j)) - (t : ℕ) (hband : ∀ i, 2 * t + 1 ≤ m i) : - (Finset.univ.filter - (fun x : ZMod (∏ i, m i) => - ∀ i, (crtEquiv m cop x) i ∈ admissible (m i) t)).card - = ∏ i, (2 * t + 1) := by - have hcongr : - (Finset.univ.filter - (fun x : ZMod (∏ i, m i) => ∀ i, (crtEquiv m cop x) i ∈ admissible (m i) t)).card - = (Finset.univ.filter - (fun r : ∀ i, ZMod (m i) => ∀ i, r i ∈ admissible (m i) t)).card := by - -- Pushforward of the filter along the bijection crtEquiv. The position-space - -- predicate is literally (product-space predicate) ∘ crtEquiv, so transport - -- by `Fintype.card_congr (Equiv.subtypeEquiv (crtEquiv m cop) (fun _ => Iff.rfl))` - -- over the corresponding subtypes (or `Finset.card_nbij'` with crtEquiv / - -- crtEquiv.symm). NOTE: `Equiv.card_filter_map` does NOT exist (review). - sorry - rw [hcongr]; exact blindspot_card_product m t hband - -/-- Oracle Check 2: exactly one all-grids coincidence per period (`t = 0`). - `NeZero (m i)` follows from `1 ≤ m i`; we take it as an instance argument - consistent with `crtEquiv`'s requirement. -/ -theorem unique_coincidence_per_period - {L : ℕ} (m : Fin L → ℕ) [∀ i, NeZero (m i)] - (cop : ∀ i j, i ≠ j → Nat.Coprime (m i) (m j)) - (hpos : ∀ i, 1 ≤ m i) : - (Finset.univ.filter - (fun x : ZMod (∏ i, m i) => - ∀ i, (crtEquiv m cop x) i ∈ admissible (m i) 0)).card = 1 := by - simpa using blindspot_card_positions m cop 0 (by simpa using hpos) - -/-! ### Proof debt, ranked (post-review) - 1. `crtEquiv` — indexed CRT bijection; fold `ZMod.chineseRemainder` over - `Fin L`. THE crux, sole consumer of `cop`. Now carries the REQUIRED - `[∀ i, NeZero (m i)]` (without it `ZMod 0 = ℤ` makes the Equiv false). - 2. `card_admissible` — window injectivity (routine `ZMod.val`). Statement - confirmed correct by review (exact `2t+1` under `2t+1 ≤ m`). - 3. `hcongr` — cardinality transport via `Equiv.subtypeEquiv` + `Fintype.card_congr` - (NOT `Equiv.card_filter_map`, which does not exist). - 4. [VERIFY] names: `Fintype.mem_piFinset`, `Fintype.card_piFinset` (both resolve - per review). `Finset.Icc` is over ℕ then cast — no type error. - Product half `blindspot_card_product` is the ONLY structurally-complete piece - (modulo 4); the intended theorem (B) is entirely deferred to crux (1). Do not - represent this as "most content proved." -/ - -end OrdvecFormalization From 0f364bac12ab6611f4c10daf85dbc6548b2dd510 Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Tue, 16 Jun 2026 15:42:44 -0500 Subject: [PATCH 10/16] Maintainer review pass: relocate research to experiments/, fix reproducibility + honesty nits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Review of this research writeup before merge (no ordvec crate / public-API change — research, not a feature): - Relocate all 21 files (13 docs + 8 probe scripts) from benchmarks/ + examples/ into experiments/ordinal-routing-research/, and add experiments/ to [package].exclude. Keeps the research out of the published crate and off the compiled-examples / clippy path, and stops it mixing with the real BEIR harness now living in benchmarks/. - Regenerate the stale synthetic table in density_collapse_results.md from the current de-circularized density_collapse.rs (tau 0.3345/0.3656 and 0.1423/0.1602; gap + 95% CI) and drop the retracted win-rate column the doc itself disowns 30 lines later. - Honesty fixes: rigidity_impossibility_proofs.md drops a non-reproducing Σ²/L scalar (keeps the quotient argument, which is correct); README citation row lists only audited names (Ethayarajh, Broughan-Barnett); crt_seam doc updated to reflect ordvec-formalization#17 is an open, sorry-free PR (not a draft with proof debt); de-fragilized the doc-count phrasing. - README reframed for the new layout + a note on running the probes (copy a .rs into examples/ and cargo run --example). Signed-off-by: Nelson Spence --- Cargo.toml | 1 + .../ADVERSARIAL_REVIEW.md | 0 .../ordinal-routing-research}/README.md | 20 ++++++++++++++----- .../REAL_CORPUS_RUNBOOK.md | 0 .../conjecture_citation_audit.md | 0 .../crt_seam_oracle.rs | 0 .../crt_seam_oracle_results.md | 2 +- .../density_collapse.rs | 0 .../density_collapse_results.md | 16 ++++++++------- .../ordinal-routing-research}/embed_ollama.py | 0 .../ordinal-routing-research}/gen_corpus.rs | 0 .../rigidity_impossibility_proofs.md | 5 +++-- .../ordinal-routing-research}/shard_recall.rs | 0 .../shard_recall_results.md | 0 .../spectral_probe.rs | 0 .../tau_rerank_bakeoff.rs | 0 .../tau_rerank_bakeoff_results.md | 0 .../ordinal-routing-research}/twonn_id.rs | 0 .../twonn_id_results.md | 0 .../withdrawn/README.md | 0 .../withdrawn/corpus_zoo_results.md | 0 .../withdrawn/spectral_probe_results.md | 0 22 files changed, 29 insertions(+), 15 deletions(-) rename {benchmarks => experiments/ordinal-routing-research}/ADVERSARIAL_REVIEW.md (100%) rename {benchmarks => experiments/ordinal-routing-research}/README.md (82%) rename {benchmarks => experiments/ordinal-routing-research}/REAL_CORPUS_RUNBOOK.md (100%) rename {benchmarks => experiments/ordinal-routing-research}/conjecture_citation_audit.md (100%) rename {examples => experiments/ordinal-routing-research}/crt_seam_oracle.rs (100%) rename {benchmarks => experiments/ordinal-routing-research}/crt_seam_oracle_results.md (98%) rename {examples => experiments/ordinal-routing-research}/density_collapse.rs (100%) rename {benchmarks => experiments/ordinal-routing-research}/density_collapse_results.md (92%) rename {examples => experiments/ordinal-routing-research}/embed_ollama.py (100%) rename {examples => experiments/ordinal-routing-research}/gen_corpus.rs (100%) rename {benchmarks => experiments/ordinal-routing-research}/rigidity_impossibility_proofs.md (96%) rename {examples => experiments/ordinal-routing-research}/shard_recall.rs (100%) rename {benchmarks => experiments/ordinal-routing-research}/shard_recall_results.md (100%) rename {examples => experiments/ordinal-routing-research}/spectral_probe.rs (100%) rename {examples => experiments/ordinal-routing-research}/tau_rerank_bakeoff.rs (100%) rename {benchmarks => experiments/ordinal-routing-research}/tau_rerank_bakeoff_results.md (100%) rename {examples => experiments/ordinal-routing-research}/twonn_id.rs (100%) rename {benchmarks => experiments/ordinal-routing-research}/twonn_id_results.md (100%) rename {benchmarks => experiments/ordinal-routing-research}/withdrawn/README.md (100%) rename {benchmarks => experiments/ordinal-routing-research}/withdrawn/corpus_zoo_results.md (100%) rename {benchmarks => experiments/ordinal-routing-research}/withdrawn/spectral_probe_results.md (100%) diff --git a/Cargo.toml b/Cargo.toml index 43111e50..136ac495 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -42,6 +42,7 @@ exclude = [ "docs/FOLLOWUP_BODY_KERNEL_TIE_BREAK.md", "docs/INDEX_PROVENANCE.md", "docs/c-api.md", + "experiments/", "fuzz/", "ordvec-ffi/", "ordvec-go/", diff --git a/benchmarks/ADVERSARIAL_REVIEW.md b/experiments/ordinal-routing-research/ADVERSARIAL_REVIEW.md similarity index 100% rename from benchmarks/ADVERSARIAL_REVIEW.md rename to experiments/ordinal-routing-research/ADVERSARIAL_REVIEW.md diff --git a/benchmarks/README.md b/experiments/ordinal-routing-research/README.md similarity index 82% rename from benchmarks/README.md rename to experiments/ordinal-routing-research/README.md index f73d37c2..973b419b 100644 --- a/benchmarks/README.md +++ b/experiments/ordinal-routing-research/README.md @@ -1,12 +1,22 @@ # Ordinal-routing research — reviewer's guide Exploratory investigation into ordvec's **density behavior** and whether -prime/spectral structure can improve training-free routing. Everything here is -in `examples/` (runnable probes) and `benchmarks/` (findings + proofs) — **no -changes to the `ordvec` crate or its public API.** +prime/spectral structure can improve training-free routing. Everything lives in +`experiments/ordinal-routing-research/` — the findings/proofs (`*.md`) next to the +probes that produced them (`*.rs`, `embed_ollama.py`) — **no changes to the +`ordvec` crate or its public API.** This directory is `package.exclude`d, so it +ships with the source tree but not the published crate. + +> **Running the probes.** These were developed and run as Cargo examples. The +> `cargo run --release --example ` commands and `examples/…` paths +> throughout these docs refer to that original layout. To reproduce, copy the +> relevant `.rs` from this directory into the crate's `examples/` directory and +> run the command shown (they depend only on the existing `ordvec` / `rand` / +> `rayon` dev-deps). They are kept here as reference source, not as a compiled +> build target. Reviewed by three internal adversarial agents plus the PR bots; findings are -tiered below by **what survived scrutiny**. Read the tiers, not all 11 docs. +tiered below by **what survived scrutiny**. Read the tiers, not every doc. ## 3-minute path @@ -33,7 +43,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not all 11 docs. | doc | status | |-----|--------| | [rigidity_impossibility_proofs.md](rigidity_impossibility_proofs.md) | The routing key is not number-variance-rigid (Thm 2/3, binomial `L(1-L/n)`). The over-broad "quantile optimal over all partitions" claim is **retracted** as a non-sequitur. | -| [conjecture_citation_audit.md](conjecture_citation_audit.md) | Citations verified by direct fetch (von Koch, Broughan-Barnett, Montgomery, Ethayarajh, etc.). | +| [conjecture_citation_audit.md](conjecture_citation_audit.md) | Citations verified by direct fetch (Ethayarajh, Broughan-Barnett, etc.); a few subagent confabulations caught and corrected. | ## WITHDRAWN — see [withdrawn/](withdrawn/) diff --git a/benchmarks/REAL_CORPUS_RUNBOOK.md b/experiments/ordinal-routing-research/REAL_CORPUS_RUNBOOK.md similarity index 100% rename from benchmarks/REAL_CORPUS_RUNBOOK.md rename to experiments/ordinal-routing-research/REAL_CORPUS_RUNBOOK.md diff --git a/benchmarks/conjecture_citation_audit.md b/experiments/ordinal-routing-research/conjecture_citation_audit.md similarity index 100% rename from benchmarks/conjecture_citation_audit.md rename to experiments/ordinal-routing-research/conjecture_citation_audit.md diff --git a/examples/crt_seam_oracle.rs b/experiments/ordinal-routing-research/crt_seam_oracle.rs similarity index 100% rename from examples/crt_seam_oracle.rs rename to experiments/ordinal-routing-research/crt_seam_oracle.rs diff --git a/benchmarks/crt_seam_oracle_results.md b/experiments/ordinal-routing-research/crt_seam_oracle_results.md similarity index 98% rename from benchmarks/crt_seam_oracle_results.md rename to experiments/ordinal-routing-research/crt_seam_oracle_results.md index f15397d8..b6293dbd 100644 --- a/benchmarks/crt_seam_oracle_results.md +++ b/experiments/ordinal-routing-research/crt_seam_oracle_results.md @@ -2,7 +2,7 @@ > Lean 4 formalization of this theorem lives in the companion repo: > [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) -> (draft — proof sketch with proof debt, kept out of that repo's sorry-free tree). +> (open PR, `sorry`-free). `examples/crt_seam_oracle.rs` enumerates the full ring Z/M to verify the within-axis vernier (coprime-grid) theorem exactly. This is the SPEC for the diff --git a/examples/density_collapse.rs b/experiments/ordinal-routing-research/density_collapse.rs similarity index 100% rename from examples/density_collapse.rs rename to experiments/ordinal-routing-research/density_collapse.rs diff --git a/benchmarks/density_collapse_results.md b/experiments/ordinal-routing-research/density_collapse_results.md similarity index 92% rename from benchmarks/density_collapse_results.md rename to experiments/ordinal-routing-research/density_collapse_results.md index b2387ced..ebbe6a51 100644 --- a/benchmarks/density_collapse_results.md +++ b/experiments/ordinal-routing-research/density_collapse_results.md @@ -26,15 +26,17 @@ Within each probe's b=2 "lookalike" set (M=40 Hamming-nearest), split by TRUE FP32 cosine into the true-neighbour half vs the far-lookalike half. Compare the top-k (k=16) Kendall-tau distance of the probe's coordinate ORDER to each: -| density | true-nbr cosine | true-nbr tau | far cosine | far tau | probes tau_true 0, K_r(cx) = K_r(x). So any structure a corpus encodes purely in vector NORMS is invisible to the key. In particular the `projected-rigid` construction x_i = t_i·u collapses under normalization to -sign(t_i)·u ∈ {+u, −u}: 50k rigid magnitudes become 2 distinct keys. Measured -Σ²/L = 1936 (degenerate), exactly as the quotient predicts. ∎ +sign(t_i)·u ∈ {+u, −u}: the 50k rigid magnitudes collapse toward ≈2 angular keys +(±u, modulo the small orthogonal jitter the generator adds), so the key sees a +near-degenerate occupancy exactly as the quotient predicts. ∎ Consequence: rigidity must live in the ANGULAR distribution on the sphere S^{d-1}, or it does not exist for the key. diff --git a/examples/shard_recall.rs b/experiments/ordinal-routing-research/shard_recall.rs similarity index 100% rename from examples/shard_recall.rs rename to experiments/ordinal-routing-research/shard_recall.rs diff --git a/benchmarks/shard_recall_results.md b/experiments/ordinal-routing-research/shard_recall_results.md similarity index 100% rename from benchmarks/shard_recall_results.md rename to experiments/ordinal-routing-research/shard_recall_results.md diff --git a/examples/spectral_probe.rs b/experiments/ordinal-routing-research/spectral_probe.rs similarity index 100% rename from examples/spectral_probe.rs rename to experiments/ordinal-routing-research/spectral_probe.rs diff --git a/examples/tau_rerank_bakeoff.rs b/experiments/ordinal-routing-research/tau_rerank_bakeoff.rs similarity index 100% rename from examples/tau_rerank_bakeoff.rs rename to experiments/ordinal-routing-research/tau_rerank_bakeoff.rs diff --git a/benchmarks/tau_rerank_bakeoff_results.md b/experiments/ordinal-routing-research/tau_rerank_bakeoff_results.md similarity index 100% rename from benchmarks/tau_rerank_bakeoff_results.md rename to experiments/ordinal-routing-research/tau_rerank_bakeoff_results.md diff --git a/examples/twonn_id.rs b/experiments/ordinal-routing-research/twonn_id.rs similarity index 100% rename from examples/twonn_id.rs rename to experiments/ordinal-routing-research/twonn_id.rs diff --git a/benchmarks/twonn_id_results.md b/experiments/ordinal-routing-research/twonn_id_results.md similarity index 100% rename from benchmarks/twonn_id_results.md rename to experiments/ordinal-routing-research/twonn_id_results.md diff --git a/benchmarks/withdrawn/README.md b/experiments/ordinal-routing-research/withdrawn/README.md similarity index 100% rename from benchmarks/withdrawn/README.md rename to experiments/ordinal-routing-research/withdrawn/README.md diff --git a/benchmarks/withdrawn/corpus_zoo_results.md b/experiments/ordinal-routing-research/withdrawn/corpus_zoo_results.md similarity index 100% rename from benchmarks/withdrawn/corpus_zoo_results.md rename to experiments/ordinal-routing-research/withdrawn/corpus_zoo_results.md diff --git a/benchmarks/withdrawn/spectral_probe_results.md b/experiments/ordinal-routing-research/withdrawn/spectral_probe_results.md similarity index 100% rename from benchmarks/withdrawn/spectral_probe_results.md rename to experiments/ordinal-routing-research/withdrawn/spectral_probe_results.md From 4b0fe448c994174c796d98ec0a04631683b7da99 Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Tue, 16 Jun 2026 16:16:07 -0500 Subject: [PATCH 11/16] docs: reconcile TwoNN evidence + Lean-PR status across the writeup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex stop-gate: the TwoNN intrinsic-dimension result was over-evidenced. twonn_id_results.md is self-labeled "PARTIAL" (OLS-through-origin estimator, biased, "prefer MLE before quoting exact values", no real-data measurement — only synthetic + sphere validations), yet the front-door README listed it under SOUND ("proven or real-data confirmed") and asserted nomic-embed-text "ID ≈ 13". - Move the twonn_id row from SOUND to THEORY, marked PARTIAL: the chord-metric fix is sound (sphere-validated), but the ID is an indicative low-tens LOWER BOUND, not a confirmed value. - Soften the "≈ 13" assertions in README + density_collapse_results.md to "low tens (lower bound)". - Fix the README front-door crt_seam row that still said "(draft)" — the Lean PR is open + sorry-free (matches the crt_seam doc). Signed-off-by: Nelson Spence --- .../ordinal-routing-research/README.md | 4 ++-- .../density_collapse_results.md | 8 +++++-- .../twonn_id_results.md | 22 +++++++++++++------ 3 files changed, 23 insertions(+), 11 deletions(-) diff --git a/experiments/ordinal-routing-research/README.md b/experiments/ordinal-routing-research/README.md index 973b419b..39a40499 100644 --- a/experiments/ordinal-routing-research/README.md +++ b/experiments/ordinal-routing-research/README.md @@ -34,8 +34,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not every doc. |-----|-------| | [density_collapse_results.md](density_collapse_results.md) | **Mechanism.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, CI > 0). Real but small. | | [tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md) | **The verdict.** Does that tau signal beat b=4? NO — b=4 wins even at the tau ceiling; tau scores below b=2's own ordering. Signal is real-but-inert; just use b=4. Closes the line: research, not a feature. | -| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (draft). | -| [twonn_id_results.md](twonn_id_results.md) | TwoNN intrinsic-dimension probe (chord-metric fix, sphere-validated). `nomic-embed-text` ID ≈ 13 / 768. Estimator-bias caveat noted. | +| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (open PR, `sorry`-free). | | [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | ## THEORY — directionally right, restated honestly @@ -44,6 +43,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not every doc. |-----|--------| | [rigidity_impossibility_proofs.md](rigidity_impossibility_proofs.md) | The routing key is not number-variance-rigid (Thm 2/3, binomial `L(1-L/n)`). The over-broad "quantile optimal over all partitions" claim is **retracted** as a non-sequitur. | | [conjecture_citation_audit.md](conjecture_citation_audit.md) | Citations verified by direct fetch (Ethayarajh, Broughan-Barnett, etc.); a few subagent confabulations caught and corrected. | +| [twonn_id_results.md](twonn_id_results.md) | ⚠️ PARTIAL. The chord-metric fix is sound (sphere-validated to ID ~12); the OLS-through-origin estimator is biased (not MLE), and **no clean real-corpus ID is recorded here**. A low-tens sentence-transformer ID is a hypothesis by cross-domain analogy (Ansuini's low-tens are vision CNNs, not sentence encoders) — not established or measured in this branch. | ## WITHDRAWN — see [withdrawn/](withdrawn/) diff --git a/experiments/ordinal-routing-research/density_collapse_results.md b/experiments/ordinal-routing-research/density_collapse_results.md index ebbe6a51..e1801a26 100644 --- a/experiments/ordinal-routing-research/density_collapse_results.md +++ b/experiments/ordinal-routing-research/density_collapse_results.md @@ -56,8 +56,12 @@ with `nomic-embed-text` (GPU-resident via ollama). Generator: `examples/embed_ollama.py`. Run: `density_collapse --corpus-npy repo_real.npy`. First real-data facts: -- TwoNN intrinsic dimension of nomic-embed-text ≈ 13 (ambient 768) — our own - measurement, squarely in the predicted low-tens range. +- A low intrinsic dimension for nomic-embed-text is a working hypothesis, not a + measurement: deep vision-CNN IDs are low-tens (Ansuini), but there is no + published sentence-transformer figure, and our TwoNN probe (sphere-validated + only; biased OLS estimate) records no clean nomic ID here. Treat low-tens as a + cross-domain analogy that motivates the density test, not an established fact + (see `twonn_id_results.md`). - Real embeddings are FAR more entangled at b=2 than synthetic clusters: mean nearest-code Hamming 314/768, and ~5083 of 8665 docs sit in each probe's b2-lookalike shell (vs ~60–120 synthetic). Real geometry, much denser collapse. diff --git a/experiments/ordinal-routing-research/twonn_id_results.md b/experiments/ordinal-routing-research/twonn_id_results.md index 1e7b3428..231bef66 100644 --- a/experiments/ordinal-routing-research/twonn_id_results.md +++ b/experiments/ordinal-routing-research/twonn_id_results.md @@ -7,9 +7,15 @@ # TwoNN intrinsic dimension probe -TwoNN estimator (Facco et al. 2017) to measure the intrinsic dimension of an -embedding corpus directly — closing the citation gap (no published TwoNN figure -for sentence-transformers) with our own measurement. +TwoNN estimator (Facco et al. 2017) for measuring the intrinsic dimension of an +embedding corpus directly. The aim was to close the citation gap (no published +TwoNN figure for sentence-transformers) — but this branch records only the +synthetic sphere/cluster validations below; **no clean real sentence-transformer +ID is measured here** (the estimator is biased; see the caveat above). There is +no published TwoNN figure for sentence-transformers; deep vision-CNN intrinsic +dimension is low-tens (Ansuini, arXiv:1905.12784) but that is a cross-domain +analogy, not sentence-transformer evidence — so this branch establishes no +sentence-transformer ID. Source: `examples/twonn_id.rs`. μ = r2/r1 (2nd/1st NN distance ratio), Pareto fit d = Σ(log μ · -log(1-F)) / Σ(log μ)², top-10% tail trimmed. Anchors sampled @@ -38,8 +44,9 @@ Verdict: **calibrated and trustworthy through ID ~12** (dense sampling at n=20k). Above that, finite-sample deflation sets in (ID=20→16.9, ID=50→35) — the curse of dimensionality floor on ALL NN-based ID estimators: ~exp(d) points needed to populate a d-dim neighborhood. Reported IDs in the tens are therefore -LOWER BOUNDS; true ID may be higher. (This also implies Ansuini's verified -12-25 last-layer figures are themselves likely deflated — consistent direction.) +LOWER BOUNDS; true ID may be higher. (The same finite-sample deflation would +also affect Ansuini's 12-25 vision-CNN last-layer figures — a methodological +note, not evidence those figures transfer to sentence-transformers.) ## Measurements @@ -56,6 +63,7 @@ lower-dim than the global latent space. Run: `cargo run --release --example twonn_id -- --corpus-npy your_embeddings.npy` Loads 2-D Date: Tue, 16 Jun 2026 15:25:23 -0700 Subject: [PATCH 12/16] =?UTF-8?q?Research=20round=202:=20oblivious-directi?= =?UTF-8?q?ons=20arc=20=E2=80=94=20pre-registered=20negative=20across=205?= =?UTF-8?q?=20encoders?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tests whether data-oblivious low-discrepancy structure (golden-angle, Fibonacci, Sobol, Kronecker) improves training-free routing. Eight pre-registered probes (thresholds fixed in source headers before running), real embeddings, GPU-measured. Lives in experiments/ordinal-routing-research/ alongside the round-1 probes. Verdict: CLASS-DEAD, replicated. Oblivious low-discrepancy directions do not beat iid-random for routing across nomic / bge-m3 / bge-large / snowflake-arctic-v2 / harrier-oss at real intrinsic dim 18-24. The one mid-ladder flicker (bge-m3 sobol +0.024) failed to replicate at near-identical ID — caught by the >=2-corpora rule. Probes + findings (oblivious_directions_results.md): - uniformity_lemma: rank transform whitens marginals -> equal-width bucketing is entropy-optimal; golden boundaries strictly waste bits. Marginal-only — blind to the joint structure the directions probes target. - overlap_decomp: ordvec's hypergeometric null is ~100% cone (hubness) on real data; per-coord centering removes it and amplifies true-nbr overlap 2-5x. - centering_recall: centering helps b=1 prefilter (+0.03..0.07 R@10) but FAILS at b=4 (-0.07..-0.15, penalty grows with encoder capacity). Replicated x4 corpora. - subspace_directions: center -> project to k-dim PCA subspace -> {random, sobol, kronecker, pca-axes}. Oblivious sequences tie/lose random; only data-aligned (pca-axes) leads at higher ID. The lever is data-alignment, which training-free forbids. - partition_balance: centering balances coarse cells (BALANCE pass) but not their alignment (PRUNING fail) — no sublinear win. Resolves the twonn_id PARTIAL: real-corpus ID measured ~18-24 across 5 encoders, and intrinsic dimension is a CORPUS property (repo~13 vs fiqa~24, same encoder), not an encoder constant. Corrects an in-session "ID~13" anchoring error rather than dropping it. No crate or public-API changes — experiments/ only. Co-Authored-By: Claude Opus 4.8 Signed-off-by: Todd Baur --- .../ordinal-routing-research/README.md | 1 + .../centering_recall.rs | 221 +++++++ .../fib_address_gate.rs | 277 +++++++++ .../fib_directions.rs | 573 ++++++++++++++++++ .../ordinal-routing-research/fib_reduction.rs | 395 ++++++++++++ .../oblivious_directions_results.md | 256 ++++++++ .../overlap_decomp.rs | 279 +++++++++ .../partition_balance.rs | 344 +++++++++++ .../subspace_directions.rs | 545 +++++++++++++++++ .../uniformity_lemma.rs | 219 +++++++ 10 files changed, 3110 insertions(+) create mode 100644 experiments/ordinal-routing-research/centering_recall.rs create mode 100644 experiments/ordinal-routing-research/fib_address_gate.rs create mode 100644 experiments/ordinal-routing-research/fib_directions.rs create mode 100644 experiments/ordinal-routing-research/fib_reduction.rs create mode 100644 experiments/ordinal-routing-research/oblivious_directions_results.md create mode 100644 experiments/ordinal-routing-research/overlap_decomp.rs create mode 100644 experiments/ordinal-routing-research/partition_balance.rs create mode 100644 experiments/ordinal-routing-research/subspace_directions.rs create mode 100644 experiments/ordinal-routing-research/uniformity_lemma.rs diff --git a/experiments/ordinal-routing-research/README.md b/experiments/ordinal-routing-research/README.md index 39a40499..4121bcff 100644 --- a/experiments/ordinal-routing-research/README.md +++ b/experiments/ordinal-routing-research/README.md @@ -36,6 +36,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not every doc. | [tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md) | **The verdict.** Does that tau signal beat b=4? NO — b=4 wins even at the tau ceiling; tau scores below b=2's own ordering. Signal is real-but-inert; just use b=4. Closes the line: research, not a feature. | | [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (open PR, `sorry`-free). | | [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | +| [oblivious_directions_results.md](oblivious_directions_results.md) | **The directions arc (round 2).** Data-oblivious low-discrepancy directions (golden-angle / Sobol / Kronecker) do NOT beat iid-random for training-free routing — across 5 encoders (nomic, bge-m3, bge-large, snowflake-arctic-v2, harrier-oss) at real intrinsic dim 18–24. CLASS-DEAD, pre-registered, replicated (the one mid-ladder flicker failed to replicate). Centering removes the cone but fails at b=4 (penalty grows with capacity). One robust positive: data-aligned (PCA) directions lead at higher ID — the lever is data-alignment, which training-free forbids. Also **resolves the twonn_id PARTIAL**: real-corpus ID measured at ~18–24 across 5 encoders, and ID is a **corpus** property (repo≈13 vs fiqa≈24, same encoder), not an encoder constant. Probes: `uniformity_lemma.rs`, `overlap_decomp.rs`, `centering_recall.rs`, `subspace_directions.rs`, `partition_balance.rs`, `fib_*.rs`. | ## THEORY — directionally right, restated honestly diff --git a/experiments/ordinal-routing-research/centering_recall.rs b/experiments/ordinal-routing-research/centering_recall.rs new file mode 100644 index 00000000..4d125040 --- /dev/null +++ b/experiments/ordinal-routing-research/centering_recall.rs @@ -0,0 +1,221 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — the DECISIVE gate for the centering finding. +//! overlap_decomp.rs showed per-coord mean-centering removes the cone and +//! amplifies true-neighbour overlap 2-5x. Necessary but NOT sufficient: the only +//! metric that promotes this to an ordvec change is RECALL at matched bytes, +//! and it must survive at b=4 (the incumbent that beat every other idea on this +//! branch). Isolated by request: touches no harness, writes nothing. +//! +//! TEST: full-scan RankQuant-style retrieval, R@10 vs FP32 cosine ground truth. +//! Encode = dimension-wise rank -> bucket into 2^B equal-width bins (B bits/coord). +//! Score = -L1 distance between B-bin codes (symmetric rank metric; identical +//! scoring for both arms, so only the ENCODING differs). +//! Ground truth = FP32 cosine top-10 on L2-normalized RAW vectors (the real task; +//! centering changes the CODE, never the target). +//! Two arms: +//! raw : rank the L2-normalized vector directly. +//! centered : subtract the per-coord CORPUS mean (a stored, data-oblivious +//! statistic — no codebook, no per-query training) before ranking. +//! The SAME mean is applied to queries (deployable asymmetric setup). +//! Also reports the two-stage candidate-recall (CR@M): fraction of FP32 top-10 +//! captured in the top-M by code distance — the bitmap-prefilter-relevant number. +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! FEATURE-WORTHY if centered R@10 - raw R@10 >= 0.02 at b=2 AND >= 0.02 at b=4 +//! (must survive at the incumbent b=4), AND centered never worse than raw by +//! > 0.005 at any B. +//! PARTIAL if it helps at low bits (b<=2) but the b=4 gain is < 0.02 (a +//! low-bit-only tweak, not a headline change). +//! FAIL if centered <= raw at b=4, or net-negative anywhere material. +//! Report actual R@10 and CR@M for both arms regardless. +//! +//! Run: cargo run --release --example centering_recall -- \ +//! --corpus-npy /tmp/repo_corpus.npy --queries-npy /tmp/repo_q.npy + +use rayon::prelude::*; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn coord_mean(corpus: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += corpus[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// Bucketed-rank code: rank each (optionally mean-subtracted) coord, then bucket +/// the rank into 2^bits equal-width bins. One u8 bucket id per coord. +fn encode(rows: &[f32], n: usize, dim: usize, bits: u32, sub: Option<&[f32]>) -> Vec { + let nb = 1u32 << bits; + let mut out = vec![0u8; n * dim]; + out.par_chunks_mut(dim).enumerate().for_each(|(r, code)| { + let mut idx: Vec = (0..dim as u16).collect(); + let val = |c: u16| rows[r * dim + c as usize] - sub.map_or(0.0, |s| s[c as usize]); + idx.sort_by(|&a, &b| val(a).partial_cmp(&val(b)).unwrap()); + for (rank, &c) in idx.iter().enumerate() { + let bucket = (rank as u32 * nb) / dim as u32; + code[c as usize] = bucket as u8; + } + }); + out +} + +fn l1(a: &[u8], b: &[u8]) -> i64 { + a.iter() + .zip(b) + .map(|(x, y)| (*x as i64 - *y as i64).abs()) + .sum() +} + +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +/// Full-scan R@10 and candidate-recall@M for a given arm's codes. +fn evaluate( + ccode: &[u8], + qcode: &[u8], + n: usize, + nq: usize, + dim: usize, + truth: &[Vec], + m: usize, +) -> (f64, f64) { + let res: Vec<(f64, f64)> = (0..nq) + .into_par_iter() + .map(|qi| { + let qc = &qcode[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, i64)> = (0..n) + .map(|di| (di as u32, l1(qc, &ccode[di * dim..(di + 1) * dim]))) + .collect(); + s.sort_by_key(|x| x.1); + let truth_set: std::collections::HashSet = truth[qi].iter().copied().collect(); + // R@10: top-10 by code distance that are in FP32 top-10 + let top10: std::collections::HashSet = s.iter().take(10).map(|x| x.0).collect(); + let r10 = top10.intersection(&truth_set).count() as f64 / truth[qi].len() as f64; + // CR@M: FP32 top-10 captured within top-M by code distance + let topm: std::collections::HashSet = s.iter().take(m).map(|x| x.0).collect(); + let cr = truth[qi].iter().filter(|t| topm.contains(t)).count() as f64 + / truth[qi].len() as f64; + (r10, cr) + }) + .collect(); + let r10 = res.iter().map(|x| x.0).sum::() / nq as f64; + let cr = res.iter().map(|x| x.1).sum::() / nq as f64; + (r10, cr) +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut m_arg = 100usize; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + "--m" => m_arg = a.next().unwrap().parse().unwrap(), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + let truth = ground_truth(&corpus, &queries, dim, 10); + let mean = coord_mean(&corpus, n, dim); + let m = m_arg; // CR@M operating point (default 100, --m to override) + + println!("# centering_recall (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}, CR@M M={m}"); + println!("# R@10 vs FP32 cosine top-10; symmetric -L1 on B-bin rank codes; only ENCODING differs between arms."); + println!("B\tbytes/vec\traw_R@10\tcen_R@10\tΔR@10\traw_CR\tcen_CR\tΔCR"); + let mut deltas = vec![]; + for &bits in &[1u32, 2, 4] { + let bytes = dim * bits as usize / 8; + let craw = encode(&corpus, n, dim, bits, None); + let qraw = encode(&queries, nq, dim, bits, None); + let ccen = encode(&corpus, n, dim, bits, Some(&mean)); + let qcen = encode(&queries, nq, dim, bits, Some(&mean)); + let (r_raw, cr_raw) = evaluate(&craw, &qraw, n, nq, dim, &truth, m); + let (r_cen, cr_cen) = evaluate(&ccen, &qcen, n, nq, dim, &truth, m); + let d_r = r_cen - r_raw; + let d_cr = cr_cen - cr_raw; + deltas.push((bits, d_r)); + println!("{bits}\t{bytes}\t{r_raw:.4}\t{r_cen:.4}\t{d_r:+.4}\t{cr_raw:.4}\t{cr_cen:.4}\t{d_cr:+.4}"); + } + println!("\n# PRE-REGISTERED: FEATURE-WORTHY if ΔR@10 >= +0.02 at b=2 AND b=4, and never worse than -0.005."); + let b2 = deltas.iter().find(|x| x.0 == 2).unwrap().1; + let b4 = deltas.iter().find(|x| x.0 == 4).unwrap().1; + let worst = deltas.iter().map(|x| x.1).fold(f64::INFINITY, f64::min); + let verdict = if b2 >= 0.02 && b4 >= 0.02 && worst >= -0.005 { + "FEATURE-WORTHY" + } else if b4 < 0.02 && deltas.iter().any(|x| x.1 >= 0.02) { + "PARTIAL (low-bit only; does not survive at incumbent b=4)" + } else { + "FAIL" + }; + println!("# VERDICT: {verdict} (Δb2={b2:+.4}, Δb4={b4:+.4}, worst={worst:+.4})"); +} diff --git a/experiments/ordinal-routing-research/fib_address_gate.rs b/experiments/ordinal-routing-research/fib_address_gate.rs new file mode 100644 index 00000000..ba2fbdf3 --- /dev/null +++ b/experiments/ordinal-routing-research/fib_address_gate.rs @@ -0,0 +1,277 @@ +//! CONTEXT-BOXED, PRE-REGISTERED PROBE — isolated by request. +//! +//! This is a standalone curiosity gate. It does NOT touch the routing harness, +//! does NOT write to benchmarks/, and its result MUST NOT be used to reframe any +//! other experiment on this branch. It answers ONE question and stops. +//! +//! QUESTION (the user's hypothesis, sharpened): +//! Is there a closed-form golden-angle ("sine spiral") ADDRESS on the data such +//! that address-distance preserves cosine locality? If yes, retrieval could be +//! COMPUTED (invert the address) instead of INDEXED (search). If no, the spiral +//! address is inert — the same wall the ambient-probe experiment hit. +//! +//! METHOD (ceiling test, like the tau bake-off): +//! Tile dim into d/3 triples. For a unit vector, each triple -> point on S^2; +//! its ADDRESS on that triple is the nearest golden-spiral index (argmin over R +//! spiral points — the EXACT nearest, an upper bound on what a closed-form +//! inverse could achieve). Address(v) = the tuple of per-triple indices. +//! Address-distance = sum of |Δindex| across triples. +//! Primary metric: Spearman rho(address-distance, FP32 cosine-distance) over a +//! random sample of corpus pairs. +//! +//! CONTROLS (golden must beat BOTH to mean anything): +//! - random-pts : same machinery, spiral replaced by R random S^2 points +//! (isolates "does the GOLDEN structure help" vs "any quantizer"). +//! - sign : ordvec's existing lever — sign bits, Hamming distance vs cosine +//! (isolates "does the spiral beat what ordvec already has"). +//! +//! PRE-REGISTERED VERDICT (fixed BEFORE running — do not move): +//! PASS : golden rho >= 0.50 AND golden beats both controls by >= 0.05. +//! => locality-preserving address; computed-retrieval worth pursuing. +//! INCONCLUSIVE : 0.30 <= golden rho < 0.50, or golden within 0.05 of a control. +//! FAIL : golden rho < 0.30, or golden <= either control. +//! => spiral address is inert; stop, same wall as ambient probes. +//! +//! Run: cargo run --release --example fib_address_gate -- \ +//! --corpus-npy /tmp/repo_corpus.npy +//! Output is a single table; nothing is committed. + +// Research probe: the header keeps aligned multi-line doc lists for readability. +#![allow(clippy::doc_overindented_list_items)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Golden-angle spiral point i of R on S^2 (the "sine spiral" address grid). +fn spiral_point(i: usize, r: usize) -> [f32; 3] { + let golden = std::f32::consts::PI * (3.0 - 5.0f32.sqrt()); + let z = if r == 1 { + 0.0 + } else { + 1.0 - 2.0 * (i as f32 + 0.5) / r as f32 + }; + let rad = (1.0 - z * z).max(0.0).sqrt(); + let theta = i as f32 * golden; + [rad * theta.cos(), rad * theta.sin(), z] +} + +/// R random points on S^2 (the structure-free control grid), seeded deterministically. +fn random_points(r: usize) -> Vec<[f32; 3]> { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x5A5A_5A5A); + (0..r) + .map(|_| { + let mut p = [gauss(&mut rng), gauss(&mut rng), gauss(&mut rng)]; + let n = (p[0] * p[0] + p[1] * p[1] + p[2] * p[2]).sqrt(); + for x in p.iter_mut() { + *x /= n; + } + p + }) + .collect() +} + +/// Per-triple address: tuple of nearest-grid-point indices (argmin = exact nearest, +/// the CEILING for any closed-form inverse). grid is the R points on S^2. +fn address(v: &[f32], dim: usize, grid: &[[f32; 3]]) -> Vec { + let n_tri = dim / 3; + let mut addr = Vec::with_capacity(n_tri); + for t in 0..n_tri { + let s = &v[t * 3..t * 3 + 3]; + let nrm = (s[0] * s[0] + s[1] * s[1] + s[2] * s[2]).sqrt().max(1e-12); + let p = [s[0] / nrm, s[1] / nrm, s[2] / nrm]; + // nearest grid point by max dot (points are unit) == min distance + let mut best = 0usize; + let mut best_dot = f32::MIN; + for (gi, g) in grid.iter().enumerate() { + let d = p[0] * g[0] + p[1] * g[1] + p[2] * g[2]; + if d > best_dot { + best_dot = d; + best = gi; + } + } + addr.push(best as u16); + } + addr +} + +fn addr_dist(a: &[u16], b: &[u16]) -> f32 { + a.iter() + .zip(b) + .map(|(x, y)| (*x as i32 - *y as i32).unsigned_abs() as f32) + .sum() +} + +fn sign_bits(v: &[f32]) -> Vec { + let mut bits = vec![0u64; v.len().div_ceil(64)]; + for (i, &x) in v.iter().enumerate() { + if x >= 0.0 { + bits[i / 64] |= 1u64 << (i % 64); + } + } + bits +} + +fn hamming(a: &[u64], b: &[u64]) -> f32 { + a.iter() + .zip(b) + .map(|(x, y)| (x ^ y).count_ones()) + .sum::() as f32 +} + +/// Spearman rho = Pearson on ranks. Ties broken by stable index (adequate for a gate). +fn spearman(xs: &[f32], ys: &[f32]) -> f32 { + let rank = |v: &[f32]| -> Vec { + let mut idx: Vec = (0..v.len()).collect(); + idx.sort_by(|&a, &b| v[a].partial_cmp(&v[b]).unwrap()); + let mut r = vec![0.0f32; v.len()]; + for (rank, &i) in idx.iter().enumerate() { + r[i] = rank as f32; + } + r + }; + let rx = rank(xs); + let ry = rank(ys); + let n = rx.len() as f32; + let mx = rx.iter().sum::() / n; + let my = ry.iter().sum::() / n; + let mut cov = 0.0; + let mut vx = 0.0; + let mut vy = 0.0; + for i in 0..rx.len() { + let dx = rx[i] - mx; + let dy = ry[i] - my; + cov += dx * dy; + vx += dx * dx; + vy += dy * dy; + } + cov / (vx.sqrt() * vy.sqrt()).max(1e-12) +} + +fn main() { + let mut corpus_path = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + if x == "--corpus-npy" { + corpus_path = Some(a.next().unwrap()); + } + } + let path = corpus_path.expect("--corpus-npy required (real embeddings)"); + let (mut corpus, n, dim) = load_npy_f32(&path); + l2_normalize_rows(&mut corpus, dim); + let n_pairs = 200_000usize; + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let pairs: Vec<(usize, usize)> = (0..n_pairs) + .map(|_| { + let i = rng.random_range(0..n); + let mut j = rng.random_range(0..n); + if j == i { + j = (j + 1) % n; + } + (i, j) + }) + .collect(); + + let row = |i: usize| &corpus[i * dim..(i + 1) * dim]; + let cos_dist: Vec = pairs + .par_iter() + .map(|&(i, j)| { + let d: f32 = row(i).iter().zip(row(j)).map(|(a, b)| a * b).sum(); + 1.0 - d + }) + .collect(); + + println!( + "# fib_address_gate (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, pairs {n_pairs}" + ); + println!("# Spearman rho vs FP32 cosine-distance (higher = better locality preservation)"); + println!("grid\tR\trho"); + + // sign baseline (no R) + let signs: Vec> = (0..n).into_par_iter().map(|i| sign_bits(row(i))).collect(); + let sign_d: Vec = pairs + .par_iter() + .map(|&(i, j)| hamming(&signs[i], &signs[j])) + .collect(); + let rho_sign = spearman(&sign_d, &cos_dist); + println!("sign\t-\t{rho_sign:.4}"); + + let mut best_golden = f32::MIN; + for &r in &[8usize, 32, 128] { + for (name, grid) in [ + ( + "golden", + (0..r).map(|i| spiral_point(i, r)).collect::>(), + ), + ("random-pts", random_points(r)), + ] { + let addrs: Vec> = (0..n) + .into_par_iter() + .map(|i| address(row(i), dim, &grid)) + .collect(); + let ad: Vec = pairs + .par_iter() + .map(|&(i, j)| addr_dist(&addrs[i], &addrs[j])) + .collect(); + let rho = spearman(&ad, &cos_dist); + if name == "golden" && rho > best_golden { + best_golden = rho; + } + println!("{name}\t{r}\t{rho:.4}"); + } + } + println!("\n# best golden rho = {best_golden:.4}"); + println!("# PRE-REGISTERED: PASS if >=0.50 and beats controls by >=0.05; FAIL if <0.30 or <= a control."); +} diff --git a/experiments/ordinal-routing-research/fib_directions.rs b/experiments/ordinal-routing-research/fib_directions.rs new file mode 100644 index 00000000..37116408 --- /dev/null +++ b/experiments/ordinal-routing-research/fib_directions.rs @@ -0,0 +1,573 @@ +//! Direction-PLACEMENT bake-off for the training-free routing layer. +//! +//! Open thread from the routing research (PR #211): `shard_recall.rs` always +//! draws iid-Gaussian projection directions — low-discrepancy *placement* of the +//! directions themselves was never an arm. This rig tests one hypothesis and is +//! built to KILL it cheaply if it's false: +//! +//! Shared-index triple-tiled golden-angle directions (a 2-sphere spiral per +//! coord-triple, all triples coupled by ONE Fibonacci index so they cohere on +//! the permutahedron) need FEWER candidates-scanned for matched recall@k than +//! iid-Gaussian — strongest at small R and at Fibonacci R, decaying as dim +//! grows (concentration of measure). +//! +//! Two arms, EVERYTHING else held identical to shard_recall's "aligned" calib +//! (single shared base_width, phase=0). Direction construction is the ONLY var. +//! gaussian — iid Gaussian normalized (the control) +//! fib-spiral — shared-index triple-tiled golden-angle, per-triple rotated +//! +//! Honesty checks baked in (printed): at R=1 the arms MUST coincide (one +//! direction cannot carry low-discrepancy structure) — divergence => harness +//! bug, not a result. SEED=1 => identical tables across runs. +//! +//! Run (synthetic dim=256): cargo run --release --example fib_directions +//! Run (real dim=768): cargo run --release --example fib_directions -- \ +//! --corpus-npy repo_real.npy --queries-npy repo_q.npy +//! No external data on the synthetic path, no BLAS. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +/// Base seed; overridable with --seed for the reseed-stability honesty check. +/// A real effect must survive reseeding (changes gaussian directions AND the +/// fib/project per-triple rotations); a seed-specific gap is an artifact. +static SEED: std::sync::atomic::AtomicU64 = std::sync::atomic::AtomicU64::new(1); +fn seed() -> u64 { + SEED.load(std::sync::atomic::Ordering::Relaxed) +} + +struct Cfg { + dim: usize, + n: usize, + n_queries: usize, + k: usize, + latent_dim: usize, + n_clusters: usize, + cone: f32, + r_values: Vec, + no_triple_rotation: bool, + corpus_npy: Option, + queries_npy: Option, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 50_000, + n_queries: 200, + k: 10, + latent_dim: 64, + n_clusters: 200, + // Cone anisotropy: real transformer embeddings are NOT gaussian/centered — + // they sit in a narrow cone (a dominant shared mean direction). `cone` is + // the weight of a fixed shared axis added to every vector before + // normalization; 0.0 = centered (old behaviour), larger = tighter cone. + // Default tuned so synthetic mean-resultant-length is in the ballpark of + // nomic (printed at startup so it can be matched to real --corpus-npy data). + cone: 3.0, + // Fibonacci R values (1,2,3,5,8,13,21) interleaved with non-Fib (4,16) + // so a Fibonacci-specific bump is legible against its neighbours. + r_values: vec![1, 2, 3, 4, 5, 8, 13, 16, 21], + no_triple_rotation: false, + corpus_npy: None, + queries_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--k" => c.k = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--cone" => c.cone = a.next().unwrap().parse().unwrap(), + "--seed" => SEED.store( + a.next().unwrap().parse().unwrap(), + std::sync::atomic::Ordering::Relaxed, + ), + "--no-triple-rotation" => c.no_triple_rotation = true, + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--queries-npy" => c.queries_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays +/// (same contract as shard_recall / bench_rank). +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + let n = v.len() / dim; + for i in 0..n { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Low-rank clustered corpus + matched queries (same construction as shard_recall). +fn make_corpus(cfg: &mut Cfg) -> (Vec, Vec) { + if let (Some(cp), Some(qp)) = (cfg.corpus_npy.clone(), cfg.queries_npy.clone()) { + let (mut corpus, n, d) = load_npy_f32(&cp); + let (mut queries, nq, dq) = load_npy_f32(&qp); + assert_eq!(d, dq, "corpus/query dim mismatch"); + l2_normalize_rows(&mut corpus, d); + l2_normalize_rows(&mut queries, dq); + cfg.dim = d; + cfg.n = n; + cfg.n_queries = nq; + eprintln!("# loaded corpus {n}x{d}, queries {nq}x{dq}"); + return (corpus, queries); + } + let mut rng = ChaCha8Rng::seed_from_u64(seed()); + let d = cfg.dim; + let l = cfg.latent_dim; + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + // Shared cone axis: one fixed unit direction added (weight `cone`) to every + // vector before normalization, making the corpus anisotropic like real + // embeddings (a dominant mean direction) rather than centered/gaussian. + let mut cone_axis = vec![0.0f32; d]; + for x in cone_axis.iter_mut() { + *x = gauss(&mut rng); + } + { + let nrm: f32 = cone_axis.iter().map(|v| v * v).sum::().sqrt(); + for x in cone_axis.iter_mut() { + *x /= nrm; + } + } + let cone = cfg.cone; + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc + cone * cone_axis[i]; + } + let nrm: f32 = out.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in out.iter_mut() { + *v /= nrm; + } + } + out + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let c = rng.random_range(0..cfg.n_clusters); + let e = make(&protos[c * l..(c + 1) * l], 0.3, &mut rng); + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + let mut queries = vec![0.0f32; cfg.n_queries * d]; + for i in 0..cfg.n_queries { + let c = rng.random_range(0..cfg.n_clusters); + let e = make(&protos[c * l..(c + 1) * l], 0.1, &mut rng); + queries[i * d..(i + 1) * d].copy_from_slice(&e); + } + (corpus, queries) +} + +/// FP32 brute-force cosine top-k ground truth (corpus is L2-normalized already). +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut scored: Vec<(usize, f32)> = (0..n) + .map(|di| { + let doc = &corpus[di * dim..(di + 1) * dim]; + let dot: f32 = q.iter().zip(doc).map(|(a, b)| a * b).sum(); + (di, dot) + }) + .collect(); + scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + scored.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +/// Mean resultant length R̄ = ‖ (1/n) Σ x_i ‖ over L2-normalized rows. +/// 0 ⇒ isotropic (directions cancel); →1 ⇒ tight cone (one shared direction). +/// Printed for BOTH synthetic and real corpora so the synthetic anisotropy can +/// be matched to nomic's — the whole point of the user's anisotropy note. +fn mean_resultant_length(corpus: &[f32], dim: usize) -> f32 { + let n = corpus.len() / dim; + let mut mean = vec![0.0f64; dim]; + for di in 0..n { + let row = &corpus[di * dim..(di + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for (m, &x) in mean.iter_mut().zip(row) { + *m += (x / nrm) as f64; + } + } + } + for m in mean.iter_mut() { + *m /= n as f64; + } + (mean.iter().map(|m| m * m).sum::().sqrt()) as f32 +} + +/// The dominant cone axis of a corpus = its normalized mean direction. Returned +/// as a unit vector; used by the project-then-spiral arm to deflate the shared +/// component before placing directions (anisotropic data lives off this axis). +fn dominant_axis(corpus: &[f32], dim: usize) -> Vec { + let n = corpus.len() / dim; + let mut mean = vec![0.0f32; dim]; + for di in 0..n { + let row = &corpus[di * dim..(di + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for (m, &x) in mean.iter_mut().zip(row) { + *m += x / nrm; + } + } + } + let nrm: f32 = mean.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for m in mean.iter_mut() { + *m /= nrm; + } + } + mean +} + +#[derive(Clone, Copy, PartialEq)] +enum DirArm { + Gaussian, + FibSpiral, + ProjectSpiral, +} + +impl DirArm { + fn name(self) -> &'static str { + match self { + DirArm::Gaussian => "gaussian", + DirArm::FibSpiral => "fib-spiral", + DirArm::ProjectSpiral => "project-spiral", + } + } +} + +/// A single golden-angle spiral point on S^2 for index `i` of `r` total points. +/// z evenly stepped in [-1,1], azimuth advanced by the golden angle π(3-√5). +fn spiral_point(i: usize, r: usize) -> [f32; 3] { + let golden = std::f32::consts::PI * (3.0 - 5.0f32.sqrt()); + let z = if r == 1 { + 0.0 + } else { + 1.0 - 2.0 * (i as f32 + 0.5) / r as f32 + }; + let rad = (1.0 - z * z).max(0.0).sqrt(); + let theta = i as f32 * golden; + [rad * theta.cos(), rad * theta.sin(), z] +} + +/// Deterministic 3x3 rotation for triple `t`: built by Gram–Schmidt on three +/// Gaussian columns from a triple-indexed RNG. Data-oblivious and reproducible. +/// Identity when `no_rotation` (the pure shared-index variant). +fn triple_rotation(t: usize, no_rotation: bool) -> [[f32; 3]; 3] { + if no_rotation { + return [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]; + } + let mut rng = ChaCha8Rng::seed_from_u64(seed() ^ 0xA071_u64.wrapping_mul(t as u64 + 1)); + let col = |rng: &mut ChaCha8Rng| [gauss(rng), gauss(rng), gauss(rng)]; + let dot = |a: &[f32; 3], b: &[f32; 3]| a[0] * b[0] + a[1] * b[1] + a[2] * b[2]; + let norm = |a: &mut [f32; 3]| { + let n = dot(a, a).sqrt(); + if n > 0.0 { + for x in a.iter_mut() { + *x /= n; + } + } + }; + let mut e0 = col(&mut rng); + norm(&mut e0); + let mut v1 = col(&mut rng); + let p1 = dot(&v1, &e0); + for i in 0..3 { + v1[i] -= p1 * e0[i]; + } + norm(&mut v1); + // e2 = e0 × e1 (right-handed, guaranteed orthonormal) + let e2 = [ + e0[1] * v1[2] - e0[2] * v1[1], + e0[2] * v1[0] - e0[0] * v1[2], + e0[0] * v1[1] - e0[1] * v1[0], + ]; + [e0, v1, e2] +} + +/// Build R unit directions of dimension `dim` for one arm. +/// - Gaussian: iid normal columns, normalized (the control). +/// - FibSpiral: tile `dim` into d/3 triples; direction i = concat over triples +/// of (rotation_t · spiral_point(i, R)); ONE shared index i couples triples. +/// - ProjectSpiral: same as FibSpiral but each direction is then deflated of its +/// component along `cone_axis` (the data's dominant direction) and renormalized, +/// so directions live OFF the shared cone where anisotropic data actually splits. +fn build_dirs(arm: DirArm, r: usize, dim: usize, no_rot: bool, cone_axis: &[f32]) -> Vec> { + match arm { + DirArm::Gaussian => { + let mut rng = ChaCha8Rng::seed_from_u64(seed() ^ 0xD132_0000 ^ (r as u64)); + (0..r) + .map(|_| { + let mut v = vec![0.0f32; dim]; + for x in v.iter_mut() { + *x = gauss(&mut rng); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + }) + .collect() + } + DirArm::FibSpiral | DirArm::ProjectSpiral => { + let n_tri = dim.div_ceil(3); // pad the tail triple if dim % 3 != 0 + let rots: Vec<[[f32; 3]; 3]> = (0..n_tri).map(|t| triple_rotation(t, no_rot)).collect(); + (0..r) + .map(|i| { + let p = spiral_point(i, r); // shared index i across all triples + let mut v = vec![0.0f32; dim]; + for (t, rot) in rots.iter().enumerate() { + // rotated triple vector + let rv = [ + rot[0][0] * p[0] + rot[1][0] * p[1] + rot[2][0] * p[2], + rot[0][1] * p[0] + rot[1][1] * p[1] + rot[2][1] * p[2], + rot[0][2] * p[0] + rot[1][2] * p[1] + rot[2][2] * p[2], + ]; + for (c, &rvc) in rv.iter().enumerate() { + let idx = t * 3 + c; + if idx < dim { + v[idx] = rvc; + } + } + } + if arm == DirArm::ProjectSpiral { + // deflate the dominant cone component, then renormalize + let proj: f32 = v.iter().zip(cone_axis).map(|(a, b)| a * b).sum(); + for (x, &c) in v.iter_mut().zip(cone_axis) { + *x -= proj * c; + } + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() + } + } +} + +/// One projection grid: unit direction + grid width (phase=0 for all arms here). +struct Proj { + dir: Vec, + width: f32, +} + +impl Proj { + fn bucket(&self, v: &[f32]) -> i64 { + let dot: f32 = self.dir.iter().zip(v).map(|(a, b)| a * b).sum(); + (dot / self.width).floor() as i64 + } +} + +type BucketIndex = Vec>>; + +fn build_index(projs: &[Proj], corpus: &[f32], dim: usize) -> BucketIndex { + let n = corpus.len() / dim; + projs + .par_iter() + .map(|p| { + let mut m: std::collections::HashMap> = std::collections::HashMap::new(); + for di in 0..n { + let b = p.bucket(&corpus[di * dim..(di + 1) * dim]); + m.entry(b).or_default().push(di as u32); + } + m + }) + .collect() +} + +fn probe_recall( + projs: &[Proj], + index: &BucketIndex, + q: &[f32], + rad: i64, + truth: &[usize], +) -> (usize, f32) { + use std::collections::HashSet; + let mut cand: HashSet = HashSet::new(); + for (pi, p) in projs.iter().enumerate() { + let b = p.bucket(q); + for bb in (b - rad)..=(b + rad) { + if let Some(ids) = index[pi].get(&bb) { + cand.extend(ids.iter().copied()); + } + } + } + let hits = truth + .iter() + .filter(|&&i| cand.contains(&(i as u32))) + .count(); + let recall = hits as f32 / truth.len().max(1) as f32; + (cand.len(), recall) +} + +fn run(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { + // base_width: single unit grid → ~sqrt(n) buckets (shard_recall calibration). + let spread = 6.0 / (cfg.dim as f32).sqrt(); + let base_width = spread / (cfg.n as f32).sqrt(); + let cone_axis = dominant_axis(corpus, cfg.dim); + let arms = [DirArm::Gaussian, DirArm::FibSpiral, DirArm::ProjectSpiral]; + + let mut pts: Vec<(&'static str, f64, f64)> = Vec::new(); + println!("# recall vs candidates-scanned (mean over queries), per arm per R"); + println!("arm\tR\trad\tcand_scanned\trecall@k"); + for &r in &cfg.r_values { + for &arm in &arms { + let dirs = build_dirs(arm, r, cfg.dim, cfg.no_triple_rotation, &cone_axis); + let projs: Vec = dirs + .into_iter() + .map(|dir| Proj { + dir, + width: base_width, + }) + .collect(); + let index = build_index(&projs, corpus, cfg.dim); + for rad in [0i64, 1, 2, 4] { + let stats: Vec<(usize, f32)> = (0..cfg.n_queries) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * cfg.dim..(qi + 1) * cfg.dim]; + probe_recall(&projs, &index, q, rad, &truth[qi]) + }) + .collect(); + let mean_cand = stats.iter().map(|s| s.0 as f64).sum::() / stats.len() as f64; + let mean_rec = stats.iter().map(|s| s.1 as f64).sum::() / stats.len() as f64; + println!( + "{}\t{r}\t{rad}\t{:.0}\t{:.4}", + arm.name(), + mean_cand, + mean_rec + ); + pts.push((arm.name(), mean_cand, mean_rec)); + } + } + } + + println!("\n# FAIR ENVELOPE: max recall@k at candidates-scanned <= budget"); + let budgets = [1000.0, 2000.0, 4000.0, 8000.0, 16000.0]; + print!("budget"); + for a in &arms { + print!("\t{}", a.name()); + } + println!(); + for &bud in &budgets { + print!("{bud:.0}"); + for a in &arms { + let best = pts + .iter() + .filter(|(name, c, _)| *name == a.name() && *c <= bud) + .map(|(_, _, r)| *r) + .fold(0.0f64, f64::max); + print!("\t{best:.4}"); + } + println!(); + } +} + +fn main() { + let mut cfg = parse(); + let (corpus, queries) = make_corpus(&mut cfg); + let truth = ground_truth(&corpus, &queries, cfg.dim, cfg.k); + let src = if cfg.corpus_npy.is_some() { + "npy" + } else { + "synthetic-cone" + }; + let rbar_c = mean_resultant_length(&corpus, cfg.dim); + let rbar_q = mean_resultant_length(&queries, cfg.dim); + println!( + "# fib_directions: n={} q={} dim={} k={} corpus={src} cone={}", + cfg.n, cfg.n_queries, cfg.dim, cfg.k, cfg.cone + ); + println!( + "# ANISOTROPY mean-resultant-length: corpus={rbar_c:.4} queries={rbar_q:.4} \ + (0=isotropic, 1=single cone) — match synthetic to real before trusting synthetic" + ); + if cfg.no_triple_rotation { + println!("# fib/project arms: NO per-triple rotation (pure shared-index variant)"); + } + run(&cfg, &corpus, &queries, &truth); +} diff --git a/experiments/ordinal-routing-research/fib_reduction.rs b/experiments/ordinal-routing-research/fib_reduction.rs new file mode 100644 index 00000000..7b526090 --- /dev/null +++ b/experiments/ordinal-routing-research/fib_reduction.rs @@ -0,0 +1,395 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — three independent probes of the hypothesis +//! "golden/Fibonacci is a REDUCTION RATE of the permutation space" (NOT an +//! address). Isolated by request: touches no routing harness, writes nothing, +//! and each experiment's verdict is fixed BELOW, before any data is seen. No +//! result here may reframe another experiment on this branch. +//! +//! Permutahedron P_{n-1} has n! vertices; a rank code is one vertex. The +//! hypothesis: at scale, the RETRIEVAL-relevant structure is a sub-factorial +//! reduction of that space, and golden ratio is its rate. +//! +//! ===================== EXP1 — FAN-OUT RATE ===================== +//! Math fact (not measured, it's a theorem): #{σ : |σ(i)-i| <= 1} = F_{n+1}, +//! ratio -> φ. The MEASURABLE crux: do the encoder's FP32 true neighbours +//! actually LIVE in the low-displacement (local) permutation neighbourhood of +//! the query's rank code, and does that neighbourhood grow at rate φ in DATA? +//! metric A (locality) = mean footrule(query, true-nbr) / mean footrule(random pair) +//! metric B (φ fan-out) = geometric growth ratio of |docs within footrule r| over r +//! PRE-REGISTERED: +//! locality PASS if A <= 0.75 (true nbrs meaningfully closer in rank space) +//! fan-out GOLDEN if B in [1.50, 1.75]; else "data-determined (intrinsic dim), not φ" +//! +//! ================= EXP2 — COARSENING SCHEDULE ================= +//! Bucket ranks into B bins; retrieve top-K by L1 on the B-bin code; recall@10 +//! vs bits/coord = log2(B). Compare bit-allocation schedules: +//! fib {2,3,5,8,13,21} pow2 {2,4,8,16,32} linear {2,3,4,5,6,7} +//! PRE-REGISTERED: +//! Fibonacci PASS if at matched bits it beats BOTH pow2 and linear by >= 0.03 +//! recall somewhere AND is never worse than either by > 0.03; else +//! "schedule-independent (recall is one smooth curve in bits; φ no advantage)" +//! +//! ==================== EXP3 — MERGE RATE ====================== +//! As ranks are progressively coarsened (B descending, Fibonacci-spaced), at +//! what RATE do permutations merge? Two levels: code-identity (do full codes +//! collide?) and neighbourhood (|docs within a fixed footrule ball|). +//! PRE-REGISTERED: +//! GOLDEN if successive neighbourhood-merge ratios converge to [1.50, 1.75]; +//! else report the actual law (expected: combinatorial ~1/B, not φ). +//! +//! Run: cargo run --release --example fib_reduction -- \ +//! --corpus-npy /tmp/repo_corpus.npy --queries-npy /tmp/repo_q.npy +//! (optional first positional: exp1 | exp2 | exp3; default runs all three) + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +/// Rank-encode: rank[i] = ordinal position (0..dim) of coordinate i within the row. +fn rank_encode(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut out = vec![0u16; n * dim]; + out.par_chunks_mut(dim).enumerate().for_each(|(r, code)| { + let row = &rows[r * dim..(r + 1) * dim]; + let mut idx: Vec = (0..dim as u16).collect(); + idx.sort_by(|&a, &b| row[a as usize].partial_cmp(&row[b as usize]).unwrap()); + for (rank, &i) in idx.iter().enumerate() { + code[i as usize] = rank as u16; + } + }); + out +} + +fn footrule(a: &[u16], b: &[u16]) -> f64 { + a.iter() + .zip(b) + .map(|(x, y)| (*x as i32 - *y as i32).unsigned_abs() as f64) + .sum() +} + +/// FP32 cosine top-k ground truth (rows already L2-normalized). +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +struct Data { + corpus: Vec, + queries: Vec, + dim: usize, + n: usize, + nq: usize, + crank: Vec, + qrank: Vec, + truth: Vec>, +} + +fn exp1_fanout(d: &Data) { + println!("\n## EXP1 — FAN-OUT RATE"); + let crow = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let qrow = |i: usize| &d.qrank[i * d.dim..(i + 1) * d.dim]; + + // metric A: locality of true neighbours vs random pairs, in rank (footrule) space + let true_med: f64 = (0..d.nq) + .into_par_iter() + .map(|qi| { + let mut fs: Vec = d.truth[qi] + .iter() + .map(|&di| footrule(qrow(qi), crow(di as usize))) + .collect(); + fs.sort_by(|a, b| a.partial_cmp(b).unwrap()); + fs[fs.len() / 2] + }) + .sum::() + / d.nq as f64; + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let rand_pairs: Vec<(usize, usize)> = (0..20_000) + .map(|_| (rng.random_range(0..d.n), rng.random_range(0..d.n))) + .collect(); + let rand_med = { + let mut fs: Vec = rand_pairs + .par_iter() + .map(|&(i, j)| footrule(crow(i), crow(j))) + .collect(); + fs.sort_by(|a, b| a.partial_cmp(b).unwrap()); + fs[fs.len() / 2] + }; + let a_ratio = true_med / rand_med; + println!(" metric A locality: true-nbr median footrule {true_med:.0} / random {rand_med:.0} = {a_ratio:.4}"); + println!( + " -> {}", + if a_ratio <= 0.75 { + "PASS (true neighbours closer in rank space)" + } else { + "FAIL (rank space does not localize neighbours)" + } + ); + + // metric B: neighbourhood growth ratio over footrule shells, on a query sample + let sample: Vec = (0..d.nq).step_by(4).collect(); + let radii: Vec = (1..=8).map(|s| rand_med * (s as f64) / 8.0).collect(); + let mut counts = vec![0f64; radii.len()]; + for &qi in &sample { + let dists: Vec = (0..d.n) + .into_par_iter() + .map(|di| footrule(qrow(qi), crow(di))) + .collect(); + for (ri, &r) in radii.iter().enumerate() { + counts[ri] += dists.iter().filter(|&&x| x <= r).count() as f64; + } + } + let mut ratios = vec![]; + for w in counts.windows(2) { + if w[0] > 0.0 { + ratios.push(w[1] / w[0]); + } + } + let gmean = ratios.iter().map(|r| r.ln()).sum::() / ratios.len() as f64; + let g = gmean.exp(); + println!(" metric B fan-out growth ratio (per equal radius step) = {g:.3}"); + println!( + " -> {}", + if (1.50..=1.75).contains(&g) { + "GOLDEN (≈φ)" + } else { + "data-determined (set by intrinsic dim), NOT φ" + } + ); +} + +fn exp2_schedule(d: &Data) { + println!("\n## EXP2 — COARSENING SCHEDULE"); + let crow = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let qrow = |i: usize| &d.qrank[i * d.dim..(i + 1) * d.dim]; + let cand_k = 100usize; + let bucket = |rank: u16, b: u32| -> u16 { ((rank as u32 * b) / d.dim as u32) as u16 }; + // recall@10 (truth top-10 captured within top-cand_k by L1 on B-bin code) + let recall_for_b = |b: u32| -> f64 { + let ccode: Vec = d.crank.iter().map(|&r| bucket(r, b)).collect(); + (0..d.nq) + .into_par_iter() + .map(|qi| { + let qc: Vec = qrow(qi).iter().map(|&r| bucket(r, b)).collect(); + let mut s: Vec<(u32, i64)> = (0..d.n) + .map(|di| { + let dc = &ccode[di * d.dim..(di + 1) * d.dim]; + let dist: i64 = qc + .iter() + .zip(dc) + .map(|(x, y)| (*x as i64 - *y as i64).abs()) + .sum(); + (di as u32, dist) + }) + .collect(); + s.sort_by_key(|x| x.1); + let top: std::collections::HashSet = + s.into_iter().take(cand_k).map(|x| x.0).collect(); + let hit = d.truth[qi].iter().filter(|t| top.contains(t)).count(); + hit as f64 / d.truth[qi].len() as f64 + }) + .sum::() + / d.nq as f64 + }; + let all_b = [2u32, 3, 4, 5, 6, 7, 8, 13, 16, 21, 32]; + let mut rec = std::collections::HashMap::new(); + for &b in &all_b { + rec.insert(b, recall_for_b(b)); + } + let show = |name: &str, bs: &[u32], rec: &std::collections::HashMap| { + println!(" {name}:"); + for &b in bs { + println!( + " B={b:>2} bits={:.2} recall@10={:.4}", + (b as f64).log2(), + rec[&b] + ); + } + }; + let _ = crow; // crow unused here; codes built from crank directly + show("fib ", &[2, 3, 5, 8, 13, 21], &rec); + show("pow2 ", &[2, 4, 8, 16, 32], &rec); + show("linear", &[2, 3, 4, 5, 6, 7], &rec); + println!(" -> recall is a function of bits=log2(B); schedules only SAMPLE that curve."); + println!( + " Fibonacci PASS only if its points beat pow2 AND linear at matched bits by >=0.03." + ); +} + +fn exp3_merge(d: &Data) { + println!("\n## EXP3 — MERGE RATE"); + let crow = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let bucket = |rank: u16, b: u32| -> u16 { ((rank as u32 * b) / d.dim as u32) as u16 }; + // (a) code-identity: do full B-bin codes collide as B shrinks? + println!(" (a) distinct full codes among {} docs:", d.n); + let fib_levels = [21u32, 13, 8, 5, 3, 2]; + let mut distinct = vec![]; + for &b in &fib_levels { + let set: std::collections::HashSet> = (0..d.n) + .map(|i| crow(i).iter().map(|&r| bucket(r, b)).collect::>()) + .collect(); + distinct.push(set.len()); + println!( + " B={b:>2}: {} distinct ({:.1}% of n)", + set.len(), + 100.0 * set.len() as f64 / d.n as f64 + ); + } + // (b) neighbourhood merge: mean docs within a fixed footrule ball, across Fibonacci radii + let crow2 = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x3333); + let probes: Vec = (0..50).map(|_| rng.random_range(0..d.n)).collect(); + // a coarse footrule scale: median random footrule + let scale = { + let mut fs: Vec = (0..5000) + .map(|_| { + footrule( + crow2(rng.random_range(0..d.n)), + crow2(rng.random_range(0..d.n)), + ) + }) + .collect(); + fs.sort_by(|a, b| a.partial_cmp(b).unwrap()); + fs[fs.len() / 2] + }; + let fib_r = [1f64, 2.0, 3.0, 5.0, 8.0, 13.0]; // Fibonacci-spaced shell radii (× scale/13) + let mut counts = vec![0f64; fib_r.len()]; + for &p in &probes { + let dists: Vec = (0..d.n) + .into_par_iter() + .map(|di| footrule(crow2(p), crow2(di))) + .collect(); + for (ri, &fr) in fib_r.iter().enumerate() { + let r = scale * fr / 13.0; + counts[ri] += dists.iter().filter(|&&x| x <= r).count() as f64; + } + } + println!(" (b) neighbourhood size at Fibonacci-spaced radii:"); + for (ri, &fr) in fib_r.iter().enumerate() { + println!( + " r={fr:>4} (×scale/13): mean |ball| = {:.1}", + counts[ri] / probes.len() as f64 + ); + } + let mut ratios = vec![]; + for w in counts.windows(2) { + if w[0] > 0.0 { + ratios.push(w[1] / w[0]); + } + } + if !ratios.is_empty() { + let g = (ratios.iter().map(|r| r.ln()).sum::() / ratios.len() as f64).exp(); + println!( + " merge growth ratio across shells = {g:.3} -> {}", + if (1.50..=1.75).contains(&g) { + "GOLDEN (≈φ)" + } else { + "NOT φ (set by data geometry / shell spacing)" + } + ); + } +} + +fn main() { + let mut which = None; + let mut corpus_path = None; + let mut queries_path = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => corpus_path = Some(a.next().unwrap()), + "--queries-npy" => queries_path = Some(a.next().unwrap()), + "exp1" | "exp2" | "exp3" => which = Some(x), + _ => {} + } + } + let cp = corpus_path.expect("--corpus-npy required"); + let qp = queries_path.expect("--queries-npy required"); + let (mut corpus, n, dim) = load_npy_f32(&cp); + let (mut queries, nq, dq) = load_npy_f32(&qp); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + let crank = rank_encode(&corpus, n, dim); + let qrank = rank_encode(&queries, nq, dim); + let truth = ground_truth(&corpus, &queries, dim, 10); + let d = Data { + corpus, + queries, + dim, + n, + nq, + crank, + qrank, + truth, + }; + println!("# fib_reduction (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}"); + let _ = (&d.corpus, &d.queries); // kept for parity; rank-space metrics use crank/qrank + match which.as_deref() { + Some("exp1") => exp1_fanout(&d), + Some("exp2") => exp2_schedule(&d), + Some("exp3") => exp3_merge(&d), + _ => { + exp1_fanout(&d); + exp2_schedule(&d); + exp3_merge(&d); + } + } +} diff --git a/experiments/ordinal-routing-research/oblivious_directions_results.md b/experiments/ordinal-routing-research/oblivious_directions_results.md new file mode 100644 index 00000000..00951ae4 --- /dev/null +++ b/experiments/ordinal-routing-research/oblivious_directions_results.md @@ -0,0 +1,256 @@ +# Data-oblivious structure vs learned-encoder geometry — the directions arc + +Second investigation round (follows the density/CRT/tau work). Question: can a +**data-oblivious low-discrepancy structure** — golden-angle, Fibonacci, Sobol, +Kronecker — improve ordvec's training-free routing? The probes (`*.rs`) and this +doc live in this directory; to run a probe, copy it into the crate's `examples/` +and `cargo run --release --example ` (same convention as the sibling docs). +**No crate or public-API changes.** Every probe was **pre-registered** (pass/fail +thresholds fixed in the source header BEFORE running) to prevent goalpost-moving, +and run on real embeddings. + +## Corpora (real, nomic-embed-text 768-d via ollama on GPU) + +| corpus | n docs | queries | domain | mean-resultant-length R̄ | +|--------|--------|---------|--------|--------------------------| +| repo | 8,777 | 200 | md+rust | ~0.07 raw-cluster fixture* | +| fiqa | 57,600 | 1,000 | financial QA | 0.69 (strongly coned) | +| nq | 74,129 | 1,000 | Wikipedia QA | 0.69 | +| quora | 150,000| 1,000 | question pairs | ~0.7 | + +Ground truth = FP32 cosine top-10. *The repo corpus is the prior round's; the +financial/QA corpora are the breadth/scale test. NOTE: intrinsic dimension is a +**corpus** property, not an encoder constant — the repo corpus reads TwoNN≈13 but +fiqa reads ≈24 with the SAME encoder. See the "ID≈13 anchor was a corpus artifact" +correction in the cross-encoder section before quoting any single ID number. + +## VERDICT (pre-registered, replicated) + +**CLASS-DEAD.** Data-oblivious low-discrepancy directions do **not** beat iid-random +directions for training-free routing — not in ambient space, not in the centered +intrinsic subspace. The lever is **not the directions**. + +## The chain of probes + +### 1. `fib_directions` — golden-angle directions, AMBIENT sphere → TIE (with caveat) +Shared-index triple-tiled golden-angle directions vs iid-Gaussian, recall@k at +matched candidates-scanned. Across 10 seeds the gap is inside random's own +seed-to-seed noise (±0.03); the eye-catching seed-1 +5.8pts **evaporated under +reseeding** (a caught false positive). CAVEAT registered at the time: ambient +768-sphere is the wrong regime — concentration of measure makes iid-Gaussian +already near-uniform, so the test was pre-doomed regardless of golden's merit. + +### 2. `uniformity_lemma` — why golden is dead on BIN EDGES (not directions) +The rank transform whitens each coordinate's marginal to **exactly uniform**, so +equal-width bucketing is **entropy-optimal** (measured: entropy = B exactly at +B∈{1,2,4}; golden boundaries 3.92 < 4.00 — they *strictly waste bits*). BUT this +lemma governs the **marginal only**; it is blind to the **joint** correlation +structure that projection directions act on. So it kills golden on bin edges and +says nothing about directions — which set up probe #4. +Correction banked: constant-composition is intrinsic to *any* fixed boundaries +on ranks, NOT unique to equal-width (the original claim was too strong). + +### 3. `overlap_decomp` — the cone is ~100% of the bitmap-overlap excess +ordvec's hypergeometric null `n_top²/D` is **miscalibrated on real data**: random +pairs overlap ~4× the textbook prediction, and `cone_frac ≈ 1.000` — essentially +**all** the excess is the shared cone (hubness), not pairwise similarity. +Per-coord **mean-centering removes it** (cone_base collapses to the uniform null) +AND amplifies the true-neighbour overlap gap 2–5×. Replicated on fiqa and nq. +This is the one continuous, geometry-aligned intervention that *converted*. + +### 4. `subspace_directions` — the directions test done RIGHT → CLASS-DEAD +Center → project to the populated k≈13-dim PCA subspace (where joint structure is +legible) → place R directions via {random, Sobol, Kronecker} → route. Fair +envelope (recall@10 at matched candidates): + +| corpus | budget | random | sobol | kronecker | pca-axes (ceiling) | +|--------|--------|--------|-------|-----------|--------------------| +| fiqa | 8000 | 0.1967 | 0.1997| 0.2001 | 0.1946 | +| nq | 8000 | 0.1836 | 0.1830| 0.1817 | 0.1877 | +| quora | 8000 | 0.1004 | 0.1035| 0.1061 | 0.1073 | + +Max margin anywhere **+0.006** (threshold for a win was +0.02), across all three +corpora. Decisive detail: +the **data-dependent ceiling (pca-axes) itself barely beats random** → the +directions are not a lever for *anyone* on this data. With ID≈13 and a cone, the +populated subspace is so dense that any ~13–128 directions cover it equally; there +is no discrepancy gap for quasi-random structure to close. The Kronecker/Sobol +**hybrid was NOT built** — it required distinct-regime component wins; there were +none. + +## centering's recall verdict (`centering_recall`, pre-registered) + +Centering as a RankQuant encoding change: full-scan R@10 vs FP32 cosine, raw vs +centered, matched bytes. + +| corpus | Δb1 | Δb2 | Δb4 | verdict | +|--------|-----|-----|-----|---------| +| repo | +0.038 | −0.044 | −0.108 | PARTIAL | +| fiqa | +0.066 | −0.017 | −0.075 | PARTIAL | +| nq | +0.051 | −0.016 | −0.083 | PARTIAL | +| quora | +0.027 | −0.022 | −0.061 | PARTIAL | + +Centering **helps at b=1, hurts at b≥2**, replicated across 3 corpora. It removes +the additive cone (good for top-bucket coverage / coarse routing) but rotates the +rank order away from raw-cosine (bad for full-order fidelity at higher bits). It +is a **low-bit-prefilter / coarse-routing tool, not a fine-scan tool** — and +explicitly does NOT survive at the incumbent b=4. + +## The through-line + +Every oblivious *discrete* structural prior tried across both rounds — golden, +Fibonacci, prime/spectral, Sobol, Kronecker, low-discrepancy directions — is +**inert against learned continuous embedding geometry**. The data is a smooth +~13-dim cone; discrete combinatorial structure has no purchase on it. The only +interventions that converted were **continuous and geometric** (centering / cone +removal), and only in the low-bit regime. "Just spend bits (b=4, raw ranks)" +remains the baseline that beats every oblivious-structure idea. + +Even centering — the one continuous intervention that *partially* converted (b=1 +prefilter, +0.03–0.07 R@10) — does not survive at b=4 and does not make a better +partition key. The cone is real and removable, but removing it trades cosine +fidelity for balance, and routing wants the fidelity. + +## Scope & boundary of this negative (checkpoint) + +This is a deliberate stopping point: the hypothesis space had one degree of +freedom — *which oblivious structure* on *which axis* (bin-edges / directions / +partition key) — and both ends of both axes have now been tested, **including the +data-dependent PCA ceiling**. The load-bearing result is that the ceiling itself +barely beats random: that kills the *axis*, so any further sequence is a +**derivative** that inherits the null. You cannot escape a null by swapping the +sequence when even the optimal member ties. + +Every result here is conditioned on **off-the-shelf `nomic-embed-text`**: a +learned, anisotropic (R̄≈0.69), low-intrinsic-dim (≈13) encoder. The honest claim +is therefore scoped: *data-oblivious combinatorial structure is inert against +learned anisotropic embedding geometry of this kind.* The only **non-derivative** +next experiment is to change that conditioning variable — a different encoder +(different anisotropy / intrinsic dim), or a sparsity/ordinal-trained encoder +where ranks are the native code and discrete structure might finally seat. +Everything inside the current encoder is a derivative of what is already +falsified here. + +## Partition balance (`partition_balance`, pre-registered) — BALANCE pass, PRUNING fail + +The one round-on-round thread: does cone-removal make a balanced, prunable coarse +shard key? Sign-pattern key over the top-`bits_key` PCA axes (basis shared across +arms; only centering differs). Swept bits_key ∈ {6,10,14} on fiqa. + +| sub-claim | threshold | result | +|-----------|-----------|--------| +| BALANCE — largest-cell-fraction reduction | ≥1.5× | **PASS** (1.6–2.2×; Gini 0.60→0.38 at bits_key=10) | +| PRUNING — candidates at matched recall≥0.90 | ≤0.80× | **FAIL** (same recall-vs-candidates envelope) | + +Centering de-hubs the cells (more balanced, uses all cells) but at matched recall +scans the **same** candidates — better balance did NOT convert to better pruning. +Mechanism: balance is a *marginal* property of the key; pruning quality depends on +cell-to-neighbour *alignment*, and centering rotates the key away from cosine (the +same reason it fails recall at b≥2). Cells spread along the wrong axis don't prune +better. This was the last live thread; it closes for the session's recurring +reason. + +## Cross-encoder extension (the non-derivative move) — IN PROGRESS + +Every result above is conditioned on nomic (ID≈13, R̄≈0.69). The only +information-adding next step is changing the encoder. Geometry gate (8k fiqa +sample, TwoNN intrinsic dim + mean-resultant-length R̄) — run BEFORE any probe to +confirm the encoders genuinely differ from nomic, else a re-run is derivative: + +First-pass gate (512-char chunks) measured: nomic 768d ID≈13 R̄0.69; all-minilm +384d ID21.7 R̄0.28; mxbai 1024d ID24.3 R̄0.69; snowflake-v1 1024d ID31.1 R̄0.74. + +**Methodology correction (capacity confound).** 384-dim (all-minilm) cannot hold +meaning above ~400-char chunks, and our chunks were capped at 512 — so minilm's +geometry was capacity-corrupted, not a clean ID point, and does not belong on the +ladder. Floor set to **768-dim** encoders and chunk cap raised to **1100 chars** +(BGE-768 capacity ceiling). Revised capacity-honest set (all ≥1024-dim, all clear +the floor): **bge-m3, bge-large, snowflake-arctic-embed-v2**, plus +**harrier-oss-v1-0.6b** (the operator's actual daily-driver encoder). Model facts +verified against the ollama host + HF config (not recalled): harrier = +Qwen3-based, **1024-dim, 32k context** (Q8 GGUF); ingests 1100-char chunks with +zero truncation. Re-embedding fiqa at 1100-char chunks; ladder rebuilt on these. +(For code corpora, the right encoder is jina-code — noted for a future code-domain +run; not in this text-corpus ladder.) + +PRE-REGISTERED prediction (fixed before the full-corpus probes run): +- The session's mechanism is "ID≈13 is so dense any directions cover it equally." + If TRUE, then as ID climbs to 31 the subspace sparsifies and a discrepancy gap + should OPEN → Sobol/Kronecker begin to beat random, and centering's b=4 penalty + shrinks (less cone). Verdict per model by the SAME thresholds as above + (directions: +0.02 at matched budget; centering: +0.02 at b4). +- If oblivious structure stays inert even at ID=31 / low-anisotropy minilm → the + negative generalizes across the realistic ID range of text encoders (stronger). + +### Ladder results (fiqa 57.6k, 1100-char chunks, capacity-honest) + +| encoder | dim | intrinsic dim | centering Δb4 | sobol−random @8k | pca-axes ceiling | +|---------|-----|---------------|---------------|------------------|------------------| +| nomic@512 | 768 | ~13 | −0.075 | ~tie | ≈ random | +| bge-m3 | 1024 | 21.4 | −0.106 | +0.006 (flicker +0.024@4k) | leads (0.216) | +| bge-large | 1024 | 22.9 | −0.155 | **−0.036** | leads (0.255) | +| snowflake-arctic-v2 | 1024 | 18.0 | −0.088 | +0.010 | leads (0.237) | +| snowflake-arctic-v2 | 1024 | 18.0 | −0.088 | +0.010 | leads (0.237) | +| harrier-oss-0.6b | 1024 | 21.4 | −0.069 | +0.002 | leads (0.252) | +| nomic@1100 (same fiqa) | 768 | 23.1 | −0.082 | (n/a) | — | +| nomic@512 (same fiqa) | 768 | 24.3 | (≈ above) | — | — | + +### CORRECTION — the "ID≈13" anchor was a CORPUS artifact, not an encoder fact + +The "nomic intrinsic dim ≈ 13" that motivated the whole "dense low-dim subspace" +mechanism came from the **repo corpus** (8.7k md+rust sentences), NOT fiqa. On a +FIXED corpus (fiqa 57.6k), nomic's ID is **24.3 @512 / 23.1 @1100** — and *every* +encoder lands ID ~18–24. The apparent "ID ladder 13→31" was mostly **corpus +differences masquerading as encoder differences** (nomic-on-repo vs others-on-fiqa). +On one corpus the ladder is nearly flat. **Intrinsic dim is set more by the corpus +than the encoder**, and chunk length (512→1100) barely moved it (24.3→23.1) — the +opposite of the predicted large drop. Both my "ID≈13" anchor and my "longer chunks +→ much lower ID" prediction were wrong; recorded here rather than quietly dropped. + +What SURVIVES the correction (now properly controlled — same corpus, varied encoder): +- **Directions CLASS-DEAD** across all 5 encoders at the ID range real text + encoders actually produce (~18–24). The negative was never about low ID; it is + general. No revival anywhere. +- **Centering b4 dead** across all 5 (−0.07 to −0.15); harrier + nomic@1100 both + PARTIAL (b1 help, b4 fail) — same pattern as the original runs. +- **pca-axes (data-aligned) leads** at every encoder (0.24–0.25 vs random ~0.19): + the lever is data-alignment, which training-free forbids. The robust positive. + +**Directions: CLASS-DEAD, confirmed across the ID ladder.** sobol−random flips +sign per corpus (+0.006, −0.036, +0.010) — noise around zero, never ≥0.02 in ≥2 +corpora; the bge-m3 flicker did NOT replicate at near-identical ID (bge-large 22.9). +The pre-registered ≥2-corpora rule holds the line. + +**The one robust high-ID effect (a real, positive, scoped finding):** the +data-aligned ceiling (pca-axes) clearly LEADS random at every high-ID encoder +(0.22–0.26 vs ~0.18–0.20), unlike nomic where it barely beat random. So higher +intrinsic dim genuinely opens room on the directions axis — but ONLY +data-dependent directions exploit it; oblivious (sobol/kronecker) cannot. The +lever at high ID is **data-alignment**, which is exactly what training-free +forbids. This locates *why* oblivious structure fails rather than just restating +that it does. + +**Centering b4: dead across all encoders, penalty SCALES with capacity** (−0.08 to +−0.15) — higher-capacity encoders carry more cosine order for centering to +corrupt. Most robust negative in the set. + +harrier-oss-0.6b (operator's daily driver) + nomic@1100 (chunk-length control) +embedding; results to follow. + +**Chunk length is a THIRD geometry axis (not a free parameter).** Longer chunks +average more tokens → vectors regress toward the corpus mean → R̄ rises (tighter +cone), intrinsic dim falls. So nomic@512 and nomic@1100 are *different geometries*, +and every 512-char result above is conditioned on that one chunk length. We +re-embed nomic@1100 on the SAME 57.6k fiqa docs to measure the shift directly +(two-point: ID and R̄ delta). Deployment consequence: a real enterprise lake has +documents at every chunk length at once → a *mixture* of geometries, not one — so +chunk-length heterogeneity is itself a lake pathology (folds into Path B). + +## Reproduce +``` +# embed real corpora via ollama nomic-embed-text -> .npy (see REAL_CORPUS_RUNBOOK.md) +cargo run --release --example uniformity_lemma -- --corpus-npy CORPUS.npy +cargo run --release --example overlap_decomp -- --corpus-npy CORPUS.npy --queries-npy Q.npy +cargo run --release --example centering_recall -- --corpus-npy CORPUS.npy --queries-npy Q.npy +cargo run --release --example subspace_directions -- --corpus-npy CORPUS.npy --queries-npy Q.npy --kdim 13 +``` diff --git a/experiments/ordinal-routing-research/overlap_decomp.rs b/experiments/ordinal-routing-research/overlap_decomp.rs new file mode 100644 index 00000000..7744efb1 --- /dev/null +++ b/experiments/ordinal-routing-research/overlap_decomp.rs @@ -0,0 +1,279 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — decomposes the bitmap top-bucket overlap +//! EXCESS on real embeddings into CONE (marginal/hubness) vs PAIRWISE (signal), +//! and tests whether per-coordinate mean-centering recovers calibration. +//! Isolated by request: touches no harness, writes nothing. Verdicts fixed below. +//! +//! WHY: uniformity_lemma.rs found the hypergeometric null_err GROWS with B on +//! real data (0.31 -> 0.85 -> 3.34). Hypothesis: that excess is the shared cone +//! (R̄=0.69 anisotropy => same coords top-bucket in EVERY doc), not genuine +//! pairwise similarity. If so, the bitmap prefilter admits candidates partly on +//! HUBNESS, not meaning. +//! +//! FOUR overlap levels for the top bucket (n_top = D/2^B coords): +//! uniform_null = n_top^2 / D (random-subset assumption) +//! cone_baseline = Σ_c f_c^2 (f_c = P(coord c in top bucket); +//! overlap of two INDEPENDENT docs +//! given marginals == pure hubness) +//! obs_random = mean top-overlap, random doc pairs +//! obs_trueNbr = mean top-overlap, (query, FP32-cosine-true-neighbour) pairs +//! +//! Decomposition (all on the SAME corpus), reported raw AND after per-coord +//! mean-centering (subtract corpus per-coordinate mean before ranking): +//! cone fraction = (cone_baseline - uniform_null) / (obs_random - uniform_null) +//! pairwise signal = obs_trueNbr - obs_random (the discriminative gap) +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! A. "cone dominates the null": cone fraction >= 0.70 (raw). +//! => most of the bitmap's apparent overlap structure is hubness. +//! B. "genuine pairwise signal survives": obs_trueNbr - obs_random >= 0.20 * obs_random +//! AND obs_trueNbr > cone_baseline (true nbrs overlap beyond what hubness explains). +//! C. "centering helps": centering reduces cone fraction by >= 0.10 absolute +//! AND does not shrink the pairwise-signal ratio. +//! Each reported PASS/FAIL independently; report actuals regardless. +//! +//! Run: cargo run --release --example overlap_decomp -- \ +//! --corpus-npy /tmp/repo_corpus.npy --queries-npy /tmp/repo_q.npy + +// Research probe: `analyze` is a parameterized runner; bundling its args into a +// struct would obscure the experiment, not clarify it. +#![allow(clippy::too_many_arguments)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +/// Per-coordinate mean over corpus rows. +fn coord_mean(corpus: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += corpus[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// Top-bucket membership bitset per doc: coords whose rank is in the top 1/2^B. +/// Returns, per doc, the sorted list of top-bucket coord ids. +fn top_sets( + rows: &[f32], + n: usize, + dim: usize, + bits: u32, + sub: Option<&[f32]>, +) -> (Vec>, usize) { + let n_top = dim >> bits; // D / 2^B + let sets: Vec> = (0..n) + .into_par_iter() + .map(|r| { + let mut v: Vec<(u16, f32)> = (0..dim) + .map(|c| { + let x = rows[r * dim + c] - sub.map_or(0.0, |s| s[c]); + (c as u16, x) + }) + .collect(); + // top n_top by value = highest ranks + v.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + let mut top: Vec = v.into_iter().take(n_top).map(|(c, _)| c).collect(); + top.sort_unstable(); + top + }) + .collect(); + (sets, n_top) +} + +fn overlap(a: &[u16], b: &[u16]) -> usize { + let (mut i, mut j, mut c) = (0, 0, 0); + while i < a.len() && j < b.len() { + match a[i].cmp(&b[j]) { + std::cmp::Ordering::Less => i += 1, + std::cmp::Ordering::Greater => j += 1, + std::cmp::Ordering::Equal => { + c += 1; + i += 1; + j += 1; + } + } + } + c +} + +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +fn analyze( + label: &str, + corpus: &[f32], + queries: &[f32], + n: usize, + nq: usize, + dim: usize, + truth: &[Vec], + sub: Option<&[f32]>, +) { + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let pairs: Vec<(usize, usize)> = (0..100_000) + .map(|_| { + let i = rng.random_range(0..n); + let mut j = rng.random_range(0..n); + if j == i { + j = (j + 1) % n; + } + (i, j) + }) + .collect(); + + println!("\n## {label}"); + println!( + "B\tn_top\tuniform_null\tcone_base\tobs_random\tobs_trueNbr\tcone_frac\tpair_sig_ratio" + ); + for &bits in &[1u32, 2, 4] { + let (csets, n_top) = top_sets(corpus, n, dim, bits, sub); + let (qsets, _) = top_sets(queries, nq, dim, bits, sub); + + // per-coord top-bucket frequency f_c -> cone baseline Σ f_c^2 + let mut f = vec![0f64; dim]; + for s in &csets { + for &c in s { + f[c as usize] += 1.0; + } + } + for x in f.iter_mut() { + *x /= n as f64; + } + let cone_base: f64 = f.iter().map(|x| x * x).sum(); + + let uniform_null = (n_top * n_top) as f64 / dim as f64; + + let obs_random: f64 = pairs + .par_iter() + .map(|&(i, j)| overlap(&csets[i], &csets[j]) as f64) + .sum::() + / pairs.len() as f64; + + // true-neighbour overlap: query top-set vs each true-nbr doc top-set + let (mut tn_sum, mut tn_cnt) = (0f64, 0usize); + for qi in 0..nq { + for &di in &truth[qi] { + tn_sum += overlap(&qsets[qi], &csets[di as usize]) as f64; + tn_cnt += 1; + } + } + let obs_true = tn_sum / tn_cnt as f64; + + let cone_frac = (cone_base - uniform_null) / (obs_random - uniform_null); + let pair_sig_ratio = (obs_true - obs_random) / obs_random; + + println!("{bits}\t{n_top}\t{uniform_null:.2}\t{cone_base:.2}\t{obs_random:.2}\t{obs_true:.2}\t{cone_frac:.3}\t{pair_sig_ratio:+.3}"); + } +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + // ground truth computed on the L2-normalized RAW vectors (the real task target); + // centering changes the rank CODE, not the retrieval ground truth. + let truth = ground_truth(&corpus, &queries, dim, 10); + + println!("# overlap_decomp (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}"); + println!("# cone_frac = (cone_base - uniform_null)/(obs_random - uniform_null); 1.0 => excess is ALL hubness"); + println!("# pair_sig_ratio = (obs_trueNbr - obs_random)/obs_random; >0 => true nbrs overlap MORE than random"); + + analyze("RAW ranks", &corpus, &queries, n, nq, dim, &truth, None); + + let mean = coord_mean(&corpus, n, dim); + analyze( + "MEAN-CENTERED ranks (subtract per-coord corpus mean before ranking)", + &corpus, + &queries, + n, + nq, + dim, + &truth, + Some(&mean), + ); + + println!("\n# PRE-REGISTERED: A cone-dominates if cone_frac>=0.70 (raw);"); + println!("# B pairwise-signal-survives if pair_sig_ratio>=0.20 AND obs_trueNbr>cone_base;"); + println!( + "# C centering-helps if cone_frac drops >=0.10 absolute and pair_sig_ratio not reduced." + ); +} diff --git a/experiments/ordinal-routing-research/partition_balance.rs b/experiments/ordinal-routing-research/partition_balance.rs new file mode 100644 index 00000000..2c4b191c --- /dev/null +++ b/experiments/ordinal-routing-research/partition_balance.rs @@ -0,0 +1,344 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — the round-on-round thread: does cone-removal +//! (centering) make a BALANCED, PRUNABLE coarse partition key for a sharding/IVF +//! layer? This is the one place cone-removal might buy SCALING (sublinear pruning) +//! rather than recall. Isolated by request: touches no harness, writes nothing. +//! +//! WHY: overlap_decomp showed raw low-B top-buckets are ~100% cone (every doc +//! shares the same hub coords) -> a raw coarse key is maximally UNBALANCED: docs +//! pile into a few cells, so an IVF layer probing few cells can't prune without +//! losing neighbours. Centering removes the cone -> should SPREAD docs across +//! cells. The decisive question is NOT balance for its own sake but candidates- +//! scanned at matched recall under a real inverted-index probe. +//! +//! PARTITION KEY = the coarse cell id of a doc. We build cells by k-means-free, +//! training-free coarse coding: project (centered or raw) onto the top PCA axes, +//! take the SIGN pattern of the top `bits_key` axes -> 2^bits_key cells (a +//! data-oblivious-given-PCA hash; PCA basis is the only data-dependent part and is +//! shared identically across arms, so arms differ ONLY in centering). +//! Query probes the P nearest cells (by Hamming on the key); candidate = union of +//! those cells' docs; recall@10 vs FP32 cosine top-10. +//! +//! ARMS (PCA basis identical across all; only the centering differs): +//! raw — sign-key from raw (uncentered) projections. +//! centered — sign-key from per-coord-mean-centered projections. +//! Controls reported: cell-occupancy Gini (0=perfectly balanced, 1=all in one +//! cell) and the largest-cell fraction. +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! BALANCE: centered must cut largest-cell-fraction by >= 1.5x vs raw. +//! PRUNING (the one that matters): at matched recall@10 (>=0.90), centered must +//! scan <= 0.80x the candidates raw scans, in >= 2 of 3 corpora. +//! => PASS = centered is a materially better coarse key (a SCALING win, distinct +//! from the recall FAIL at b>=2). FAIL = centering doesn't help partitioning +//! either; the cone is not the bottleneck for routing. +//! +//! Run: cargo run --release --example partition_balance -- \ +//! --corpus-npy /tmp/corpora/fiqa_corpus.npy --queries-npy /tmp/corpora/fiqa_q.npy + +// Research probe: `analyze` is a parameterized runner; bundling its args into a +// struct would obscure the experiment, not clarify it. +#![allow(clippy::too_many_arguments)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn coord_mean(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += rows[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// top-k PCA via power iteration + deflation (implicit covariance action). Pure Rust. +fn pca_topk(centered: &[f32], n: usize, dim: usize, k: usize) -> Vec> { + let sample: Vec = if n > 20000 { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x9C0FFEE); + (0..20000).map(|_| rng.random_range(0..n)).collect() + } else { + (0..n).collect() + }; + let cov_mul = |x: &[f32]| -> Vec { + let acc: Vec = sample + .par_iter() + .map(|&i| { + let row = ¢ered[i * dim..(i + 1) * dim]; + let d: f32 = row.iter().zip(x).map(|(a, b)| a * b).sum(); + row.iter().map(|&r| r * d).collect::>() + }) + .reduce( + || vec![0f32; dim], + |mut a, b| { + for j in 0..dim { + a[j] += b[j]; + } + a + }, + ); + let s = sample.len() as f32; + acc.iter().map(|v| v / s).collect() + }; + let mut comps: Vec> = Vec::with_capacity(k); + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xABCD); + for _ in 0..k { + let mut v: Vec = (0..dim) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + for _ in 0..50 { + for c in &comps { + let p: f32 = v.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + v[j] -= p * c[j]; + } + } + let mut y = cov_mul(&v); + for c in &comps { + let p: f32 = y.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + y[j] -= p * c[j]; + } + } + let nrm: f32 = y.iter().map(|t| t * t).sum::().sqrt(); + if nrm < 1e-20 { + break; + } + for j in 0..dim { + v[j] = y[j] / nrm; + } + } + comps.push(v); + } + comps +} + +/// sign-pattern key over the top `bits_key` PCA axes -> cell id in [0, 2^bits_key). +fn cell_key(v: &[f32], mean: Option<&[f32]>, basis: &[Vec], bits_key: usize) -> u32 { + let mut key = 0u32; + for (a, axis) in basis.iter().take(bits_key).enumerate() { + let proj: f32 = axis + .iter() + .enumerate() + .map(|(c, &w)| (v[c] - mean.map_or(0.0, |m| m[c])) * w) + .sum(); + if proj > 0.0 { + key |= 1 << a; + } + } + key +} + +fn gini(counts: &[usize]) -> f32 { + let n = counts.len(); + if n == 0 { + return 0.0; + } + let mut s: Vec = counts.iter().map(|&c| c as f64).collect(); + s.sort_by(|a, b| a.partial_cmp(b).unwrap()); + let tot: f64 = s.iter().sum(); + if tot == 0.0 { + return 0.0; + } + let mut cum = 0.0; + let mut lorenz = 0.0; + for &x in &s { + cum += x; + lorenz += cum; + } + let g = (n as f64 + 1.0 - 2.0 * lorenz / tot) / n as f64; + g as f32 +} + +fn analyze( + label: &str, + corpus: &[f32], + queries: &[f32], + n: usize, + nq: usize, + dim: usize, + truth: &[Vec], + mean: Option<&[f32]>, + basis: &[Vec], + bits_key: usize, +) { + let n_cells = 1usize << bits_key; + // assign cells + let ckeys: Vec = (0..n) + .into_par_iter() + .map(|i| cell_key(&corpus[i * dim..(i + 1) * dim], mean, basis, bits_key)) + .collect(); + let qkeys: Vec = (0..nq) + .into_par_iter() + .map(|i| cell_key(&queries[i * dim..(i + 1) * dim], mean, basis, bits_key)) + .collect(); + let mut cells: Vec> = vec![Vec::new(); n_cells]; + for (i, &k) in ckeys.iter().enumerate() { + cells[k as usize].push(i as u32); + } + let counts: Vec = cells.iter().map(|c| c.len()).collect(); + let g = gini(&counts); + let largest = *counts.iter().max().unwrap() as f64 / n as f64; + let nonempty = counts.iter().filter(|&&c| c > 0).count(); + + println!("\n## {label} (bits_key={bits_key}, {n_cells} cells, {nonempty} non-empty)"); + println!(" occupancy Gini={g:.4} (0=balanced), largest-cell-fraction={largest:.4}"); + println!(" P_probe\tcand_scanned\trecall@10"); + // probe P nearest cells by Hamming on the key + for p_probe in [1usize, 2, 4, 8, 16, 32] { + if p_probe > n_cells { + break; + } + let stats: Vec<(usize, f32)> = (0..nq) + .into_par_iter() + .map(|qi| { + let qk = qkeys[qi]; + // rank cells by Hamming distance to qk, take P nearest + let mut order: Vec = (0..n_cells as u32).collect(); + order.sort_by_key(|&c| (c ^ qk).count_ones()); + let mut cand = 0usize; + use std::collections::HashSet; + let mut hit_set: HashSet = HashSet::new(); + for &c in order.iter().take(p_probe) { + cand += cells[c as usize].len(); + for &d in &cells[c as usize] { + hit_set.insert(d); + } + } + let hits = truth[qi].iter().filter(|&&t| hit_set.contains(&t)).count(); + (cand, hits as f32 / truth[qi].len() as f32) + }) + .collect(); + let mc = stats.iter().map(|s| s.0 as f64).sum::() / nq as f64; + let mr = stats.iter().map(|s| s.1 as f64).sum::() / nq as f64; + println!(" {p_probe}\t{mc:.0}\t{mr:.4}"); + } +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut bits_key = 10usize; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + "--bits-key" => bits_key = a.next().unwrap().parse().unwrap(), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + let truth: Vec> = (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(10).map(|(i, _)| i).collect() + }) + .collect(); + + let mean = coord_mean(&corpus, n, dim); + // PCA basis from the CENTERED corpus, shared identically across arms. + let mut cc = corpus.clone(); + for i in 0..n { + for c in 0..dim { + cc[i * dim + c] -= mean[c]; + } + } + eprintln!("# computing top-{bits_key} PCA basis..."); + let basis = pca_topk(&cc, n, dim, bits_key); + + println!("# partition_balance (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}"); + println!("# coarse cell key = sign pattern of top-{bits_key} PCA axes; PCA basis shared across arms."); + analyze( + "RAW key (uncentered projections)", + &corpus, + &queries, + n, + nq, + dim, + &truth, + None, + &basis, + bits_key, + ); + analyze( + "CENTERED key (mean-subtracted projections)", + &corpus, + &queries, + n, + nq, + dim, + &truth, + Some(&mean), + &basis, + bits_key, + ); + println!("\n# PRE-REGISTERED: BALANCE = centered cuts largest-cell-fraction >=1.5x; PRUNING = centered scans <=0.80x"); + println!("# candidates at matched recall>=0.90 in >=2/3 corpora => PASS (a scaling win). Else FAIL."); +} diff --git a/experiments/ordinal-routing-research/subspace_directions.rs b/experiments/ordinal-routing-research/subspace_directions.rs new file mode 100644 index 00000000..da3c3843 --- /dev/null +++ b/experiments/ordinal-routing-research/subspace_directions.rs @@ -0,0 +1,545 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — the directions experiment done RIGHT, using +//! every finding: center (remove the cone) -> project to the populated k-dim +//! subspace (where joint structure is legible) -> place R probe directions via +//! {random, Sobol, Kronecker} -> route -> candidates-scanned at matched recall. +//! Isolated by request: touches no harness, writes nothing. +//! +//! WHY THIS ISN'T fib_directions REDUX: +//! fib_directions placed golden directions on the AMBIENT 768-sphere, where +//! concentration of measure makes iid-Gaussian already near-uniform -> doomed. +//! The marginal uniformity lemma is BLIND to joint structure; directions act on +//! the JOINT correlation. So the honest test is in the centered, low-effective- +//! dim subspace (intrinsic dim ~13), where low-discrepancy has room to matter. +//! +//! SEQUENCES (data-oblivious; PCA control is the data-DEPENDENT ceiling): +//! random — iid Gaussian directions in the k-subspace (the baseline). +//! sobol — base-2 digital net -> inverse-normal -> subspace direction. +//! Discrepancy-optimal at R=2^m; predicted to spike at R in {16,32,64}. +//! kronecker— additive {i*alpha} per axis (generalized golden) -> direction. +//! Three-distance at EVERY prefix; predicted to hold across all R but +//! degrade as k climbs (dimension factor in its discrepancy bound). +//! pca-axes — the top-R principal directions themselves (DATA-DEPENDENT control; +//! expected to win — it bounds what oblivious sets leave on the table). +//! +//! METRIC: per query, probe its bucket +/- rad on each of R directions; candidate +//! union; recall@10 vs FP32 cosine top-10. Report the FAIR ENVELOPE (max recall at +//! candidates-scanned <= budget) per sequence — the only fair cross-arm axis. +//! Also report post-map DISCREPANCY (star-discrepancy proxy) of each direction set, +//! because cube->sphere mapping warps the even-spacing and that must be MEASURED. +//! +//! PRE-REGISTERED VERDICT (fixed before running), family-level, Bonferroni in mind: +//! COMPONENT WIN if a sequence beats random by >= 0.02 recall at matched +//! candidates in >= 2 of {fiqa,nq,quora} AND survives at k near 13. +//! SOBOL-AS-PREDICTED if its wins concentrate at R in {16,32,64} (power-of-2). +//! KRONECKER-AS-PREDICTED if it wins across R but its margin shrinks as k grows. +//! CLASS-DEAD if neither low-discrepancy sequence beats random anywhere by >=0.02 +//! -> the entire oblivious-directions hypothesis (not just golden) is falsified. +//! The Kronecker/Sobol HYBRID is NOT built here; it is earned only if each wins +//! in a DISTINCT R-regime. This experiment decides whether that even applies. +//! +//! Run: cargo run --release --example subspace_directions -- \ +//! --corpus-npy /tmp/corpora/fiqa_corpus.npy --queries-npy /tmp/corpora/fiqa_q.npy --kdim 13 + +// Research probe (not library code): index loops and the inv-normal constant +// tables read closer to the math than the iterator rewrites would. +#![allow( + clippy::needless_range_loop, + clippy::type_complexity, + clippy::excessive_precision +)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn coord_mean(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += rows[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// Top-k principal directions of the centered corpus via power iteration + +/// deflation. Returns k unit vectors of length `dim`. Pure Rust, no BLAS. +/// Uses a doc subsample for the covariance action (cov is implicit: Cov*x = +/// (1/n) X^T (X x), never materialized). +fn pca_topk(centered: &[f32], n: usize, dim: usize, k: usize, mean: &[f32]) -> Vec> { + let sample: Vec = if n > 20000 { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x9C0FFEE); + (0..20000).map(|_| rng.random_range(0..n)).collect() + } else { + (0..n).collect() + }; + // implicit covariance action y = Cov * x using centered rows on the sample + let cov_mul = |x: &[f32]| -> Vec { + let acc: Vec = sample + .par_iter() + .map(|&i| { + let row = ¢ered[i * dim..(i + 1) * dim]; + let d: f32 = row.iter().zip(x).map(|(a, b)| a * b).sum(); + row.iter().map(|&r| r * d).collect::>() + }) + .reduce( + || vec![0f32; dim], + |mut a, b| { + for j in 0..dim { + a[j] += b[j]; + } + a + }, + ); + let s = sample.len() as f32; + acc.iter().map(|v| v / s).collect() + }; + let mut comps: Vec> = Vec::with_capacity(k); + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xABCD); + let _ = mean; + for _ in 0..k { + let mut v: Vec = (0..dim) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + for _ in 0..50 { + // deflate against found components + for c in &comps { + let p: f32 = v.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + v[j] -= p * c[j]; + } + } + let mut y = cov_mul(&v); + for c in &comps { + let p: f32 = y.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + y[j] -= p * c[j]; + } + } + let nrm: f32 = y.iter().map(|t| t * t).sum::().sqrt(); + if nrm < 1e-20 { + break; + } + for j in 0..dim { + v[j] = y[j] / nrm; + } + } + comps.push(v); + } + comps +} + +/// project a centered vector onto the k-dim PCA basis -> k coords. +fn project(v: &[f32], basis: &[Vec]) -> Vec { + basis + .iter() + .map(|b| v.iter().zip(b).map(|(a, c)| a * c).sum()) + .collect() +} + +// ---- direction generators in k-dim (returned as R unit vectors length k) ---- + +fn dirs_random(r: usize, k: usize) -> Vec> { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x1111 ^ (r as u64)); + (0..r) + .map(|_| { + let mut v: Vec = (0..k) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + }) + .collect() +} + +/// Van der Corput / Sobol-lite: radical inverse in base 2 per dimension using +/// distinct primes for the radical-inverse base per axis (Halton-style net, the +/// digital low-discrepancy family). inverse-normal -> unit direction. +fn radical_inverse(mut i: usize, base: usize) -> f32 { + let mut f = 1.0f64; + let mut r = 0.0f64; + while i > 0 { + f /= base as f64; + r += f * (i % base) as f64; + i /= base; + } + r as f32 +} + +const PRIMES: [usize; 32] = [ + 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, + 101, 103, 107, 109, 113, 127, 131, +]; + +/// inverse standard normal CDF (Acklam's rational approximation) for u in (0,1). +fn inv_norm(u: f32) -> f32 { + let u = u.clamp(1e-6, 1.0 - 1e-6) as f64; + let a = [ + -3.969683028665376e+01, + 2.209460984245205e+02, + -2.759285104469687e+02, + 1.383577518672690e+02, + -3.066479806614716e+01, + 2.506628277459239e+00, + ]; + let b = [ + -5.447609879822406e+01, + 1.615858368580409e+02, + -1.556989798598866e+02, + 6.680131188771972e+01, + -1.328068155288572e+01, + ]; + let c = [ + -7.784894002430293e-03, + -3.223964580411365e-01, + -2.400758277161838e+00, + -2.549732539343734e+00, + 4.374664141464968e+00, + 2.938163982698783e+00, + ]; + let d = [ + 7.784695709041462e-03, + 3.224671290700398e-01, + 2.445134137142996e+00, + 3.754408661907416e+00, + ]; + let pl = 0.02425; + let x = if u < pl { + let q = (-2.0 * u.ln()).sqrt(); + (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + } else if u <= 1.0 - pl { + let q = u - 0.5; + let rr = q * q; + (((((a[0] * rr + a[1]) * rr + a[2]) * rr + a[3]) * rr + a[4]) * rr + a[5]) * q + / (((((b[0] * rr + b[1]) * rr + b[2]) * rr + b[3]) * rr + b[4]) * rr + 1.0) + } else { + let q = (-2.0 * (1.0 - u).ln()).sqrt(); + -(((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + }; + x as f32 +} + +fn dirs_sobol(r: usize, k: usize) -> Vec> { + (0..r) + .map(|i| { + let mut v: Vec = (0..k) + .map(|a| inv_norm(radical_inverse(i + 1, PRIMES[a % PRIMES.len()]))) + .collect(); + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() +} + +/// Kronecker: per-axis additive recurrence {i*alpha_a}, alpha from the generalized +/// golden ratio (root of x^{k+1}=x+1) approximated per-axis by primes^(1/(a+2)) frac. +fn dirs_kronecker(r: usize, k: usize) -> Vec> { + // irrational generators: fractional parts of sqrt(prime) (badly approximable, Weyl-equidistributed) + let alphas: Vec = (0..k) + .map(|a| (PRIMES[a % PRIMES.len()] as f64).sqrt().fract()) + .collect(); + (0..r) + .map(|i| { + let mut v: Vec = (0..k) + .map(|a| inv_norm(((i as f64 + 1.0) * alphas[a]).fract() as f32)) + .collect(); + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() +} + +fn dirs_pca_axes(r: usize, k: usize) -> Vec> { + // data-dependent ceiling: the k principal axes, then (for R>k) random unit + // combinations of them seeded deterministically — so R>k gives R DISTINCT + // directions spanning the populated subspace, not repeats (control fix). + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xACE5); + (0..r) + .map(|i| { + let mut v = vec![0f32; k]; + if i < k { + v[i] = 1.0; + } else { + for x in v.iter_mut() { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + *x = (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos(); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() +} + +/// star-discrepancy proxy: max over a random set of half-spaces of |empirical - expected|. +fn discrepancy_proxy(dirs: &[Vec], k: usize) -> f32 { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xDDDD); + let probes = 2000; + let mut worst = 0f32; + for _ in 0..probes { + // random half-space normal + let mut hn: Vec = (0..k) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + let hnm: f32 = hn.iter().map(|t| t * t).sum::().sqrt(); + for x in hn.iter_mut() { + *x /= hnm; + } + let frac = dirs + .iter() + .filter(|d| d.iter().zip(&hn).map(|(a, b)| a * b).sum::() > 0.0) + .count() as f32 + / dirs.len() as f32; + worst = worst.max((frac - 0.5).abs()); + } + worst +} + +struct Proj { + dir: Vec, + width: f32, +} +impl Proj { + fn bucket(&self, v: &[f32]) -> i64 { + let dot: f32 = self.dir.iter().zip(v).map(|(a, b)| a * b).sum(); + (dot / self.width).floor() as i64 + } +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut kdim = 13usize; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + "--kdim" => kdim = a.next().unwrap().parse().unwrap(), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + + // ground truth = FP32 cosine top-10 on normalized RAW vectors + let truth: Vec> = (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(10).map(|(i, _)| i).collect() + }) + .collect(); + + // center + let mean = coord_mean(&corpus, n, dim); + let mut cc = corpus.clone(); + for i in 0..n { + for c in 0..dim { + cc[i * dim + c] -= mean[c]; + } + } + let mut qc = queries.clone(); + for i in 0..nq { + for c in 0..dim { + qc[i * dim + c] -= mean[c]; + } + } + + // PCA basis from centered corpus, project both + eprintln!("# computing top-{kdim} PCA basis (power iteration)..."); + let basis = pca_topk(&cc, n, dim, kdim, &mean); + let cproj: Vec> = (0..n) + .into_par_iter() + .map(|i| project(&cc[i * dim..(i + 1) * dim], &basis)) + .collect(); + let qproj: Vec> = (0..nq) + .into_par_iter() + .map(|i| project(&qc[i * dim..(i + 1) * dim], &basis)) + .collect(); + + // width calibration: ~sqrt(n) buckets along a unit direction in subspace + let spread = { + let mut s = 0f32; + for v in cproj.iter().take(2000) { + for &x in v { + s += x * x; + } + } + (s / (2000.0 * kdim as f32)).sqrt() * 6.0 + }; + let base_width = spread / (n as f32).sqrt(); + + println!("# subspace_directions (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, k={kdim}"); + println!("# centered+projected to k-dim PCA subspace; recall@10 vs FP32 cosine top-10"); + let gens: [(&str, fn(usize, usize) -> Vec>); 4] = [ + ("random", dirs_random), + ("sobol", dirs_sobol), + ("kronecker", dirs_kronecker), + ("pca-axes", dirs_pca_axes), + ]; + let r_values = [8usize, 16, 32, 64, 128]; + + println!("seq\tR\tdiscrepancy\trad\tcand\trecall@10"); + let mut pts: Vec<(String, f64, f64)> = vec![]; + for (name, gen) in gens.iter() { + for &r in &r_values { + let dirs = gen(r, kdim); + let disc = discrepancy_proxy(&dirs, kdim); + let projs: Vec = dirs + .into_iter() + .map(|d| Proj { + dir: d, + width: base_width, + }) + .collect(); + // index: per direction, bucket -> docs + let index: Vec>> = projs + .par_iter() + .map(|p| { + let mut m: std::collections::HashMap> = + std::collections::HashMap::new(); + for di in 0..n { + m.entry(p.bucket(&cproj[di])).or_default().push(di as u32); + } + m + }) + .collect(); + for rad in [0i64, 1, 2, 4] { + let stats: Vec<(usize, f32)> = (0..nq) + .into_par_iter() + .map(|qi| { + use std::collections::HashSet; + let mut cand: HashSet = HashSet::new(); + for (pi, p) in projs.iter().enumerate() { + let b = p.bucket(&qproj[qi]); + for bb in (b - rad)..=(b + rad) { + if let Some(ids) = index[pi].get(&bb) { + cand.extend(ids.iter().copied()); + } + } + } + let hits = truth[qi].iter().filter(|&&i| cand.contains(&i)).count(); + (cand.len(), hits as f32 / truth[qi].len() as f32) + }) + .collect(); + let mc = stats.iter().map(|s| s.0 as f64).sum::() / nq as f64; + let mr = stats.iter().map(|s| s.1 as f64).sum::() / nq as f64; + println!("{name}\t{r}\t{disc:.4}\t{rad}\t{mc:.0}\t{mr:.4}"); + pts.push((name.to_string(), mc, mr)); + } + } + } + println!("\n# FAIR ENVELOPE: max recall@10 at candidates-scanned <= budget"); + let budgets = [500.0, 1000.0, 2000.0, 4000.0, 8000.0]; + print!("budget"); + for (name, _) in gens.iter() { + print!("\t{name}"); + } + println!(); + for &bud in &budgets { + print!("{bud:.0}"); + for (name, _) in gens.iter() { + let best = pts + .iter() + .filter(|(nm, c, _)| nm == name && *c <= bud) + .map(|(_, _, r)| *r) + .fold(0.0, f64::max); + print!("\t{best:.4}"); + } + println!(); + } + println!("\n# PRE-REGISTERED: COMPONENT WIN if sobol/kronecker beat random by >=0.02 at matched budget in >=2 corpora;"); + println!("# sobol-as-predicted if wins concentrate at R in {{16,32,64}}; kronecker if across R but shrinks as k grows;"); + println!("# CLASS-DEAD if neither beats random by >=0.02 anywhere. Hybrid earned only on distinct-regime component wins."); +} diff --git a/experiments/ordinal-routing-research/uniformity_lemma.rs b/experiments/ordinal-routing-research/uniformity_lemma.rs new file mode 100644 index 00000000..e063b240 --- /dev/null +++ b/experiments/ordinal-routing-research/uniformity_lemma.rs @@ -0,0 +1,219 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — proves the uniformity lemma underlying +//! ordvec's equal-width bucketing, and shows WHY golden-ratio boundaries break it. +//! Isolated by request: touches no harness, writes nothing. Verdicts fixed below. +//! +//! THE LEMMA (construction rationale, not a retrieval claim): +//! The rank transform maps a vector with distinct coords to a permutation of +//! {0..D-1} — an EXACTLY uniform coordinate distribution. Consequences: +//! (1) equal-width bucketing (2^B | D) is MAX-ENTROPY: every bucket is +//! equiprobable, so the B-bit code wastes no bits. +//! (2) CONSTANT-COMPOSITION: every document puts exactly D/2^B coords in +//! each bucket — identical histogram across all docs. This is what gives +//! the bitmap overlap its closed-form HYPERGEOMETRIC null E[X]=n_top^2/D. +//! (3) Any NON-uniform boundary scheme (golden-ratio / three-gap included) +//! makes bucket cardinalities UNEQUAL -> lower entropy AND breaks +//! constant-composition -> destroys the closed-form null. +//! So golden ratio is not merely unhelpful on the rank domain: it would remove +//! the property ordvec's headline theorem stands on. Equal-width is FORCED. +//! +//! MEASURED on real ranks (corpus rank codes), three boundary schemes for B bits: +//! equal-width : cuts at k*D/2^B (uniform) +//! golden : cuts at sorted {frac(i*phi)}*D, i=1..2^B-1 (three-gap/low-disc) +//! random : cuts at 2^B-1 sorted uniform random points (control) +//! Metrics: per-bucket composition std across docs (0 == constant-composition); +//! code entropy in bits (ceiling = B); hypergeometric-null error +//! |observed mean top-bucket overlap - n_top^2/D| / (n_top^2/D). +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! LEMMA HOLDS if, for B in {1,2,4}: +//! equal-width composition-std ≈ 0 (< 0.01 coords) AND entropy ≈ B (>= B-0.01) +//! AND hypergeom-null error < 0.02; AND golden/random are strictly worse on +//! composition-std and entropy. Else the lemma as stated is wrong — report it. +//! +//! Run: cargo run --release --example uniformity_lemma -- --corpus-npy /tmp/repo_corpus.npy + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +/// rank[i] = ordinal position (0..dim) of coordinate i within the row. +fn rank_encode(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut out = vec![0u16; n * dim]; + out.par_chunks_mut(dim).enumerate().for_each(|(r, code)| { + let row = &rows[r * dim..(r + 1) * dim]; + let mut idx: Vec = (0..dim as u16).collect(); + idx.sort_by(|&a, &b| row[a as usize].partial_cmp(&row[b as usize]).unwrap()); + for (rank, &i) in idx.iter().enumerate() { + code[i as usize] = rank as u16; + } + }); + out +} + +/// Boundary cut points in rank units [0, dim], length 2^B - 1, ascending. +fn boundaries(scheme: &str, bits: u32, dim: usize) -> Vec { + let nb = 1usize << bits; + match scheme { + "equal" => (1..nb).map(|k| k as f64 * dim as f64 / nb as f64).collect(), + "golden" => { + let phi = (1.0 + 5.0f64.sqrt()) / 2.0; + let mut cuts: Vec = (1..nb) + .map(|i| ((i as f64 * phi).fract()) * dim as f64) + .collect(); + cuts.sort_by(|a, b| a.partial_cmp(b).unwrap()); + cuts + } + "random" => { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xBEEF); + let mut cuts: Vec = (1..nb) + .map(|_| rng.random_range(0.0..1.0) * dim as f64) + .collect(); + cuts.sort_by(|a, b| a.partial_cmp(b).unwrap()); + cuts + } + _ => unreachable!(), + } +} + +fn bucket_of(rank: u16, cuts: &[f64]) -> usize { + let r = rank as f64 + 0.5; + cuts.iter().filter(|&&c| r > c).count() +} + +fn main() { + let mut corpus_path = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + if x == "--corpus-npy" { + corpus_path = Some(a.next().unwrap()); + } + } + let path = corpus_path.expect("--corpus-npy required"); + let (corpus, n, dim) = load_npy_f32(&path); + let ranks = rank_encode(&corpus, n, dim); + let row = |i: usize| &ranks[i * dim..(i + 1) * dim]; + + println!("# uniformity_lemma (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}"); + println!("# per-bucket composition std across docs (0=constant-composition), code entropy (bits), hypergeom-null err"); + println!("scheme\tB\tcomp_std\tentropy\tnull_err"); + + let mut rng_pairs = ChaCha8Rng::seed_from_u64(SEED); + let pairs: Vec<(usize, usize)> = (0..50_000) + .map(|_| { + let i = rng_pairs.random_range(0..n); + let mut j = rng_pairs.random_range(0..n); + if j == i { + j = (j + 1) % n; + } + (i, j) + }) + .collect(); + + for &bits in &[1u32, 2, 4] { + let nb = 1usize << bits; + for scheme in ["equal", "golden", "random"] { + let cuts = boundaries(scheme, bits, dim); + + // composition: count per bucket per doc; std of counts across docs, per bucket, then mean. + let comps: Vec> = (0..n) + .into_par_iter() + .map(|d| { + let mut c = vec![0f64; nb]; + for &rk in row(d) { + c[bucket_of(rk, &cuts)] += 1.0; + } + c + }) + .collect(); + let mut comp_std_sum = 0.0; + for b in 0..nb { + let vals: Vec = comps.iter().map(|c| c[b]).collect(); + let m = vals.iter().sum::() / n as f64; + let v = vals.iter().map(|x| (x - m) * (x - m)).sum::() / n as f64; + comp_std_sum += v.sqrt(); + } + let comp_std = comp_std_sum / nb as f64; + + // code entropy: bucket occupancy probabilities (same for every doc under + // constant-composition); use the mean histogram normalized. + let mut hist = vec![0f64; nb]; + for c in &comps { + for b in 0..nb { + hist[b] += c[b]; + } + } + let tot: f64 = hist.iter().sum(); + let entropy: f64 = hist + .iter() + .filter(|&&h| h > 0.0) + .map(|&h| { + let p = h / tot; + -p * p.log2() + }) + .sum(); + + // hypergeometric null: top bucket = highest bucket id; overlap of top sets. + let n_top_counts: Vec = comps.iter().map(|c| c[nb - 1]).collect(); + let mean_ntop = n_top_counts.iter().sum::() / n as f64; + let top_sets: Vec> = (0..n) + .into_par_iter() + .map(|d| { + row(d) + .iter() + .enumerate() + .filter(|(_, &rk)| bucket_of(rk, &cuts) == nb - 1) + .map(|(i, _)| i as u16) + .collect() + }) + .collect(); + let obs_overlap: f64 = pairs + .par_iter() + .map(|&(i, j)| top_sets[i].intersection(&top_sets[j]).count() as f64) + .sum::() + / pairs.len() as f64; + let expected = mean_ntop * mean_ntop / dim as f64; + let null_err = if expected > 0.0 { + (obs_overlap - expected).abs() / expected + } else { + f64::NAN + }; + + println!("{scheme}\t{bits}\t{comp_std:.4}\t{entropy:.4}\t{null_err:.4}"); + } + } + println!("\n# LEMMA HOLDS iff equal-width: comp_std<0.01, entropy>=B-0.01, null_err<0.02; golden/random strictly worse on comp_std & entropy."); +} From 0db751512d88f05d64735c5c5bbff3f20ee2400f Mon Sep 17 00:00:00 2001 From: Todd Baur Date: Tue, 16 Jun 2026 15:48:02 -0700 Subject: [PATCH 13/16] =?UTF-8?q?Phase=20B:=20messy=20multi-cone=20"lake"?= =?UTF-8?q?=20robustness=20=E2=80=94=20all=20three=20lake=20fears=20unfoun?= =?UTF-8?q?ded?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Synthetic enterprise lake = union of 3 nomic domains (fiqa+nq+quora = 281,729 docs, 3,000 queries spanning cones) + optional templated near-duplicate hub injection (make_lake.py). Three pre-registered questions, validated probes from round 2: A. Global centering breaks on multi-cone? NO — still collapses cone_base to the uniform null (B=4: 11.2->3.25), under the 0.30 "insufficient" bar. 3 domains share a removable common mean. (Falsifies the "no single removable mean" pred.) B. Higher union ID revives oblivious structure? Premise FALSE — union ID = 8.2 (vs ~24/domain). Far-apart clusters read as discrete blobs = LOWER intrinsic dim. Cones add separation, not dimensions. (Falsifies the "union->higher ID" pred.) C. Brittle to templated hubs? NO — b=4 R@10 flat across 0/5/10/15% hub prevalence (0.843 -> 0.841, drop -0.002, essentially immune). Fixed-mass rank code has no global frequency/IDF term for a hub to poison. Verdict: ordvec's training-free encoding does NOT degrade on multi-cone / hub-heavy lake geometry — if anything the geometry is easier (lower ID, globally centerable, hub-immune). Both pre-registered "messy lake = harder" intuitions were falsified by the validated probes — the payoff of the negative-verification discipline. CAVEAT: synthetic union of CURATED corpora; models multi-cone + hubs, NOT the OCR/mixed-language/length heterogeneity of a true S3 dump. Established: multi-domain union is benign. Not established: raw dirty S3 sludge is benign. No crate or public-API changes — experiments/ only. Signed-off-by: Todd Baur Co-Authored-By: Claude Opus 4.8 --- .../ordinal-routing-research/make_lake.py | 76 +++++++++++++++++++ .../oblivious_directions_results.md | 56 ++++++++++++++ 2 files changed, 132 insertions(+) create mode 100644 experiments/ordinal-routing-research/make_lake.py diff --git a/experiments/ordinal-routing-research/make_lake.py b/experiments/ordinal-routing-research/make_lake.py new file mode 100644 index 00000000..19143edc --- /dev/null +++ b/experiments/ordinal-routing-research/make_lake.py @@ -0,0 +1,76 @@ +#!/usr/bin/env python3 +"""Build a synthetic multi-cone 'lake' from existing per-domain .npy embeddings. +Stdlib only (struct/array). Models two real enterprise-lake pathologies: + 1. MULTI-CONE: concatenate N domain corpora (each its own cone) into one corpus. + 2. TEMPLATED HUBS: optionally inject near-duplicate clusters (one base vector + + tiny gaussian noise) at a target prevalence -> artificial hubs. +Queries: concatenate the per-domain held-out query sets (ground truth spans cones). +Deterministic (fixed LCG); no numpy. +""" +import struct, array, sys, math, argparse + +def read_npy(path): + b = open(path, "rb").read() + assert b[:6] == b"\x93NUMPY" + hl = struct.unpack("> 7; x ^= (x << 17) & 0xFFFFFFFFFFFFFFFF + _state[0] = x; return (x & 0xFFFFFFFF) / 0x100000000 +def _g(): + u1 = max(_u(), 1e-12); u2 = _u() + return math.sqrt(-2*math.log(u1)) * math.cos(2*math.pi*u2) + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--parts", nargs="+", required=True, help="domain corpus .npy files") + ap.add_argument("--out", required=True) + ap.add_argument("--hub-frac", type=float, default=0.0, help="fraction of corpus that is templated near-dupes") + ap.add_argument("--hub-clusters", type=int, default=50, help="number of distinct templated clusters") + ap.add_argument("--hub-noise", type=float, default=0.02, help="gaussian noise added to base vec (pre-normalize)") + ap.add_argument("--cap", type=int, default=0, help="cap docs per part (0=all)") + args = ap.parse_args() + + dim = None; rows = array.array("f"); total = 0 + for p in args.parts: + a, n, d = read_npy(p) + dim = d if dim is None else dim + assert d == dim, f"dim mismatch {p}" + take = n if args.cap == 0 else min(n, args.cap) + rows.extend(a[:take*d]); total += take + print(f"# {p}: +{take} docs", file=sys.stderr) + + # inject templated hubs: replace a fraction of rows with near-dupes of K bases + if args.hub_frac > 0: + n_hub = int(total * args.hub_frac) + # pick K base vectors from existing rows (deterministic indices) + bases = [ [rows[(i*9973 % total)*dim + j] for j in range(dim)] for i in range(args.hub_clusters) ] + for h in range(n_hub): + base = bases[h % args.hub_clusters] + tgt = (h * 7919) % total # deterministic overwrite positions + row = [base[j] + args.hub_noise * _g() for j in range(dim)] + nrm = math.sqrt(sum(x*x for x in row)) or 1.0 + for j in range(dim): + rows[tgt*dim + j] = row[j] / nrm + print(f"# injected {n_hub} templated-hub docs ({args.hub_frac:.0%}) across {args.hub_clusters} clusters", file=sys.stderr) + + write_npy(args.out, rows, total, dim) + +if __name__ == "__main__": + main() diff --git a/experiments/ordinal-routing-research/oblivious_directions_results.md b/experiments/ordinal-routing-research/oblivious_directions_results.md index 00951ae4..a207a4db 100644 --- a/experiments/ordinal-routing-research/oblivious_directions_results.md +++ b/experiments/ordinal-routing-research/oblivious_directions_results.md @@ -246,6 +246,61 @@ re-embed nomic@1100 on the SAME 57.6k fiqa docs to measure the shift directly documents at every chunk length at once → a *mixture* of geometries, not one — so chunk-length heterogeneity is itself a lake pathology (folds into Path B). +## Phase B — the messy multi-cone "lake" (deployment robustness) + +Built a synthetic enterprise lake: union of 3 nomic-embedded domains (fiqa 57.6k + +nq 74.1k + quora 150k = **281,729 docs**, 3,000 queries spanning all cones), with +optional templated near-duplicate hub injection. Two pre-registered predictions +about lake geometry — **both FALSIFIED** by the validated probes (the payoff of the +negative-verification discipline: locked thresholds caught my wrong intuitions): + +- **Predicted: union of cones → higher effective ID.** WRONG — union ID = **8.2** + (vs ~24 per-domain). Far-apart domain clusters read to TwoNN as a few discrete + blobs = *lower* intrinsic dim. Cones add separation, not dimensions. A + multi-domain lake is **lower-ID and more clustered**, not higher-ID. (This also + voids Phase B's Question B — there is no ID rise to revive oblivious structure.) +- **Predicted: no single removable mean (global centering breaks).** WRONG — + global centering still collapses cone_base to the uniform null (B=4: 11.2→3.25 ≈ + textbook 3.0), well under the pre-registered 0.30 "insufficient" bar. The 3 + domains share enough common direction that ONE global mean still de-hubs them. + Global centering **holds** on the multi-cone union. + +Net: the "messy lake = higher-ID, locally-coned, needs per-cluster centering" +intuition is wrong on both counts. A multi-domain union is lower-ID and globally +centerable. + +**Question C — templated-hub robustness (the dominant real-lake pathology).** +Inject near-duplicate clusters (1 base vec + tiny noise) at rising prevalence; +measure raw RankQuant b=4 R@10 vs FP32 cosine on the 3000-query lake set: + +| hub prevalence | b4 R@10 | Δ from clean | +|----------------|---------|--------------| +| 0% | 0.8429 | — | +| 5% | 0.8408 | −0.0021 | +| 10% | 0.8420 | −0.0009 | +| 15% | 0.8421 | −0.0008 | + +Pre-registered "graceful if drop < 0.10 through 15%." Actual: **−0.002, flat to +noise — essentially immune.** Mechanism: the fixed-mass rank code has no global +frequency/IDF term for a hub to poison; injected boilerplate is far from real +queries in rank space and never enters the top-k. Training-free + data-oblivious +is *exactly* the property that confers hub-robustness (a learned codebook or +IDF-weighted scheme would be corrupted by hub prevalence; ordvec is not). + +## Phase B verdict — ordvec is robust on multi-cone lake geometry + +All three pre-registered lake fears were unfounded: a multi-domain union is +**lower-ID** (8.2 vs ~24/domain), **still globally centerable**, and **immune to +templated hubs** (−0.002 R@10 through 15%). b=4 raw holds R@10 ≈ 0.84 throughout. +The deployment-relevant conclusion: ordvec's training-free encoding does NOT +degrade on multi-cone / hub-heavy data — if anything the geometry is easier. + +CAVEAT (honest scope): this is a synthetic union of 3 *curated* corpora. It models +multi-cone structure + templated hubs, but NOT the OCR-garbage / mixed-language / +extreme-length heterogeneity of a true S3 dump (uncapturable from clean +embeddings). Claim established: "multi-domain union is benign." Claim NOT +established: "raw dirty S3 sludge is benign" — that needs actual dirty data. + ## Reproduce ``` # embed real corpora via ollama nomic-embed-text -> .npy (see REAL_CORPUS_RUNBOOK.md) @@ -253,4 +308,5 @@ cargo run --release --example uniformity_lemma -- --corpus-npy CORPUS.npy cargo run --release --example overlap_decomp -- --corpus-npy CORPUS.npy --queries-npy Q.npy cargo run --release --example centering_recall -- --corpus-npy CORPUS.npy --queries-npy Q.npy cargo run --release --example subspace_directions -- --corpus-npy CORPUS.npy --queries-npy Q.npy --kdim 13 +# Phase B lake: python make_lake.py --parts fiqa_corpus.npy nq_corpus.npy quora_corpus.npy --out lake.npy [--hub-frac 0.1] ``` From e406bfbe68d2743156f13a539ab57d1b8d622352 Mon Sep 17 00:00:00 2001 From: Todd Baur Date: Tue, 16 Jun 2026 16:21:42 -0700 Subject: [PATCH 14/16] =?UTF-8?q?Path=20B:=20chunk-length-mixture=20lake?= =?UTF-8?q?=20=E2=80=94=20fear=20falsified,=20lake=20arc=20closed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The last synthetic-lake pathology capturable without real dirty data. Embed the same 57.6k fiqa docs at four chunk lengths {128,256,512,1100} and union into a 230k-doc length-mixture lake; re-run the incumbent centering_recall b=4 raw arm. Two pre-registered findings: - The "chunk length is a third geometry axis" claim: real but SMALL and CO-AXIAL. R̄ (cone tightness) spreads only 0.705->0.723 over an 8.6x length range and the four cone axes are >=0.986 aligned — not the distinct geometries the "mixture of geometries" framing imagined. Direction (longer->tighter) holds; magnitude tiny. - b=4 raw R@10 is IMMUNE to the mixture: 0.8230 (single-length) -> 0.8253 (lake), Δ +0.002 flat to noise, CR@100 stays 1.0. Same mechanism: per-vector rank codes discard the magnitude a tightness-shift moves, so global bin edges can't be poisoned — the same training-free property that gave Phase B hub-immunity. With Phase B (multi-domain cones + hubs) this leaves every synthetic lake pathology benign for "spend the bits, b=4, raw ranks." Only un-run test left needs real dirty data (OCR/multilingual S3 sludge), uncapturable from clean embeddings. Probes: make_length_lake.py + centering_recall.rs. Reconciles README tiers, the deployment-question section, and the Path B forward-pointer in the directions doc. Signed-off-by: Todd Baur --- .../ordinal-routing-research/README.md | 8 ++ .../length_mixture_lake_results.md | 126 ++++++++++++++++++ .../make_length_lake.py | 64 +++++++++ .../oblivious_directions_results.md | 5 +- 4 files changed, 202 insertions(+), 1 deletion(-) create mode 100644 experiments/ordinal-routing-research/length_mixture_lake_results.md create mode 100644 experiments/ordinal-routing-research/make_length_lake.py diff --git a/experiments/ordinal-routing-research/README.md b/experiments/ordinal-routing-research/README.md index 4121bcff..0b5305f1 100644 --- a/experiments/ordinal-routing-research/README.md +++ b/experiments/ordinal-routing-research/README.md @@ -37,6 +37,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not every doc. | [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (open PR, `sorry`-free). | | [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | | [oblivious_directions_results.md](oblivious_directions_results.md) | **The directions arc (round 2).** Data-oblivious low-discrepancy directions (golden-angle / Sobol / Kronecker) do NOT beat iid-random for training-free routing — across 5 encoders (nomic, bge-m3, bge-large, snowflake-arctic-v2, harrier-oss) at real intrinsic dim 18–24. CLASS-DEAD, pre-registered, replicated (the one mid-ladder flicker failed to replicate). Centering removes the cone but fails at b=4 (penalty grows with capacity). One robust positive: data-aligned (PCA) directions lead at higher ID — the lever is data-alignment, which training-free forbids. Also **resolves the twonn_id PARTIAL**: real-corpus ID measured at ~18–24 across 5 encoders, and ID is a **corpus** property (repo≈13 vs fiqa≈24, same encoder), not an encoder constant. Probes: `uniformity_lemma.rs`, `overlap_decomp.rs`, `centering_recall.rs`, `subspace_directions.rs`, `partition_balance.rs`, `fib_*.rs`. | +| [length_mixture_lake_results.md](length_mixture_lake_results.md) | **Path B — chunk-length-mixture lake (closes the synthetic-lake arc).** Same fiqa docs embedded at 4 chunk lengths {128,256,512,1100} unioned into a 230k-doc lake; b=4 raw R@10 vs FP32 cosine is **immune** (+0.002, CR@100=1.0). Bonus measurement of the "chunk length is a third geometry axis" claim: real but **small and co-axial** — R̄ spreads only 0.705→0.723 over an 8.6× length range, cone axes ≥0.986 aligned (not the distinct geometries the mixture framing imagined). With Phase B (multi-domain) this leaves every synthetic lake pathology — multi-cone, hub, multi-length — benign for "spend the bits, b=4." Probe: `make_length_lake.py` + `centering_recall.rs`. | ## THEORY — directionally right, restated honestly @@ -76,6 +77,13 @@ tau gap is too weak to ORDER them — it scores below b=2's own ordering. The density-collapse signal is **real but inert**: "just use b=4," no ordvec feature follows. +The **deployment-robustness** sub-arc is likewise resolved negative: across every +synthetic lake pathology — multi-domain cones + templated hubs (Phase B) and now a +chunk-length mixture ([length_mixture_lake_results.md](length_mixture_lake_results.md), +Path B) — b=4 raw routing does not degrade (hub Δ −0.002 through 15%; length-mixture +Δ +0.002). The only un-run test left needs *real* dirty data (OCR / multilingual S3 +sludge), uncapturable from clean embeddings. + This is the honest bottom line of the whole branch: a characterized mechanism and a clean negative. **Research, not a feature** — the prime/spectral/permutation ideas for dense-region retrieval do not beat the boring baseline (spend the bits). diff --git a/experiments/ordinal-routing-research/length_mixture_lake_results.md b/experiments/ordinal-routing-research/length_mixture_lake_results.md new file mode 100644 index 00000000..31b2e664 --- /dev/null +++ b/experiments/ordinal-routing-research/length_mixture_lake_results.md @@ -0,0 +1,126 @@ +# Path B — the chunk-length-mixture lake (deployment robustness, final arc) + +The Phase B caveat split the unmodeled real-lake pathologies in two: OCR-garbage / +mixed-language (uncapturable from clean embeddings — needs a real dirty S3 sample) +and **chunk-length heterogeneity** (a *mixture of geometries* — capturable +synthetically). The directions writeup flagged the latter as **Path B** +(`oblivious_directions_results.md`, "chunk-length heterogeneity is itself a lake +pathology (folds into Path B)") and named chunk length a **third geometry axis** +(longer chunks → tokens average → vectors regress to the corpus mean → tighter cone, +lower ID). Path B tests both at once: does a real chunk-length *mixture* break b=4 raw +rank routing, and how big is the chunk-length geometry axis actually? + +Domain is held constant (fiqa, 57.6k docs), only the chunk length varies — the clean +complement to Phase B, which held length constant and varied domain. + +## Setup + +Embed the SAME 57,600 fiqa docs (nomic-embed-text, gpu1) at four chunk-length caps +`{128, 256, 512, 1100}` chars, then union into one 230,388-doc lake +(`make_length_lake.py`). Queries: the full-length fiqa held-out set (1000) — a real +lake mixes *document* chunk lengths, not query lengths. Probe: the **same +`centering_recall.rs`** whose raw b=4 arm IS the incumbent that beat every +oblivious-structure idea. Metric: R@10 of b-bit rank codes vs **FP32-cosine top-10 +recomputed on each corpus** — a routing-*fidelity* metric, so the 4× size difference +cancels (it hits the code arm and its own FP32 ground truth equally; no size confound). + +## Pre-registration (locked before the lake run) + +- **Path B fear CONFIRMED** (the first positive of the whole arc) if b=4 raw R@10 on + the length mixture drops **≥ 0.02** vs the single-length fiqa@512 baseline. +- **FALSIFIED** (consistent with every prior lake fear) if flat to noise (|Δ| < 0.02). +- Secondary, pre-stated from the "third geometry axis" claim: longer chunks → tighter + cone (higher R̄) and the strata cones point in *different* directions (a true mixture + of geometries, not one). + +## Result 1 — the chunk-length geometry axis is real but SMALL and CO-AXIAL + +Per-stratum cone tightness R̄ (mean cosine to that stratum's centroid) and pairwise +centroid-axis cosine, all on the identical 57.6k docs: + +| chunk cap | R̄ (cone tightness) | NaN/zero rows | +|-----------|--------------------|---------------| +| 128 | 0.7046 | 0 | +| 256 | 0.7151 | 0 | +| 512 | 0.7198 | 0 | +| 1100 | 0.7234 | 12 | + +Cone-axis cosine between strata: **0.986–0.999** (128-vs-1100 = 0.9856; the rest tighter). + +The predicted direction holds — longer chunks DO give a tighter cone (R̄ rises +monotonically 0.705→0.723) — but the magnitude is tiny: a 0.019 R̄ spread across an +8.6× chunk-length range, and the four cones are essentially **co-axial** (≥0.986). So +chunk length is a real third geometry axis but a *weak* one on this encoder/corpus: it +shifts cone tightness slightly along a shared axis, it does NOT produce the distinct, +differently-pointed cones the "mixture of geometries" framing imagined. (Caveat: fiqa +docs are short Q&A; a corpus with genuinely long source documents — where a 128-char +cap truly amputates content — could spread R̄ more. Established here only for +fiqa/nomic.) + +## Result 2 — b=4 raw routing is IMMUNE to the length mixture (fear FALSIFIED) + +`centering_recall` raw arm, 1000 fiqa queries, FP32-cosine top-10 ground truth: + +| corpus | b=4 raw R@10 | b=4 raw CR@100 | +|--------|--------------|----------------| +| baseline fiqa@512 (57.6k) | 0.8230 | 1.0000 | +| **Path B length-mixture lake (230k, 4 lengths)** | **0.8253** | 1.0000 | +| **Δ** | **+0.0023** | 0.0000 | + +Pre-registered ≥0.02 drop to confirm the fear. **Actual +0.002 — flat to noise, if +anything slightly up.** Candidate recall stays a perfect 1.0000: every FP32 top-10 +neighbour is still captured in the top-100 by code distance on the 4×-larger mixed lake. + +The centering signature is unchanged from single-length, too — b1 raw 0.516 → centered ++0.055, b2/b4 centered net-negative (b4 ΔR −0.079) — same PARTIAL pattern, same +capacity-scaling penalty as every prior corpus. The mixture changed neither the routing +fidelity nor the centering mechanism; it just reconfirmed both on 4× the data. + +## Why the global bin edges survive a length mixture + +`centering_recall`'s encoder is **per-dimension rank → equal-width bins** — the rank +transform is computed within each vector, so a slightly-tighter (long-chunk) or +slightly-looser (short-chunk) cone produces the same *ranks*; only the raw cosine +magnitudes differ, and ranks discard magnitude. The fixed-mass rank code has nothing +global for a tightness-shift to poison, exactly as the templated-hub immunity in Phase B +had nothing global (no IDF term) for a hub to poison. Training-free + rank-based is again +the property that confers the robustness. + +## Verdict — Path B closes the synthetic-lake arc on the same negative + +The last fear capturable without real dirty data is unfounded: a chunk-length mixture is +**not** a mixture of distinct geometries on fiqa/nomic (cones co-axial, R̄ spread 0.019), +and b=4 raw routing is **immune** to it (+0.002 R@10, CR 1.0). Combined with Phase B +(multi-domain union: lower-ID, globally centerable, hub-immune), every synthetic lake +pathology — multi-cone, templated-hub, and now multi-length — leaves the boring "spend +the bits, b=4, raw ranks" baseline intact. + +CAVEAT (honest scope, unchanged): this is still *clean* embeddings of *curated* fiqa +text truncated to length. It models length-geometry heterogeneity; it does NOT model +OCR garbage, mixed-language, or broken-encoding sludge. The one remaining +non-derivative test is an actual dirty S3 sample (the OCR/multilingual half of the Phase +B caveat). Claim established: "a chunk-length mixture is benign for b=4 routing." Claim +NOT established: "raw dirty S3 sludge is benign." + +## Reproduce +``` +# 1. embed fiqa text truncated to each chunk-length cap (gpu1 nomic) +for N in 128 256; do python3 - < separated cones), this unions the SAME +documents embedded at several chunk lengths {128,256,512,1100}. Domain is held +constant; only the length-geometry varies. Models the real-lake pathology where one +S3 bucket holds the same content chunked every which way -> a *mixture of cones of +different tightness* quantized under one global set of rank-bin edges. + +Stdlib only (struct/array), matches make_lake.py's npy IO. Deterministic. +""" +import struct, array, sys, argparse + +def read_npy(path): + b = open(path, "rb").read() + assert b[:6] == b"\x93NUMPY" + hl = struct.unpack(" Date: Tue, 16 Jun 2026 19:29:08 -0700 Subject: [PATCH 15/16] docs: graduate the messy-corpus robustness finding into product docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The one reusable positive from the ordinal-routing research arc: ordvec is robust-by-construction on heterogeneous corpora, and it's a direct corollary of the constant-composition mechanism already documented in RANK_MODES — ranks discard magnitude and there's no global IDF term, so the two failure modes that corrupt learned codebooks (near-duplicate hubs, mixed chunk lengths) have nothing to grab. Promotes it out of experiments/ (research record) into the product surface: - docs/RANK_MODES.md: new "A consequence: robust by construction on messy corpora" section right after the mechanism, with the measured numbers (hub −0.002 through 15%; chunk-length mixture +0.002) and honest scope (synthetic-messy curated corpora; does NOT model OCR/multilingual/broken-encoding). - README "What's different": a discoverable bullet linking to the section. Numbers and full pre-registered record stay in experiments/ordinal-routing-research/; this is the citation, not new claims. Signed-off-by: Todd Baur --- README.md | 6 ++++++ docs/RANK_MODES.md | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) diff --git a/README.md b/README.md index 2975ba8c..1b5aa6c7 100644 --- a/README.md +++ b/README.md @@ -82,6 +82,12 @@ structure of each vector on its own: fit step. Encoding is a per-vector rank (or sign) transform — index the very first vector with no prior data, and never refit when the corpus drifts. +- **Robust by construction on messy corpora.** Because the code is built + from per-vector ranks (magnitude discarded) with no global frequency / IDF + term, the two things that corrupt learned codebooks — near-duplicate hubs + and mixed chunk lengths — have nothing to grab: b=4 R@10 moved −0.002 under + 15% templated-hub injection and +0.002 across a four-chunk-length mixture. + ([details + honest scope](docs/RANK_MODES.md#a-consequence-robust-by-construction-on-messy-corpora)) - **Zero system dependencies.** Pure Rust — no BLAS / LAPACK / `ndarray` / `faer`. Builds and cross-compiles cleanly, including to `aarch64` and `wasm32`. diff --git a/docs/RANK_MODES.md b/docs/RANK_MODES.md index 381ebee2..01b54091 100644 --- a/docs/RANK_MODES.md +++ b/docs/RANK_MODES.md @@ -158,6 +158,43 @@ fraction of the full-scan latency. The `bench_rank` run prints this as its `TwoStage ...` rows with the per-M candidate-recall (`CR`) figure attached. +## A consequence: robust by construction on messy corpora + +The same two properties that make the score combinatorial — it is a +function of **ranks** (magnitude discarded) and carries **no global +per-coordinate frequency / IDF term** — also make the encoding +*robust* on the heterogeneous data a real corpus dump contains. The +two failure modes that corrupt learned codebooks and IDF-weighted +schemes have nothing to grab here: + +- **Templated / near-duplicate hubs.** A hub poisons schemes with a + global frequency term by inflating it. RankQuant has no such term, so + injected boilerplate stays far from real queries in rank space and + never enters the top-k. Measured: injecting templated near-duplicate + clusters into a 281k-doc multi-domain union at rising prevalence + moved b=4 R@10 by **−0.002 through 15%** — flat to noise. +- **Mixed chunk lengths.** Documents chunked at every length at once + produce cones of slightly different tightness. Because the per-vector + rank transform discards the magnitude that a tightness shift moves, + the *ranks* — and the score — are unchanged. Measured: a 230k-doc + lake unioning the same documents embedded at four chunk lengths + {128, 256, 512, 1100} moved b=4 R@10 by **+0.002** (candidate recall + stayed 1.0) versus the single-length baseline. + +This is robustness *because* the code is deliberately oblivious, not in +spite of it — the property is a direct corollary of the +constant-composition mechanism above, no tuning involved. + +**Scope (honest).** These are clean embeddings of curated corpora +(fiqa / nq / quora) made messy synthetically — multi-cone unions, +templated hubs, and chunk-length mixtures. They establish that +multi-domain, hub-heavy, and multi-length geometry are benign. They do +**not** model OCR garbage, mixed-language, or broken-encoding text, +which clean embeddings cannot reproduce; that case is untested. The +full pre-registered record (and the matching negative for +oblivious-*direction* structure across five encoders) is in +[`experiments/ordinal-routing-research/`](../experiments/ordinal-routing-research/README.md). + ## Synthetic stress-test numbers This is the clean-checkout stress test — regenerated by the default From 9515198682b0547b8e92179d43e5c96bef9fe13d Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Fri, 19 Jun 2026 10:27:10 -0500 Subject: [PATCH 16/16] Polish ordinal-routing research hygiene Signed-off-by: Nelson Spence --- .../ordinal-routing-research/centering_recall.rs | 8 ++++++++ .../ordinal-routing-research/density_collapse.rs | 8 ++++++++ .../ordinal-routing-research/fib_address_gate.rs | 8 ++++++++ .../ordinal-routing-research/fib_directions.rs | 8 ++++++++ .../ordinal-routing-research/fib_reduction.rs | 8 ++++++++ .../ordinal-routing-research/overlap_decomp.rs | 8 ++++++++ .../partition_balance.rs | 8 ++++++++ .../ordinal-routing-research/shard_recall.rs | 10 +++++++++- .../ordinal-routing-research/spectral_probe.rs | 8 ++++++++ .../subspace_directions.rs | 8 ++++++++ .../tau_rerank_bakeoff.rs | 8 ++++++++ experiments/ordinal-routing-research/twonn_id.rs | 16 ++++++++++++---- .../ordinal-routing-research/twonn_id_results.md | 2 +- .../ordinal-routing-research/uniformity_lemma.rs | 8 ++++++++ .../withdrawn/spectral_probe_results.md | 2 +- 15 files changed, 111 insertions(+), 7 deletions(-) diff --git a/experiments/ordinal-routing-research/centering_recall.rs b/experiments/ordinal-routing-research/centering_recall.rs index 4d125040..16d94eea 100644 --- a/experiments/ordinal-routing-research/centering_recall.rs +++ b/experiments/ordinal-routing-research/centering_recall.rs @@ -40,6 +40,13 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -48,6 +55,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { let bytes = std::fs::read(path).expect("read npy"); assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -60,6 +67,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -113,6 +120,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -61,6 +68,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -59,6 +66,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -60,6 +67,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { let bytes = std::fs::read(path).expect("read npy"); assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' BucketIndex { /// For one query at a given probe radius `rad` (probe buckets b-rad..=b+rad on /// each projection), return the candidate union and its size (candidates -/// scanned). Recall is measured against `truth_set`. +/// scanned). Recall is measured against the ground-truth ids. fn probe_recall( projs: &[Proj], index: &BucketIndex, diff --git a/experiments/ordinal-routing-research/spectral_probe.rs b/experiments/ordinal-routing-research/spectral_probe.rs index fa654540..0772c18e 100644 --- a/experiments/ordinal-routing-research/spectral_probe.rs +++ b/experiments/ordinal-routing-research/spectral_probe.rs @@ -123,11 +123,19 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { let bytes = std::fs::read(path).expect("read npy"); assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -69,6 +76,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { let bytes = std::fs::read(path).expect("read npy"); assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' (Vec, usize, usize) { let bytes = std::fs::read(path).expect("read npy"); assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' Vec { .par_iter() .filter_map(|&ai| { let a = &corpus[ai * dim..(ai + 1) * dim]; - // cosine distance = 1 - dot (vectors are unit norm). Track the two - // smallest distances to OTHER points. + // Track the two smallest chord distances to OTHER points. let (mut d1, mut d2) = (f64::INFINITY, f64::INFINITY); for bi in 0..n { if bi == ai { diff --git a/experiments/ordinal-routing-research/twonn_id_results.md b/experiments/ordinal-routing-research/twonn_id_results.md index 231bef66..e15fd0bf 100644 --- a/experiments/ordinal-routing-research/twonn_id_results.md +++ b/experiments/ordinal-routing-research/twonn_id_results.md @@ -3,7 +3,7 @@ > linearized CDF (F=(i+1)/N), the biased non-MLE variant; some of the "finite- > sample deflation" attributed to sampling is actually estimator bias. Treat the > low-tens ID as indicative, and prefer the MLE TwoNN estimator before quoting -> exact values. See benchmarks/ADVERSARIAL_REVIEW.md. +> exact values. See ADVERSARIAL_REVIEW.md. # TwoNN intrinsic dimension probe diff --git a/experiments/ordinal-routing-research/uniformity_lemma.rs b/experiments/ordinal-routing-research/uniformity_lemma.rs index e063b240..a969069d 100644 --- a/experiments/ordinal-routing-research/uniformity_lemma.rs +++ b/experiments/ordinal-routing-research/uniformity_lemma.rs @@ -45,6 +45,13 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { "not a numpy file" ); let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } let (hlen, hstart) = if major == 1 { (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) } else { @@ -53,6 +60,7 @@ fn load_npy_f32(path: &str) -> (Vec, usize, usize) { 12, ) }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); assert!(header.contains("'descr': ' The numbers below are RETAINED ONLY as the record of the failure. Fixing needs > the estimator calibrated against a process with known closed-form Σ²(L). The > theory (rigidity_impossibility_proofs.md) does NOT depend on this probe. Full -> account: benchmarks/ADVERSARIAL_REVIEW.md. +> account: ../ADVERSARIAL_REVIEW.md. # Number-variance probe: is a 1-D routing key rigid or Poisson?