diff --git a/Cargo.toml b/Cargo.toml index 8e7ec30c..9f38180d 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -42,6 +42,7 @@ exclude = [ "docs/FOLLOWUP_BODY_KERNEL_TIE_BREAK.md", "docs/INDEX_PROVENANCE.md", "docs/c-api.md", + "experiments/", "fuzz/", "ordvec-ffi/", "ordvec-go/", diff --git a/README.md b/README.md index 41104549..ff81c9ab 100644 --- a/README.md +++ b/README.md @@ -82,6 +82,12 @@ structure of each vector on its own: fit step. Encoding is a per-vector rank (or sign) transform — index the very first vector with no prior data, and never refit when the corpus drifts. +- **Robust by construction on messy corpora.** Because the code is built + from per-vector ranks (magnitude discarded) with no global frequency / IDF + term, the two things that corrupt learned codebooks — near-duplicate hubs + and mixed chunk lengths — have nothing to grab: b=4 R@10 moved −0.002 under + 15% templated-hub injection and +0.002 across a four-chunk-length mixture. + ([details + honest scope](docs/RANK_MODES.md#a-consequence-robust-by-construction-on-messy-corpora)) - **Zero system dependencies.** Pure Rust — no BLAS / LAPACK / `ndarray` / `faer`. Builds and cross-compiles cleanly, including to `aarch64` and `wasm32`. diff --git a/docs/RANK_MODES.md b/docs/RANK_MODES.md index 381ebee2..01b54091 100644 --- a/docs/RANK_MODES.md +++ b/docs/RANK_MODES.md @@ -158,6 +158,43 @@ fraction of the full-scan latency. The `bench_rank` run prints this as its `TwoStage ...` rows with the per-M candidate-recall (`CR`) figure attached. +## A consequence: robust by construction on messy corpora + +The same two properties that make the score combinatorial — it is a +function of **ranks** (magnitude discarded) and carries **no global +per-coordinate frequency / IDF term** — also make the encoding +*robust* on the heterogeneous data a real corpus dump contains. The +two failure modes that corrupt learned codebooks and IDF-weighted +schemes have nothing to grab here: + +- **Templated / near-duplicate hubs.** A hub poisons schemes with a + global frequency term by inflating it. RankQuant has no such term, so + injected boilerplate stays far from real queries in rank space and + never enters the top-k. Measured: injecting templated near-duplicate + clusters into a 281k-doc multi-domain union at rising prevalence + moved b=4 R@10 by **−0.002 through 15%** — flat to noise. +- **Mixed chunk lengths.** Documents chunked at every length at once + produce cones of slightly different tightness. Because the per-vector + rank transform discards the magnitude that a tightness shift moves, + the *ranks* — and the score — are unchanged. Measured: a 230k-doc + lake unioning the same documents embedded at four chunk lengths + {128, 256, 512, 1100} moved b=4 R@10 by **+0.002** (candidate recall + stayed 1.0) versus the single-length baseline. + +This is robustness *because* the code is deliberately oblivious, not in +spite of it — the property is a direct corollary of the +constant-composition mechanism above, no tuning involved. + +**Scope (honest).** These are clean embeddings of curated corpora +(fiqa / nq / quora) made messy synthetically — multi-cone unions, +templated hubs, and chunk-length mixtures. They establish that +multi-domain, hub-heavy, and multi-length geometry are benign. They do +**not** model OCR garbage, mixed-language, or broken-encoding text, +which clean embeddings cannot reproduce; that case is untested. The +full pre-registered record (and the matching negative for +oblivious-*direction* structure across five encoders) is in +[`experiments/ordinal-routing-research/`](../experiments/ordinal-routing-research/README.md). + ## Synthetic stress-test numbers This is the clean-checkout stress test — regenerated by the default diff --git a/experiments/ordinal-routing-research/ADVERSARIAL_REVIEW.md b/experiments/ordinal-routing-research/ADVERSARIAL_REVIEW.md new file mode 100644 index 00000000..ad7cfbde --- /dev/null +++ b/experiments/ordinal-routing-research/ADVERSARIAL_REVIEW.md @@ -0,0 +1,107 @@ +# Adversarial review findings (three independent hostile reviewers) + +This branch was reviewed by three adversarial agents (Rust probes, math proofs, +Lean skeleton). Their findings are recorded here verbatim-in-summary so the PR +carries its own critique. Conclusions are tiered by what SURVIVED review. + +## SOUND (ships as claimed) + +- **CRT seam oracle** (`examples/crt_seam_oracle.rs`): exhaustive finite proof. + Coincidence spacing = lcm; exactly one all-L coincidence per period; the + honest negative result that phases cannot generically rescue a pointwise floor. + All three reviewers independent-agree this file is honest. +- **TwoNN metric fix** (chord vs cosine): the squared-distance bug fix is correct + and validated on sphere controls. + +## CORRECTED IN THIS BRANCH (was wrong/loose, now fixed) + +- **CRT density closed form**: must be `∏ min(2t+1, m_i)/m_i`, NOT `∏(2t+1)/m_i`. + The literal product exceeds 1 at t=3 (period-3 grid saturates) — the SAME error + class the doc attributed to the rejected `(2t)^L/M` form. The Rust code already + caps with `.min(p)`; the markdown and Lean statements are corrected to match. + Valid precondition for the uncapped form: `2t+1 ≤ min_i m_i` (t≤1 here). +- **Lean `crtEquiv`**: false as typed — `ZMod 0 = ℤ` breaks the cardinality + match. Needs `[∀ i, NeZero (m i)]`. Signature corrected; `hcongr` lemma plan + corrected to `Equiv.subtypeEquiv` + `Fintype.card_congr` (the cited + `Equiv.card_filter_map` does not exist). + +## DEMOTED TO EXPLORATORY (known confounds — NOT a settled finding) + +- **"Routing keys are super-Poisson, never rigid"** (`spectral_probe.rs`, + `withdrawn/corpus_zoo_results.md`): **WITHDRAWN — the probe does not measure what it + claims.** Attempted salvage (added `--unfold-smooth K`, a K-knot empirical + unfold) INVERTED the result: under the smooth unfold the isotropic corpus + reads super-Poisson (Σ²/L 1.7→12) and clustered reads LOWER (1.4→6.8) — the + opposite ordering from the Gaussian-unfold version. So the two unfolds + disagree and NEITHER is validated against analytic ground truth. The + Gaussian-unfold "isotropic = clean Poisson 0.99" was not a control passing — + it was the single marginal the wrong unfold happened to fit. The smooth unfold + has its own artifact (knot-scale + interpolation structure). CONCLUSION: the + number-variance empirical finding is UNSUPPORTED in either direction. Fixing it + requires rebuilding the estimator against a case with KNOWN analytic number + variance (e.g. a stationary process with closed-form Σ²(L)) to calibrate the + unfold + window estimator before trusting any corpus reading. This is open + follow-up work, not a patch. The THEORY (rigidity_impossibility_proofs.md) is + unaffected — it does not depend on this probe. +- **"Random offsets redundant / coprime adds nothing"** (`shard_recall.rs`): + **Bug L FIXED.** `build_projs` now seeds direction and phase RNGs separately + and identically across arms, so aligned vs random-offset share the same R + directions and the ablation is clean. Re-run: aligned 0.9095 vs random-offset + 0.9080 (tied) — the "random offsets add nothing" claim now holds on a + controlled comparison. Still-open caveat: the "fair envelope" undersells the + coprime arms (they subdivide buckets and saturate below high budgets), and + coprimality across R directions is the wrong geometry — the within-axis vernier + harness remains unbuilt (theory is in crt_seam_oracle). +- **`gen_corpus` Bug O FIXED.** Corpus and queries now share one geometry + (`A` + prototypes seeded from a dedicated geometry-only RNG keyed by cfg.seed), + so query/corpus latent spaces match and shard_recall ground truth is valid. + +## IMPOSSIBILITY PROOFS — repairable, currently overstated + +- **Theorem 2**: conclusion (no rigidity) holds, but proof text is WRONG as + written — n i.i.d. uniforms are a BINOMIAL process, Σ²(L) = L(1−L/n), not + "Poisson, Σ²=L exactly, independent". Restate with the binomial value; the + Θ(L) conclusion is unaffected. +- **Theorem 3**: correct under its hypothesis but the hypothesis is smuggled — + "any fixed-distribution corpus" must be stated as "conditionally i.i.d. given + latent θ (mixture of i.i.d.)". de Finetti is decorative; it's the law of total + variance. Finite-without-replacement (the escape hatch) is excluded by + ASSUMPTION, not proof. +- **NON-SEQUITUR (must retract)**: "quantile bucketing is optimal against the + entire achievable class / prime-spectral structure provably is not there" does + NOT follow from Σ²(L) ≥ L. Number variance and partition recall/load-balance + are different figures of merit. Narrow the claim to: "the key is not + number-variance-rigid" (true), and drop the optimality-over-all-partitions + claim unless separately proved (likely false). + +## Round 2 (real-embedding pipeline + post-PR code review) + +After the real-embedding work and the PR was opened, a second hostile review +(plus the Gemini/qodo PR bots) hit the new material: + +- **CRITICAL — density-collapse headline was an artifact, now corrected.** The + win-rate climb 0.667→0.930 with top-k was an estimator-variance effect (M2), + and tau was computed on the probe's own coords, coupling it to cosine (M1). + FIXED: tau now uses the per-pair UNION of top coords (de-circularized), and we + report the tau GAP (effect size) with a bootstrap 95% CI instead of win rate. + Result survives but is MODEST and FLAT: gap ≈ 0.04, CI strictly > 0 at every + top-k. The "sharpening / signature of a real effect" claim is RETRACTED; the + small-but-real separation stands. +- **qodo: bucketing bug FIXED.** density_collapse reimplemented bucketing as + `rank/(d/2^bits)` (panics at d/2^bits==0; wrong for non-divisible dims). Now + uses `ordvec::rank::rank_to_bucket` — measures REAL RankQuant behavior. +- **embed_ollama.py hardened:** E2 (silent row misalignment if ollama returns + wrong count) now aborts; E3 (empty corpus) guarded. +- **Reproducibility (E4/E5):** the repo-sentence extraction + embed procedure is + now recorded verbatim in density_collapse_results.md (was unrecorded). +- **G1 overclaim:** body language softened; single-corpus/single-model + generality explicitly NOT claimed. + +## Net + +The mathematically defensible deliverables are: the CRT vernier structure +(oracle + corrected density + Lean skeleton with corrected `crtEquiv` signature), +and the TwoNN metric fix. The empirical rigidity/routing findings are +exploratory with identified confounds and concrete salvage paths. The +impossibility theorems are directionally right but need restatement (binomial +not Poisson; mixture hypothesis explicit; optimality claim retracted). diff --git a/experiments/ordinal-routing-research/README.md b/experiments/ordinal-routing-research/README.md new file mode 100644 index 00000000..0b5305f1 --- /dev/null +++ b/experiments/ordinal-routing-research/README.md @@ -0,0 +1,89 @@ +# Ordinal-routing research — reviewer's guide + +Exploratory investigation into ordvec's **density behavior** and whether +prime/spectral structure can improve training-free routing. Everything lives in +`experiments/ordinal-routing-research/` — the findings/proofs (`*.md`) next to the +probes that produced them (`*.rs`, `embed_ollama.py`) — **no changes to the +`ordvec` crate or its public API.** This directory is `package.exclude`d, so it +ships with the source tree but not the published crate. + +> **Running the probes.** These were developed and run as Cargo examples. The +> `cargo run --release --example ` commands and `examples/…` paths +> throughout these docs refer to that original layout. To reproduce, copy the +> relevant `.rs` from this directory into the crate's `examples/` directory and +> run the command shown (they depend only on the existing `ordvec` / `rand` / +> `rayon` dev-deps). They are kept here as reference source, not as a compiled +> build target. + +Reviewed by three internal adversarial agents plus the PR bots; findings are +tiered below by **what survived scrutiny**. Read the tiers, not every doc. + +## 3-minute path + +1. This file (the tiers + the verdict at the bottom). +2. **[density_collapse_results.md](density_collapse_results.md)** — the mechanism + (real-embedding, with its honest correction), then + **[tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md)** — the + decisive negative: it doesn't beat b=4. +3. **[ADVERSARIAL_REVIEW.md](ADVERSARIAL_REVIEW.md)** — what was challenged, + fixed, retracted, withdrawn. The integrity record. + +## SOUND — proven or real-data confirmed + +| doc | claim | +|-----|-------| +| [density_collapse_results.md](density_collapse_results.md) | **Mechanism.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, CI > 0). Real but small. | +| [tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md) | **The verdict.** Does that tau signal beat b=4? NO — b=4 wins even at the tau ceiling; tau scores below b=2's own ordering. Signal is real-but-inert; just use b=4. Closes the line: research, not a feature. | +| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (open PR, `sorry`-free). | +| [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. | +| [oblivious_directions_results.md](oblivious_directions_results.md) | **The directions arc (round 2).** Data-oblivious low-discrepancy directions (golden-angle / Sobol / Kronecker) do NOT beat iid-random for training-free routing — across 5 encoders (nomic, bge-m3, bge-large, snowflake-arctic-v2, harrier-oss) at real intrinsic dim 18–24. CLASS-DEAD, pre-registered, replicated (the one mid-ladder flicker failed to replicate). Centering removes the cone but fails at b=4 (penalty grows with capacity). One robust positive: data-aligned (PCA) directions lead at higher ID — the lever is data-alignment, which training-free forbids. Also **resolves the twonn_id PARTIAL**: real-corpus ID measured at ~18–24 across 5 encoders, and ID is a **corpus** property (repo≈13 vs fiqa≈24, same encoder), not an encoder constant. Probes: `uniformity_lemma.rs`, `overlap_decomp.rs`, `centering_recall.rs`, `subspace_directions.rs`, `partition_balance.rs`, `fib_*.rs`. | +| [length_mixture_lake_results.md](length_mixture_lake_results.md) | **Path B — chunk-length-mixture lake (closes the synthetic-lake arc).** Same fiqa docs embedded at 4 chunk lengths {128,256,512,1100} unioned into a 230k-doc lake; b=4 raw R@10 vs FP32 cosine is **immune** (+0.002, CR@100=1.0). Bonus measurement of the "chunk length is a third geometry axis" claim: real but **small and co-axial** — R̄ spreads only 0.705→0.723 over an 8.6× length range, cone axes ≥0.986 aligned (not the distinct geometries the mixture framing imagined). With Phase B (multi-domain) this leaves every synthetic lake pathology — multi-cone, hub, multi-length — benign for "spend the bits, b=4." Probe: `make_length_lake.py` + `centering_recall.rs`. | + +## THEORY — directionally right, restated honestly + +| doc | status | +|-----|--------| +| [rigidity_impossibility_proofs.md](rigidity_impossibility_proofs.md) | The routing key is not number-variance-rigid (Thm 2/3, binomial `L(1-L/n)`). The over-broad "quantile optimal over all partitions" claim is **retracted** as a non-sequitur. | +| [conjecture_citation_audit.md](conjecture_citation_audit.md) | Citations verified by direct fetch (Ethayarajh, Broughan-Barnett, etc.); a few subagent confabulations caught and corrected. | +| [twonn_id_results.md](twonn_id_results.md) | ⚠️ PARTIAL. The chord-metric fix is sound (sphere-validated to ID ~12); the OLS-through-origin estimator is biased (not MLE), and **no clean real-corpus ID is recorded here**. A low-tens sentence-transformer ID is a hypothesis by cross-domain analogy (Ansuini's low-tens are vision CNNs, not sentence encoders) — not established or measured in this branch. | + +## WITHDRAWN — see [withdrawn/](withdrawn/) + +The number-variance "super-Poisson" finding ([withdrawn/spectral_probe_results.md](withdrawn/spectral_probe_results.md), +[withdrawn/corpus_zoo_results.md](withdrawn/corpus_zoo_results.md)) did not +survive: its unfold is uncalibrated (a salvage attempt inverted the result). The +*theory* above does not depend on it. Kept for the record, not as a claim. + +## Conjecture verdict (the framing question) + +Prime / Li(x) / Sacks-spiral constructions don't help retrieval: they act on the +index (ℕ) and carry no corpus information. The exploitable dense-region structure +lives on the permutohedron `S_D` — the data's own order — which is the +density-collapse result above. Detail across the theory docs + ADVERSARIAL_REVIEW. + +## Reproduce + +Per-doc commands are at the bottom of each file. Real-embedding pipeline (GPU via +ollama) is fully recorded in [density_collapse_results.md](density_collapse_results.md); +external-corpus recipe in [REAL_CORPUS_RUNBOOK.md](REAL_CORPUS_RUNBOOK.md). + +## The deployment question — RESOLVED (negative) + +[tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md): the decisive +matched-bytes experiment was run. **b=4 wins decisively, even at the tau ceiling** +(real embeddings: b4 0.942, b2 0.898, tau-rerank 0.597, fp32-rerank 1.000). The +b=2 candidate pool contains every true neighbour (fp32-rerank=1.0), but the ~0.04 +tau gap is too weak to ORDER them — it scores below b=2's own ordering. The +density-collapse signal is **real but inert**: "just use b=4," no ordvec feature +follows. + +The **deployment-robustness** sub-arc is likewise resolved negative: across every +synthetic lake pathology — multi-domain cones + templated hubs (Phase B) and now a +chunk-length mixture ([length_mixture_lake_results.md](length_mixture_lake_results.md), +Path B) — b=4 raw routing does not degrade (hub Δ −0.002 through 15%; length-mixture +Δ +0.002). The only un-run test left needs *real* dirty data (OCR / multilingual S3 +sludge), uncapturable from clean embeddings. + +This is the honest bottom line of the whole branch: a characterized mechanism and +a clean negative. **Research, not a feature** — the prime/spectral/permutation +ideas for dense-region retrieval do not beat the boring baseline (spend the bits). diff --git a/experiments/ordinal-routing-research/REAL_CORPUS_RUNBOOK.md b/experiments/ordinal-routing-research/REAL_CORPUS_RUNBOOK.md new file mode 100644 index 00000000..84848edf --- /dev/null +++ b/experiments/ordinal-routing-research/REAL_CORPUS_RUNBOOK.md @@ -0,0 +1,48 @@ +# Real-corpus runbook for the conjecture probes + +All four examples now consume the SAME `.npy` format `bench_rank` documents: +2-D little-endian float32 (`random-offset gap that survives reseeding would be the surprise worth +chasing. diff --git a/experiments/ordinal-routing-research/centering_recall.rs b/experiments/ordinal-routing-research/centering_recall.rs new file mode 100644 index 00000000..16d94eea --- /dev/null +++ b/experiments/ordinal-routing-research/centering_recall.rs @@ -0,0 +1,229 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — the DECISIVE gate for the centering finding. +//! overlap_decomp.rs showed per-coord mean-centering removes the cone and +//! amplifies true-neighbour overlap 2-5x. Necessary but NOT sufficient: the only +//! metric that promotes this to an ordvec change is RECALL at matched bytes, +//! and it must survive at b=4 (the incumbent that beat every other idea on this +//! branch). Isolated by request: touches no harness, writes nothing. +//! +//! TEST: full-scan RankQuant-style retrieval, R@10 vs FP32 cosine ground truth. +//! Encode = dimension-wise rank -> bucket into 2^B equal-width bins (B bits/coord). +//! Score = -L1 distance between B-bin codes (symmetric rank metric; identical +//! scoring for both arms, so only the ENCODING differs). +//! Ground truth = FP32 cosine top-10 on L2-normalized RAW vectors (the real task; +//! centering changes the CODE, never the target). +//! Two arms: +//! raw : rank the L2-normalized vector directly. +//! centered : subtract the per-coord CORPUS mean (a stored, data-oblivious +//! statistic — no codebook, no per-query training) before ranking. +//! The SAME mean is applied to queries (deployable asymmetric setup). +//! Also reports the two-stage candidate-recall (CR@M): fraction of FP32 top-10 +//! captured in the top-M by code distance — the bitmap-prefilter-relevant number. +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! FEATURE-WORTHY if centered R@10 - raw R@10 >= 0.02 at b=2 AND >= 0.02 at b=4 +//! (must survive at the incumbent b=4), AND centered never worse than raw by +//! > 0.005 at any B. +//! PARTIAL if it helps at low bits (b<=2) but the b=4 gain is < 0.02 (a +//! low-bit-only tweak, not a headline change). +//! FAIL if centered <= raw at b=4, or net-negative anywhere material. +//! Report actual R@10 and CR@M for both arms regardless. +//! +//! Run: cargo run --release --example centering_recall -- \ +//! --corpus-npy /tmp/repo_corpus.npy --queries-npy /tmp/repo_q.npy + +use rayon::prelude::*; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn coord_mean(corpus: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += corpus[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// Bucketed-rank code: rank each (optionally mean-subtracted) coord, then bucket +/// the rank into 2^bits equal-width bins. One u8 bucket id per coord. +fn encode(rows: &[f32], n: usize, dim: usize, bits: u32, sub: Option<&[f32]>) -> Vec { + let nb = 1u32 << bits; + let mut out = vec![0u8; n * dim]; + out.par_chunks_mut(dim).enumerate().for_each(|(r, code)| { + let mut idx: Vec = (0..dim as u16).collect(); + let val = |c: u16| rows[r * dim + c as usize] - sub.map_or(0.0, |s| s[c as usize]); + idx.sort_by(|&a, &b| val(a).partial_cmp(&val(b)).unwrap()); + for (rank, &c) in idx.iter().enumerate() { + let bucket = (rank as u32 * nb) / dim as u32; + code[c as usize] = bucket as u8; + } + }); + out +} + +fn l1(a: &[u8], b: &[u8]) -> i64 { + a.iter() + .zip(b) + .map(|(x, y)| (*x as i64 - *y as i64).abs()) + .sum() +} + +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +/// Full-scan R@10 and candidate-recall@M for a given arm's codes. +fn evaluate( + ccode: &[u8], + qcode: &[u8], + n: usize, + nq: usize, + dim: usize, + truth: &[Vec], + m: usize, +) -> (f64, f64) { + let res: Vec<(f64, f64)> = (0..nq) + .into_par_iter() + .map(|qi| { + let qc = &qcode[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, i64)> = (0..n) + .map(|di| (di as u32, l1(qc, &ccode[di * dim..(di + 1) * dim]))) + .collect(); + s.sort_by_key(|x| x.1); + let truth_set: std::collections::HashSet = truth[qi].iter().copied().collect(); + // R@10: top-10 by code distance that are in FP32 top-10 + let top10: std::collections::HashSet = s.iter().take(10).map(|x| x.0).collect(); + let r10 = top10.intersection(&truth_set).count() as f64 / truth[qi].len() as f64; + // CR@M: FP32 top-10 captured within top-M by code distance + let topm: std::collections::HashSet = s.iter().take(m).map(|x| x.0).collect(); + let cr = truth[qi].iter().filter(|t| topm.contains(t)).count() as f64 + / truth[qi].len() as f64; + (r10, cr) + }) + .collect(); + let r10 = res.iter().map(|x| x.0).sum::() / nq as f64; + let cr = res.iter().map(|x| x.1).sum::() / nq as f64; + (r10, cr) +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut m_arg = 100usize; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + "--m" => m_arg = a.next().unwrap().parse().unwrap(), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + let truth = ground_truth(&corpus, &queries, dim, 10); + let mean = coord_mean(&corpus, n, dim); + let m = m_arg; // CR@M operating point (default 100, --m to override) + + println!("# centering_recall (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}, CR@M M={m}"); + println!("# R@10 vs FP32 cosine top-10; symmetric -L1 on B-bin rank codes; only ENCODING differs between arms."); + println!("B\tbytes/vec\traw_R@10\tcen_R@10\tΔR@10\traw_CR\tcen_CR\tΔCR"); + let mut deltas = vec![]; + for &bits in &[1u32, 2, 4] { + let bytes = dim * bits as usize / 8; + let craw = encode(&corpus, n, dim, bits, None); + let qraw = encode(&queries, nq, dim, bits, None); + let ccen = encode(&corpus, n, dim, bits, Some(&mean)); + let qcen = encode(&queries, nq, dim, bits, Some(&mean)); + let (r_raw, cr_raw) = evaluate(&craw, &qraw, n, nq, dim, &truth, m); + let (r_cen, cr_cen) = evaluate(&ccen, &qcen, n, nq, dim, &truth, m); + let d_r = r_cen - r_raw; + let d_cr = cr_cen - cr_raw; + deltas.push((bits, d_r)); + println!("{bits}\t{bytes}\t{r_raw:.4}\t{r_cen:.4}\t{d_r:+.4}\t{cr_raw:.4}\t{cr_cen:.4}\t{d_cr:+.4}"); + } + println!("\n# PRE-REGISTERED: FEATURE-WORTHY if ΔR@10 >= +0.02 at b=2 AND b=4, and never worse than -0.005."); + let b2 = deltas.iter().find(|x| x.0 == 2).unwrap().1; + let b4 = deltas.iter().find(|x| x.0 == 4).unwrap().1; + let worst = deltas.iter().map(|x| x.1).fold(f64::INFINITY, f64::min); + let verdict = if b2 >= 0.02 && b4 >= 0.02 && worst >= -0.005 { + "FEATURE-WORTHY" + } else if b4 < 0.02 && deltas.iter().any(|x| x.1 >= 0.02) { + "PARTIAL (low-bit only; does not survive at incumbent b=4)" + } else { + "FAIL" + }; + println!("# VERDICT: {verdict} (Δb2={b2:+.4}, Δb4={b4:+.4}, worst={worst:+.4})"); +} diff --git a/experiments/ordinal-routing-research/conjecture_citation_audit.md b/experiments/ordinal-routing-research/conjecture_citation_audit.md new file mode 100644 index 00000000..7fd0161c --- /dev/null +++ b/experiments/ordinal-routing-research/conjecture_citation_audit.md @@ -0,0 +1,41 @@ +# Citation audit — prime/RH/embedding-geometry conjecture + +Verified by direct WebFetch against primary sources (subagent sandbox had no web +egress; their "VERIFIED" tags were memory passes and are NOT relied on here). +WebSearch was backend-down; all checks below are WebFetch on known URLs. + +## VERIFIED ✓ (fetched primary source) + +| Claim | Verified value | Source | +|---|---|---| +| GPT-2 final-layer anisotropy | ≈0.99 ("almost perfect cosine similarity"); ~0.6 in layers 2–8, rising exponentially layers 8–12 | Ethayarajh 2019, arXiv:1909.00512 (via ar5iv) | +| Rogue dimensions dominate cosine | "a small number of rogue dimensions, often just 1–3, dominate these measures" (exact quote) | Timkey & van Schijndel 2021, arXiv:2109.04404 | +| Super-prime counting function | π_q(x) ~ x/(log x)²; q_n ~ n(log n)² | Broughan & Barnett 2009, J. Integer Seq. vol 12 | +| Mathlib CRT primitive | `ZMod.chineseRemainder : ZMod (m*n) ≃+* ZMod m × ZMod n` (exact signature) | mathlib4_docs ZMod/Basic | +| Discrepancy rates | uniform grid O(1/N); optimal LDS O(log N/N); random/Poisson O(√(log log N/N)) | Wikipedia Low-discrepancy_sequence | + +## PARTIAL / QUALITATIVE (source confirms direction, not exact figure) + +| Claim | Status | +|---|---| +| Mu & Viswanath D≈d/100 rule | Principle confirmed ("common mean + a few top directions"); the exact d/100 ratio NOT stated in fetched text. Cite as "a few top components", not d/100. arXiv:1702.01417 | +| Ansuini intrinsic dimension | NUMBERS CONFIRMED (ar5iv): last hidden layers ID ≈ 12–25; peak (early layers, rel.depth 0.2–0.4) ≈ 100–120; ID/ambient ≈ 2e-4. Hunchback = rise then contract. arXiv:1905.12784. NOTE: vision/CNN nets, not sentence encoders. | +| Valeriani et al. 2023 transformer ID | Qualitative confirmed: same expand–contract–stabilize ID profile in transformers (protein LM + image). Exact peak number not extracted. arXiv:2302.00294 | +| Routing implication | d_int ≈ low tens ⇒ JL R ≈ c·d_int ⇒ R∈{8,16} projections suffice (matches shard_recall envelope still rising at R=16). Projection budget sized by ~20, not ambient 256/1024. | + +## CORRECTED (subagent confabulations caught) + +- k-fold prime-cascade ~1/(log x)^k: **HEURISTIC, not proved.** Subagent falsely + cited Goldston–Yıldırım (that paper is about small prime gaps — unrelated). + Only the single iteration (Broughan&Barnett) is proved. +- Super-prime count "x/(ln x · ln ln x)": **wrong form**; paper proves x/(log x)². +- Prime-sequence discrepancy "O(n^-1/2)" (my own earlier claim): **unsupported**; + no clean published rate for scaled primes. Verdict holds via uniform/quantile + dominance regardless. + +## STILL UNVERIFIED (needs working WebSearch or specific PDF) + +- Exact intrinsic-dimension number for modern sentence embeddings + (sentence-transformers / E5 / text-embedding-3). No confident published TwoNN + figure located. This is the one empirical gap for the Level-2 anisotropy argument. +- Cai et al. 2021 "local isotropy" arXiv id unconfirmed. diff --git a/experiments/ordinal-routing-research/crt_seam_oracle.rs b/experiments/ordinal-routing-research/crt_seam_oracle.rs new file mode 100644 index 00000000..6832a66b --- /dev/null +++ b/experiments/ordinal-routing-research/crt_seam_oracle.rs @@ -0,0 +1,164 @@ +//! CRT seam oracle: exhaustive finite verification of the vernier theorem. +//! +//! This is the SPEC for the within-axis vernier experiment. The CRT claim is +//! a finite statement over the ring Z/M (M = product of periods), so we +//! enumerate every position and CHECK the predicted structure exactly. +//! Exhaustive enumeration over the finite ring is a proof for these parameters +//! — the same thing a Lean theorem would assert. We surface the phase subtlety +//! (the naive zero-offset statement is FALSE at the origin) before any recall +//! experiment is built on top. +//! +//! Setup: L grids on ONE axis, grid i has period m_i and phase phi_i. A +//! position x is a "seam" of grid i if (x - phi_i) mod m_i == 0. The vernier +//! claim concerns how often seams of different grids COINCIDE. +//! +//! Checks: +//! 1. coincidence spacing: two coprime grids (phi=0) share a seam exactly +//! every lcm(m_i,m_j) = m_i*m_j. NON-coprime share every lcm < product. +//! 2. all-L coincidence count over one period M: with phi=0 there is exactly +//! one (at x=0); the claim "one per period" is about COUNT, and it holds. +//! 3. near-seam blind-spot density: fraction of x within band |.|<=t of a +//! seam on ALL L grids simultaneously ~= (2t)^L / M (the headline). +//! 4. the phase refutation: with phi=0 every grid has a seam at x=0, so the +//! pointwise "max-over-grids distance-to-seam >= c" floor is FALSE there. +//! +//! Run: cargo run --release --example crt_seam_oracle + +fn gcd(mut a: u64, mut b: u64) -> u64 { + while b != 0 { + let t = b; + b = a % b; + a = t; + } + a +} +fn lcm(a: u64, b: u64) -> u64 { + a / gcd(a, b) * b +} + +/// Is x a seam of a grid with period m and phase phi (integer phase)? +/// Seam <=> (x - phi) ≡ 0 (mod m). +fn is_seam(x: i64, m: u64, phi: i64) -> bool { + let m = m as i64; + (((x - phi) % m) + m) % m == 0 +} + +/// Distance (in integer steps, on the ring) from x to the nearest seam of +/// grid (m, phi): min over k of |x - phi - k*m|, i.e. the centered residue. +fn dist_to_seam(x: i64, m: u64, phi: i64) -> u64 { + let m = m as i64; + let r = (((x - phi) % m) + m) % m; // in [0, m) + r.min(m - r) as u64 +} + +fn main() { + println!("# CRT seam oracle — exhaustive finite verification"); + + // ---- Check 1: pairwise coincidence spacing ---- + println!("\n## Check 1: first common seam of two grids (phi=0)"); + println!("m_i\tm_j\tcoprime\tfirst_common\tlcm(=expected)"); + for (mi, mj) in [(3u64, 5u64), (4, 6), (5, 7), (6, 9), (8, 12)] { + let mut first = None; + for x in 1..=(mi * mj) as i64 { + if is_seam(x, mi, 0) && is_seam(x, mj, 0) { + first = Some(x); + break; + } + } + let cop = gcd(mi, mj) == 1; + println!( + "{mi}\t{mj}\t{cop}\t{}\t{}", + first.unwrap(), + lcm(mi, mj) + ); + } + + // remaining checks appended in next pieces + checks_2_3_4(); +} + +fn checks_2_3_4() { + // L coprime grids on one axis. + let periods: [u64; 4] = [3, 5, 7, 11]; + let m: u64 = periods.iter().product(); // M = 1155 + let l = periods.len(); + + // ---- Check 2: all-L coincidence count over one period, phi=0 ---- + // Claim: exactly ONE position in [0, M) is a seam of all L grids. + let mut all_seam_zero = 0u64; + for x in 0..m as i64 { + if periods.iter().all(|&p| is_seam(x, p, 0)) { + all_seam_zero += 1; + } + } + println!("\n## Check 2: all-{l} coincidences in [0,M), phi=0 (M={m})"); + println!("count = {all_seam_zero} (claim: exactly 1, at x=0)"); + + // ---- Check 3: near-seam blind-spot density vs (2t)^L / M ---- + // A position is in the joint blind spot if it is within band |.|<=t of a + // seam on ALL L grids. Predicted density = product over grids of + // (per-grid near-seam density) = ((2t+1)/m_i) ... but the clean continuous + // form is (2t)^L / M. We report measured vs both the integer band count + // ((2t+1) residues per grid) and the continuous (2t)^L/M headline. + println!("\n## Check 3: joint near-seam density vs prediction (phi=0)"); + println!("t\tmeasured\t(2t+1)^L/M_int\t(2t)^L/M_cont"); + for t in [0u64, 1, 2, 3] { + let mut hits = 0u64; + for x in 0..m as i64 { + if periods.iter().all(|&p| dist_to_seam(x, p, 0) <= t) { + hits += 1; + } + } + let measured = hits as f64 / m as f64; + // integer band: each grid admits (2t+1) residues (centered), but capped + // at m_i; product of per-grid fractions = prod((2t+1)/m_i). + let int_pred: f64 = periods + .iter() + .map(|&p| ((2 * t + 1).min(p) as f64) / p as f64) + .product(); + let cont_pred = (2 * t).pow(l as u32) as f64 / m as f64; + println!("{t}\t{measured:.6}\t{int_pred:.6}\t{cont_pred:.6}"); + } + + // ---- Check 4: the phase refutation ---- + // With phi=0, x=0 is a seam on every grid -> max-over-grids dist = 0, so a + // pointwise floor "some grid keeps you >= c from its seam" is FALSE there. + // With chosen distinct phases, the all-seam coincidence can be MOVED but, + // by CRT, still occurs exactly once per period (we just relocate it). + let phis_zero = [0i64; 4]; + let worst_zero = (0..m as i64) + .map(|x| { + periods + .iter() + .zip(phis_zero.iter()) + .map(|(&p, &ph)| dist_to_seam(x, p, ph)) + .max() + .unwrap() + }) + .min() + .unwrap(); + // staggered phases: spread the seams + let phis_stag = [0i64, 1, 2, 3]; + let worst_stag = (0..m as i64) + .map(|x| { + periods + .iter() + .zip(phis_stag.iter()) + .map(|(&p, &ph)| dist_to_seam(x, p, ph)) + .max() + .unwrap() + }) + .min() + .unwrap(); + println!("\n## Check 4: pointwise max-over-grids dist-to-seam (the floor)"); + println!("min_x max_i dist (phi=0) = {worst_zero} (=> 0: floor is FALSE at origin)"); + println!("min_x max_i dist (phi staggered) = {worst_stag} (phases can lift the floor)"); + + // verdict line for downstream + println!("\n## Verdict"); + println!( + "coincidence_count_per_period={all_seam_zero} (=1 ✓); \ + floor_phi0={worst_zero} (=0, naive claim false ✓); \ + floor_staggered={worst_stag} (phase-dependent)" + ); +} diff --git a/experiments/ordinal-routing-research/crt_seam_oracle_results.md b/experiments/ordinal-routing-research/crt_seam_oracle_results.md new file mode 100644 index 00000000..b6293dbd --- /dev/null +++ b/experiments/ordinal-routing-research/crt_seam_oracle_results.md @@ -0,0 +1,87 @@ +# CRT seam oracle — corrected vernier theorem (exhaustive finite proof) + +> Lean 4 formalization of this theorem lives in the companion repo: +> [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) +> (open PR, `sorry`-free). + +`examples/crt_seam_oracle.rs` enumerates the full ring Z/M to verify the +within-axis vernier (coprime-grid) theorem exactly. This is the SPEC for the +vernier recall experiment and the corrected target for any Lean formalization. +Exhaustive enumeration over the finite ring IS the proof for these parameters. + +Params: L=4 grids on one axis, coprime periods {3,5,7,11}, M=1155. + +## CONFIRMED exactly + +1. **Coincidence spacing = lcm.** First common seam of grids (m_i,m_j) at phi=0 + is exactly lcm(m_i,m_j). Coprime ⇒ lcm = m_i·m_j (maximal spacing: 3,5→15; + 5,7→35). Non-coprime coincide early (4,6→12; 6,9→18). This is WHY coprimality + maximizes seam separation. +2. **Exactly one all-L coincidence per period M.** count = 1 over [0,M). ✓ + +## CORRECTED (the closed form is the CAPPED product) + +3. **Joint near-seam density is ∏_i min(2t+1, m_i)/m_i.** The uncapped + ∏(2t+1)/m_i is correct ONLY in the interior `2t+1 ≤ min_i m_i` (here t≤1); + beyond that, bands saturate a grid and that grid's factor caps at 1. The + uncapped form exceeds 1 at t=3 (period-3 grid: 7/3 > 1) — the SAME nonsense + as the rejected continuous (2t)^L/M. (Caught in adversarial review; the Rust + code already computes the capped product via `.min(p)`.) + + | t | measured | ∏min(2t+1,m_i)/m_i (capped) | ∏(2t+1)/m_i (uncapped) | (2t)^L/M | + |---|----------|------------------------------|------------------------|----------| + | 0 | 0.000866 | 0.000866 | 0.000866 | 0.000000 | + | 1 | 0.070130 | 0.070130 | 0.070130 | 0.013853 | + | 2 | 0.324675 | 0.324675 | 1.039 (>1, wrong) | 0.221645 | + | 3 | 0.636364 | 0.636364 | 2.078 (>1, wrong) | 1.122 | + + Measured matches the CAPPED product to 6 d.p. at all t. The uncapped product + is valid only for t≤1 (where 2t+1 ≤ min m_i = 3); it already exceeds 1 at t=2. + The provable/Lean statement is the capped form, OR the uncapped form under the + explicit precondition `2t+1 ≤ min_i m_i`. + +## REFUTED (stronger than the earlier caveat) + +4. **No generic pointwise seam-margin floor — phases cannot rescue it.** + min_x max_i dist_to_seam = 0 with phi=0 AND with staggered phases [0,1,2,3]. + For L=4 with these small periods, some position always sits on every grid's + seam regardless of phase. The "choose phases for a pointwise floor >= c" + claim FAILS in this regime; the floor is not generic, it exists only for + specific (periods, phases, L) and must be verified per-instance, not assumed. + +## Proof of the density closed form (was: numerically confirmed only) + +Claim: joint near-seam density = ∏_i min(2t+1, m_i)/m_i for pairwise-coprime +m_i. Equivalently = ∏_i (2t+1)/m_i under the precondition 2t+1 ≤ min_i m_i. + +Proof. CRT: x ↦ (x mod m_1, …, x mod m_L) is a BIJECTION Z/M → ∏ Z/m_i (M=∏m_i, +pairwise coprime). Position x is within band t of a seam on grid i iff +(x − φ_i) mod m_i lies in the centered band {−t,…,+t} mod m_i. The number of +DISTINCT such residues is min(2t+1, m_i): exactly 2t+1 when the band fits +(2t+1 ≤ m_i), capping at m_i once it saturates (the band wraps to cover all +residues). The wraparound/min in the distance condition is INTRA-coordinate +(depends only on x mod m_i), so it does not couple grids — CRT independence +across coordinates is untouched. The joint count therefore factors: + + count = ∏_i min(2t+1, m_i), density = count/M = ∏_i min(2t+1,m_i)/m_i. ∎ + +(Even m_i have an extra antipodal-residue caveat at the exact half-period; the +chosen odd periods avoid it.) The independence is precisely what CRT supplies; +without coprimality the residue coordinates are coupled and the product fails +(matches Check 1: non-coprime grids coincide early). This is the provable form — +NOT the uncapped ∏(2t+1)/m_i (>1 for t≥2 here) nor the continuous (2t)^L/M. + +## Consequence for the workplan + +- The provable theorem is the DENSITY/COUNT statement (checks 1–3, CAPPED product + ∏min(2t+1,m_i)/m_i), not a pointwise floor. A Lean proof should target the + capped form (or the uncapped form under explicit 2t+1 ≤ min m_i) and + "one coincidence per period", and should NOT claim a phase-tunable margin. +- The vernier RECALL experiment must therefore test whether the (small, + bounded) joint blind-spot region actually costs recall — it cannot lean on a + guaranteed per-query margin, because there isn't one. +- This is the ambush avoided: building the recall test on a phase-tunable-floor + premise, then having Lean fail to prove it, would have invalidated the + experiment after the fact. Caught by finite enumeration first. + +Reproduce: `cargo run --release --example crt_seam_oracle` diff --git a/experiments/ordinal-routing-research/density_collapse.rs b/experiments/ordinal-routing-research/density_collapse.rs new file mode 100644 index 00000000..735e660f --- /dev/null +++ b/experiments/ordinal-routing-research/density_collapse.rs @@ -0,0 +1,385 @@ +//! Density-collapse probe: what does ordvec lose in dense regions, and is the +//! lost signal recoverable from intra-bucket permutation order? +//! +//! Mechanism under test. RankQuant b=2 encodes each vector as which QUARTILE +//! each coordinate's rank falls in. Two vectors that share top-bucket membership +//! COLLIDE — b=2 cannot tell them apart. This is worst in tight semantic +//! clusters (near-parallel vectors share their largest dims → same top ranks). +//! +//! But the FULL rank vector (a permutation of 0..D) still orders the +//! coordinates WITHIN the top bucket; b=2 discards that. The question: +//! +//! Within a collided bucket, do TRUE FP32 neighbors differ in Kendall-tau of +//! their top-k coordinate order LESS than random pairs in the same bucket? +//! +//! If yes, intra-bucket order carries the signal b=2 threw away, and a +//! tau-rerank recovers it WITHOUT new storage (the order is in the Rank code). +//! Ground truth = FP32 cosine, so this cannot miscalibrate (unlike the +//! number-variance probe). +//! +//! Run: cargo run --release --example density_collapse +//! cargo run --release --example density_collapse -- --noise 0.15 (tighter) +//! No external data, no BLAS. Uses ordvec::rank::rank_transform. + +use ordvec::rank::{rank_to_bucket, rank_transform}; +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + latent: usize, + clusters: usize, + noise: f32, // cluster tightness: smaller = denser = more collapse + bits: u8, // bucket bits (2 = quartiles); ordvec RankQuant uses {1,2,4} + topk: usize, // # top coords used for the intra-bucket tau test + corpus_npy: Option, // real embeddings; overrides synthetic + dim/n +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 30_000, + latent: 64, + clusters: 200, + noise: 0.3, + bits: 2, + topk: 16, + corpus_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent = a.next().unwrap().parse().unwrap(), + "--clusters" => c.clusters = a.next().unwrap().parse().unwrap(), + "--noise" => c.noise = a.next().unwrap().parse().unwrap(), + "--bits" => c.bits = a.next().unwrap().parse().unwrap(), + "--topk" => c.topk = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Cosine of two unit vectors (corpus is L2-normalized). +fn cos(a: &[f32], b: &[f32]) -> f32 { + a.iter().zip(b).map(|(x, y)| x * y).sum() +} + +/// Kendall-tau DISTANCE between the orderings the two vectors induce on a fixed +/// coordinate set `coords` (fraction of discordant pairs, in [0,1]). Uses the +/// raw values to decide order on each coordinate pair. 0 = identical order. +fn kendall_tau(a: &[f32], b: &[f32], coords: &[usize]) -> f32 { + let m = coords.len(); + if m < 2 { + return 0.0; + } + let mut disc = 0usize; + let mut tot = 0usize; + for x in 0..m { + for y in (x + 1)..m { + let (cx, cy) = (coords[x], coords[y]); + let sa = (a[cx] - a[cy]).signum(); + let sb = (b[cx] - b[cy]).signum(); + if sa != sb { + disc += 1; + } + tot += 1; + } + } + disc as f32 / tot.max(1) as f32 +} + +/// Indices of the top-k coordinates of a vector by value (descending). +fn top_coords(v: &[f32], k: usize) -> Vec { + let mut idx: Vec = (0..v.len()).collect(); + idx.sort_by(|&i, &j| v[j].partial_cmp(&v[i]).unwrap_or(std::cmp::Ordering::Equal)); + idx.truncate(k); + idx +} + +/// Hamming distance between two equal-length bucket codes (# coords that +/// landed in a different bucket). This is what the b=2 SCORER conflates: small +/// Hamming distance = codes the kernel can barely separate = real collapse. +fn hamming(a: &[u8], b: &[u8]) -> u32 { + a.iter().zip(b).filter(|(x, y)| x != y).count() as u32 +} + +/// Report the code-distance structure: exact collisions are ~0 on continuous +/// data (constant-composition codes are length-D sequences), so the real +/// "collapse" is NEAR-collision — how many docs sit within small Hamming +/// distance of a probe. Sampled over probes to stay O(n * samples). +fn collision_report(cfg: &Cfg, codes: &[Vec]) { + let n = cfg.n; + let n_probes = 200.min(n); + let stride = (n / n_probes).max(1); + let probes: Vec = (0..n).step_by(stride).take(n_probes).collect(); + // for each probe, min Hamming distance to any other doc + count within 2*min + let stats: Vec<(u32, usize)> = probes + .par_iter() + .map(|&p| { + let cp = &codes[p]; + let mut min_h = u32::MAX; + for j in 0..n { + if j == p { + continue; + } + let h = hamming(cp, &codes[j]); + if h < min_h { + min_h = h; + } + } + let thresh = (min_h + min_h / 2).max(min_h + 1); + let near = (0..n) + .filter(|&j| j != p && hamming(cp, &codes[j]) <= thresh) + .count(); + (min_h, near) + }) + .collect(); + let mean_min = stats.iter().map(|s| s.0 as f64).sum::() / stats.len() as f64; + let mean_near = stats.iter().map(|s| s.1 as f64).sum::() / stats.len() as f64; + println!("# density_collapse: n={} dim={} bits={} noise={} topk={} (smaller noise=denser)", + cfg.n, cfg.dim, cfg.bits, cfg.noise, cfg.topk); + println!("## Code near-collision structure (b={} bucket code, length {})", cfg.bits, cfg.dim); + println!("mean nearest-code Hamming dist = {mean_min:.1} of {} coords", cfg.dim); + println!("mean #docs within ~1.5x that distance = {mean_near:.1}"); + println!("(small Hamming = codes the b={} scorer can barely separate = collapse)", cfg.bits); +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays. +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + // L2-normalize rows (cosine geometry) + for i in 0..n { + let row = &mut out[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } + (out, n, dim) +} + +fn main() { + let mut cfg = parse(); + let corpus = if let Some(path) = cfg.corpus_npy.clone() { + let (v, n, dim) = load_npy_f32(&path); + eprintln!("# loaded REAL corpus {n} x {dim} from {path}"); + cfg.n = n; + cfg.dim = dim; + v + } else { + make_corpus(&cfg) + }; + let d = cfg.dim; + + // Per-coordinate b-bit bucket id, using ordvec's OWN bucketing + // (`rank_to_bucket`) so the probe measures real RankQuant behavior — it is + // guarded (d>0, bits<=7, rank> = (0..cfg.n) + .into_par_iter() + .map(|i| { + let ranks = rank_transform(&corpus[i * d..(i + 1) * d]); + ranks.iter().map(|&r| rank_to_bucket(r, d, cfg.bits)).collect() + }) + .collect(); + + collision_report(&cfg, &codes); + tau_report(&cfg, &corpus, &codes); +} + +/// THE TEST. For each probe, gather the M docs whose b=2 codes are +/// Hamming-CLOSEST (the "lookalikes" the scorer conflates with the probe). +/// Split them into the FP32-true neighbours (top half by cosine) and the +/// FP32-far lookalikes (bottom half). Question: does intra-bucket top-k +/// Kendall-tau SEPARATE these two groups — i.e. do the true neighbours have +/// lower tau than the false lookalikes that the b=2 code can't tell apart? +/// If yes, the fine permutation order recovers exactly the signal b=2 lost, +/// from the existing Rank code, no new storage. FP32 cosine = ground truth. +fn tau_report(cfg: &Cfg, corpus: &[f32], codes: &[Vec]) { + let d = cfg.dim; + let n = cfg.n; + let m = 40usize; // size of the b2-lookalike neighbourhood per probe + let n_probes = 300.min(n); + let stride = (n / n_probes).max(1); + let probes: Vec = (0..n).step_by(stride).take(n_probes).collect(); + + // Per probe: (mean tau of true-neighbour half, mean tau of far-lookalike half, + // mean cosine of each half) — averaged over probes. + let rows: Vec<(f32, f32, f32, f32)> = probes + .par_iter() + .map(|&p| { + let pv = &corpus[p * d..(p + 1) * d]; + let cp = &codes[p]; + // M code-nearest docs by Hamming + let mut by_h: Vec<(u32, usize)> = (0..n) + .filter(|&j| j != p) + .map(|j| (hamming(cp, &codes[j]), j)) + .collect(); + by_h.sort_by_key(|x| x.0); + by_h.truncate(m); + // among those lookalikes, rank by TRUE cosine to the probe + let mut by_cos: Vec<(f32, usize)> = by_h + .iter() + .map(|&(_, j)| (cos(pv, &corpus[j * d..(j + 1) * d]), j)) + .collect(); + by_cos.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap_or(std::cmp::Ordering::Equal)); + let half = by_cos.len() / 2; + // M1 fix (review): tau coords must NOT be the probe's own top-k + // (that couples tau to cosine, making the test near-tautological). + // Use the UNION of each PAIR's own top coords — chosen independently + // of the cosine ranking — so tau measures order agreement on the + // dims the two vectors jointly care about, not the probe's alone. + let tau_of = |slice: &[(f32, usize)]| -> (f32, f32) { + let mut st = 0.0f32; + let mut sc = 0.0f32; + for &(c, j) in slice { + let jv = &corpus[j * d..(j + 1) * d]; + let mut coords = top_coords(pv, cfg.topk); + coords.extend(top_coords(jv, cfg.topk)); + coords.sort_unstable(); + coords.dedup(); + st += kendall_tau(pv, jv, &coords); + sc += c; + } + let k = slice.len().max(1) as f32; + (st / k, sc / k) + }; + let (tau_near, cos_near) = tau_of(&by_cos[..half]); + let (tau_far, cos_far) = tau_of(&by_cos[half..]); + (tau_near, tau_far, cos_near, cos_far) + }) + .collect(); + + let mean = |f: &dyn Fn(&(f32, f32, f32, f32)) -> f32| { + rows.iter().map(f).sum::() / rows.len().max(1) as f32 + }; + let tau_near = mean(&|r| r.0); + let tau_far = mean(&|r| r.1); + let cos_near = mean(&|r| r.2); + let cos_far = mean(&|r| r.3); + // Per-probe tau GAP (far - near): positive => true neighbours have lower + // tau. Report its mean with a bootstrap 95% CI over probes. The gap (an + // EFFECT SIZE) is the honest statistic — NOT the win rate, whose climb with + // top-k is just estimator-variance reduction, not a sharpening effect (M2). + let gaps: Vec = rows.iter().map(|r| (r.1 - r.0) as f64).collect(); + let gap_mean = gaps.iter().sum::() / gaps.len().max(1) as f64; + let (lo, hi) = bootstrap_ci(&gaps, 2000); + println!("\n## Intra-code Kendall-tau test (top-{} coords PER-PAIR union, M={m}, {} probes)", + cfg.topk, rows.len()); + println!("among b2-lookalikes: mean cosine mean tau-distance"); + println!(" FP32-TRUE neighbours {cos_near:.4} {tau_near:.4}"); + println!(" FP32-FAR lookalikes {cos_far:.4} {tau_far:.4}"); + println!("tau GAP (far - near) = {gap_mean:.4} 95% CI [{lo:.4}, {hi:.4}]"); + let verdict = if lo > 0.0 { + "SIGNAL: gap CI strictly > 0 — true neighbours have lower tau (effect, not just win-rate)" + } else if hi < 0.0 { + "INVERTED: gap CI strictly < 0" + } else { + "NO SIGNAL: gap CI spans 0 — cannot distinguish from noise" + }; + println!("verdict: {verdict}"); +} + +/// Percentile bootstrap 95% CI for the mean of `xs`, `b` resamples. +/// Deterministic RNG (no Date/rand-of-clock) so the CI is reproducible. +fn bootstrap_ci(xs: &[f64], b: usize) -> (f64, f64) { + if xs.is_empty() { + return (0.0, 0.0); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xB007); + let n = xs.len(); + let mut means: Vec = (0..b) + .map(|_| { + let mut s = 0.0; + for _ in 0..n { + s += xs[rng.random_range(0..n)]; + } + s / n as f64 + }) + .collect(); + means.sort_by(|a, c| a.partial_cmp(c).unwrap_or(std::cmp::Ordering::Equal)); + (means[(0.025 * b as f64) as usize], means[((0.975 * b as f64) as usize).min(b - 1)]) +} + +/// Low-rank clustered corpus; `noise` is the density dial. +fn make_corpus(cfg: &Cfg) -> Vec { + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let (d, l) = (cfg.dim, cfg.latent); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let c = rng.random_range(0..cfg.clusters); + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = protos[c * l + j] + cfg.noise * gauss(&mut rng); + } + let row = &mut corpus[i * d..(i + 1) * d]; + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[ii * l + j] * z[j]; + } + row[ii] = acc; + } + let nrm: f32 = row.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in row.iter_mut() { + *v /= nrm; + } + } + } + corpus +} diff --git a/experiments/ordinal-routing-research/density_collapse_results.md b/experiments/ordinal-routing-research/density_collapse_results.md new file mode 100644 index 00000000..e1801a26 --- /dev/null +++ b/experiments/ordinal-routing-research/density_collapse_results.md @@ -0,0 +1,142 @@ +# Density collapse in ordvec: mechanism and a recoverable signal + +What "density collapse" actually is in RankQuant, and whether the lost signal is +recoverable. Source: `examples/density_collapse.rs`. Ground truth = FP32 cosine, +so this probe cannot miscalibrate (unlike the withdrawn number-variance probe). + +## Mechanism (corrected mid-investigation) + +First model was wrong: I expected b=2 codes to COLLIDE (identical) in dense +regions. They don't — the rank transform is a permutation of 0..D, so a b=2 code +is a length-D sequence of bucket ids; two docs collide only if all D coordinates +bucket identically (astronomically rare on continuous data; measured collision +rate ≈ 0 even at noise=0.08). + +The real mechanism is NEAR-collision: docs whose b=2 codes are HAMMING-CLOSE +(differ in few coordinate-buckets) are what the scorer cannot separate. +- noise=0.30: mean nearest-code Hamming = 59.9 / 256; ~123 docs within ~1.5×. +- noise=0.10: mean nearest-code Hamming = 18.4 / 256; ~58 docs within ~1.5×. +Denser clusters → tighter Hamming neighbourhoods → more docs the b=2 kernel +conflates. THAT is collapse: not identical codes, but codes too close for the +2-bit resolution to rank correctly. + +## The recoverable signal (positive result) + +Within each probe's b=2 "lookalike" set (M=40 Hamming-nearest), split by TRUE +FP32 cosine into the true-neighbour half vs the far-lookalike half. Compare the +top-k (k=16) Kendall-tau distance of the probe's coordinate ORDER to each: + +| density | true-nbr cosine | true-nbr tau | far cosine | far tau | tau gap (far − near) | 95% CI | +|---------|-----------------|--------------|------------|---------|----------------------|--------| +| noise=0.30 | 0.9397 | 0.3345 | 0.9250 | 0.3656 | 0.0311 | [0.0291, 0.0331] | +| noise=0.10 | 0.9926 | 0.1423 | 0.9905 | 0.1602 | 0.0179 | [0.0166, 0.0191] | + +**The FP32-true neighbours have systematically LOWER intra-code Kendall-tau than +the b2-lookalikes the code conflates them with — gap ≈ 0.018–0.031, 95% CI +strictly above 0 at both densities** (reported as the tau gap, not the retracted +win-rate — see the M1/M2 correction below). The fine permutation order (which +coordinate outranks which, WITHIN the top bucket) separates true neighbours from +false lookalikes. + +## Why this matters + +That fine order is ALREADY in the full `Rank` code (u16 per coord) and is +exactly what RankQuant b=2 discards. So the signal to break dense-region ties is +recoverable WITHOUT new storage — as a rerank stage on Kendall-tau of the top-k +coordinates of the b=2 survivors. This is the data-dependent, on-the-permutohedron +version of "uncover structure in dense regions": the exploitable combinatorial +structure is in S_D (the order the encoder induced), NOT on the integer line +(primes / Sacks spiral, which act on the index and carry no corpus information — +see the conjecture thread). + +## REAL EMBEDDINGS (nomic-embed-text via ollama on RTX 5080, 768-d) + +Corpus: 8665 real sentences extracted from this repo's markdown + Rust, embedded +with `nomic-embed-text` (GPU-resident via ollama). Generator: +`examples/embed_ollama.py`. Run: `density_collapse --corpus-npy repo_real.npy`. + +First real-data facts: +- A low intrinsic dimension for nomic-embed-text is a working hypothesis, not a + measurement: deep vision-CNN IDs are low-tens (Ansuini), but there is no + published sentence-transformer figure, and our TwoNN probe (sphere-validated + only; biased OLS estimate) records no clean nomic ID here. Treat low-tens as a + cross-domain analogy that motivates the density test, not an established fact + (see `twonn_id_results.md`). +- Real embeddings are FAR more entangled at b=2 than synthetic clusters: mean + nearest-code Hamming 314/768, and ~5083 of 8665 docs sit in each probe's + b2-lookalike shell (vs ~60–120 synthetic). Real geometry, much denser collapse. + +### CORRECTED after a second adversarial review (M1 + M2) + +The first writeup reported a win-rate climb 0.667→0.930 with top-k as "the +signature of a real effect." TWO review findings invalidated that framing, and +the test was rewritten: + +- **M2 (the win-rate climb was an artifact):** the per-probe tau GAP is flat + across top-k; only the win-rate estimator's variance shrinks as k grows, + mechanically pushing the win rate up. Win rate was the wrong statistic. +- **M1 (circularity):** tau was computed on the PROBE'S OWN top-k coords, which + couples tau to cosine and makes the test near-tautological. Fixed to use the + per-PAIR UNION of top coords, chosen independently of the cosine ranking. + +Rewritten test reports the tau GAP (far − near) as an effect size, with a +bootstrap 95% CI over probes (de-circularized coords): + +| top-k | tau gap (far − near) | 95% CI | verdict | +|-------|----------------------|--------|---------| +| 8 | 0.0420 | [0.0380, 0.0463] | signal | +| 16 | 0.0417 | [0.0381, 0.0454] | signal | +| 32 | 0.0440 | [0.0409, 0.0472] | signal | +| 64 | 0.0453 | [0.0424, 0.0483] | signal | + +**Honest conclusion:** there IS a real effect — true neighbours have lower +intra-code Kendall-tau than the b2-lookalikes the code conflates them with, gap +≈ 0.04, CI strictly above 0 at every top-k. But it is MODEST and FLAT, not the +dramatic "sharpening" originally claimed. The win-rate monotonicity is retracted. +The lever (intra-code permutation order in the Rank code) shows a measurable but +small separation on this corpus/model; whether ~0.04 tau converts to useful +recall gain vs simply using b=4 is the unanswered deployment question. + +## Caveats / open + +- Real corpus here is repo-domain (md+rust), 8665 docs — narrow domain, modest + size. Confirm on a larger, broader corpus (MS MARCO / Wikipedia passages). +- **RESOLVED (negative):** the tau-rerank-vs-b4 bake-off was run + (tau_rerank_bakeoff_results.md). b=4 wins decisively even at the tau ceiling + (real: b4 0.942 vs tau 0.597, and tau is worse than b2's own 0.898). The + ~0.04 tau gap is a real binary discriminator but too weak to ORDER candidates; + it does not convert to retrieval value. Conclusion: just use b=4. The signal + is real-but-inert — no ordvec feature follows. +- Kendall-tau is computed on FP32 values restricted to the per-pair union of + top-k coords; a deployable version computes it on the stored ranks. +- Single corpus, single model (nomic-embed-text), no cross-encoder check — do + NOT read "confirmed" generality from one narrow corpus. Repeat on ≥1 more + encoder before any strong claim. + +Reproduce (synthetic): +``` +cargo run --release --example density_collapse # noise=0.30 +cargo run --release --example density_collapse -- --noise 0.10 # denser +``` + +Reproduce (real embeddings — full recorded procedure, was an E4 repro gap): +``` +# 1. extract repo sentences (md + rust, 30..300 chars, >=4 words, deduped): +python - <<'PY' +import re, glob, os, tempfile +texts=[] +for f in glob.glob("**/*.md",recursive=True)+glob.glob("**/*.rs",recursive=True): + if "target" in f.split(os.sep) or ".git" in f.split(os.sep): continue + t=re.sub(r'```.*?```','',open(f,encoding='utf-8',errors='ignore').read(),flags=re.S) + for line in re.split(r'(?<=[.!?])\s+|\n',t): + s=line.strip(" #-*/>|`").strip() + if 30<=len(s)<=300 and re.search(r'[a-zA-Z]{4}',s) and s.count(' ')>=4: texts.append(s) +seen=set(); out=[s for s in texts if not (s in seen or seen.add(s))] +open(os.path.join(tempfile.gettempdir(),"repo_texts.txt"),"w",encoding="utf-8").write("\n".join(out)) +print(len(out),"sentences") +PY +# 2. embed via ollama nomic-embed-text (GPU) -> .npy: +python examples/embed_ollama.py --texts "$TMP/repo_texts.txt" --out "$TMP/repo_real.npy" --n 9000 --batch 128 +# 3. run the probe: +cargo run --release --example density_collapse -- --corpus-npy "$TMP/repo_real.npy" --topk 32 +``` diff --git a/experiments/ordinal-routing-research/embed_ollama.py b/experiments/ordinal-routing-research/embed_ollama.py new file mode 100644 index 00000000..aa16f640 --- /dev/null +++ b/experiments/ordinal-routing-research/embed_ollama.py @@ -0,0 +1,114 @@ +#!/usr/bin/env python3 +"""Generate real sentence embeddings via ollama (GPU) -> .npy for the probes. + +ollama runs nomic-embed-text (768-d) resident on the GPU. We embed a corpus of +short texts and write a 2-D little-endian float32 C-order .npy matching the +format bench_rank / the conjecture probes read (--corpus-npy). + +Corpus source (in priority order): + 1. --texts FILE : one text per line + 2. built-in : a few thousand templated sentences across many topics, + enough varied geometry for a first real-encoder pass + (no external download, no datasets lib). + +Usage: + python examples/embed_ollama.py --out corpus_real.npy --n 5000 + python examples/embed_ollama.py --texts mylines.txt --out corpus_real.npy +""" +import argparse, json, struct, sys, urllib.request + +OLLAMA = "http://localhost:11434/api/embed" +MODEL = "nomic-embed-text" + + +def embed_batch(texts): + req = urllib.request.Request( + OLLAMA, + data=json.dumps({"model": MODEL, "input": texts}).encode(), + headers={"Content-Type": "application/json"}, + ) + with urllib.request.urlopen(req, timeout=120) as r: + embs = json.load(r)["embeddings"] + # E2 fix: ollama must return exactly one vector per input, IN ORDER. + # A mismatch would silently misalign rows against the source texts. + if len(embs) != len(texts): + raise RuntimeError( + f"embedding count {len(embs)} != input count {len(texts)} " + "(row misalignment) — aborting rather than writing corrupt .npy" + ) + return embs + + +def build_builtin_corpus(n): + # Templated sentences spanning many semantic clusters -> real anisotropy, + # real cluster structure, without an external dataset. Deterministic. + subjects = ["the engineer", "a chef", "the astronomer", "my neighbor", + "the orchestra", "a startup", "the river", "an old library", + "the algorithm", "a mountain village", "the immune system", + "a jazz quartet", "the supply chain", "an ancient manuscript", + "the coral reef", "a chess grandmaster", "the power grid", + "a desert caravan", "the neural network", "a vineyard"] + verbs = ["studied", "rebuilt", "abandoned", "celebrated", "measured", + "optimized", "flooded", "catalogued", "trained", "harvested", + "defended", "improvised", "rerouted", "deciphered", "mapped"] + objects = ["under heavy rain", "for three decades", "with great precision", + "against all odds", "in the summer of 1998", "without any funding", + "across twelve countries", "before the deadline", "at dawn", + "using only open data", "despite the noise", "on a tight budget", + "in complete silence", "after the merger", "beyond the horizon"] + out = [] + i = 0 + while len(out) < n: + s = subjects[i % len(subjects)] + v = verbs[(i // len(subjects)) % len(verbs)] + o = objects[(i // (len(subjects) * len(verbs))) % len(objects)] + out.append(f"{s} {v} {o}.") + i += 1 + return out[:n] + + +def write_npy(path, vecs): + if not vecs: + raise ValueError("no vectors to write (empty corpus?)") + n = len(vecs) + dim = len(vecs[0]) + header = ("{'descr': ' point on S^2; +//! its ADDRESS on that triple is the nearest golden-spiral index (argmin over R +//! spiral points — the EXACT nearest, an upper bound on what a closed-form +//! inverse could achieve). Address(v) = the tuple of per-triple indices. +//! Address-distance = sum of |Δindex| across triples. +//! Primary metric: Spearman rho(address-distance, FP32 cosine-distance) over a +//! random sample of corpus pairs. +//! +//! CONTROLS (golden must beat BOTH to mean anything): +//! - random-pts : same machinery, spiral replaced by R random S^2 points +//! (isolates "does the GOLDEN structure help" vs "any quantizer"). +//! - sign : ordvec's existing lever — sign bits, Hamming distance vs cosine +//! (isolates "does the spiral beat what ordvec already has"). +//! +//! PRE-REGISTERED VERDICT (fixed BEFORE running — do not move): +//! PASS : golden rho >= 0.50 AND golden beats both controls by >= 0.05. +//! => locality-preserving address; computed-retrieval worth pursuing. +//! INCONCLUSIVE : 0.30 <= golden rho < 0.50, or golden within 0.05 of a control. +//! FAIL : golden rho < 0.30, or golden <= either control. +//! => spiral address is inert; stop, same wall as ambient probes. +//! +//! Run: cargo run --release --example fib_address_gate -- \ +//! --corpus-npy /tmp/repo_corpus.npy +//! Output is a single table; nothing is committed. + +// Research probe: the header keeps aligned multi-line doc lists for readability. +#![allow(clippy::doc_overindented_list_items)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Golden-angle spiral point i of R on S^2 (the "sine spiral" address grid). +fn spiral_point(i: usize, r: usize) -> [f32; 3] { + let golden = std::f32::consts::PI * (3.0 - 5.0f32.sqrt()); + let z = if r == 1 { + 0.0 + } else { + 1.0 - 2.0 * (i as f32 + 0.5) / r as f32 + }; + let rad = (1.0 - z * z).max(0.0).sqrt(); + let theta = i as f32 * golden; + [rad * theta.cos(), rad * theta.sin(), z] +} + +/// R random points on S^2 (the structure-free control grid), seeded deterministically. +fn random_points(r: usize) -> Vec<[f32; 3]> { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x5A5A_5A5A); + (0..r) + .map(|_| { + let mut p = [gauss(&mut rng), gauss(&mut rng), gauss(&mut rng)]; + let n = (p[0] * p[0] + p[1] * p[1] + p[2] * p[2]).sqrt(); + for x in p.iter_mut() { + *x /= n; + } + p + }) + .collect() +} + +/// Per-triple address: tuple of nearest-grid-point indices (argmin = exact nearest, +/// the CEILING for any closed-form inverse). grid is the R points on S^2. +fn address(v: &[f32], dim: usize, grid: &[[f32; 3]]) -> Vec { + let n_tri = dim / 3; + let mut addr = Vec::with_capacity(n_tri); + for t in 0..n_tri { + let s = &v[t * 3..t * 3 + 3]; + let nrm = (s[0] * s[0] + s[1] * s[1] + s[2] * s[2]).sqrt().max(1e-12); + let p = [s[0] / nrm, s[1] / nrm, s[2] / nrm]; + // nearest grid point by max dot (points are unit) == min distance + let mut best = 0usize; + let mut best_dot = f32::MIN; + for (gi, g) in grid.iter().enumerate() { + let d = p[0] * g[0] + p[1] * g[1] + p[2] * g[2]; + if d > best_dot { + best_dot = d; + best = gi; + } + } + addr.push(best as u16); + } + addr +} + +fn addr_dist(a: &[u16], b: &[u16]) -> f32 { + a.iter() + .zip(b) + .map(|(x, y)| (*x as i32 - *y as i32).unsigned_abs() as f32) + .sum() +} + +fn sign_bits(v: &[f32]) -> Vec { + let mut bits = vec![0u64; v.len().div_ceil(64)]; + for (i, &x) in v.iter().enumerate() { + if x >= 0.0 { + bits[i / 64] |= 1u64 << (i % 64); + } + } + bits +} + +fn hamming(a: &[u64], b: &[u64]) -> f32 { + a.iter() + .zip(b) + .map(|(x, y)| (x ^ y).count_ones()) + .sum::() as f32 +} + +/// Spearman rho = Pearson on ranks. Ties broken by stable index (adequate for a gate). +fn spearman(xs: &[f32], ys: &[f32]) -> f32 { + let rank = |v: &[f32]| -> Vec { + let mut idx: Vec = (0..v.len()).collect(); + idx.sort_by(|&a, &b| v[a].partial_cmp(&v[b]).unwrap()); + let mut r = vec![0.0f32; v.len()]; + for (rank, &i) in idx.iter().enumerate() { + r[i] = rank as f32; + } + r + }; + let rx = rank(xs); + let ry = rank(ys); + let n = rx.len() as f32; + let mx = rx.iter().sum::() / n; + let my = ry.iter().sum::() / n; + let mut cov = 0.0; + let mut vx = 0.0; + let mut vy = 0.0; + for i in 0..rx.len() { + let dx = rx[i] - mx; + let dy = ry[i] - my; + cov += dx * dy; + vx += dx * dx; + vy += dy * dy; + } + cov / (vx.sqrt() * vy.sqrt()).max(1e-12) +} + +fn main() { + let mut corpus_path = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + if x == "--corpus-npy" { + corpus_path = Some(a.next().unwrap()); + } + } + let path = corpus_path.expect("--corpus-npy required (real embeddings)"); + let (mut corpus, n, dim) = load_npy_f32(&path); + l2_normalize_rows(&mut corpus, dim); + let n_pairs = 200_000usize; + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let pairs: Vec<(usize, usize)> = (0..n_pairs) + .map(|_| { + let i = rng.random_range(0..n); + let mut j = rng.random_range(0..n); + if j == i { + j = (j + 1) % n; + } + (i, j) + }) + .collect(); + + let row = |i: usize| &corpus[i * dim..(i + 1) * dim]; + let cos_dist: Vec = pairs + .par_iter() + .map(|&(i, j)| { + let d: f32 = row(i).iter().zip(row(j)).map(|(a, b)| a * b).sum(); + 1.0 - d + }) + .collect(); + + println!( + "# fib_address_gate (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, pairs {n_pairs}" + ); + println!("# Spearman rho vs FP32 cosine-distance (higher = better locality preservation)"); + println!("grid\tR\trho"); + + // sign baseline (no R) + let signs: Vec> = (0..n).into_par_iter().map(|i| sign_bits(row(i))).collect(); + let sign_d: Vec = pairs + .par_iter() + .map(|&(i, j)| hamming(&signs[i], &signs[j])) + .collect(); + let rho_sign = spearman(&sign_d, &cos_dist); + println!("sign\t-\t{rho_sign:.4}"); + + let mut best_golden = f32::MIN; + for &r in &[8usize, 32, 128] { + for (name, grid) in [ + ( + "golden", + (0..r).map(|i| spiral_point(i, r)).collect::>(), + ), + ("random-pts", random_points(r)), + ] { + let addrs: Vec> = (0..n) + .into_par_iter() + .map(|i| address(row(i), dim, &grid)) + .collect(); + let ad: Vec = pairs + .par_iter() + .map(|&(i, j)| addr_dist(&addrs[i], &addrs[j])) + .collect(); + let rho = spearman(&ad, &cos_dist); + if name == "golden" && rho > best_golden { + best_golden = rho; + } + println!("{name}\t{r}\t{rho:.4}"); + } + } + println!("\n# best golden rho = {best_golden:.4}"); + println!("# PRE-REGISTERED: PASS if >=0.50 and beats controls by >=0.05; FAIL if <0.30 or <= a control."); +} diff --git a/experiments/ordinal-routing-research/fib_directions.rs b/experiments/ordinal-routing-research/fib_directions.rs new file mode 100644 index 00000000..d2b77ccb --- /dev/null +++ b/experiments/ordinal-routing-research/fib_directions.rs @@ -0,0 +1,581 @@ +//! Direction-PLACEMENT bake-off for the training-free routing layer. +//! +//! Open thread from the routing research (PR #211): `shard_recall.rs` always +//! draws iid-Gaussian projection directions — low-discrepancy *placement* of the +//! directions themselves was never an arm. This rig tests one hypothesis and is +//! built to KILL it cheaply if it's false: +//! +//! Shared-index triple-tiled golden-angle directions (a 2-sphere spiral per +//! coord-triple, all triples coupled by ONE Fibonacci index so they cohere on +//! the permutahedron) need FEWER candidates-scanned for matched recall@k than +//! iid-Gaussian — strongest at small R and at Fibonacci R, decaying as dim +//! grows (concentration of measure). +//! +//! Two arms, EVERYTHING else held identical to shard_recall's "aligned" calib +//! (single shared base_width, phase=0). Direction construction is the ONLY var. +//! gaussian — iid Gaussian normalized (the control) +//! fib-spiral — shared-index triple-tiled golden-angle, per-triple rotated +//! +//! Honesty checks baked in (printed): at R=1 the arms MUST coincide (one +//! direction cannot carry low-discrepancy structure) — divergence => harness +//! bug, not a result. SEED=1 => identical tables across runs. +//! +//! Run (synthetic dim=256): cargo run --release --example fib_directions +//! Run (real dim=768): cargo run --release --example fib_directions -- \ +//! --corpus-npy repo_real.npy --queries-npy repo_q.npy +//! No external data on the synthetic path, no BLAS. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +/// Base seed; overridable with --seed for the reseed-stability honesty check. +/// A real effect must survive reseeding (changes gaussian directions AND the +/// fib/project per-triple rotations); a seed-specific gap is an artifact. +static SEED: std::sync::atomic::AtomicU64 = std::sync::atomic::AtomicU64::new(1); +fn seed() -> u64 { + SEED.load(std::sync::atomic::Ordering::Relaxed) +} + +struct Cfg { + dim: usize, + n: usize, + n_queries: usize, + k: usize, + latent_dim: usize, + n_clusters: usize, + cone: f32, + r_values: Vec, + no_triple_rotation: bool, + corpus_npy: Option, + queries_npy: Option, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 50_000, + n_queries: 200, + k: 10, + latent_dim: 64, + n_clusters: 200, + // Cone anisotropy: real transformer embeddings are NOT gaussian/centered — + // they sit in a narrow cone (a dominant shared mean direction). `cone` is + // the weight of a fixed shared axis added to every vector before + // normalization; 0.0 = centered (old behaviour), larger = tighter cone. + // Default tuned so synthetic mean-resultant-length is in the ballpark of + // nomic (printed at startup so it can be matched to real --corpus-npy data). + cone: 3.0, + // Fibonacci R values (1,2,3,5,8,13,21) interleaved with non-Fib (4,16) + // so a Fibonacci-specific bump is legible against its neighbours. + r_values: vec![1, 2, 3, 4, 5, 8, 13, 16, 21], + no_triple_rotation: false, + corpus_npy: None, + queries_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--k" => c.k = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--cone" => c.cone = a.next().unwrap().parse().unwrap(), + "--seed" => SEED.store( + a.next().unwrap().parse().unwrap(), + std::sync::atomic::Ordering::Relaxed, + ), + "--no-triple-rotation" => c.no_triple_rotation = true, + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--queries-npy" => c.queries_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays +/// (same contract as shard_recall / bench_rank). +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + let n = v.len() / dim; + for i in 0..n { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Low-rank clustered corpus + matched queries (same construction as shard_recall). +fn make_corpus(cfg: &mut Cfg) -> (Vec, Vec) { + if let (Some(cp), Some(qp)) = (cfg.corpus_npy.clone(), cfg.queries_npy.clone()) { + let (mut corpus, n, d) = load_npy_f32(&cp); + let (mut queries, nq, dq) = load_npy_f32(&qp); + assert_eq!(d, dq, "corpus/query dim mismatch"); + l2_normalize_rows(&mut corpus, d); + l2_normalize_rows(&mut queries, dq); + cfg.dim = d; + cfg.n = n; + cfg.n_queries = nq; + eprintln!("# loaded corpus {n}x{d}, queries {nq}x{dq}"); + return (corpus, queries); + } + let mut rng = ChaCha8Rng::seed_from_u64(seed()); + let d = cfg.dim; + let l = cfg.latent_dim; + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + // Shared cone axis: one fixed unit direction added (weight `cone`) to every + // vector before normalization, making the corpus anisotropic like real + // embeddings (a dominant mean direction) rather than centered/gaussian. + let mut cone_axis = vec![0.0f32; d]; + for x in cone_axis.iter_mut() { + *x = gauss(&mut rng); + } + { + let nrm: f32 = cone_axis.iter().map(|v| v * v).sum::().sqrt(); + for x in cone_axis.iter_mut() { + *x /= nrm; + } + } + let cone = cfg.cone; + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc + cone * cone_axis[i]; + } + let nrm: f32 = out.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in out.iter_mut() { + *v /= nrm; + } + } + out + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let c = rng.random_range(0..cfg.n_clusters); + let e = make(&protos[c * l..(c + 1) * l], 0.3, &mut rng); + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + let mut queries = vec![0.0f32; cfg.n_queries * d]; + for i in 0..cfg.n_queries { + let c = rng.random_range(0..cfg.n_clusters); + let e = make(&protos[c * l..(c + 1) * l], 0.1, &mut rng); + queries[i * d..(i + 1) * d].copy_from_slice(&e); + } + (corpus, queries) +} + +/// FP32 brute-force cosine top-k ground truth (corpus is L2-normalized already). +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut scored: Vec<(usize, f32)> = (0..n) + .map(|di| { + let doc = &corpus[di * dim..(di + 1) * dim]; + let dot: f32 = q.iter().zip(doc).map(|(a, b)| a * b).sum(); + (di, dot) + }) + .collect(); + scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + scored.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +/// Mean resultant length R̄ = ‖ (1/n) Σ x_i ‖ over L2-normalized rows. +/// 0 ⇒ isotropic (directions cancel); →1 ⇒ tight cone (one shared direction). +/// Printed for BOTH synthetic and real corpora so the synthetic anisotropy can +/// be matched to nomic's — the whole point of the user's anisotropy note. +fn mean_resultant_length(corpus: &[f32], dim: usize) -> f32 { + let n = corpus.len() / dim; + let mut mean = vec![0.0f64; dim]; + for di in 0..n { + let row = &corpus[di * dim..(di + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for (m, &x) in mean.iter_mut().zip(row) { + *m += (x / nrm) as f64; + } + } + } + for m in mean.iter_mut() { + *m /= n as f64; + } + (mean.iter().map(|m| m * m).sum::().sqrt()) as f32 +} + +/// The dominant cone axis of a corpus = its normalized mean direction. Returned +/// as a unit vector; used by the project-then-spiral arm to deflate the shared +/// component before placing directions (anisotropic data lives off this axis). +fn dominant_axis(corpus: &[f32], dim: usize) -> Vec { + let n = corpus.len() / dim; + let mut mean = vec![0.0f32; dim]; + for di in 0..n { + let row = &corpus[di * dim..(di + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for (m, &x) in mean.iter_mut().zip(row) { + *m += x / nrm; + } + } + } + let nrm: f32 = mean.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for m in mean.iter_mut() { + *m /= nrm; + } + } + mean +} + +#[derive(Clone, Copy, PartialEq)] +enum DirArm { + Gaussian, + FibSpiral, + ProjectSpiral, +} + +impl DirArm { + fn name(self) -> &'static str { + match self { + DirArm::Gaussian => "gaussian", + DirArm::FibSpiral => "fib-spiral", + DirArm::ProjectSpiral => "project-spiral", + } + } +} + +/// A single golden-angle spiral point on S^2 for index `i` of `r` total points. +/// z evenly stepped in [-1,1], azimuth advanced by the golden angle π(3-√5). +fn spiral_point(i: usize, r: usize) -> [f32; 3] { + let golden = std::f32::consts::PI * (3.0 - 5.0f32.sqrt()); + let z = if r == 1 { + 0.0 + } else { + 1.0 - 2.0 * (i as f32 + 0.5) / r as f32 + }; + let rad = (1.0 - z * z).max(0.0).sqrt(); + let theta = i as f32 * golden; + [rad * theta.cos(), rad * theta.sin(), z] +} + +/// Deterministic 3x3 rotation for triple `t`: built by Gram–Schmidt on three +/// Gaussian columns from a triple-indexed RNG. Data-oblivious and reproducible. +/// Identity when `no_rotation` (the pure shared-index variant). +fn triple_rotation(t: usize, no_rotation: bool) -> [[f32; 3]; 3] { + if no_rotation { + return [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]; + } + let mut rng = ChaCha8Rng::seed_from_u64(seed() ^ 0xA071_u64.wrapping_mul(t as u64 + 1)); + let col = |rng: &mut ChaCha8Rng| [gauss(rng), gauss(rng), gauss(rng)]; + let dot = |a: &[f32; 3], b: &[f32; 3]| a[0] * b[0] + a[1] * b[1] + a[2] * b[2]; + let norm = |a: &mut [f32; 3]| { + let n = dot(a, a).sqrt(); + if n > 0.0 { + for x in a.iter_mut() { + *x /= n; + } + } + }; + let mut e0 = col(&mut rng); + norm(&mut e0); + let mut v1 = col(&mut rng); + let p1 = dot(&v1, &e0); + for i in 0..3 { + v1[i] -= p1 * e0[i]; + } + norm(&mut v1); + // e2 = e0 × e1 (right-handed, guaranteed orthonormal) + let e2 = [ + e0[1] * v1[2] - e0[2] * v1[1], + e0[2] * v1[0] - e0[0] * v1[2], + e0[0] * v1[1] - e0[1] * v1[0], + ]; + [e0, v1, e2] +} + +/// Build R unit directions of dimension `dim` for one arm. +/// - Gaussian: iid normal columns, normalized (the control). +/// - FibSpiral: tile `dim` into d/3 triples; direction i = concat over triples +/// of (rotation_t · spiral_point(i, R)); ONE shared index i couples triples. +/// - ProjectSpiral: same as FibSpiral but each direction is then deflated of its +/// component along `cone_axis` (the data's dominant direction) and renormalized, +/// so directions live OFF the shared cone where anisotropic data actually splits. +fn build_dirs(arm: DirArm, r: usize, dim: usize, no_rot: bool, cone_axis: &[f32]) -> Vec> { + match arm { + DirArm::Gaussian => { + let mut rng = ChaCha8Rng::seed_from_u64(seed() ^ 0xD132_0000 ^ (r as u64)); + (0..r) + .map(|_| { + let mut v = vec![0.0f32; dim]; + for x in v.iter_mut() { + *x = gauss(&mut rng); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + }) + .collect() + } + DirArm::FibSpiral | DirArm::ProjectSpiral => { + let n_tri = dim.div_ceil(3); // pad the tail triple if dim % 3 != 0 + let rots: Vec<[[f32; 3]; 3]> = (0..n_tri).map(|t| triple_rotation(t, no_rot)).collect(); + (0..r) + .map(|i| { + let p = spiral_point(i, r); // shared index i across all triples + let mut v = vec![0.0f32; dim]; + for (t, rot) in rots.iter().enumerate() { + // rotated triple vector + let rv = [ + rot[0][0] * p[0] + rot[1][0] * p[1] + rot[2][0] * p[2], + rot[0][1] * p[0] + rot[1][1] * p[1] + rot[2][1] * p[2], + rot[0][2] * p[0] + rot[1][2] * p[1] + rot[2][2] * p[2], + ]; + for (c, &rvc) in rv.iter().enumerate() { + let idx = t * 3 + c; + if idx < dim { + v[idx] = rvc; + } + } + } + if arm == DirArm::ProjectSpiral { + // deflate the dominant cone component, then renormalize + let proj: f32 = v.iter().zip(cone_axis).map(|(a, b)| a * b).sum(); + for (x, &c) in v.iter_mut().zip(cone_axis) { + *x -= proj * c; + } + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() + } + } +} + +/// One projection grid: unit direction + grid width (phase=0 for all arms here). +struct Proj { + dir: Vec, + width: f32, +} + +impl Proj { + fn bucket(&self, v: &[f32]) -> i64 { + let dot: f32 = self.dir.iter().zip(v).map(|(a, b)| a * b).sum(); + (dot / self.width).floor() as i64 + } +} + +type BucketIndex = Vec>>; + +fn build_index(projs: &[Proj], corpus: &[f32], dim: usize) -> BucketIndex { + let n = corpus.len() / dim; + projs + .par_iter() + .map(|p| { + let mut m: std::collections::HashMap> = std::collections::HashMap::new(); + for di in 0..n { + let b = p.bucket(&corpus[di * dim..(di + 1) * dim]); + m.entry(b).or_default().push(di as u32); + } + m + }) + .collect() +} + +fn probe_recall( + projs: &[Proj], + index: &BucketIndex, + q: &[f32], + rad: i64, + truth: &[usize], +) -> (usize, f32) { + use std::collections::HashSet; + let mut cand: HashSet = HashSet::new(); + for (pi, p) in projs.iter().enumerate() { + let b = p.bucket(q); + for bb in (b - rad)..=(b + rad) { + if let Some(ids) = index[pi].get(&bb) { + cand.extend(ids.iter().copied()); + } + } + } + let hits = truth + .iter() + .filter(|&&i| cand.contains(&(i as u32))) + .count(); + let recall = hits as f32 / truth.len().max(1) as f32; + (cand.len(), recall) +} + +fn run(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { + // base_width: single unit grid → ~sqrt(n) buckets (shard_recall calibration). + let spread = 6.0 / (cfg.dim as f32).sqrt(); + let base_width = spread / (cfg.n as f32).sqrt(); + let cone_axis = dominant_axis(corpus, cfg.dim); + let arms = [DirArm::Gaussian, DirArm::FibSpiral, DirArm::ProjectSpiral]; + + let mut pts: Vec<(&'static str, f64, f64)> = Vec::new(); + println!("# recall vs candidates-scanned (mean over queries), per arm per R"); + println!("arm\tR\trad\tcand_scanned\trecall@k"); + for &r in &cfg.r_values { + for &arm in &arms { + let dirs = build_dirs(arm, r, cfg.dim, cfg.no_triple_rotation, &cone_axis); + let projs: Vec = dirs + .into_iter() + .map(|dir| Proj { + dir, + width: base_width, + }) + .collect(); + let index = build_index(&projs, corpus, cfg.dim); + for rad in [0i64, 1, 2, 4] { + let stats: Vec<(usize, f32)> = (0..cfg.n_queries) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * cfg.dim..(qi + 1) * cfg.dim]; + probe_recall(&projs, &index, q, rad, &truth[qi]) + }) + .collect(); + let mean_cand = stats.iter().map(|s| s.0 as f64).sum::() / stats.len() as f64; + let mean_rec = stats.iter().map(|s| s.1 as f64).sum::() / stats.len() as f64; + println!( + "{}\t{r}\t{rad}\t{:.0}\t{:.4}", + arm.name(), + mean_cand, + mean_rec + ); + pts.push((arm.name(), mean_cand, mean_rec)); + } + } + } + + println!("\n# FAIR ENVELOPE: max recall@k at candidates-scanned <= budget"); + let budgets = [1000.0, 2000.0, 4000.0, 8000.0, 16000.0]; + print!("budget"); + for a in &arms { + print!("\t{}", a.name()); + } + println!(); + for &bud in &budgets { + print!("{bud:.0}"); + for a in &arms { + let best = pts + .iter() + .filter(|(name, c, _)| *name == a.name() && *c <= bud) + .map(|(_, _, r)| *r) + .fold(0.0f64, f64::max); + print!("\t{best:.4}"); + } + println!(); + } +} + +fn main() { + let mut cfg = parse(); + let (corpus, queries) = make_corpus(&mut cfg); + let truth = ground_truth(&corpus, &queries, cfg.dim, cfg.k); + let src = if cfg.corpus_npy.is_some() { + "npy" + } else { + "synthetic-cone" + }; + let rbar_c = mean_resultant_length(&corpus, cfg.dim); + let rbar_q = mean_resultant_length(&queries, cfg.dim); + println!( + "# fib_directions: n={} q={} dim={} k={} corpus={src} cone={}", + cfg.n, cfg.n_queries, cfg.dim, cfg.k, cfg.cone + ); + println!( + "# ANISOTROPY mean-resultant-length: corpus={rbar_c:.4} queries={rbar_q:.4} \ + (0=isotropic, 1=single cone) — match synthetic to real before trusting synthetic" + ); + if cfg.no_triple_rotation { + println!("# fib/project arms: NO per-triple rotation (pure shared-index variant)"); + } + run(&cfg, &corpus, &queries, &truth); +} diff --git a/experiments/ordinal-routing-research/fib_reduction.rs b/experiments/ordinal-routing-research/fib_reduction.rs new file mode 100644 index 00000000..0b09ba43 --- /dev/null +++ b/experiments/ordinal-routing-research/fib_reduction.rs @@ -0,0 +1,403 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — three independent probes of the hypothesis +//! "golden/Fibonacci is a REDUCTION RATE of the permutation space" (NOT an +//! address). Isolated by request: touches no routing harness, writes nothing, +//! and each experiment's verdict is fixed BELOW, before any data is seen. No +//! result here may reframe another experiment on this branch. +//! +//! Permutahedron P_{n-1} has n! vertices; a rank code is one vertex. The +//! hypothesis: at scale, the RETRIEVAL-relevant structure is a sub-factorial +//! reduction of that space, and golden ratio is its rate. +//! +//! ===================== EXP1 — FAN-OUT RATE ===================== +//! Math fact (not measured, it's a theorem): #{σ : |σ(i)-i| <= 1} = F_{n+1}, +//! ratio -> φ. The MEASURABLE crux: do the encoder's FP32 true neighbours +//! actually LIVE in the low-displacement (local) permutation neighbourhood of +//! the query's rank code, and does that neighbourhood grow at rate φ in DATA? +//! metric A (locality) = mean footrule(query, true-nbr) / mean footrule(random pair) +//! metric B (φ fan-out) = geometric growth ratio of |docs within footrule r| over r +//! PRE-REGISTERED: +//! locality PASS if A <= 0.75 (true nbrs meaningfully closer in rank space) +//! fan-out GOLDEN if B in [1.50, 1.75]; else "data-determined (intrinsic dim), not φ" +//! +//! ================= EXP2 — COARSENING SCHEDULE ================= +//! Bucket ranks into B bins; retrieve top-K by L1 on the B-bin code; recall@10 +//! vs bits/coord = log2(B). Compare bit-allocation schedules: +//! fib {2,3,5,8,13,21} pow2 {2,4,8,16,32} linear {2,3,4,5,6,7} +//! PRE-REGISTERED: +//! Fibonacci PASS if at matched bits it beats BOTH pow2 and linear by >= 0.03 +//! recall somewhere AND is never worse than either by > 0.03; else +//! "schedule-independent (recall is one smooth curve in bits; φ no advantage)" +//! +//! ==================== EXP3 — MERGE RATE ====================== +//! As ranks are progressively coarsened (B descending, Fibonacci-spaced), at +//! what RATE do permutations merge? Two levels: code-identity (do full codes +//! collide?) and neighbourhood (|docs within a fixed footrule ball|). +//! PRE-REGISTERED: +//! GOLDEN if successive neighbourhood-merge ratios converge to [1.50, 1.75]; +//! else report the actual law (expected: combinatorial ~1/B, not φ). +//! +//! Run: cargo run --release --example fib_reduction -- \ +//! --corpus-npy /tmp/repo_corpus.npy --queries-npy /tmp/repo_q.npy +//! (optional first positional: exp1 | exp2 | exp3; default runs all three) + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +/// Rank-encode: rank[i] = ordinal position (0..dim) of coordinate i within the row. +fn rank_encode(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut out = vec![0u16; n * dim]; + out.par_chunks_mut(dim).enumerate().for_each(|(r, code)| { + let row = &rows[r * dim..(r + 1) * dim]; + let mut idx: Vec = (0..dim as u16).collect(); + idx.sort_by(|&a, &b| row[a as usize].partial_cmp(&row[b as usize]).unwrap()); + for (rank, &i) in idx.iter().enumerate() { + code[i as usize] = rank as u16; + } + }); + out +} + +fn footrule(a: &[u16], b: &[u16]) -> f64 { + a.iter() + .zip(b) + .map(|(x, y)| (*x as i32 - *y as i32).unsigned_abs() as f64) + .sum() +} + +/// FP32 cosine top-k ground truth (rows already L2-normalized). +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +struct Data { + corpus: Vec, + queries: Vec, + dim: usize, + n: usize, + nq: usize, + crank: Vec, + qrank: Vec, + truth: Vec>, +} + +fn exp1_fanout(d: &Data) { + println!("\n## EXP1 — FAN-OUT RATE"); + let crow = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let qrow = |i: usize| &d.qrank[i * d.dim..(i + 1) * d.dim]; + + // metric A: locality of true neighbours vs random pairs, in rank (footrule) space + let true_med: f64 = (0..d.nq) + .into_par_iter() + .map(|qi| { + let mut fs: Vec = d.truth[qi] + .iter() + .map(|&di| footrule(qrow(qi), crow(di as usize))) + .collect(); + fs.sort_by(|a, b| a.partial_cmp(b).unwrap()); + fs[fs.len() / 2] + }) + .sum::() + / d.nq as f64; + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let rand_pairs: Vec<(usize, usize)> = (0..20_000) + .map(|_| (rng.random_range(0..d.n), rng.random_range(0..d.n))) + .collect(); + let rand_med = { + let mut fs: Vec = rand_pairs + .par_iter() + .map(|&(i, j)| footrule(crow(i), crow(j))) + .collect(); + fs.sort_by(|a, b| a.partial_cmp(b).unwrap()); + fs[fs.len() / 2] + }; + let a_ratio = true_med / rand_med; + println!(" metric A locality: true-nbr median footrule {true_med:.0} / random {rand_med:.0} = {a_ratio:.4}"); + println!( + " -> {}", + if a_ratio <= 0.75 { + "PASS (true neighbours closer in rank space)" + } else { + "FAIL (rank space does not localize neighbours)" + } + ); + + // metric B: neighbourhood growth ratio over footrule shells, on a query sample + let sample: Vec = (0..d.nq).step_by(4).collect(); + let radii: Vec = (1..=8).map(|s| rand_med * (s as f64) / 8.0).collect(); + let mut counts = vec![0f64; radii.len()]; + for &qi in &sample { + let dists: Vec = (0..d.n) + .into_par_iter() + .map(|di| footrule(qrow(qi), crow(di))) + .collect(); + for (ri, &r) in radii.iter().enumerate() { + counts[ri] += dists.iter().filter(|&&x| x <= r).count() as f64; + } + } + let mut ratios = vec![]; + for w in counts.windows(2) { + if w[0] > 0.0 { + ratios.push(w[1] / w[0]); + } + } + let gmean = ratios.iter().map(|r| r.ln()).sum::() / ratios.len() as f64; + let g = gmean.exp(); + println!(" metric B fan-out growth ratio (per equal radius step) = {g:.3}"); + println!( + " -> {}", + if (1.50..=1.75).contains(&g) { + "GOLDEN (≈φ)" + } else { + "data-determined (set by intrinsic dim), NOT φ" + } + ); +} + +fn exp2_schedule(d: &Data) { + println!("\n## EXP2 — COARSENING SCHEDULE"); + let crow = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let qrow = |i: usize| &d.qrank[i * d.dim..(i + 1) * d.dim]; + let cand_k = 100usize; + let bucket = |rank: u16, b: u32| -> u16 { ((rank as u32 * b) / d.dim as u32) as u16 }; + // recall@10 (truth top-10 captured within top-cand_k by L1 on B-bin code) + let recall_for_b = |b: u32| -> f64 { + let ccode: Vec = d.crank.iter().map(|&r| bucket(r, b)).collect(); + (0..d.nq) + .into_par_iter() + .map(|qi| { + let qc: Vec = qrow(qi).iter().map(|&r| bucket(r, b)).collect(); + let mut s: Vec<(u32, i64)> = (0..d.n) + .map(|di| { + let dc = &ccode[di * d.dim..(di + 1) * d.dim]; + let dist: i64 = qc + .iter() + .zip(dc) + .map(|(x, y)| (*x as i64 - *y as i64).abs()) + .sum(); + (di as u32, dist) + }) + .collect(); + s.sort_by_key(|x| x.1); + let top: std::collections::HashSet = + s.into_iter().take(cand_k).map(|x| x.0).collect(); + let hit = d.truth[qi].iter().filter(|t| top.contains(t)).count(); + hit as f64 / d.truth[qi].len() as f64 + }) + .sum::() + / d.nq as f64 + }; + let all_b = [2u32, 3, 4, 5, 6, 7, 8, 13, 16, 21, 32]; + let mut rec = std::collections::HashMap::new(); + for &b in &all_b { + rec.insert(b, recall_for_b(b)); + } + let show = |name: &str, bs: &[u32], rec: &std::collections::HashMap| { + println!(" {name}:"); + for &b in bs { + println!( + " B={b:>2} bits={:.2} recall@10={:.4}", + (b as f64).log2(), + rec[&b] + ); + } + }; + let _ = crow; // crow unused here; codes built from crank directly + show("fib ", &[2, 3, 5, 8, 13, 21], &rec); + show("pow2 ", &[2, 4, 8, 16, 32], &rec); + show("linear", &[2, 3, 4, 5, 6, 7], &rec); + println!(" -> recall is a function of bits=log2(B); schedules only SAMPLE that curve."); + println!( + " Fibonacci PASS only if its points beat pow2 AND linear at matched bits by >=0.03." + ); +} + +fn exp3_merge(d: &Data) { + println!("\n## EXP3 — MERGE RATE"); + let crow = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let bucket = |rank: u16, b: u32| -> u16 { ((rank as u32 * b) / d.dim as u32) as u16 }; + // (a) code-identity: do full B-bin codes collide as B shrinks? + println!(" (a) distinct full codes among {} docs:", d.n); + let fib_levels = [21u32, 13, 8, 5, 3, 2]; + let mut distinct = vec![]; + for &b in &fib_levels { + let set: std::collections::HashSet> = (0..d.n) + .map(|i| crow(i).iter().map(|&r| bucket(r, b)).collect::>()) + .collect(); + distinct.push(set.len()); + println!( + " B={b:>2}: {} distinct ({:.1}% of n)", + set.len(), + 100.0 * set.len() as f64 / d.n as f64 + ); + } + // (b) neighbourhood merge: mean docs within a fixed footrule ball, across Fibonacci radii + let crow2 = |i: usize| &d.crank[i * d.dim..(i + 1) * d.dim]; + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x3333); + let probes: Vec = (0..50).map(|_| rng.random_range(0..d.n)).collect(); + // a coarse footrule scale: median random footrule + let scale = { + let mut fs: Vec = (0..5000) + .map(|_| { + footrule( + crow2(rng.random_range(0..d.n)), + crow2(rng.random_range(0..d.n)), + ) + }) + .collect(); + fs.sort_by(|a, b| a.partial_cmp(b).unwrap()); + fs[fs.len() / 2] + }; + let fib_r = [1f64, 2.0, 3.0, 5.0, 8.0, 13.0]; // Fibonacci-spaced shell radii (× scale/13) + let mut counts = vec![0f64; fib_r.len()]; + for &p in &probes { + let dists: Vec = (0..d.n) + .into_par_iter() + .map(|di| footrule(crow2(p), crow2(di))) + .collect(); + for (ri, &fr) in fib_r.iter().enumerate() { + let r = scale * fr / 13.0; + counts[ri] += dists.iter().filter(|&&x| x <= r).count() as f64; + } + } + println!(" (b) neighbourhood size at Fibonacci-spaced radii:"); + for (ri, &fr) in fib_r.iter().enumerate() { + println!( + " r={fr:>4} (×scale/13): mean |ball| = {:.1}", + counts[ri] / probes.len() as f64 + ); + } + let mut ratios = vec![]; + for w in counts.windows(2) { + if w[0] > 0.0 { + ratios.push(w[1] / w[0]); + } + } + if !ratios.is_empty() { + let g = (ratios.iter().map(|r| r.ln()).sum::() / ratios.len() as f64).exp(); + println!( + " merge growth ratio across shells = {g:.3} -> {}", + if (1.50..=1.75).contains(&g) { + "GOLDEN (≈φ)" + } else { + "NOT φ (set by data geometry / shell spacing)" + } + ); + } +} + +fn main() { + let mut which = None; + let mut corpus_path = None; + let mut queries_path = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => corpus_path = Some(a.next().unwrap()), + "--queries-npy" => queries_path = Some(a.next().unwrap()), + "exp1" | "exp2" | "exp3" => which = Some(x), + _ => {} + } + } + let cp = corpus_path.expect("--corpus-npy required"); + let qp = queries_path.expect("--queries-npy required"); + let (mut corpus, n, dim) = load_npy_f32(&cp); + let (mut queries, nq, dq) = load_npy_f32(&qp); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + let crank = rank_encode(&corpus, n, dim); + let qrank = rank_encode(&queries, nq, dim); + let truth = ground_truth(&corpus, &queries, dim, 10); + let d = Data { + corpus, + queries, + dim, + n, + nq, + crank, + qrank, + truth, + }; + println!("# fib_reduction (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}"); + let _ = (&d.corpus, &d.queries); // kept for parity; rank-space metrics use crank/qrank + match which.as_deref() { + Some("exp1") => exp1_fanout(&d), + Some("exp2") => exp2_schedule(&d), + Some("exp3") => exp3_merge(&d), + _ => { + exp1_fanout(&d); + exp2_schedule(&d); + exp3_merge(&d); + } + } +} diff --git a/experiments/ordinal-routing-research/gen_corpus.rs b/experiments/ordinal-routing-research/gen_corpus.rs new file mode 100644 index 00000000..e12ab09d --- /dev/null +++ b/experiments/ordinal-routing-research/gen_corpus.rs @@ -0,0 +1,279 @@ +//! Corpus zoo generator — writes diverse embedding geometries as .npy so the +//! conjecture probes (spectral_probe, shard_recall, twonn_id, bench_rank) can +//! be verified across distributions, not just the one synthetic clustered +//! Gaussian. Decouples generation from measurement and dogfoods the .npy path. +//! +//! Kinds (each stresses a different failure mode / mimics a real pathology): +//! clustered — low-rank Gaussian clusters (the existing baseline) +//! isotropic — N(0,I) on the sphere (Poisson NULL control) +//! anisotropic — power-law singular spectrum sigma_k ~ k^-alpha (real +//! embedding decay; Gao et al. narrow-cone) +//! rogue — a few dimensions with huge variance (Timkey rogue dims) +//! manifold — nonlinear low-D manifold (sinusoidal) in high-D: the one +//! place angular RIGIDITY could plausibly hide +//! lattice — jittered 1-D lattice projected up: POSITIVE rigidity +//! control. If the number-variance probe never reports +//! sub-Poisson, it can't see rigidity; this proves it can. +//! +//! Run: cargo run --release --example gen_corpus -- --kind anisotropic \ +//! --n 50000 --dim 256 --out corpus_aniso.npy +//! Also emits a matched query file with --queries-out (smaller-noise draws). + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; + +struct Cfg { + kind: String, + n: usize, + n_queries: usize, + dim: usize, + latent: usize, + clusters: usize, + alpha: f32, + rogue: usize, + out: String, + queries_out: Option, + seed: u64, +} + +fn parse() -> Cfg { + let mut c = Cfg { + kind: "clustered".into(), + n: 50_000, + n_queries: 200, + dim: 256, + latent: 64, + clusters: 200, + alpha: 1.0, + rogue: 3, + out: "corpus.npy".into(), + queries_out: None, + seed: 1, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--kind" => c.kind = a.next().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent = a.next().unwrap().parse().unwrap(), + "--clusters" => c.clusters = a.next().unwrap().parse().unwrap(), + "--alpha" => c.alpha = a.next().unwrap().parse().unwrap(), + "--rogue" => c.rogue = a.next().unwrap().parse().unwrap(), + "--out" => c.out = a.next().unwrap(), + "--queries-out" => c.queries_out = Some(a.next().unwrap()), + "--seed" => c.seed = a.next().unwrap().parse().unwrap(), + other => panic!("unknown arg {other}"), + } + } + c +} + +/// Inverse standard-normal CDF (Acklam's rational approximation, |err|<1e-9). +fn inv_norm_cdf(p: f64) -> f64 { + let p = p.clamp(1e-12, 1.0 - 1e-12); + let a = [-3.969683028665376e+01, 2.209460984245205e+02, -2.759285104469687e+02, + 1.383577518672690e+02, -3.066479806614716e+01, 2.506628277459239e+00]; + let b = [-5.447609879822406e+01, 1.615858368580409e+02, -1.556989798598866e+02, + 6.680131188771972e+01, -1.328068155288572e+01]; + let c = [-7.784894002430293e-03, -3.223964580411365e-01, -2.400758277161838e+00, + -2.549732539343734e+00, 4.374664141464968e+00, 2.938163982698783e+00]; + let d = [7.784695709041462e-03, 3.224671290700398e-01, 2.445134137142996e+00, + 3.754408661907416e+00]; + let plow = 0.02425; + if p < plow { + let q = (-2.0 * p.ln()).sqrt(); + (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + } else if p <= 1.0 - plow { + let q = p - 0.5; + let r = q * q; + (((((a[0] * r + a[1]) * r + a[2]) * r + a[3]) * r + a[4]) * r + a[5]) * q + / (((((b[0] * r + b[1]) * r + b[2]) * r + b[3]) * r + b[4]) * r + 1.0) + } else { + let q = (-2.0 * (1.0 - p).ln()).sqrt(); + -(((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Write a 2-D row-major f32 buffer as a NumPy v1 .npy (().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } +} + +/// Generate `count` L2-normalized rows of dimension cfg.dim for the chosen +/// geometry. `is_query` reduces noise so queries sit near corpus structure. +fn generate(cfg: &Cfg, rng: &mut ChaCha8Rng, count: usize, is_query: bool) -> Vec { + let d = cfg.dim; + let l = cfg.latent; + let noise = if is_query { 0.1 } else { 0.3 }; + // Bug O fix: A and prototypes define the SHARED latent geometry and MUST be + // identical for corpus and queries (otherwise queries land in a different + // space and shard_recall ground truth is meaningless). Seed them from a + // dedicated, geometry-only RNG keyed by cfg.seed, NOT the streaming `rng` + // (whose state differs between the corpus and query calls). + let mut geo = ChaCha8Rng::seed_from_u64(cfg.seed ^ 0x9E37_79B9_7F4A_7C15); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut geo); + } + let mut protos = vec![0.0f32; cfg.clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut geo); + } + // power-law per-latent scaling for "anisotropic": sigma_k ~ (k+1)^-alpha + let scale: Vec = (0..l) + .map(|k| ((k + 1) as f32).powf(-cfg.alpha)) + .collect(); + + let mut out = vec![0.0f32; count * d]; + for i in 0..count { + let row = &mut out[i * d..(i + 1) * d]; + match cfg.kind.as_str() { + "isotropic" => { + for x in row.iter_mut() { + *x = gauss(rng); + } + } + "clustered" => { + let c = rng.random_range(0..cfg.clusters); + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = protos[c * l + j] + noise * gauss(rng); + } + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[ii * l + j] * z[j]; + } + row[ii] = acc; + } + } + "anisotropic" => { + // power-law spectrum: latent coords scaled by k^-alpha + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = scale[j] * gauss(rng); + } + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[ii * l + j] * z[j]; + } + row[ii] = acc; + } + } + "rogue" => { + // baseline isotropic + a few dims with ~10x variance + for x in row.iter_mut() { + *x = gauss(rng); + } + for r in 0..cfg.rogue.min(d) { + row[r] = 10.0 * gauss(rng); + } + } + "manifold" => { + // nonlinear low-D manifold: latent t in R^m mapped through + // sinusoids of varying frequency, then projected up. Angular + // structure here is the best chance for rigidity to appear. + let m = l.min(8).max(2); + let mut t = vec![0.0f32; m]; + for tj in t.iter_mut() { + *tj = rng.random_range(0.0..std::f32::consts::TAU); + } + // build a smooth high-D embedding: row[ii] = sum_j sin((j+1) t_j + phase_ii) + for ii in 0..d { + let mut acc = 0.0f32; + for j in 0..m { + let phase = a[ii * l + j]; // reuse A as fixed phases + acc += ((j + 1) as f32 * t[j] + phase).sin(); + } + row[ii] = acc + noise * gauss(rng); + } + } + "projected-rigid" => { + // STRONGEST rigidity control: place a quantile-spaced target on + // a FIXED direction u (= col 0 of A, unit-normalized once) and + // make it dominate. x_i = t_i * u (+ tiny ortho jitter). Then + // for ANY probe direction r, r.x_i = t_i * (r.u), a scaled copy + // of the rigid sequence — so the projected key inherits the + // quantile spacing (up to L2 distortion). This tests whether the + // random-projection + normalize pipeline PRESERVES rigidity that + // genuinely lives in the key, closing the selftest gap. + let t = inv_norm_cdf((i as f64 + 0.5) / count as f64) as f32; + for ii in 0..d { + row[ii] = t * a[ii * l] + 0.001 * gauss(rng); + } + } + "lattice" => { + // POSITIVE rigidity control. Each row gets a DISTINCT, evenly + // spaced position pos = i/count in [-1,1] (used once — no + // duplicates). Embedded on a small ARC: a constant base + // direction (col 0 of A) plus a small pos-scaled component along + // an orthogonal direction (col 1). The constant base survives + // L2 normalization (a pure line would collapse to +/- one + // vector); projecting onto the pos-direction recovers the + // evenly-spaced sequence -> SUB-Poisson (rigid) number variance. + // Gaussian-QUANTILE-spaced positions: pos_i = Phi^-1((i+0.5)/count). + // The marginal is then exactly N(0,1), matching the probe's + // Gaussian unfold, so any residual structure is REAL. Evenly + // spaced in probability = maximally regular = rigid. + let pos = inv_norm_cdf((i as f64 + 0.5) / count as f64) as f32; + for ii in 0..d { + let base = a[ii * l]; + let along = a[ii * l + 1]; + row[ii] = base + 0.35 * pos * along + 0.01 * gauss(rng); + } + } + other => panic!("unknown kind {other}"), + } + l2(row); + } + out +} + +fn main() { + let cfg = parse(); + let mut rng = ChaCha8Rng::seed_from_u64(cfg.seed); + let corpus = generate(&cfg, &mut rng, cfg.n, false); + write_npy(&cfg.out, &corpus, cfg.n, cfg.dim); + if let Some(ref qp) = cfg.queries_out { + let q = generate(&cfg, &mut rng, cfg.n_queries, true); + write_npy(qp, &q, cfg.n_queries, cfg.dim); + } +} diff --git a/experiments/ordinal-routing-research/length_mixture_lake_results.md b/experiments/ordinal-routing-research/length_mixture_lake_results.md new file mode 100644 index 00000000..31b2e664 --- /dev/null +++ b/experiments/ordinal-routing-research/length_mixture_lake_results.md @@ -0,0 +1,126 @@ +# Path B — the chunk-length-mixture lake (deployment robustness, final arc) + +The Phase B caveat split the unmodeled real-lake pathologies in two: OCR-garbage / +mixed-language (uncapturable from clean embeddings — needs a real dirty S3 sample) +and **chunk-length heterogeneity** (a *mixture of geometries* — capturable +synthetically). The directions writeup flagged the latter as **Path B** +(`oblivious_directions_results.md`, "chunk-length heterogeneity is itself a lake +pathology (folds into Path B)") and named chunk length a **third geometry axis** +(longer chunks → tokens average → vectors regress to the corpus mean → tighter cone, +lower ID). Path B tests both at once: does a real chunk-length *mixture* break b=4 raw +rank routing, and how big is the chunk-length geometry axis actually? + +Domain is held constant (fiqa, 57.6k docs), only the chunk length varies — the clean +complement to Phase B, which held length constant and varied domain. + +## Setup + +Embed the SAME 57,600 fiqa docs (nomic-embed-text, gpu1) at four chunk-length caps +`{128, 256, 512, 1100}` chars, then union into one 230,388-doc lake +(`make_length_lake.py`). Queries: the full-length fiqa held-out set (1000) — a real +lake mixes *document* chunk lengths, not query lengths. Probe: the **same +`centering_recall.rs`** whose raw b=4 arm IS the incumbent that beat every +oblivious-structure idea. Metric: R@10 of b-bit rank codes vs **FP32-cosine top-10 +recomputed on each corpus** — a routing-*fidelity* metric, so the 4× size difference +cancels (it hits the code arm and its own FP32 ground truth equally; no size confound). + +## Pre-registration (locked before the lake run) + +- **Path B fear CONFIRMED** (the first positive of the whole arc) if b=4 raw R@10 on + the length mixture drops **≥ 0.02** vs the single-length fiqa@512 baseline. +- **FALSIFIED** (consistent with every prior lake fear) if flat to noise (|Δ| < 0.02). +- Secondary, pre-stated from the "third geometry axis" claim: longer chunks → tighter + cone (higher R̄) and the strata cones point in *different* directions (a true mixture + of geometries, not one). + +## Result 1 — the chunk-length geometry axis is real but SMALL and CO-AXIAL + +Per-stratum cone tightness R̄ (mean cosine to that stratum's centroid) and pairwise +centroid-axis cosine, all on the identical 57.6k docs: + +| chunk cap | R̄ (cone tightness) | NaN/zero rows | +|-----------|--------------------|---------------| +| 128 | 0.7046 | 0 | +| 256 | 0.7151 | 0 | +| 512 | 0.7198 | 0 | +| 1100 | 0.7234 | 12 | + +Cone-axis cosine between strata: **0.986–0.999** (128-vs-1100 = 0.9856; the rest tighter). + +The predicted direction holds — longer chunks DO give a tighter cone (R̄ rises +monotonically 0.705→0.723) — but the magnitude is tiny: a 0.019 R̄ spread across an +8.6× chunk-length range, and the four cones are essentially **co-axial** (≥0.986). So +chunk length is a real third geometry axis but a *weak* one on this encoder/corpus: it +shifts cone tightness slightly along a shared axis, it does NOT produce the distinct, +differently-pointed cones the "mixture of geometries" framing imagined. (Caveat: fiqa +docs are short Q&A; a corpus with genuinely long source documents — where a 128-char +cap truly amputates content — could spread R̄ more. Established here only for +fiqa/nomic.) + +## Result 2 — b=4 raw routing is IMMUNE to the length mixture (fear FALSIFIED) + +`centering_recall` raw arm, 1000 fiqa queries, FP32-cosine top-10 ground truth: + +| corpus | b=4 raw R@10 | b=4 raw CR@100 | +|--------|--------------|----------------| +| baseline fiqa@512 (57.6k) | 0.8230 | 1.0000 | +| **Path B length-mixture lake (230k, 4 lengths)** | **0.8253** | 1.0000 | +| **Δ** | **+0.0023** | 0.0000 | + +Pre-registered ≥0.02 drop to confirm the fear. **Actual +0.002 — flat to noise, if +anything slightly up.** Candidate recall stays a perfect 1.0000: every FP32 top-10 +neighbour is still captured in the top-100 by code distance on the 4×-larger mixed lake. + +The centering signature is unchanged from single-length, too — b1 raw 0.516 → centered ++0.055, b2/b4 centered net-negative (b4 ΔR −0.079) — same PARTIAL pattern, same +capacity-scaling penalty as every prior corpus. The mixture changed neither the routing +fidelity nor the centering mechanism; it just reconfirmed both on 4× the data. + +## Why the global bin edges survive a length mixture + +`centering_recall`'s encoder is **per-dimension rank → equal-width bins** — the rank +transform is computed within each vector, so a slightly-tighter (long-chunk) or +slightly-looser (short-chunk) cone produces the same *ranks*; only the raw cosine +magnitudes differ, and ranks discard magnitude. The fixed-mass rank code has nothing +global for a tightness-shift to poison, exactly as the templated-hub immunity in Phase B +had nothing global (no IDF term) for a hub to poison. Training-free + rank-based is again +the property that confers the robustness. + +## Verdict — Path B closes the synthetic-lake arc on the same negative + +The last fear capturable without real dirty data is unfounded: a chunk-length mixture is +**not** a mixture of distinct geometries on fiqa/nomic (cones co-axial, R̄ spread 0.019), +and b=4 raw routing is **immune** to it (+0.002 R@10, CR 1.0). Combined with Phase B +(multi-domain union: lower-ID, globally centerable, hub-immune), every synthetic lake +pathology — multi-cone, templated-hub, and now multi-length — leaves the boring "spend +the bits, b=4, raw ranks" baseline intact. + +CAVEAT (honest scope, unchanged): this is still *clean* embeddings of *curated* fiqa +text truncated to length. It models length-geometry heterogeneity; it does NOT model +OCR garbage, mixed-language, or broken-encoding sludge. The one remaining +non-derivative test is an actual dirty S3 sample (the OCR/multilingual half of the Phase +B caveat). Claim established: "a chunk-length mixture is benign for b=4 routing." Claim +NOT established: "raw dirty S3 sludge is benign." + +## Reproduce +``` +# 1. embed fiqa text truncated to each chunk-length cap (gpu1 nomic) +for N in 128 256; do python3 - < artificial hubs. +Queries: concatenate the per-domain held-out query sets (ground truth spans cones). +Deterministic (fixed LCG); no numpy. +""" +import struct, array, sys, math, argparse + +def read_npy(path): + b = open(path, "rb").read() + assert b[:6] == b"\x93NUMPY" + hl = struct.unpack("> 7; x ^= (x << 17) & 0xFFFFFFFFFFFFFFFF + _state[0] = x; return (x & 0xFFFFFFFF) / 0x100000000 +def _g(): + u1 = max(_u(), 1e-12); u2 = _u() + return math.sqrt(-2*math.log(u1)) * math.cos(2*math.pi*u2) + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--parts", nargs="+", required=True, help="domain corpus .npy files") + ap.add_argument("--out", required=True) + ap.add_argument("--hub-frac", type=float, default=0.0, help="fraction of corpus that is templated near-dupes") + ap.add_argument("--hub-clusters", type=int, default=50, help="number of distinct templated clusters") + ap.add_argument("--hub-noise", type=float, default=0.02, help="gaussian noise added to base vec (pre-normalize)") + ap.add_argument("--cap", type=int, default=0, help="cap docs per part (0=all)") + args = ap.parse_args() + + dim = None; rows = array.array("f"); total = 0 + for p in args.parts: + a, n, d = read_npy(p) + dim = d if dim is None else dim + assert d == dim, f"dim mismatch {p}" + take = n if args.cap == 0 else min(n, args.cap) + rows.extend(a[:take*d]); total += take + print(f"# {p}: +{take} docs", file=sys.stderr) + + # inject templated hubs: replace a fraction of rows with near-dupes of K bases + if args.hub_frac > 0: + n_hub = int(total * args.hub_frac) + # pick K base vectors from existing rows (deterministic indices) + bases = [ [rows[(i*9973 % total)*dim + j] for j in range(dim)] for i in range(args.hub_clusters) ] + for h in range(n_hub): + base = bases[h % args.hub_clusters] + tgt = (h * 7919) % total # deterministic overwrite positions + row = [base[j] + args.hub_noise * _g() for j in range(dim)] + nrm = math.sqrt(sum(x*x for x in row)) or 1.0 + for j in range(dim): + rows[tgt*dim + j] = row[j] / nrm + print(f"# injected {n_hub} templated-hub docs ({args.hub_frac:.0%}) across {args.hub_clusters} clusters", file=sys.stderr) + + write_npy(args.out, rows, total, dim) + +if __name__ == "__main__": + main() diff --git a/experiments/ordinal-routing-research/make_length_lake.py b/experiments/ordinal-routing-research/make_length_lake.py new file mode 100644 index 00000000..0410f94e --- /dev/null +++ b/experiments/ordinal-routing-research/make_length_lake.py @@ -0,0 +1,64 @@ +#!/usr/bin/env python3 +"""Path B — build a CHUNK-LENGTH-MIXTURE lake (one domain, many length geometries). + +Unlike make_lake.py (multi-DOMAIN union -> separated cones), this unions the SAME +documents embedded at several chunk lengths {128,256,512,1100}. Domain is held +constant; only the length-geometry varies. Models the real-lake pathology where one +S3 bucket holds the same content chunked every which way -> a *mixture of cones of +different tightness* quantized under one global set of rank-bin edges. + +Stdlib only (struct/array), matches make_lake.py's npy IO. Deterministic. +""" +import struct, array, sys, argparse + +def read_npy(path): + b = open(path, "rb").read() + assert b[:6] == b"\x93NUMPY" + hl = struct.unpack("` (same convention as the sibling docs). +**No crate or public-API changes.** Every probe was **pre-registered** (pass/fail +thresholds fixed in the source header BEFORE running) to prevent goalpost-moving, +and run on real embeddings. + +## Corpora (real, nomic-embed-text 768-d via ollama on GPU) + +| corpus | n docs | queries | domain | mean-resultant-length R̄ | +|--------|--------|---------|--------|--------------------------| +| repo | 8,777 | 200 | md+rust | ~0.07 raw-cluster fixture* | +| fiqa | 57,600 | 1,000 | financial QA | 0.69 (strongly coned) | +| nq | 74,129 | 1,000 | Wikipedia QA | 0.69 | +| quora | 150,000| 1,000 | question pairs | ~0.7 | + +Ground truth = FP32 cosine top-10. *The repo corpus is the prior round's; the +financial/QA corpora are the breadth/scale test. NOTE: intrinsic dimension is a +**corpus** property, not an encoder constant — the repo corpus reads TwoNN≈13 but +fiqa reads ≈24 with the SAME encoder. See the "ID≈13 anchor was a corpus artifact" +correction in the cross-encoder section before quoting any single ID number. + +## VERDICT (pre-registered, replicated) + +**CLASS-DEAD.** Data-oblivious low-discrepancy directions do **not** beat iid-random +directions for training-free routing — not in ambient space, not in the centered +intrinsic subspace. The lever is **not the directions**. + +## The chain of probes + +### 1. `fib_directions` — golden-angle directions, AMBIENT sphere → TIE (with caveat) +Shared-index triple-tiled golden-angle directions vs iid-Gaussian, recall@k at +matched candidates-scanned. Across 10 seeds the gap is inside random's own +seed-to-seed noise (±0.03); the eye-catching seed-1 +5.8pts **evaporated under +reseeding** (a caught false positive). CAVEAT registered at the time: ambient +768-sphere is the wrong regime — concentration of measure makes iid-Gaussian +already near-uniform, so the test was pre-doomed regardless of golden's merit. + +### 2. `uniformity_lemma` — why golden is dead on BIN EDGES (not directions) +The rank transform whitens each coordinate's marginal to **exactly uniform**, so +equal-width bucketing is **entropy-optimal** (measured: entropy = B exactly at +B∈{1,2,4}; golden boundaries 3.92 < 4.00 — they *strictly waste bits*). BUT this +lemma governs the **marginal only**; it is blind to the **joint** correlation +structure that projection directions act on. So it kills golden on bin edges and +says nothing about directions — which set up probe #4. +Correction banked: constant-composition is intrinsic to *any* fixed boundaries +on ranks, NOT unique to equal-width (the original claim was too strong). + +### 3. `overlap_decomp` — the cone is ~100% of the bitmap-overlap excess +ordvec's hypergeometric null `n_top²/D` is **miscalibrated on real data**: random +pairs overlap ~4× the textbook prediction, and `cone_frac ≈ 1.000` — essentially +**all** the excess is the shared cone (hubness), not pairwise similarity. +Per-coord **mean-centering removes it** (cone_base collapses to the uniform null) +AND amplifies the true-neighbour overlap gap 2–5×. Replicated on fiqa and nq. +This is the one continuous, geometry-aligned intervention that *converted*. + +### 4. `subspace_directions` — the directions test done RIGHT → CLASS-DEAD +Center → project to the populated k≈13-dim PCA subspace (where joint structure is +legible) → place R directions via {random, Sobol, Kronecker} → route. Fair +envelope (recall@10 at matched candidates): + +| corpus | budget | random | sobol | kronecker | pca-axes (ceiling) | +|--------|--------|--------|-------|-----------|--------------------| +| fiqa | 8000 | 0.1967 | 0.1997| 0.2001 | 0.1946 | +| nq | 8000 | 0.1836 | 0.1830| 0.1817 | 0.1877 | +| quora | 8000 | 0.1004 | 0.1035| 0.1061 | 0.1073 | + +Max margin anywhere **+0.006** (threshold for a win was +0.02), across all three +corpora. Decisive detail: +the **data-dependent ceiling (pca-axes) itself barely beats random** → the +directions are not a lever for *anyone* on this data. With ID≈13 and a cone, the +populated subspace is so dense that any ~13–128 directions cover it equally; there +is no discrepancy gap for quasi-random structure to close. The Kronecker/Sobol +**hybrid was NOT built** — it required distinct-regime component wins; there were +none. + +## centering's recall verdict (`centering_recall`, pre-registered) + +Centering as a RankQuant encoding change: full-scan R@10 vs FP32 cosine, raw vs +centered, matched bytes. + +| corpus | Δb1 | Δb2 | Δb4 | verdict | +|--------|-----|-----|-----|---------| +| repo | +0.038 | −0.044 | −0.108 | PARTIAL | +| fiqa | +0.066 | −0.017 | −0.075 | PARTIAL | +| nq | +0.051 | −0.016 | −0.083 | PARTIAL | +| quora | +0.027 | −0.022 | −0.061 | PARTIAL | + +Centering **helps at b=1, hurts at b≥2**, replicated across 3 corpora. It removes +the additive cone (good for top-bucket coverage / coarse routing) but rotates the +rank order away from raw-cosine (bad for full-order fidelity at higher bits). It +is a **low-bit-prefilter / coarse-routing tool, not a fine-scan tool** — and +explicitly does NOT survive at the incumbent b=4. + +## The through-line + +Every oblivious *discrete* structural prior tried across both rounds — golden, +Fibonacci, prime/spectral, Sobol, Kronecker, low-discrepancy directions — is +**inert against learned continuous embedding geometry**. The data is a smooth +~13-dim cone; discrete combinatorial structure has no purchase on it. The only +interventions that converted were **continuous and geometric** (centering / cone +removal), and only in the low-bit regime. "Just spend bits (b=4, raw ranks)" +remains the baseline that beats every oblivious-structure idea. + +Even centering — the one continuous intervention that *partially* converted (b=1 +prefilter, +0.03–0.07 R@10) — does not survive at b=4 and does not make a better +partition key. The cone is real and removable, but removing it trades cosine +fidelity for balance, and routing wants the fidelity. + +## Scope & boundary of this negative (checkpoint) + +This is a deliberate stopping point: the hypothesis space had one degree of +freedom — *which oblivious structure* on *which axis* (bin-edges / directions / +partition key) — and both ends of both axes have now been tested, **including the +data-dependent PCA ceiling**. The load-bearing result is that the ceiling itself +barely beats random: that kills the *axis*, so any further sequence is a +**derivative** that inherits the null. You cannot escape a null by swapping the +sequence when even the optimal member ties. + +Every result here is conditioned on **off-the-shelf `nomic-embed-text`**: a +learned, anisotropic (R̄≈0.69), low-intrinsic-dim (≈13) encoder. The honest claim +is therefore scoped: *data-oblivious combinatorial structure is inert against +learned anisotropic embedding geometry of this kind.* The only **non-derivative** +next experiment is to change that conditioning variable — a different encoder +(different anisotropy / intrinsic dim), or a sparsity/ordinal-trained encoder +where ranks are the native code and discrete structure might finally seat. +Everything inside the current encoder is a derivative of what is already +falsified here. + +## Partition balance (`partition_balance`, pre-registered) — BALANCE pass, PRUNING fail + +The one round-on-round thread: does cone-removal make a balanced, prunable coarse +shard key? Sign-pattern key over the top-`bits_key` PCA axes (basis shared across +arms; only centering differs). Swept bits_key ∈ {6,10,14} on fiqa. + +| sub-claim | threshold | result | +|-----------|-----------|--------| +| BALANCE — largest-cell-fraction reduction | ≥1.5× | **PASS** (1.6–2.2×; Gini 0.60→0.38 at bits_key=10) | +| PRUNING — candidates at matched recall≥0.90 | ≤0.80× | **FAIL** (same recall-vs-candidates envelope) | + +Centering de-hubs the cells (more balanced, uses all cells) but at matched recall +scans the **same** candidates — better balance did NOT convert to better pruning. +Mechanism: balance is a *marginal* property of the key; pruning quality depends on +cell-to-neighbour *alignment*, and centering rotates the key away from cosine (the +same reason it fails recall at b≥2). Cells spread along the wrong axis don't prune +better. This was the last live thread; it closes for the session's recurring +reason. + +## Cross-encoder extension (the non-derivative move) — IN PROGRESS + +Every result above is conditioned on nomic (ID≈13, R̄≈0.69). The only +information-adding next step is changing the encoder. Geometry gate (8k fiqa +sample, TwoNN intrinsic dim + mean-resultant-length R̄) — run BEFORE any probe to +confirm the encoders genuinely differ from nomic, else a re-run is derivative: + +First-pass gate (512-char chunks) measured: nomic 768d ID≈13 R̄0.69; all-minilm +384d ID21.7 R̄0.28; mxbai 1024d ID24.3 R̄0.69; snowflake-v1 1024d ID31.1 R̄0.74. + +**Methodology correction (capacity confound).** 384-dim (all-minilm) cannot hold +meaning above ~400-char chunks, and our chunks were capped at 512 — so minilm's +geometry was capacity-corrupted, not a clean ID point, and does not belong on the +ladder. Floor set to **768-dim** encoders and chunk cap raised to **1100 chars** +(BGE-768 capacity ceiling). Revised capacity-honest set (all ≥1024-dim, all clear +the floor): **bge-m3, bge-large, snowflake-arctic-embed-v2**, plus +**harrier-oss-v1-0.6b** (the operator's actual daily-driver encoder). Model facts +verified against the ollama host + HF config (not recalled): harrier = +Qwen3-based, **1024-dim, 32k context** (Q8 GGUF); ingests 1100-char chunks with +zero truncation. Re-embedding fiqa at 1100-char chunks; ladder rebuilt on these. +(For code corpora, the right encoder is jina-code — noted for a future code-domain +run; not in this text-corpus ladder.) + +PRE-REGISTERED prediction (fixed before the full-corpus probes run): +- The session's mechanism is "ID≈13 is so dense any directions cover it equally." + If TRUE, then as ID climbs to 31 the subspace sparsifies and a discrepancy gap + should OPEN → Sobol/Kronecker begin to beat random, and centering's b=4 penalty + shrinks (less cone). Verdict per model by the SAME thresholds as above + (directions: +0.02 at matched budget; centering: +0.02 at b4). +- If oblivious structure stays inert even at ID=31 / low-anisotropy minilm → the + negative generalizes across the realistic ID range of text encoders (stronger). + +### Ladder results (fiqa 57.6k, 1100-char chunks, capacity-honest) + +| encoder | dim | intrinsic dim | centering Δb4 | sobol−random @8k | pca-axes ceiling | +|---------|-----|---------------|---------------|------------------|------------------| +| nomic@512 | 768 | ~13 | −0.075 | ~tie | ≈ random | +| bge-m3 | 1024 | 21.4 | −0.106 | +0.006 (flicker +0.024@4k) | leads (0.216) | +| bge-large | 1024 | 22.9 | −0.155 | **−0.036** | leads (0.255) | +| snowflake-arctic-v2 | 1024 | 18.0 | −0.088 | +0.010 | leads (0.237) | +| snowflake-arctic-v2 | 1024 | 18.0 | −0.088 | +0.010 | leads (0.237) | +| harrier-oss-0.6b | 1024 | 21.4 | −0.069 | +0.002 | leads (0.252) | +| nomic@1100 (same fiqa) | 768 | 23.1 | −0.082 | (n/a) | — | +| nomic@512 (same fiqa) | 768 | 24.3 | (≈ above) | — | — | + +### CORRECTION — the "ID≈13" anchor was a CORPUS artifact, not an encoder fact + +The "nomic intrinsic dim ≈ 13" that motivated the whole "dense low-dim subspace" +mechanism came from the **repo corpus** (8.7k md+rust sentences), NOT fiqa. On a +FIXED corpus (fiqa 57.6k), nomic's ID is **24.3 @512 / 23.1 @1100** — and *every* +encoder lands ID ~18–24. The apparent "ID ladder 13→31" was mostly **corpus +differences masquerading as encoder differences** (nomic-on-repo vs others-on-fiqa). +On one corpus the ladder is nearly flat. **Intrinsic dim is set more by the corpus +than the encoder**, and chunk length (512→1100) barely moved it (24.3→23.1) — the +opposite of the predicted large drop. Both my "ID≈13" anchor and my "longer chunks +→ much lower ID" prediction were wrong; recorded here rather than quietly dropped. + +What SURVIVES the correction (now properly controlled — same corpus, varied encoder): +- **Directions CLASS-DEAD** across all 5 encoders at the ID range real text + encoders actually produce (~18–24). The negative was never about low ID; it is + general. No revival anywhere. +- **Centering b4 dead** across all 5 (−0.07 to −0.15); harrier + nomic@1100 both + PARTIAL (b1 help, b4 fail) — same pattern as the original runs. +- **pca-axes (data-aligned) leads** at every encoder (0.24–0.25 vs random ~0.19): + the lever is data-alignment, which training-free forbids. The robust positive. + +**Directions: CLASS-DEAD, confirmed across the ID ladder.** sobol−random flips +sign per corpus (+0.006, −0.036, +0.010) — noise around zero, never ≥0.02 in ≥2 +corpora; the bge-m3 flicker did NOT replicate at near-identical ID (bge-large 22.9). +The pre-registered ≥2-corpora rule holds the line. + +**The one robust high-ID effect (a real, positive, scoped finding):** the +data-aligned ceiling (pca-axes) clearly LEADS random at every high-ID encoder +(0.22–0.26 vs ~0.18–0.20), unlike nomic where it barely beat random. So higher +intrinsic dim genuinely opens room on the directions axis — but ONLY +data-dependent directions exploit it; oblivious (sobol/kronecker) cannot. The +lever at high ID is **data-alignment**, which is exactly what training-free +forbids. This locates *why* oblivious structure fails rather than just restating +that it does. + +**Centering b4: dead across all encoders, penalty SCALES with capacity** (−0.08 to +−0.15) — higher-capacity encoders carry more cosine order for centering to +corrupt. Most robust negative in the set. + +harrier-oss-0.6b (operator's daily driver) + nomic@1100 (chunk-length control) +embedding; results to follow. + +**Chunk length is a THIRD geometry axis (not a free parameter).** Longer chunks +average more tokens → vectors regress toward the corpus mean → R̄ rises (tighter +cone), intrinsic dim falls. So nomic@512 and nomic@1100 are *different geometries*, +and every 512-char result above is conditioned on that one chunk length. We +re-embed nomic@1100 on the SAME 57.6k fiqa docs to measure the shift directly +(two-point: ID and R̄ delta). Deployment consequence: a real enterprise lake has +documents at every chunk length at once → a *mixture* of geometries, not one — so +chunk-length heterogeneity is itself a lake pathology (folds into Path B). **Path B +was run** — see [length_mixture_lake_results.md](length_mixture_lake_results.md): the +chunk-length axis is real but small and co-axial (R̄ 0.705→0.723 over 8.6× length, +cones ≥0.986 aligned), and a 4-length mixture lake leaves b=4 raw R@10 immune (+0.002). + +## Phase B — the messy multi-cone "lake" (deployment robustness) + +Built a synthetic enterprise lake: union of 3 nomic-embedded domains (fiqa 57.6k + +nq 74.1k + quora 150k = **281,729 docs**, 3,000 queries spanning all cones), with +optional templated near-duplicate hub injection. Two pre-registered predictions +about lake geometry — **both FALSIFIED** by the validated probes (the payoff of the +negative-verification discipline: locked thresholds caught my wrong intuitions): + +- **Predicted: union of cones → higher effective ID.** WRONG — union ID = **8.2** + (vs ~24 per-domain). Far-apart domain clusters read to TwoNN as a few discrete + blobs = *lower* intrinsic dim. Cones add separation, not dimensions. A + multi-domain lake is **lower-ID and more clustered**, not higher-ID. (This also + voids Phase B's Question B — there is no ID rise to revive oblivious structure.) +- **Predicted: no single removable mean (global centering breaks).** WRONG — + global centering still collapses cone_base to the uniform null (B=4: 11.2→3.25 ≈ + textbook 3.0), well under the pre-registered 0.30 "insufficient" bar. The 3 + domains share enough common direction that ONE global mean still de-hubs them. + Global centering **holds** on the multi-cone union. + +Net: the "messy lake = higher-ID, locally-coned, needs per-cluster centering" +intuition is wrong on both counts. A multi-domain union is lower-ID and globally +centerable. + +**Question C — templated-hub robustness (the dominant real-lake pathology).** +Inject near-duplicate clusters (1 base vec + tiny noise) at rising prevalence; +measure raw RankQuant b=4 R@10 vs FP32 cosine on the 3000-query lake set: + +| hub prevalence | b4 R@10 | Δ from clean | +|----------------|---------|--------------| +| 0% | 0.8429 | — | +| 5% | 0.8408 | −0.0021 | +| 10% | 0.8420 | −0.0009 | +| 15% | 0.8421 | −0.0008 | + +Pre-registered "graceful if drop < 0.10 through 15%." Actual: **−0.002, flat to +noise — essentially immune.** Mechanism: the fixed-mass rank code has no global +frequency/IDF term for a hub to poison; injected boilerplate is far from real +queries in rank space and never enters the top-k. Training-free + data-oblivious +is *exactly* the property that confers hub-robustness (a learned codebook or +IDF-weighted scheme would be corrupted by hub prevalence; ordvec is not). + +## Phase B verdict — ordvec is robust on multi-cone lake geometry + +All three pre-registered lake fears were unfounded: a multi-domain union is +**lower-ID** (8.2 vs ~24/domain), **still globally centerable**, and **immune to +templated hubs** (−0.002 R@10 through 15%). b=4 raw holds R@10 ≈ 0.84 throughout. +The deployment-relevant conclusion: ordvec's training-free encoding does NOT +degrade on multi-cone / hub-heavy data — if anything the geometry is easier. + +CAVEAT (honest scope): this is a synthetic union of 3 *curated* corpora. It models +multi-cone structure + templated hubs, but NOT the OCR-garbage / mixed-language / +extreme-length heterogeneity of a true S3 dump (uncapturable from clean +embeddings). Claim established: "multi-domain union is benign." Claim NOT +established: "raw dirty S3 sludge is benign" — that needs actual dirty data. + +## Reproduce +``` +# embed real corpora via ollama nomic-embed-text -> .npy (see REAL_CORPUS_RUNBOOK.md) +cargo run --release --example uniformity_lemma -- --corpus-npy CORPUS.npy +cargo run --release --example overlap_decomp -- --corpus-npy CORPUS.npy --queries-npy Q.npy +cargo run --release --example centering_recall -- --corpus-npy CORPUS.npy --queries-npy Q.npy +cargo run --release --example subspace_directions -- --corpus-npy CORPUS.npy --queries-npy Q.npy --kdim 13 +# Phase B lake: python make_lake.py --parts fiqa_corpus.npy nq_corpus.npy quora_corpus.npy --out lake.npy [--hub-frac 0.1] +``` diff --git a/experiments/ordinal-routing-research/overlap_decomp.rs b/experiments/ordinal-routing-research/overlap_decomp.rs new file mode 100644 index 00000000..969c79a9 --- /dev/null +++ b/experiments/ordinal-routing-research/overlap_decomp.rs @@ -0,0 +1,287 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — decomposes the bitmap top-bucket overlap +//! EXCESS on real embeddings into CONE (marginal/hubness) vs PAIRWISE (signal), +//! and tests whether per-coordinate mean-centering recovers calibration. +//! Isolated by request: touches no harness, writes nothing. Verdicts fixed below. +//! +//! WHY: uniformity_lemma.rs found the hypergeometric null_err GROWS with B on +//! real data (0.31 -> 0.85 -> 3.34). Hypothesis: that excess is the shared cone +//! (R̄=0.69 anisotropy => same coords top-bucket in EVERY doc), not genuine +//! pairwise similarity. If so, the bitmap prefilter admits candidates partly on +//! HUBNESS, not meaning. +//! +//! FOUR overlap levels for the top bucket (n_top = D/2^B coords): +//! uniform_null = n_top^2 / D (random-subset assumption) +//! cone_baseline = Σ_c f_c^2 (f_c = P(coord c in top bucket); +//! overlap of two INDEPENDENT docs +//! given marginals == pure hubness) +//! obs_random = mean top-overlap, random doc pairs +//! obs_trueNbr = mean top-overlap, (query, FP32-cosine-true-neighbour) pairs +//! +//! Decomposition (all on the SAME corpus), reported raw AND after per-coord +//! mean-centering (subtract corpus per-coordinate mean before ranking): +//! cone fraction = (cone_baseline - uniform_null) / (obs_random - uniform_null) +//! pairwise signal = obs_trueNbr - obs_random (the discriminative gap) +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! A. "cone dominates the null": cone fraction >= 0.70 (raw). +//! => most of the bitmap's apparent overlap structure is hubness. +//! B. "genuine pairwise signal survives": obs_trueNbr - obs_random >= 0.20 * obs_random +//! AND obs_trueNbr > cone_baseline (true nbrs overlap beyond what hubness explains). +//! C. "centering helps": centering reduces cone fraction by >= 0.10 absolute +//! AND does not shrink the pairwise-signal ratio. +//! Each reported PASS/FAIL independently; report actuals regardless. +//! +//! Run: cargo run --release --example overlap_decomp -- \ +//! --corpus-npy /tmp/repo_corpus.npy --queries-npy /tmp/repo_q.npy + +// Research probe: `analyze` is a parameterized runner; bundling its args into a +// struct would obscure the experiment, not clarify it. +#![allow(clippy::too_many_arguments)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +/// Per-coordinate mean over corpus rows. +fn coord_mean(corpus: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += corpus[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// Top-bucket membership bitset per doc: coords whose rank is in the top 1/2^B. +/// Returns, per doc, the sorted list of top-bucket coord ids. +fn top_sets( + rows: &[f32], + n: usize, + dim: usize, + bits: u32, + sub: Option<&[f32]>, +) -> (Vec>, usize) { + let n_top = dim >> bits; // D / 2^B + let sets: Vec> = (0..n) + .into_par_iter() + .map(|r| { + let mut v: Vec<(u16, f32)> = (0..dim) + .map(|c| { + let x = rows[r * dim + c] - sub.map_or(0.0, |s| s[c]); + (c as u16, x) + }) + .collect(); + // top n_top by value = highest ranks + v.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + let mut top: Vec = v.into_iter().take(n_top).map(|(c, _)| c).collect(); + top.sort_unstable(); + top + }) + .collect(); + (sets, n_top) +} + +fn overlap(a: &[u16], b: &[u16]) -> usize { + let (mut i, mut j, mut c) = (0, 0, 0); + while i < a.len() && j < b.len() { + match a[i].cmp(&b[j]) { + std::cmp::Ordering::Less => i += 1, + std::cmp::Ordering::Greater => j += 1, + std::cmp::Ordering::Equal => { + c += 1; + i += 1; + j += 1; + } + } + } + c +} + +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +fn analyze( + label: &str, + corpus: &[f32], + queries: &[f32], + n: usize, + nq: usize, + dim: usize, + truth: &[Vec], + sub: Option<&[f32]>, +) { + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let pairs: Vec<(usize, usize)> = (0..100_000) + .map(|_| { + let i = rng.random_range(0..n); + let mut j = rng.random_range(0..n); + if j == i { + j = (j + 1) % n; + } + (i, j) + }) + .collect(); + + println!("\n## {label}"); + println!( + "B\tn_top\tuniform_null\tcone_base\tobs_random\tobs_trueNbr\tcone_frac\tpair_sig_ratio" + ); + for &bits in &[1u32, 2, 4] { + let (csets, n_top) = top_sets(corpus, n, dim, bits, sub); + let (qsets, _) = top_sets(queries, nq, dim, bits, sub); + + // per-coord top-bucket frequency f_c -> cone baseline Σ f_c^2 + let mut f = vec![0f64; dim]; + for s in &csets { + for &c in s { + f[c as usize] += 1.0; + } + } + for x in f.iter_mut() { + *x /= n as f64; + } + let cone_base: f64 = f.iter().map(|x| x * x).sum(); + + let uniform_null = (n_top * n_top) as f64 / dim as f64; + + let obs_random: f64 = pairs + .par_iter() + .map(|&(i, j)| overlap(&csets[i], &csets[j]) as f64) + .sum::() + / pairs.len() as f64; + + // true-neighbour overlap: query top-set vs each true-nbr doc top-set + let (mut tn_sum, mut tn_cnt) = (0f64, 0usize); + for qi in 0..nq { + for &di in &truth[qi] { + tn_sum += overlap(&qsets[qi], &csets[di as usize]) as f64; + tn_cnt += 1; + } + } + let obs_true = tn_sum / tn_cnt as f64; + + let cone_frac = (cone_base - uniform_null) / (obs_random - uniform_null); + let pair_sig_ratio = (obs_true - obs_random) / obs_random; + + println!("{bits}\t{n_top}\t{uniform_null:.2}\t{cone_base:.2}\t{obs_random:.2}\t{obs_true:.2}\t{cone_frac:.3}\t{pair_sig_ratio:+.3}"); + } +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + // ground truth computed on the L2-normalized RAW vectors (the real task target); + // centering changes the rank CODE, not the retrieval ground truth. + let truth = ground_truth(&corpus, &queries, dim, 10); + + println!("# overlap_decomp (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}"); + println!("# cone_frac = (cone_base - uniform_null)/(obs_random - uniform_null); 1.0 => excess is ALL hubness"); + println!("# pair_sig_ratio = (obs_trueNbr - obs_random)/obs_random; >0 => true nbrs overlap MORE than random"); + + analyze("RAW ranks", &corpus, &queries, n, nq, dim, &truth, None); + + let mean = coord_mean(&corpus, n, dim); + analyze( + "MEAN-CENTERED ranks (subtract per-coord corpus mean before ranking)", + &corpus, + &queries, + n, + nq, + dim, + &truth, + Some(&mean), + ); + + println!("\n# PRE-REGISTERED: A cone-dominates if cone_frac>=0.70 (raw);"); + println!("# B pairwise-signal-survives if pair_sig_ratio>=0.20 AND obs_trueNbr>cone_base;"); + println!( + "# C centering-helps if cone_frac drops >=0.10 absolute and pair_sig_ratio not reduced." + ); +} diff --git a/experiments/ordinal-routing-research/partition_balance.rs b/experiments/ordinal-routing-research/partition_balance.rs new file mode 100644 index 00000000..c5afd2a5 --- /dev/null +++ b/experiments/ordinal-routing-research/partition_balance.rs @@ -0,0 +1,352 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — the round-on-round thread: does cone-removal +//! (centering) make a BALANCED, PRUNABLE coarse partition key for a sharding/IVF +//! layer? This is the one place cone-removal might buy SCALING (sublinear pruning) +//! rather than recall. Isolated by request: touches no harness, writes nothing. +//! +//! WHY: overlap_decomp showed raw low-B top-buckets are ~100% cone (every doc +//! shares the same hub coords) -> a raw coarse key is maximally UNBALANCED: docs +//! pile into a few cells, so an IVF layer probing few cells can't prune without +//! losing neighbours. Centering removes the cone -> should SPREAD docs across +//! cells. The decisive question is NOT balance for its own sake but candidates- +//! scanned at matched recall under a real inverted-index probe. +//! +//! PARTITION KEY = the coarse cell id of a doc. We build cells by k-means-free, +//! training-free coarse coding: project (centered or raw) onto the top PCA axes, +//! take the SIGN pattern of the top `bits_key` axes -> 2^bits_key cells (a +//! data-oblivious-given-PCA hash; PCA basis is the only data-dependent part and is +//! shared identically across arms, so arms differ ONLY in centering). +//! Query probes the P nearest cells (by Hamming on the key); candidate = union of +//! those cells' docs; recall@10 vs FP32 cosine top-10. +//! +//! ARMS (PCA basis identical across all; only the centering differs): +//! raw — sign-key from raw (uncentered) projections. +//! centered — sign-key from per-coord-mean-centered projections. +//! Controls reported: cell-occupancy Gini (0=perfectly balanced, 1=all in one +//! cell) and the largest-cell fraction. +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! BALANCE: centered must cut largest-cell-fraction by >= 1.5x vs raw. +//! PRUNING (the one that matters): at matched recall@10 (>=0.90), centered must +//! scan <= 0.80x the candidates raw scans, in >= 2 of 3 corpora. +//! => PASS = centered is a materially better coarse key (a SCALING win, distinct +//! from the recall FAIL at b>=2). FAIL = centering doesn't help partitioning +//! either; the cone is not the bottleneck for routing. +//! +//! Run: cargo run --release --example partition_balance -- \ +//! --corpus-npy /tmp/corpora/fiqa_corpus.npy --queries-npy /tmp/corpora/fiqa_q.npy + +// Research probe: `analyze` is a parameterized runner; bundling its args into a +// struct would obscure the experiment, not clarify it. +#![allow(clippy::too_many_arguments)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn coord_mean(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += rows[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// top-k PCA via power iteration + deflation (implicit covariance action). Pure Rust. +fn pca_topk(centered: &[f32], n: usize, dim: usize, k: usize) -> Vec> { + let sample: Vec = if n > 20000 { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x9C0FFEE); + (0..20000).map(|_| rng.random_range(0..n)).collect() + } else { + (0..n).collect() + }; + let cov_mul = |x: &[f32]| -> Vec { + let acc: Vec = sample + .par_iter() + .map(|&i| { + let row = ¢ered[i * dim..(i + 1) * dim]; + let d: f32 = row.iter().zip(x).map(|(a, b)| a * b).sum(); + row.iter().map(|&r| r * d).collect::>() + }) + .reduce( + || vec![0f32; dim], + |mut a, b| { + for j in 0..dim { + a[j] += b[j]; + } + a + }, + ); + let s = sample.len() as f32; + acc.iter().map(|v| v / s).collect() + }; + let mut comps: Vec> = Vec::with_capacity(k); + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xABCD); + for _ in 0..k { + let mut v: Vec = (0..dim) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + for _ in 0..50 { + for c in &comps { + let p: f32 = v.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + v[j] -= p * c[j]; + } + } + let mut y = cov_mul(&v); + for c in &comps { + let p: f32 = y.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + y[j] -= p * c[j]; + } + } + let nrm: f32 = y.iter().map(|t| t * t).sum::().sqrt(); + if nrm < 1e-20 { + break; + } + for j in 0..dim { + v[j] = y[j] / nrm; + } + } + comps.push(v); + } + comps +} + +/// sign-pattern key over the top `bits_key` PCA axes -> cell id in [0, 2^bits_key). +fn cell_key(v: &[f32], mean: Option<&[f32]>, basis: &[Vec], bits_key: usize) -> u32 { + let mut key = 0u32; + for (a, axis) in basis.iter().take(bits_key).enumerate() { + let proj: f32 = axis + .iter() + .enumerate() + .map(|(c, &w)| (v[c] - mean.map_or(0.0, |m| m[c])) * w) + .sum(); + if proj > 0.0 { + key |= 1 << a; + } + } + key +} + +fn gini(counts: &[usize]) -> f32 { + let n = counts.len(); + if n == 0 { + return 0.0; + } + let mut s: Vec = counts.iter().map(|&c| c as f64).collect(); + s.sort_by(|a, b| a.partial_cmp(b).unwrap()); + let tot: f64 = s.iter().sum(); + if tot == 0.0 { + return 0.0; + } + let mut cum = 0.0; + let mut lorenz = 0.0; + for &x in &s { + cum += x; + lorenz += cum; + } + let g = (n as f64 + 1.0 - 2.0 * lorenz / tot) / n as f64; + g as f32 +} + +fn analyze( + label: &str, + corpus: &[f32], + queries: &[f32], + n: usize, + nq: usize, + dim: usize, + truth: &[Vec], + mean: Option<&[f32]>, + basis: &[Vec], + bits_key: usize, +) { + let n_cells = 1usize << bits_key; + // assign cells + let ckeys: Vec = (0..n) + .into_par_iter() + .map(|i| cell_key(&corpus[i * dim..(i + 1) * dim], mean, basis, bits_key)) + .collect(); + let qkeys: Vec = (0..nq) + .into_par_iter() + .map(|i| cell_key(&queries[i * dim..(i + 1) * dim], mean, basis, bits_key)) + .collect(); + let mut cells: Vec> = vec![Vec::new(); n_cells]; + for (i, &k) in ckeys.iter().enumerate() { + cells[k as usize].push(i as u32); + } + let counts: Vec = cells.iter().map(|c| c.len()).collect(); + let g = gini(&counts); + let largest = *counts.iter().max().unwrap() as f64 / n as f64; + let nonempty = counts.iter().filter(|&&c| c > 0).count(); + + println!("\n## {label} (bits_key={bits_key}, {n_cells} cells, {nonempty} non-empty)"); + println!(" occupancy Gini={g:.4} (0=balanced), largest-cell-fraction={largest:.4}"); + println!(" P_probe\tcand_scanned\trecall@10"); + // probe P nearest cells by Hamming on the key + for p_probe in [1usize, 2, 4, 8, 16, 32] { + if p_probe > n_cells { + break; + } + let stats: Vec<(usize, f32)> = (0..nq) + .into_par_iter() + .map(|qi| { + let qk = qkeys[qi]; + // rank cells by Hamming distance to qk, take P nearest + let mut order: Vec = (0..n_cells as u32).collect(); + order.sort_by_key(|&c| (c ^ qk).count_ones()); + let mut cand = 0usize; + use std::collections::HashSet; + let mut hit_set: HashSet = HashSet::new(); + for &c in order.iter().take(p_probe) { + cand += cells[c as usize].len(); + for &d in &cells[c as usize] { + hit_set.insert(d); + } + } + let hits = truth[qi].iter().filter(|&&t| hit_set.contains(&t)).count(); + (cand, hits as f32 / truth[qi].len() as f32) + }) + .collect(); + let mc = stats.iter().map(|s| s.0 as f64).sum::() / nq as f64; + let mr = stats.iter().map(|s| s.1 as f64).sum::() / nq as f64; + println!(" {p_probe}\t{mc:.0}\t{mr:.4}"); + } +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut bits_key = 10usize; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + "--bits-key" => bits_key = a.next().unwrap().parse().unwrap(), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + let truth: Vec> = (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(10).map(|(i, _)| i).collect() + }) + .collect(); + + let mean = coord_mean(&corpus, n, dim); + // PCA basis from the CENTERED corpus, shared identically across arms. + let mut cc = corpus.clone(); + for i in 0..n { + for c in 0..dim { + cc[i * dim + c] -= mean[c]; + } + } + eprintln!("# computing top-{bits_key} PCA basis..."); + let basis = pca_topk(&cc, n, dim, bits_key); + + println!("# partition_balance (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, queries {nq}"); + println!("# coarse cell key = sign pattern of top-{bits_key} PCA axes; PCA basis shared across arms."); + analyze( + "RAW key (uncentered projections)", + &corpus, + &queries, + n, + nq, + dim, + &truth, + None, + &basis, + bits_key, + ); + analyze( + "CENTERED key (mean-subtracted projections)", + &corpus, + &queries, + n, + nq, + dim, + &truth, + Some(&mean), + &basis, + bits_key, + ); + println!("\n# PRE-REGISTERED: BALANCE = centered cuts largest-cell-fraction >=1.5x; PRUNING = centered scans <=0.80x"); + println!("# candidates at matched recall>=0.90 in >=2/3 corpora => PASS (a scaling win). Else FAIL."); +} diff --git a/experiments/ordinal-routing-research/rigidity_impossibility_proofs.md b/experiments/ordinal-routing-research/rigidity_impossibility_proofs.md new file mode 100644 index 00000000..82956616 --- /dev/null +++ b/experiments/ordinal-routing-research/rigidity_impossibility_proofs.md @@ -0,0 +1,122 @@ +# Why oblivious random-projection keys cannot be rigid — proofs, not corpora + +The zoo gives evidence; these give the reason. Three claims, stated and proved. +"Rigid" = number variance Σ²(L) grows sub-linearly (o(L)); "Poisson-or-worse" += Σ²(L) = Ω(L). The routing key is K_r(x) = r·x / ||x|| for a probe direction r +(cosine geometry: the key is a function of the RAY, magnitude-free). + +------------------------------------------------------------------------ +## Lemma 0 (magnitude is quotiented out) + +For any x ≠ 0 and scalar c > 0, K_r(cx) = K_r(x). So any structure a corpus +encodes purely in vector NORMS is invisible to the key. In particular the +`projected-rigid` construction x_i = t_i·u collapses under normalization to +sign(t_i)·u ∈ {+u, −u}: the 50k rigid magnitudes collapse toward ≈2 angular keys +(±u, modulo the small orthogonal jitter the generator adds), so the key sees a +near-degenerate occupancy exactly as the quotient predicts. ∎ + +Consequence: rigidity must live in the ANGULAR distribution on the sphere +S^{d-1}, or it does not exist for the key. + +------------------------------------------------------------------------ +## Theorem 1 (isotropy ⇒ exact Poisson key) + +Let x be distributed so that x/||x|| is uniform on S^{d-1} (isotropic). For a +fixed unit r, the projected key K_r = r·(x/||x||) has the known marginal with +density ∝ (1−s²)^((d-3)/2) on [−1,1]. Drawing n i.i.d. corpus points gives n +i.i.d. key values from a fixed continuous law F. After unfolding by F, the +points are a homogeneous Poisson process of rate 1, for which Σ²(L) = L exactly. + +Proof: i.i.d. samples under their own CDF are uniform on [0,1]; counts in +disjoint windows have Binomial counts → Σ²(L) = L(1−L/n) (≈ L for L ≪ n). ∎ +(Matches isotropic zoo row: Σ²/L = 0.99, flat.) + +------------------------------------------------------------------------ +## Theorem 2 (i.i.d. corpus ⇒ Σ²(L) = L(1−L/n), NO rigidity, for ANY distribution) + +This is the special case Var_θ(p) = 0 of Theorem 3 (i.i.d. = degenerate mixing +measure). Let corpus points be i.i.d. from ANY distribution D on R^d (isotropic +or not, clustered, manifold, heavy-tailed). Fix r. The keys K_r(x_1..x_n) are +i.i.d. from the induced 1-D law F_r. Unfolding by F_r (probability integral +transform) gives n i.i.d. Uniform[0,1] points. + +Proof. n i.i.d. uniforms form a BINOMIAL point process (NOT Poisson — review +correction). A window of width L/n has count N ~ Binomial(n, L/n), so + Σ²(L) = Var(N) = n·(L/n)(1 − L/n) = L(1 − L/n). +Thus Σ²(L) = L(1−L/n) = Θ(L): sub-LINEAR rigidity (Σ² = o(L)) is impossible. ∎ + +NOTE (corrected): the earlier "homogeneous Poisson, Σ²=L exactly, independent +window counts, Var=mean=L" was WRONG — that is the Poisson framing this work +exists to debunk. For fixed n the process is binomial; counts in disjoint +windows are NEGATIVELY correlated (they sum to n), and Σ²(L)=L(1−L/n) < L. The +CONCLUSION (no sub-linear rigidity) is unaffected because the value is Θ(L). + +This is why clustered/anisotropic/rogue/manifold are not sub-linear: by Thm 3 +the mixing variance only ADDS (Σ² ≥ L(1−L/n), and clustering pushes above L). +Genuine rigidity (Σ² = o(L)) needs a negatively-associated, FINITELY-exchangeable +generator (lattice / determinantal point process) — see Thm 3's escape hatch — +which a fixed data distribution does not produce. + +------------------------------------------------------------------------ +## Theorem 3 (no rigidity for any fixed-distribution corpus — full derivation) + +Setup. Corpus rows exchangeable ⇒ keys exchangeable ⇒ unfolded positions +U_1..U_n exchangeable on [0,1]. Window W of length L/n, so E[count]=L. Let +I_j = 1{U_j ∈ W}, N = Σ I_j, p = L/n. By exchangeability every I_j has mean p +and every pair shares one covariance c = Cov(I_1,I_2): + + Var(N) = n·p(1−p) + n(n−1)·c. (★) + +de Finetti (infinite exchangeable ⇒ conditionally i.i.d. given latent θ): +I_j are independent Bernoulli(p(θ)) given θ, so + c = Cov(I_1,I_2) = E[p(θ)²] − p̄² = Var_θ(p(θ)) ≥ 0, p̄ = E[p(θ)]. +Substituting into (★): + + Σ²(L) = Var(N) = n·p̄(1−p̄) + n(n−1)·Var_θ(p(θ)) + ≥ n·p̄(1−p̄) = L(1 − L/n). ∎ + +So Σ²(L) ≥ L(1−L/n): rigidity (sub-Poisson, Σ²=o(L)) is IMPOSSIBLE because the +mixing-variance term Var_θ(p(θ)) is non-negative and can only RAISE the count +variance above the Poisson line. Clustering = large Var_θ(p(θ)) = strict excess. + +THE PRECISE ESCAPE HATCH (named, not gestured at). de Finetti requires INFINITE +exchangeability. Finitely-exchangeable sequences (sampling WITHOUT replacement +from a fixed finite set — i.e. a lattice / determinantal point process) violate +the c ≥ 0 step: there c < 0 and Σ²(L) < L (genuinely rigid). But "drawn from a +fixed data distribution" IS conditional-i.i.d.-given-θ — the infinite-exchangeable +case — so any corpus an encoder produces from a data distribution falls under +the theorem. Rigidity requires a negatively-associated generator (lattice/DPP) +AND preserved row order; an oblivious router over a fixed distribution has +neither. ∎ + +------------------------------------------------------------------------ +## What this proves about the conjecture (NARROWED post-review) + +The original hope was that prime/spectral structure could exploit *rigidity* in +the routing key. Theorems 1–3 show the key has Σ²(L) ≥ L(1−L/n) for any +conditionally-i.i.d. (mixture-of-i.i.d.) corpus — equality at isotropy, strict +excess under clustering. So: + + **The routing key is not number-variance-rigid.** Sub-linear Σ²(L) requires a + finitely-exchangeable / negatively-associated generator (lattice/DPP) AND + preserved row order — neither holds for an oblivious router over a fixed + distribution. + +That is the defensible claim, and it is all the mathematics supports. + +RETRACTED (non-sequitur, flagged in adversarial review). The earlier conclusion +— "quantile bucketing is optimal against the entire achievable class; no +spectral/prime partition can beat it because the structure is provably not +there" — DOES NOT FOLLOW from Σ²(L) ≥ L. Number variance and partition +quality (load balance / recall) are different figures of merit; Σ² ≥ L says +nothing about whether some partition outperforms quantile bucketing. A Poisson +key can still admit a useful partition. The impossibility result refutes only +the specific "exploit rigidity" route, NOT all spectral/prime partitions, and +NOT in favor of quantile-optimality-over-all-partitions. Establishing the latter +would need a separate argument (that partition quality is monotone in Σ²), which +is not provided and is likely false. + +(The zoo runs in withdrawn/corpus_zoo_results.md were EVIDENCE for the narrow claim, with +the unfolding caveat in ADVERSARIAL_REVIEW.md. The selftest Σ²=0 only shows the +estimator can represent the o(L) regime; Thm 3 says a fixed-distribution corpus +cannot reach it.) diff --git a/experiments/ordinal-routing-research/shard_recall.rs b/experiments/ordinal-routing-research/shard_recall.rs new file mode 100644 index 00000000..c053af2c --- /dev/null +++ b/experiments/ordinal-routing-research/shard_recall.rs @@ -0,0 +1,445 @@ +//! R-projection shard-recall experiment for the training-free routing layer. +//! +//! Open empirical question from the prime/index conjecture investigation: +//! can a *data-oblivious* router (R random 1-D projections, each bucketed by +//! a uniform grid) capture true neighbours well enough to route, and does +//! making the grid periods pairwise COPRIME beat independent random OFFSETS +//! at decorrelating bucket seams? +//! +//! Method (per agent-5 spec): for each query, the router probes a budget of +//! buckets across the R projections; shard-recall@k = fraction of the FP32 +//! cosine top-k that land in the probed union. We plot recall vs +//! CANDIDATES-SCANNED (not vs probe radius), so arms with different bucket +//! counts are compared at equal work — the only fair axis. +//! +//! Arms: +//! coprime — R grids, pairwise-coprime periods +//! aligned — R grids, identical period (worst case: seams stack) +//! random-offset — R grids, same period, independent random phase +//! both — coprime periods AND random offsets +//! +//! Decision rule: if coprime ~= random-offset within noise, coprimality adds +//! nothing (justified only on build-determinism). If coprime materially beats +//! random-offset at equal candidates-scanned and survives reseeding, that is +//! the publishable surprise. +//! +//! Run (synthetic): cargo run --release --example shard_recall +//! No external data, no BLAS. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + n_queries: usize, + k: usize, + latent_dim: usize, + n_clusters: usize, + r_values: Vec, // projection counts to sweep + isotropic: bool, + corpus_npy: Option, + queries_npy: Option, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 50_000, + n_queries: 200, + k: 10, + latent_dim: 64, + n_clusters: 200, + r_values: vec![1, 2, 4, 8, 16], + isotropic: false, + corpus_npy: None, + queries_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--k" => c.k = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--isotropic" => c.isotropic = true, + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--queries-npy" => c.queries_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays +/// (same format contract as bench_rank's --corpus-npy / --queries-npy). +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + let n = v.len() / dim; + for i in 0..n { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Low-rank clustered corpus + matched queries (same construction as bench_rank). +/// Returns (corpus, queries) as flat row-major dim-strided buffers. +fn make_corpus(cfg: &mut Cfg) -> (Vec, Vec) { + // Real corpus path: both files required (need real queries for honest recall). + if let (Some(cp), Some(qp)) = (cfg.corpus_npy.clone(), cfg.queries_npy.clone()) { + let (mut corpus, n, d) = load_npy_f32(&cp); + let (mut queries, nq, dq) = load_npy_f32(&qp); + assert_eq!(d, dq, "corpus/query dim mismatch"); + l2_normalize_rows(&mut corpus, d); + l2_normalize_rows(&mut queries, dq); + cfg.dim = d; + cfg.n = n; + cfg.n_queries = nq; + eprintln!("# loaded corpus {n}x{d}, queries {nq}x{dq}"); + return (corpus, queries); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let d = cfg.dim; + let l = cfg.latent_dim; + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc; + } + let nrm: f32 = out.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in out.iter_mut() { + *v /= nrm; + } + } + out + }; + let iso = |rng: &mut ChaCha8Rng| -> Vec { + let mut v = vec![0.0f32; d]; + for x in v.iter_mut() { + *x = gauss(rng); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let e = if cfg.isotropic { + iso(&mut rng) + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.3, &mut rng) + }; + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + let mut queries = vec![0.0f32; cfg.n_queries * d]; + for i in 0..cfg.n_queries { + let e = if cfg.isotropic { + iso(&mut rng) + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.1, &mut rng) + }; + queries[i * d..(i + 1) * d].copy_from_slice(&e); + } + (corpus, queries) +} + +/// FP32 brute-force cosine top-k ground truth (corpus is L2-normalized already). +fn ground_truth(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + let nq = queries.len() / dim; + (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut scored: Vec<(usize, f32)> = (0..n) + .map(|di| { + let doc = &corpus[di * dim..(di + 1) * dim]; + let dot: f32 = q.iter().zip(doc).map(|(a, b)| a * b).sum(); + (di, dot) + }) + .collect(); + scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + scored.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +/// One projection: a unit random direction + a grid period + a phase offset. +/// A doc's bucket on this projection is floor((r.x - phase) / width). +struct Proj { + dir: Vec, + width: f32, + phase: f32, +} + +impl Proj { + fn bucket(&self, v: &[f32]) -> i64 { + let dot: f32 = self.dir.iter().zip(v).map(|(a, b)| a * b).sum(); + ((dot - self.phase) / self.width).floor() as i64 + } +} + +/// Pairwise-coprime-ish period multipliers (distinct small primes). The grid +/// WIDTH is base_width / period, so more-divided grids have more, finer buckets. +/// Coprime periods => the seam sets (multiples of width) share no interior +/// coincidence until the lcm — the vernier effect. +const COPRIME_PERIODS: [u32; 16] = [ + 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, +]; + +#[derive(Clone, Copy)] +enum Arm { + Coprime = 0, + Aligned = 1, + RandomOffset = 2, + Both = 3, +} + +impl Arm { + fn name(self) -> &'static str { + match self { + Arm::Coprime => "coprime", + Arm::Aligned => "aligned", + Arm::RandomOffset => "random-offset", + Arm::Both => "both", + } + } +} + +/// Build R projections for one arm. `base_width` is calibrated so a single +/// grid has ~sqrt(n) buckets (a reasonable shard granularity); coprime arms +/// subdivide by the prime multiplier. +fn build_projs(cfg: &Cfg, arm: Arm, r: usize, base_width: f32) -> Vec { + let d = cfg.dim; + // Bug L fix: directions and phases come from SEPARATE, fixed-seed RNGs that + // do not depend on the arm. Every arm therefore sees the IDENTICAL R + // projection directions (and identical phase draws where used), so the arms + // differ ONLY in width/phase policy — the clean ablation. Previously the + // RandomOffset arm consumed an extra value from one shared stream, desyncing + // the directions so arms were compared on different projections. + let mut dir_rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xD132_0000 ^ (r as u64)); + let mut phase_rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x9DA5E_000 ^ (r as u64)); + (0..r) + .map(|i| { + let mut dir = vec![0.0f32; d]; + for x in dir.iter_mut() { + *x = gauss(&mut dir_rng); + } + let nrm: f32 = dir.iter().map(|t| t * t).sum::().sqrt(); + for x in dir.iter_mut() { + *x /= nrm; + } + // draw a phase fraction for EVERY arm (keeps phase_rng in lockstep); + // arms that don't use it simply ignore the value. + let p: f32 = phase_rng.random_range(0.0..1.0); + // i % len() guards against r > COPRIME_PERIODS.len() (OOB panic). + let period = COPRIME_PERIODS[i % COPRIME_PERIODS.len()] as f32; + let (width, phase) = match arm { + Arm::Coprime => (base_width / period, 0.0), + Arm::Aligned => (base_width, 0.0), + Arm::RandomOffset => (base_width, p * base_width), + Arm::Both => { + let w = base_width / period; + (w, p * w) + } + }; + Proj { dir, width, phase } + }) + .collect() +} + +/// Assign every doc a bucket-key tuple (one i64 per projection), grouped into +/// an index: for each projection, map bucket-id -> list of doc ids. +type BucketIndex = Vec>>; + +fn build_index(projs: &[Proj], corpus: &[f32], dim: usize) -> BucketIndex { + let n = corpus.len() / dim; + projs + .par_iter() + .map(|p| { + let mut m: std::collections::HashMap> = std::collections::HashMap::new(); + for di in 0..n { + let b = p.bucket(&corpus[di * dim..(di + 1) * dim]); + m.entry(b).or_default().push(di as u32); + } + m + }) + .collect() +} + +/// For one query at a given probe radius `rad` (probe buckets b-rad..=b+rad on +/// each projection), return the candidate union and its size (candidates +/// scanned). Recall is measured against the ground-truth ids. +fn probe_recall( + projs: &[Proj], + index: &BucketIndex, + q: &[f32], + rad: i64, + truth: &[usize], +) -> (usize, f32) { + use std::collections::HashSet; + let mut cand: HashSet = HashSet::new(); + for (pi, p) in projs.iter().enumerate() { + let b = p.bucket(q); + for bb in (b - rad)..=(b + rad) { + if let Some(ids) = index[pi].get(&bb) { + cand.extend(ids.iter().copied()); + } + } + } + // truth is already unique ground-truth ids; filter directly against cand + // (no separate HashSet allocation — review suggestion). + let hits = truth.iter().filter(|&&i| cand.contains(&(i as u32))).count(); + let recall = hits as f32 / truth.len().max(1) as f32; + (cand.len(), recall) +} + +fn run_arms(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { + // Calibrate base_width so a single unit-direction grid yields ~sqrt(n) + // buckets. r.x ~ N(0, 1/dim) for L2-normalized x, so its spread ~ 6/sqrt(dim) + // across the corpus; divide that range into ~sqrt(n) cells. + let spread = 6.0 / (cfg.dim as f32).sqrt(); + let base_width = spread / (cfg.n as f32).sqrt(); + let arms = [Arm::Coprime, Arm::Aligned, Arm::RandomOffset, Arm::Both]; + // Collect (arm, cand, recall) across all (R, rad) for the fair envelope. + let mut pts: Vec<(&'static str, f64, f64)> = Vec::new(); + println!("# recall vs candidates-scanned (mean over queries), per arm per R"); + println!("arm\tR\trad\tcand_scanned\trecall@k"); + for &r in &cfg.r_values { + for &arm in &arms { + // build_projs now seeds its own direction/phase RNGs internally + // (identical across arms for the same r) — Bug L fix. + let projs = build_projs(cfg, arm, r, base_width); + let index = build_index(&projs, corpus, cfg.dim); + for rad in [0i64, 1, 2, 4] { + let stats: Vec<(usize, f32)> = (0..cfg.n_queries) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * cfg.dim..(qi + 1) * cfg.dim]; + probe_recall(&projs, &index, q, rad, &truth[qi]) + }) + .collect(); + let mean_cand = + stats.iter().map(|s| s.0 as f64).sum::() / stats.len() as f64; + let mean_rec = + stats.iter().map(|s| s.1 as f64).sum::() / stats.len() as f64; + println!( + "{}\t{r}\t{rad}\t{:.0}\t{:.4}", + arm.name(), + mean_cand, + mean_rec + ); + pts.push((arm.name(), mean_cand, mean_rec)); + } + } + } + // Fair envelope: best recall achievable per arm at or below each candidate + // budget. This is the only honest cross-arm comparison — equal work. + println!("\n# FAIR ENVELOPE: max recall@k at candidates-scanned <= budget"); + let budgets = [1000.0, 2000.0, 4000.0, 8000.0, 16000.0]; + print!("budget"); + for a in &arms { + print!("\t{}", a.name()); + } + println!(); + for &bud in &budgets { + print!("{bud:.0}"); + for a in &arms { + let best = pts + .iter() + .filter(|(name, c, _)| *name == a.name() && *c <= bud) + .map(|(_, _, r)| *r) + .fold(0.0f64, f64::max); + print!("\t{best:.4}"); + } + println!(); + } +} + +fn main() { + let mut cfg = parse(); + let (corpus, queries) = make_corpus(&mut cfg); + let truth = ground_truth(&corpus, &queries, cfg.dim, cfg.k); + let src = if cfg.corpus_npy.is_some() { + "npy" + } else if cfg.isotropic { + "isotropic" + } else { + "clustered" + }; + println!( + "# shard_recall: n={} q={} dim={} k={} corpus={src}", + cfg.n, cfg.n_queries, cfg.dim, cfg.k + ); + run_arms(&cfg, &corpus, &queries, &truth); +} diff --git a/experiments/ordinal-routing-research/shard_recall_results.md b/experiments/ordinal-routing-research/shard_recall_results.md new file mode 100644 index 00000000..7e3c496d --- /dev/null +++ b/experiments/ordinal-routing-research/shard_recall_results.md @@ -0,0 +1,69 @@ +> ✅ RNG-DESYNC BUG FIXED (was Bug L). build_projs now seeds direction and phase +> RNGs separately and identically across arms, so aligned vs random-offset share +> the SAME R projection directions and differ ONLY in phase — the clean ablation. +> Re-run result: aligned 0.9095 vs random-offset 0.9080 at 16k budget = TIED, so +> "random phase offsets add nothing across R different directions" now holds on a +> controlled comparison. CAVEAT STILL OPEN: the coprime/both arms subdivide +> buckets so the "fair envelope" undersells them (they saturate below high +> budgets); coprimality across R directions remains the wrong geometry — the +> within-axis vernier harness (crt_seam_oracle covers the theory) is the right +> test and is not built here. Numbers below are the post-fix run. + +# R-projection shard-recall: does coprime seam-decorrelation help? + +Experiment for the training-free routing layer. Five projection arms over a +random-projection ensemble; recall@10 measured against FP32 cosine top-10, +compared at EQUAL candidates-scanned (the only fair axis). + +Source: `examples/shard_recall.rs`. Synthetic clustered corpus n=50k, dim=256, +200 queries, k=10, seed=1. + +## Fair envelope (max recall@10 at candidates-scanned <= budget) + +Post-fix run (Bug L fixed: arms share identical projection directions): + +| budget | coprime | aligned | random-offset | both | +|--------|---------|---------|---------------|------| +| 1000 | 0.109 | 0.0885 | 0.0835 | 0.1085 | +| 2000 | 0.1885 | 0.1795 | 0.1840 | 0.1830 | +| 4000 | 0.3370 | 0.3780 | 0.3880 | 0.3290 | +| 8000 | 0.5315 | 0.6425 | 0.6490 | 0.5300 | +| 16000 | 0.5315 | 0.9095 | 0.9080 | 0.5300 | + +## Findings + +**1. CLEAN RESULT (controlled ablation): random offsets are redundant.** +After the Bug-L fix, `aligned` and `random-offset` share the SAME R projection +directions and identical bucket width — differing ONLY in phase. They are +statistically TIED (0.9095 vs 0.9080 at 16k). Across R *different* random +projection directions, the direction randomness already decorrelates seams; +adding random phase buys nothing. Confirms the literature prediction +(multi-probe LSH / random-rotation decorrelation). + +**2. CONFOUND (do NOT read as "coprime is worse"):** the coprime/both arms +subdivide each projection by a distinct prime (W/2..W/53), so their grids are +mostly tiny — they scan few candidates and plateau at ~0.52 due to collapsed +bucket VOLUME, not seam structure. This is a parameterization flaw. + +**3. GEOMETRY MISMATCH (the real lesson):** coprime PERIODS are a +within-single-axis stacked-grid (vernier) effect. Across R different random +directions there is only ONE grid per axis, so coprimality of periods across +directions is not even the right test. The vernier idea, to be tested properly, +needs multiple coprime-period grids overlaid on ONE projection direction — a +different harness (TODO). + +## Verdict + +For the multi-direction routing layer: **plain R random projections with a +single shared grid width are as good as any seam-decorrelation scheme tried +here.** Coprimality adds nothing in this geometry and is only potentially +meaningful in the within-axis stacked-grid setup, which this experiment does +not test. Recommendation: build the oblivious router with R shared-width +random-projection grids; do not invest in coprime periods unless the within-axis +vernier experiment shows a surprise. + +Next: (a) within-axis vernier test (coprime grids on one direction); +(b) re-run on real embeddings via --corpus-npy; +(c) add a quantile-k-means control arm to size the train-free vs trained gap. + +Reproduce: `cargo run --release --example shard_recall` diff --git a/experiments/ordinal-routing-research/spectral_probe.rs b/experiments/ordinal-routing-research/spectral_probe.rs new file mode 100644 index 00000000..0772c18e --- /dev/null +++ b/experiments/ordinal-routing-research/spectral_probe.rs @@ -0,0 +1,383 @@ +//! Spectral / number-variance probe for the "prime mile-marker" conjecture. +//! +//! Decisive cheap experiment #1: is a 1-D routing key over an embedding +//! corpus *Poisson* (number variance grows like the window length L) or +//! *rigid* (variance grows like log L, as GUE-spectra do)? +//! +//! If Poisson, there is nothing for spectral / prime-gap structure to grip +//! and plain quantile bucketing is the whole story. Rigidity (sub-linear +//! number variance) would be the only empirical opening for the conjecture. +//! +//! Key subtlety handled below: number variance is only meaningful after +//! *unfolding* by the SMOOTH density, not the empirical CDF. Unfolding by +//! rank trivially yields a perfect lattice (Sigma^2 = 0) and measures +//! nothing. A random projection r . x of an L2-normalized vector is +//! approximately N(0, 1/dim) by the CLT, so we unfold with the matching +//! Gaussian CDF Phi and measure the residual fluctuation. +//! +//! Run (synthetic, self-contained): +//! cargo run --release --example spectral_probe +//! Run (real corpus): +//! cargo run --release --example spectral_probe -- --corpus-npy emb.npy +//! +//! No external data, no BLAS. Reuses the same construction as bench_rank. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + latent_dim: usize, + n_clusters: usize, + n_keys: usize, // number of independent random-projection keys to average over + corpus_npy: Option, + isotropic: bool, // control: N(0,I) corpus, no clusters (true Poisson expected) + unfold_empirical: bool, // exact-rank unfold (lattice; demo only, not a rigidity test) + unfold_smooth_knots: usize, // K>0 => smooth empirical unfold with K knots (correct method) +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 200_000, + latent_dim: 64, + n_clusters: 200, + n_keys: 8, + corpus_npy: None, + isotropic: false, + unfold_empirical: false, + unfold_smooth_knots: 0, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--keys" => c.n_keys = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--isotropic" => c.isotropic = true, + "--unfold-empirical" => c.unfold_empirical = true, + "--unfold-smooth" => c.unfold_smooth_knots = a.next().unwrap().parse().unwrap(), + "--rigid-selftest" => {} // handled in main, bypasses corpus + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Phi(z) = 0.5 (1 + erf(z / sqrt 2)). Standard normal CDF. +fn normal_cdf(z: f64) -> f64 { + 0.5 * (1.0 + erf(z / std::f64::consts::SQRT_2)) +} + +fn erf(x: f64) -> f64 { + let t = 1.0 / (1.0 + 0.327_591_1 * x.abs()); + let y = 1.0 + - (((((1.061_405_429 * t - 1.453_152_027) * t) + 1.421_413_741) * t - 0.284_496_736) * t + + 0.254_829_592) + * t + * (-x * x).exp(); + if x >= 0.0 { + y + } else { + -y + } +} + +fn main() { + let cfg = parse(); + // Instrument self-test: feed a perfectly evenly-spaced (unfolded) key + // directly, bypassing corpus + projection. This is a maximally rigid 1-D + // sequence; the estimator MUST report Sigma^2/L << 1 (falling). Proves the + // probe can SEE rigidity when it is present, so a Poisson reading elsewhere + // is a real finding, not instrument blindness. + if std::env::args().any(|a| a == "--rigid-selftest") { + let n = 50_000usize; + let key: Vec = (0..n).map(|i| i as f64).collect(); // perfect lattice + println!("# RIGID SELF-TEST: perfectly even key, expect Sigma^2/L << 1"); + println!("L\tSigma2_mean\tSigma2/L"); + report(&[key], n); + return; + } + let (keys_per_proj, n) = build_keys(&cfg); + println!("# spectral_probe: n={n} keys={}", keys_per_proj.len()); + println!("# unfolded by Gaussian CDF; Poisson => Sigma^2(L) ~ L, rigid => ~log L"); + println!("L\tSigma2_mean\tSigma2/L\tslope_hint"); + report(&keys_per_proj, n); +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays +/// (same format contract as bench_rank's --corpus-npy). +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +/// Project a real corpus into the same per-projection unfolded keys the +/// synthetic path produces. L2-normalizes rows first (cosine geometry). +fn build_keys_from_corpus(cfg: &Cfg, corpus: &[f32], n: usize, d: usize) -> (Vec>, usize) { + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + // unit-normalize rows in a local copy + let mut c = corpus.to_vec(); + for i in 0..n { + let row = &mut c[i * d..(i + 1) * d]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } + let mut out = Vec::with_capacity(cfg.n_keys); + for _ in 0..cfg.n_keys { + let mut r = vec![0.0f32; d]; + for v in r.iter_mut() { + *v = gauss(&mut rng); + } + let rn: f32 = r.iter().map(|v| v * v).sum::().sqrt(); + for v in r.iter_mut() { + *v /= rn; + } + let mut keys: Vec = (0..n) + .map(|i| { + let doc = &c[i * d..(i + 1) * d]; + r.iter().zip(doc).map(|(a, b)| (a * b) as f64).sum() + }) + .collect(); + if cfg.unfold_empirical { + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + for (i, kv) in keys.iter_mut().enumerate() { + *kv = i as f64; + } + } else { + let sigma = (1.0 / d as f64).sqrt(); + for kv in keys.iter_mut() { + *kv = normal_cdf(*kv / sigma) * n as f64; + } + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + } + out.push(keys); + } + (out, n) +} + +/// Returns (one unfolded key-position list per projection, n). +fn build_keys(cfg: &Cfg) -> (Vec>, usize) { + if let Some(ref path) = cfg.corpus_npy { + let (corpus, n, d) = load_npy_f32(path); + eprintln!("# loaded {n} x {d} from {path}"); + return build_keys_from_corpus(cfg, &corpus, n, d); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let d = cfg.dim; + let l = cfg.latent_dim; + // low-rank clustered corpus, same construction as bench_rank + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc; + } + let nrm: f32 = out.iter().map(|v| v * v).sum::().sqrt(); + if nrm > 0.0 { + for v in out.iter_mut() { + *v /= nrm; + } + } + out + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let e = if cfg.isotropic { + // control: draw directly in R^d as N(0,I), normalize. No cluster + // structure, no low-rank latent -> the projection key should be a + // clean Poisson process after Gaussian unfold. + let mut v = vec![0.0f32; d]; + for x in v.iter_mut() { + *x = gauss(&mut rng); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.3, &mut rng) + }; + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + // random projection keys + let mut out = Vec::with_capacity(cfg.n_keys); + for _k in 0..cfg.n_keys { + let mut r = vec![0.0f32; d]; + for v in r.iter_mut() { + *v = gauss(&mut rng); + } + let rn: f32 = r.iter().map(|v| v * v).sum::().sqrt(); + for v in r.iter_mut() { + *v /= rn; + } + // key value for each doc = r . x ; for L2-normalized x and unit r, + // distributed ~ N(0, 1/dim) under the CLT for isotropic x. + let mut keys: Vec = (0..cfg.n) + .map(|i| { + let doc = &corpus[i * d..(i + 1) * d]; + let dot: f32 = r.iter().zip(doc).map(|(a, b)| a * b).sum(); + dot as f64 + }) + .collect(); + keys.sort_by(|x, y| x.partial_cmp(y).unwrap()); + if cfg.unfold_smooth_knots > 0 { + // SMOOTH EMPIRICAL unfold (correct method; removes the Gaussian- + // mismatch confound). Fit the CDF with K coarse knots placed at + // rank-quantiles, linear between. This subtracts the large-scale + // marginal density of ANY distribution WITHOUT collapsing local + // fluctuation to a lattice (which exact-rank unfolding does). + // Number variance must then be read at window scales L << n/K so + // the knot smoothing does not wash out the signal being measured. + let k = cfg.unfold_smooth_knots.min(cfg.n); + let n = cfg.n; + // Precompute the k+1 knots once: (position p_j = rank j*n/k, value + // keys[p_j]). Avoids recomputing knot_rank (div/mul/min) and + // re-indexing keys inside the per-key binary search (review). + let kr = |j: usize| -> usize { (j * n / k).min(n - 1) }; + let knot_pos: Vec = (0..=k).map(|j| kr(j) as f64).collect(); + let knot_val: Vec = (0..=k).map(|j| keys[kr(j)]).collect(); + let unfolded: Vec = keys + .iter() + .map(|&x| { + // bracketing knot index by value (knots sorted ascending) + let mut lo = 0usize; + let mut hi = k; + while lo < hi { + let mid = (lo + hi) / 2; + if knot_val[mid] < x { + lo = mid + 1; + } else { + hi = mid; + } + } + let j = lo.clamp(1, k); + let (v0, v1) = (knot_val[j - 1], knot_val[j]); + let (p0, p1) = (knot_pos[j - 1], knot_pos[j]); + if (v1 - v0).abs() < 1e-30 { + p0 + } else { + p0 + (x - v0) / (v1 - v0) * (p1 - p0) + } + }) + .collect(); + let mut u = unfolded; + u.sort_by(|a, b| a.partial_cmp(b).unwrap()); + out.push(u); + } else if cfg.unfold_empirical { + // exact-rank unfold: assigns rank i. Trivially a lattice (Sigma^2=0) + // for ANY input — this is NOT a rigidity test, only a demonstration + // that exact quantile tiling balances occupancy. Kept for that use. + for (i, kv) in keys.iter_mut().enumerate() { + *kv = i as f64; + } + out.push(keys); + } else { + // fixed-Gaussian unfold: ONLY valid for the isotropic marginal. + // CONFOUNDED for non-isotropic corpora (see ADVERSARIAL_REVIEW.md); + // prefer --unfold-smooth. + let sigma = (1.0 / d as f64).sqrt(); + for kv in keys.iter_mut() { + *kv = normal_cdf(*kv / sigma) * cfg.n as f64; + } + out.push(keys); + } + } + (out, cfg.n) +} + +/// Number variance Sigma^2(L): variance of the count of points in a window +/// of length L, over many window placements, averaged across projections. +fn report(keys_per_proj: &[Vec], n: usize) { + let ls = [2.0_f64, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0]; + let n_windows = 4000usize; + for &lwin in ls.iter() { + if lwin >= n as f64 { + break; + } + let mut acc_var = 0.0f64; + for positions in keys_per_proj { + // place n_windows windows uniformly in [0, n-L], count via binary search + let mut counts = Vec::with_capacity(n_windows); + for w in 0..n_windows { + let start = (w as f64) * ((n as f64 - lwin) / n_windows as f64); + let end = start + lwin; + let lo = positions.partition_point(|&p| p < start); + let hi = positions.partition_point(|&p| p < end); + counts.push((hi - lo) as f64); + } + let mean = counts.iter().sum::() / counts.len() as f64; + let var = counts.iter().map(|c| (c - mean).powi(2)).sum::() + / counts.len() as f64; + acc_var += var; + } + let sigma2 = acc_var / keys_per_proj.len() as f64; + let log_l = lwin.ln(); + println!( + "{lwin:.0}\t{sigma2:.4}\t{:.4}\t(L={lwin:.0}, lnL={log_l:.3})", + sigma2 / lwin + ); + } +} diff --git a/experiments/ordinal-routing-research/subspace_directions.rs b/experiments/ordinal-routing-research/subspace_directions.rs new file mode 100644 index 00000000..1bfec38f --- /dev/null +++ b/experiments/ordinal-routing-research/subspace_directions.rs @@ -0,0 +1,553 @@ +//! CONTEXT-BOXED, PRE-REGISTERED — the directions experiment done RIGHT, using +//! every finding: center (remove the cone) -> project to the populated k-dim +//! subspace (where joint structure is legible) -> place R probe directions via +//! {random, Sobol, Kronecker} -> route -> candidates-scanned at matched recall. +//! Isolated by request: touches no harness, writes nothing. +//! +//! WHY THIS ISN'T fib_directions REDUX: +//! fib_directions placed golden directions on the AMBIENT 768-sphere, where +//! concentration of measure makes iid-Gaussian already near-uniform -> doomed. +//! The marginal uniformity lemma is BLIND to joint structure; directions act on +//! the JOINT correlation. So the honest test is in the centered, low-effective- +//! dim subspace (intrinsic dim ~13), where low-discrepancy has room to matter. +//! +//! SEQUENCES (data-oblivious; PCA control is the data-DEPENDENT ceiling): +//! random — iid Gaussian directions in the k-subspace (the baseline). +//! sobol — base-2 digital net -> inverse-normal -> subspace direction. +//! Discrepancy-optimal at R=2^m; predicted to spike at R in {16,32,64}. +//! kronecker— additive {i*alpha} per axis (generalized golden) -> direction. +//! Three-distance at EVERY prefix; predicted to hold across all R but +//! degrade as k climbs (dimension factor in its discrepancy bound). +//! pca-axes — the top-R principal directions themselves (DATA-DEPENDENT control; +//! expected to win — it bounds what oblivious sets leave on the table). +//! +//! METRIC: per query, probe its bucket +/- rad on each of R directions; candidate +//! union; recall@10 vs FP32 cosine top-10. Report the FAIR ENVELOPE (max recall at +//! candidates-scanned <= budget) per sequence — the only fair cross-arm axis. +//! Also report post-map DISCREPANCY (star-discrepancy proxy) of each direction set, +//! because cube->sphere mapping warps the even-spacing and that must be MEASURED. +//! +//! PRE-REGISTERED VERDICT (fixed before running), family-level, Bonferroni in mind: +//! COMPONENT WIN if a sequence beats random by >= 0.02 recall at matched +//! candidates in >= 2 of {fiqa,nq,quora} AND survives at k near 13. +//! SOBOL-AS-PREDICTED if its wins concentrate at R in {16,32,64} (power-of-2). +//! KRONECKER-AS-PREDICTED if it wins across R but its margin shrinks as k grows. +//! CLASS-DEAD if neither low-discrepancy sequence beats random anywhere by >=0.02 +//! -> the entire oblivious-directions hypothesis (not just golden) is falsified. +//! The Kronecker/Sobol HYBRID is NOT built here; it is earned only if each wins +//! in a DISTINCT R-regime. This experiment decides whether that even applies. +//! +//! Run: cargo run --release --example subspace_directions -- \ +//! --corpus-npy /tmp/corpora/fiqa_corpus.npy --queries-npy /tmp/corpora/fiqa_q.npy --kdim 13 + +// Research probe (not library code): index loops and the inv-normal constant +// tables read closer to the math than the iterator rewrites would. +#![allow( + clippy::needless_range_loop, + clippy::type_complexity, + clippy::excessive_precision +)] + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn coord_mean(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut m = vec![0f32; dim]; + for i in 0..n { + for c in 0..dim { + m[c] += rows[i * dim + c]; + } + } + for m_c in m.iter_mut() { + *m_c /= n as f32; + } + m +} + +/// Top-k principal directions of the centered corpus via power iteration + +/// deflation. Returns k unit vectors of length `dim`. Pure Rust, no BLAS. +/// Uses a doc subsample for the covariance action (cov is implicit: Cov*x = +/// (1/n) X^T (X x), never materialized). +fn pca_topk(centered: &[f32], n: usize, dim: usize, k: usize, mean: &[f32]) -> Vec> { + let sample: Vec = if n > 20000 { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x9C0FFEE); + (0..20000).map(|_| rng.random_range(0..n)).collect() + } else { + (0..n).collect() + }; + // implicit covariance action y = Cov * x using centered rows on the sample + let cov_mul = |x: &[f32]| -> Vec { + let acc: Vec = sample + .par_iter() + .map(|&i| { + let row = ¢ered[i * dim..(i + 1) * dim]; + let d: f32 = row.iter().zip(x).map(|(a, b)| a * b).sum(); + row.iter().map(|&r| r * d).collect::>() + }) + .reduce( + || vec![0f32; dim], + |mut a, b| { + for j in 0..dim { + a[j] += b[j]; + } + a + }, + ); + let s = sample.len() as f32; + acc.iter().map(|v| v / s).collect() + }; + let mut comps: Vec> = Vec::with_capacity(k); + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xABCD); + let _ = mean; + for _ in 0..k { + let mut v: Vec = (0..dim) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + for _ in 0..50 { + // deflate against found components + for c in &comps { + let p: f32 = v.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + v[j] -= p * c[j]; + } + } + let mut y = cov_mul(&v); + for c in &comps { + let p: f32 = y.iter().zip(c).map(|(a, b)| a * b).sum(); + for j in 0..dim { + y[j] -= p * c[j]; + } + } + let nrm: f32 = y.iter().map(|t| t * t).sum::().sqrt(); + if nrm < 1e-20 { + break; + } + for j in 0..dim { + v[j] = y[j] / nrm; + } + } + comps.push(v); + } + comps +} + +/// project a centered vector onto the k-dim PCA basis -> k coords. +fn project(v: &[f32], basis: &[Vec]) -> Vec { + basis + .iter() + .map(|b| v.iter().zip(b).map(|(a, c)| a * c).sum()) + .collect() +} + +// ---- direction generators in k-dim (returned as R unit vectors length k) ---- + +fn dirs_random(r: usize, k: usize) -> Vec> { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0x1111 ^ (r as u64)); + (0..r) + .map(|_| { + let mut v: Vec = (0..k) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + v + }) + .collect() +} + +/// Van der Corput / Sobol-lite: radical inverse in base 2 per dimension using +/// distinct primes for the radical-inverse base per axis (Halton-style net, the +/// digital low-discrepancy family). inverse-normal -> unit direction. +fn radical_inverse(mut i: usize, base: usize) -> f32 { + let mut f = 1.0f64; + let mut r = 0.0f64; + while i > 0 { + f /= base as f64; + r += f * (i % base) as f64; + i /= base; + } + r as f32 +} + +const PRIMES: [usize; 32] = [ + 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, + 101, 103, 107, 109, 113, 127, 131, +]; + +/// inverse standard normal CDF (Acklam's rational approximation) for u in (0,1). +fn inv_norm(u: f32) -> f32 { + let u = u.clamp(1e-6, 1.0 - 1e-6) as f64; + let a = [ + -3.969683028665376e+01, + 2.209460984245205e+02, + -2.759285104469687e+02, + 1.383577518672690e+02, + -3.066479806614716e+01, + 2.506628277459239e+00, + ]; + let b = [ + -5.447609879822406e+01, + 1.615858368580409e+02, + -1.556989798598866e+02, + 6.680131188771972e+01, + -1.328068155288572e+01, + ]; + let c = [ + -7.784894002430293e-03, + -3.223964580411365e-01, + -2.400758277161838e+00, + -2.549732539343734e+00, + 4.374664141464968e+00, + 2.938163982698783e+00, + ]; + let d = [ + 7.784695709041462e-03, + 3.224671290700398e-01, + 2.445134137142996e+00, + 3.754408661907416e+00, + ]; + let pl = 0.02425; + let x = if u < pl { + let q = (-2.0 * u.ln()).sqrt(); + (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + } else if u <= 1.0 - pl { + let q = u - 0.5; + let rr = q * q; + (((((a[0] * rr + a[1]) * rr + a[2]) * rr + a[3]) * rr + a[4]) * rr + a[5]) * q + / (((((b[0] * rr + b[1]) * rr + b[2]) * rr + b[3]) * rr + b[4]) * rr + 1.0) + } else { + let q = (-2.0 * (1.0 - u).ln()).sqrt(); + -(((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5]) + / ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0) + }; + x as f32 +} + +fn dirs_sobol(r: usize, k: usize) -> Vec> { + (0..r) + .map(|i| { + let mut v: Vec = (0..k) + .map(|a| inv_norm(radical_inverse(i + 1, PRIMES[a % PRIMES.len()]))) + .collect(); + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() +} + +/// Kronecker: per-axis additive recurrence {i*alpha_a}, alpha from the generalized +/// golden ratio (root of x^{k+1}=x+1) approximated per-axis by primes^(1/(a+2)) frac. +fn dirs_kronecker(r: usize, k: usize) -> Vec> { + // irrational generators: fractional parts of sqrt(prime) (badly approximable, Weyl-equidistributed) + let alphas: Vec = (0..k) + .map(|a| (PRIMES[a % PRIMES.len()] as f64).sqrt().fract()) + .collect(); + (0..r) + .map(|i| { + let mut v: Vec = (0..k) + .map(|a| inv_norm(((i as f64 + 1.0) * alphas[a]).fract() as f32)) + .collect(); + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + if nrm > 0.0 { + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() +} + +fn dirs_pca_axes(r: usize, k: usize) -> Vec> { + // data-dependent ceiling: the k principal axes, then (for R>k) random unit + // combinations of them seeded deterministically — so R>k gives R DISTINCT + // directions spanning the populated subspace, not repeats (control fix). + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xACE5); + (0..r) + .map(|i| { + let mut v = vec![0f32; k]; + if i < k { + v[i] = 1.0; + } else { + for x in v.iter_mut() { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + *x = (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos(); + } + let nrm: f32 = v.iter().map(|t| t * t).sum::().sqrt(); + for x in v.iter_mut() { + *x /= nrm; + } + } + v + }) + .collect() +} + +/// star-discrepancy proxy: max over a random set of half-spaces of |empirical - expected|. +fn discrepancy_proxy(dirs: &[Vec], k: usize) -> f32 { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xDDDD); + let probes = 2000; + let mut worst = 0f32; + for _ in 0..probes { + // random half-space normal + let mut hn: Vec = (0..k) + .map(|_| { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() + }) + .collect(); + let hnm: f32 = hn.iter().map(|t| t * t).sum::().sqrt(); + for x in hn.iter_mut() { + *x /= hnm; + } + let frac = dirs + .iter() + .filter(|d| d.iter().zip(&hn).map(|(a, b)| a * b).sum::() > 0.0) + .count() as f32 + / dirs.len() as f32; + worst = worst.max((frac - 0.5).abs()); + } + worst +} + +struct Proj { + dir: Vec, + width: f32, +} +impl Proj { + fn bucket(&self, v: &[f32]) -> i64 { + let dot: f32 = self.dir.iter().zip(v).map(|(a, b)| a * b).sum(); + (dot / self.width).floor() as i64 + } +} + +fn main() { + let mut cp = None; + let mut qp = None; + let mut kdim = 13usize; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--corpus-npy" => cp = Some(a.next().unwrap()), + "--queries-npy" => qp = Some(a.next().unwrap()), + "--kdim" => kdim = a.next().unwrap().parse().unwrap(), + _ => {} + } + } + let (mut corpus, n, dim) = load_npy_f32(&cp.expect("--corpus-npy")); + let (mut queries, nq, dq) = load_npy_f32(&qp.expect("--queries-npy")); + assert_eq!(dim, dq); + l2_normalize_rows(&mut corpus, dim); + l2_normalize_rows(&mut queries, dim); + + // ground truth = FP32 cosine top-10 on normalized RAW vectors + let truth: Vec> = (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(u32, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di as u32, q.iter().zip(d).map(|(a, b)| a * b).sum::()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); + s.into_iter().take(10).map(|(i, _)| i).collect() + }) + .collect(); + + // center + let mean = coord_mean(&corpus, n, dim); + let mut cc = corpus.clone(); + for i in 0..n { + for c in 0..dim { + cc[i * dim + c] -= mean[c]; + } + } + let mut qc = queries.clone(); + for i in 0..nq { + for c in 0..dim { + qc[i * dim + c] -= mean[c]; + } + } + + // PCA basis from centered corpus, project both + eprintln!("# computing top-{kdim} PCA basis (power iteration)..."); + let basis = pca_topk(&cc, n, dim, kdim, &mean); + let cproj: Vec> = (0..n) + .into_par_iter() + .map(|i| project(&cc[i * dim..(i + 1) * dim], &basis)) + .collect(); + let qproj: Vec> = (0..nq) + .into_par_iter() + .map(|i| project(&qc[i * dim..(i + 1) * dim], &basis)) + .collect(); + + // width calibration: ~sqrt(n) buckets along a unit direction in subspace + let spread = { + let mut s = 0f32; + for v in cproj.iter().take(2000) { + for &x in v { + s += x * x; + } + } + (s / (2000.0 * kdim as f32)).sqrt() * 6.0 + }; + let base_width = spread / (n as f32).sqrt(); + + println!("# subspace_directions (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}, k={kdim}"); + println!("# centered+projected to k-dim PCA subspace; recall@10 vs FP32 cosine top-10"); + let gens: [(&str, fn(usize, usize) -> Vec>); 4] = [ + ("random", dirs_random), + ("sobol", dirs_sobol), + ("kronecker", dirs_kronecker), + ("pca-axes", dirs_pca_axes), + ]; + let r_values = [8usize, 16, 32, 64, 128]; + + println!("seq\tR\tdiscrepancy\trad\tcand\trecall@10"); + let mut pts: Vec<(String, f64, f64)> = vec![]; + for (name, gen) in gens.iter() { + for &r in &r_values { + let dirs = gen(r, kdim); + let disc = discrepancy_proxy(&dirs, kdim); + let projs: Vec = dirs + .into_iter() + .map(|d| Proj { + dir: d, + width: base_width, + }) + .collect(); + // index: per direction, bucket -> docs + let index: Vec>> = projs + .par_iter() + .map(|p| { + let mut m: std::collections::HashMap> = + std::collections::HashMap::new(); + for di in 0..n { + m.entry(p.bucket(&cproj[di])).or_default().push(di as u32); + } + m + }) + .collect(); + for rad in [0i64, 1, 2, 4] { + let stats: Vec<(usize, f32)> = (0..nq) + .into_par_iter() + .map(|qi| { + use std::collections::HashSet; + let mut cand: HashSet = HashSet::new(); + for (pi, p) in projs.iter().enumerate() { + let b = p.bucket(&qproj[qi]); + for bb in (b - rad)..=(b + rad) { + if let Some(ids) = index[pi].get(&bb) { + cand.extend(ids.iter().copied()); + } + } + } + let hits = truth[qi].iter().filter(|&&i| cand.contains(&i)).count(); + (cand.len(), hits as f32 / truth[qi].len() as f32) + }) + .collect(); + let mc = stats.iter().map(|s| s.0 as f64).sum::() / nq as f64; + let mr = stats.iter().map(|s| s.1 as f64).sum::() / nq as f64; + println!("{name}\t{r}\t{disc:.4}\t{rad}\t{mc:.0}\t{mr:.4}"); + pts.push((name.to_string(), mc, mr)); + } + } + } + println!("\n# FAIR ENVELOPE: max recall@10 at candidates-scanned <= budget"); + let budgets = [500.0, 1000.0, 2000.0, 4000.0, 8000.0]; + print!("budget"); + for (name, _) in gens.iter() { + print!("\t{name}"); + } + println!(); + for &bud in &budgets { + print!("{bud:.0}"); + for (name, _) in gens.iter() { + let best = pts + .iter() + .filter(|(nm, c, _)| nm == name && *c <= bud) + .map(|(_, _, r)| *r) + .fold(0.0, f64::max); + print!("\t{best:.4}"); + } + println!(); + } + println!("\n# PRE-REGISTERED: COMPONENT WIN if sobol/kronecker beat random by >=0.02 at matched budget in >=2 corpora;"); + println!("# sobol-as-predicted if wins concentrate at R in {{16,32,64}}; kronecker if across R but shrinks as k grows;"); + println!("# CLASS-DEAD if neither beats random by >=0.02 anywhere. Hybrid earned only on distinct-regime component wins."); +} diff --git a/experiments/ordinal-routing-research/tau_rerank_bakeoff.rs b/experiments/ordinal-routing-research/tau_rerank_bakeoff.rs new file mode 100644 index 00000000..43bf3bf1 --- /dev/null +++ b/experiments/ordinal-routing-research/tau_rerank_bakeoff.rs @@ -0,0 +1,344 @@ +//! Matched-bytes bake-off: does a Kendall-tau rerank of b=2 survivors beat +//! simply spending the bits on b=4, on R@10 vs FP32 cosine? +//! +//! This is the decisive deployment question for the density-collapse finding. +//! It is a CEILING experiment: the tau-rerank uses the FULL stored rank order +//! (ignoring how compactly that order could be stored). If the idealized +//! tau-rerank cannot beat b=4, the answer is "just use b=4" and no codec work +//! is justified. If it can, a compact tau codec becomes worth designing. +//! +//! Arms (all scored against FP32 brute-force cosine top-k): +//! b2 — RankQuant b=2 asymmetric (dim/4 bytes/vec) +//! b4 — RankQuant b=4 asymmetric (dim/2 bytes/vec) +//! b2+tau — b=2 top-M candidates, reranked by Kendall-tau of the +//! query's top-k coord order vs each doc's stored rank order +//! b2+fp32 — b=2 top-M candidates, reranked by exact FP32 cosine +//! (absolute ceiling / sanity — should approach FP32) +//! +//! Run: cargo run --release --example tau_rerank_bakeoff +//! cargo run --release --example tau_rerank_bakeoff -- --corpus-npy emb.npy --queries-npy q.npy +//! No external data, no BLAS. + +use ordvec::rank::rank_transform; +use ordvec::RankQuant; +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + n_queries: usize, + latent: usize, + clusters: usize, + k: usize, + m: usize, // candidate-set size for the rerank arms + topk: usize, // # query top coords used by the tau rerank + corpus_npy: Option, + queries_npy: Option, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 30_000, + n_queries: 200, + latent: 64, + clusters: 200, + k: 10, + m: 200, + topk: 32, + corpus_npy: None, + queries_npy: None, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--queries" => c.n_queries = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent = a.next().unwrap().parse().unwrap(), + "--clusters" => c.clusters = a.next().unwrap().parse().unwrap(), + "--k" => c.k = a.next().unwrap().parse().unwrap(), + "--m" => c.m = a.next().unwrap().parse().unwrap(), + "--topk" => c.topk = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--queries-npy" => c.queries_npy = Some(a.next().unwrap()), + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +fn l2_rows(v: &mut [f32], dim: usize) { + for i in 0..v.len() / dim { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "len mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +fn main() { + let mut cfg = parse(); + let (corpus, queries) = load_corpus(&mut cfg); + let truth = fp32_topk(&corpus, &queries, cfg.dim, cfg.k); + println!( + "# tau_rerank_bakeoff: n={} q={} dim={} k={} M={} topk={}", + cfg.n, cfg.n_queries, cfg.dim, cfg.k, cfg.m, cfg.topk + ); + run(&cfg, &corpus, &queries, &truth); +} + +fn recall_at_k(pred: &[usize], truth: &[usize]) -> f32 { + use std::collections::HashSet; + let t: HashSet = truth.iter().copied().collect(); + let hits = pred.iter().filter(|i| t.contains(i)).count(); + hits as f32 / truth.len().max(1) as f32 +} + +/// Top-k indices of a vector by value, descending. +fn top_coords(v: &[f32], k: usize) -> Vec { + let mut idx: Vec = (0..v.len()).collect(); + idx.sort_by(|&i, &j| v[j].partial_cmp(&v[i]).unwrap_or(std::cmp::Ordering::Equal)); + idx.truncate(k); + idx +} + +/// Kendall-tau distance between the orderings q and the doc's stored RANKS +/// induce on `coords` (fraction of discordant pairs). Lower = more similar +/// order. `dranks[c]` is the stored rank of coord c for this doc. +fn tau_dist(q: &[f32], dranks: &[u16], coords: &[usize]) -> f32 { + let m = coords.len(); + if m < 2 { + return 0.0; + } + let (mut disc, mut tot) = (0usize, 0usize); + for x in 0..m { + for y in (x + 1)..m { + let (cx, cy) = (coords[x], coords[y]); + let sq = (q[cx] - q[cy]).partial_cmp(&0.0).unwrap_or(std::cmp::Ordering::Equal); + // stored ranks are integers; higher rank = larger value + let sd = dranks[cx].cmp(&dranks[cy]); + if sq != sd { + disc += 1; + } + tot += 1; + } + } + disc as f32 / tot.max(1) as f32 +} + +fn run(cfg: &Cfg, corpus: &[f32], queries: &[f32], truth: &[Vec]) { + let d = cfg.dim; + let nq = cfg.n_queries; + + // Build b=2 and b=4 indices on identical data. + let mut b2 = RankQuant::new(d, 2); + b2.add(corpus); + let mut b4 = RankQuant::new(d, 4); + b4.add(corpus); + eprintln!( + "# b2 {} B/vec, b4 {} B/vec", + b2.bytes_per_vec(), + b4.bytes_per_vec() + ); + + // Precompute stored ranks per doc (the CEILING tau signal: full order). + // This is what a compact tau codec would have to approximate. + let doc_ranks: Vec> = (0..cfg.n) + .into_par_iter() + .map(|i| rank_transform(&corpus[i * d..(i + 1) * d])) + .collect(); + + // Per query: candidate set = b=2's own asymmetric top-M (the cheap stage). + let b2_res = b2.search_asymmetric(queries, cfg.m); + let b4_res = b4.search_asymmetric(queries, cfg.k); + + let mut r_b2 = 0.0f64; + let mut r_b4 = 0.0f64; + let mut r_tau = 0.0f64; + let mut r_fp = 0.0f64; + + let per_q: Vec<(f32, f32, f32, f32)> = (0..nq) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * d..(qi + 1) * d]; + // b2 top-k (first k of its top-M) + let b2_ids: Vec = + b2_res.indices_for_query(qi).iter().take(cfg.k).map(|&i| i as usize).collect(); + let b4_ids: Vec = + b4_res.indices_for_query(qi).iter().map(|&i| i as usize).collect(); + // candidate pool = b2 top-M + let cands: Vec = + b2_res.indices_for_query(qi).iter().map(|&i| i as usize).collect(); + let coords = top_coords(q, cfg.topk); + // tau rerank: ascending tau distance + let mut by_tau: Vec<(usize, f32)> = cands + .iter() + .map(|&di| (di, tau_dist(q, &doc_ranks[di], &coords))) + .collect(); + by_tau.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(std::cmp::Ordering::Equal)); + let tau_ids: Vec = by_tau.iter().take(cfg.k).map(|&(i, _)| i).collect(); + // fp32 rerank: descending cosine (absolute ceiling) + let mut by_fp: Vec<(usize, f32)> = cands + .iter() + .map(|&di| { + let dv = &corpus[di * d..(di + 1) * d]; + (di, q.iter().zip(dv).map(|(a, b)| a * b).sum()) + }) + .collect(); + by_fp.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + let fp_ids: Vec = by_fp.iter().take(cfg.k).map(|&(i, _)| i).collect(); + ( + recall_at_k(&b2_ids, &truth[qi]), + recall_at_k(&b4_ids, &truth[qi]), + recall_at_k(&tau_ids, &truth[qi]), + recall_at_k(&fp_ids, &truth[qi]), + ) + }) + .collect(); + for (a, b, c, e) in &per_q { + r_b2 += *a as f64; + r_b4 += *b as f64; + r_tau += *c as f64; + r_fp += *e as f64; + } + let nqf = nq as f64; + println!("\narm bytes/vec R@{}", cfg.k); + println!("b2 asym {:>9} {:.4}", b2.bytes_per_vec(), r_b2 / nqf); + println!("b4 asym {:>9} {:.4}", b4.bytes_per_vec(), r_b4 / nqf); + println!( + "b2 + tau-rerank {:>9}* {:.4} (*ceiling: full stored ranks)", + b2.bytes_per_vec(), + r_tau / nqf + ); + println!("b2 + fp32-rerank {:>9} {:.4} (absolute ceiling)", "—", r_fp / nqf); + println!("\nVERDICT:"); + let (tau, b4v) = (r_tau / nqf, r_b4 / nqf); + if tau > b4v + 0.005 { + println!(" tau-rerank BEATS b4 at half the bytes ({tau:.4} > {b4v:.4}) — codec work justified"); + } else if tau + 0.005 < b4v { + println!(" b4 BEATS idealized tau-rerank ({b4v:.4} > {tau:.4}) — just use b4, no codec"); + } else { + println!(" tau-rerank ~= b4 ({tau:.4} vs {b4v:.4}) — tie even at the ceiling; b4 simpler"); + } +} + +/// FP32 brute-force cosine top-k (corpus + queries L2-normalized). Vec per query. +fn fp32_topk(corpus: &[f32], queries: &[f32], dim: usize, k: usize) -> Vec> { + let n = corpus.len() / dim; + (0..queries.len() / dim) + .into_par_iter() + .map(|qi| { + let q = &queries[qi * dim..(qi + 1) * dim]; + let mut s: Vec<(usize, f32)> = (0..n) + .map(|di| { + let d = &corpus[di * dim..(di + 1) * dim]; + (di, q.iter().zip(d).map(|(a, b)| a * b).sum()) + }) + .collect(); + s.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); + s.into_iter().take(k).map(|(i, _)| i).collect() + }) + .collect() +} + +fn load_corpus(cfg: &mut Cfg) -> (Vec, Vec) { + if let (Some(cp), Some(qp)) = (cfg.corpus_npy.clone(), cfg.queries_npy.clone()) { + let (mut c, n, d) = load_npy_f32(&cp); + let (mut q, nq, dq) = load_npy_f32(&qp); + assert_eq!(d, dq, "dim mismatch"); + l2_rows(&mut c, d); + l2_rows(&mut q, dq); + cfg.n = n; + cfg.dim = d; + cfg.n_queries = nq; + eprintln!("# loaded corpus {n}x{d}, queries {nq}x{dq}"); + return (c, q); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let (d, l) = (cfg.dim, cfg.latent); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let mk = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut o = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + o[i] = acc; + } + o + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let c = rng.random_range(0..cfg.clusters); + corpus[i * d..(i + 1) * d].copy_from_slice(&mk(&protos[c * l..(c + 1) * l], 0.3, &mut rng)); + } + let mut queries = vec![0.0f32; cfg.n_queries * d]; + for i in 0..cfg.n_queries { + let c = rng.random_range(0..cfg.clusters); + queries[i * d..(i + 1) * d] + .copy_from_slice(&mk(&protos[c * l..(c + 1) * l], 0.1, &mut rng)); + } + l2_rows(&mut corpus, d); + l2_rows(&mut queries, d); + (corpus, queries) +} diff --git a/experiments/ordinal-routing-research/tau_rerank_bakeoff_results.md b/experiments/ordinal-routing-research/tau_rerank_bakeoff_results.md new file mode 100644 index 00000000..e651aaa1 --- /dev/null +++ b/experiments/ordinal-routing-research/tau_rerank_bakeoff_results.md @@ -0,0 +1,54 @@ +# Matched-bytes bake-off: tau-rerank vs b=4 — the deployment verdict + +The decisive question for the density-collapse finding: does breaking dense-region +ties with permutation order (Kendall-tau rerank of b=2 survivors) beat simply +spending the bits on b=4? Source: `examples/tau_rerank_bakeoff.rs`. CEILING +experiment — tau uses the FULL stored rank order (best case for the method). + +## Result — NO. b=4 wins decisively, even at the tau ceiling. + +Real embeddings (nomic-embed-text, 8665 docs / 200 held-out queries, 768-d, +topk=128, M=50 — tau's best parameters): + +| arm | bytes/vec | R@10 | +|-----|-----------|------| +| b2 asym | 192 | 0.8980 | +| b4 asym | 384 | 0.9420 | +| b2 + tau-rerank | 192* | 0.5965 (*ceiling: full stored ranks) | +| b2 + fp32-rerank | — | 1.0000 (absolute ceiling) | + +Synthetic (dim 256, n 30k) tells the same story: b2 0.5785, b4 0.8095, +tau-rerank 0.22–0.57 (best case ties b2, never reaches b4), fp32-rerank 1.0000. + +## Why the tau-rerank fails (the precise lesson) + +1. **The candidate pool is fine.** fp32-rerank = 1.0000 proves b=2's top-M + *contains* every FP32 true neighbour. Candidate generation is not the problem. +2. **Tau MISORDERS them.** tau-rerank (0.597) is worse than b=2's OWN ordering + (0.898). The ~0.04 tau gap from density_collapse_results.md is real as a faint + BINARY discriminator (true-neighbour vs far-lookalike) but far too weak to + ORDER 50 candidates correctly — it actively destroys the ordering b=2's + scores already provide. +3. **Even at the ceiling** (full stored ranks, best topk/M) tau-rerank only + claws back to ≈ b=2; it never reaches b=4. No compact tau codec could do + better than this ceiling, so codec work is NOT justified. + +## Verdict + +**The density-collapse tau-signal is a real-but-inert observation, not a usable +lever. "Just use b=4."** This closes the loop opened in +density_collapse_results.md: the signal exists (binary separation, CI > 0) but +does not convert to retrieval value. There is no ordvec code change here worth +making. + +This is the honest negative that the whole research line was meant to resolve: +the prime/spectral/permutation ideas for improving dense-region retrieval do not +beat the boring baseline (spend the bits). The contribution of this branch is +therefore RESEARCH (a characterized mechanism + a clean negative), not a feature. + +Reproduce: +``` +cargo run --release --example tau_rerank_bakeoff # synthetic +cargo run --release --example tau_rerank_bakeoff -- \ + --corpus-npy repo_real.npy --queries-npy repo_q.npy --topk 128 --m 50 +``` diff --git a/experiments/ordinal-routing-research/twonn_id.rs b/experiments/ordinal-routing-research/twonn_id.rs new file mode 100644 index 00000000..c446ec07 --- /dev/null +++ b/experiments/ordinal-routing-research/twonn_id.rs @@ -0,0 +1,243 @@ +//! TwoNN intrinsic-dimension estimator (Facco et al. 2017) for embedding corpora. +//! +//! For each point, μ = r2/r1 = ratio of 2nd to 1st nearest-neighbour distance. +//! Under locally-uniform density of intrinsic dimension d, μ ~ Pareto(d), so +//! the linear fit log(1 - F(μ_i)) = -d * log(μ_i) through the origin +//! recovers d. No binning, no manifold reconstruction — two distances/point. +//! +//! Metric: embeddings are L2-normalized and retrieval is ANGULAR, so NN +//! distances here use chord / Euclidean distance between unit vectors +//! (`sqrt(2 - 2cos)`). Cosine distance (`1 - cos`) is squared-angle locally +//! and biases TwoNN estimates downward. +//! +//! Validation control: the synthetic corpus is a latent_dim-D manifold +//! projected into dim-D, so TwoNN should recover ID ≈ latent_dim. Run the +//! default and check it lands near 64 before trusting it on real .npy data. +//! +//! Run (synthetic control): cargo run --release --example twonn_id +//! Run (real corpus): cargo run --release --example twonn_id -- --corpus-npy emb.npy +//! No external data, no BLAS. + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +struct Cfg { + dim: usize, + n: usize, + latent_dim: usize, + n_clusters: usize, + corpus_npy: Option, + isotropic: bool, +} + +fn parse() -> Cfg { + let mut c = Cfg { + dim: 256, + n: 20_000, + latent_dim: 64, + n_clusters: 200, + corpus_npy: None, + isotropic: false, + }; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + match x.as_str() { + "--dim" => c.dim = a.next().unwrap().parse().unwrap(), + "--n" => c.n = a.next().unwrap().parse().unwrap(), + "--latent" => c.latent_dim = a.next().unwrap().parse().unwrap(), + "--clusters" => c.n_clusters = a.next().unwrap().parse().unwrap(), + "--corpus-npy" => c.corpus_npy = Some(a.next().unwrap()), + "--isotropic" => c.isotropic = true, + other => panic!("unknown arg {other}"), + } + } + c +} + +fn gauss(rng: &mut ChaCha8Rng) -> f32 { + let u1: f32 = rng.random_range(1e-9..1.0); + let u2: f32 = rng.random_range(0.0..1.0); + (-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos() +} + +/// Minimal NumPy v1/v2 .npy reader for 2-D little-endian f32 C-order arrays. +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!(bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", "not a numpy file"); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + (u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, 12) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner.split(',').filter_map(|s| s.trim().parse().ok()).collect(); + assert_eq!(dims.len(), 2, "expected 2-D array"); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + assert_eq!(bytes.len() - dstart, n * dim * 4, "data length mismatch"); + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +/// L2-normalize each row in place (cosine geometry). +fn l2_normalize_rows(v: &mut [f32], dim: usize) { + let n = v.len() / dim; + for i in 0..n { + let row = &mut v[i * dim..(i + 1) * dim]; + let nrm: f32 = row.iter().map(|x| x * x).sum::().sqrt(); + if nrm > 0.0 { + for x in row.iter_mut() { + *x /= nrm; + } + } + } +} + +/// Returns (corpus, n, dim). Synthetic = same low-rank clustered construction +/// as bench_rank (latent_dim manifold in dim-space → TwoNN should recover +/// ~latent_dim). Real = whatever the .npy holds, L2-normalized. +fn load_corpus(cfg: &Cfg) -> (Vec, usize, usize) { + if let Some(ref path) = cfg.corpus_npy { + let (mut v, n, dim) = load_npy_f32(path); + l2_normalize_rows(&mut v, dim); + return (v, n, dim); + } + let mut rng = ChaCha8Rng::seed_from_u64(SEED); + let (d, l) = (cfg.dim, cfg.latent_dim); + let mut a = vec![0.0f32; d * l]; + for x in a.iter_mut() { + *x = gauss(&mut rng); + } + let mut protos = vec![0.0f32; cfg.n_clusters * l]; + for x in protos.iter_mut() { + *x = gauss(&mut rng); + } + let make = |proto: &[f32], noise: f32, rng: &mut ChaCha8Rng| -> Vec { + let mut z = vec![0.0f32; l]; + for j in 0..l { + z[j] = proto[j] + noise * gauss(rng); + } + let mut out = vec![0.0f32; d]; + for i in 0..d { + let mut acc = 0.0f32; + for j in 0..l { + acc += a[i * l + j] * z[j]; + } + out[i] = acc; + } + out + }; + let mut corpus = vec![0.0f32; cfg.n * d]; + for i in 0..cfg.n { + let e = if cfg.isotropic { + let mut v = vec![0.0f32; d]; + for x in v.iter_mut() { + *x = gauss(&mut rng); + } + v + } else { + let c = rng.random_range(0..cfg.n_clusters); + make(&protos[c * l..(c + 1) * l], 0.3, &mut rng) + }; + corpus[i * d..(i + 1) * d].copy_from_slice(&e); + } + l2_normalize_rows(&mut corpus, d); + (corpus, cfg.n, d) +} + +/// For sampled anchors, compute μ = r2/r1 over COSINE distance (1 - cos), +/// searching each anchor against the full corpus. Anchors are sampled (the +/// Pareto fit is robust to it) but each search is exact, so r1/r2 are exact. +/// Returns the μ values (>1, with degenerate/duplicate cases dropped). +fn nn_ratios(corpus: &[f32], n: usize, dim: usize) -> Vec { + // sample up to 3000 anchors deterministically (stride sampling, seed-free) + let n_anchors = n.min(3000); + let stride = (n / n_anchors).max(1); + let anchors: Vec = (0..n).step_by(stride).take(n_anchors).collect(); + anchors + .par_iter() + .filter_map(|&ai| { + let a = &corpus[ai * dim..(ai + 1) * dim]; + // Track the two smallest chord distances to OTHER points. + let (mut d1, mut d2) = (f64::INFINITY, f64::INFINITY); + for bi in 0..n { + if bi == ai { + continue; + } + let b = &corpus[bi * dim..(bi + 1) * dim]; + let dot: f32 = a.iter().zip(b).map(|(x, y)| x * y).sum(); + // Euclidean / chord distance between unit vectors: + // sqrt(2 - 2cos) ∝ sin(θ/2), locally LINEAR in the angle. + // (Cosine distance 1-cos ≈ θ²/2 is a *squared* distance and + // halves the TwoNN estimate — TwoNN needs a locally-linear metric.) + let dist = (2.0 - 2.0 * dot as f64).max(0.0).sqrt(); + if dist < d1 { + d2 = d1; + d1 = dist; + } else if dist < d2 { + d2 = dist; + } + } + // drop degenerate anchors (exact duplicate neighbour => r1≈0) + if d1 > 1e-12 && d2.is_finite() { + let mu = d2 / d1; + if mu > 1.0 + 1e-9 { + return Some(mu); + } + } + None + }) + .collect() +} + +/// TwoNN Pareto fit. Sort μ ascending, empirical CDF F_i = (i+1)/N, then fit +/// d through the origin on y = -log(1 - F) vs x = log μ. Robust variant: +/// discard the top 10% of μ (Facco et al.'s recommended tail trim) and use the +/// closed-form slope d = Σ x·y / Σ x². +fn twonn_fit(mus: &[f64]) -> f64 { + let mut m: Vec = mus.to_vec(); + m.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)); + let n = m.len(); + let keep = ((n as f64) * 0.9) as usize; // trim top 10% tail + let (mut sxx, mut sxy) = (0.0f64, 0.0f64); + for (i, &mu) in m.iter().enumerate().take(keep) { + let f = (i + 1) as f64 / n as f64; // empirical CDF + let x = mu.ln(); + let y = -(1.0 - f).ln(); + sxx += x * x; + sxy += x * y; + } + sxy / sxx +} + +fn main() { + let cfg = parse(); + let (corpus, n, dim) = load_corpus(&cfg); + eprintln!("# twonn_id: n={n} dim={dim} source={}", + if cfg.corpus_npy.is_some() { "npy" } else if cfg.isotropic { "isotropic" } else { "clustered" }); + let mus = nn_ratios(&corpus, n, dim); + let d = twonn_fit(&mus); + println!("TwoNN intrinsic dimension estimate: {d:.2} (ambient dim {dim})"); + if cfg.corpus_npy.is_none() && !cfg.isotropic { + println!("(synthetic control: expect ~{} = latent_dim)", cfg.latent_dim); + } +} diff --git a/experiments/ordinal-routing-research/twonn_id_results.md b/experiments/ordinal-routing-research/twonn_id_results.md new file mode 100644 index 00000000..e15fd0bf --- /dev/null +++ b/experiments/ordinal-routing-research/twonn_id_results.md @@ -0,0 +1,69 @@ +> ⚠️ PARTIAL (adversarial review). The chord-vs-cosine METRIC FIX is correct and +> sphere-validated — that stands. BUT the estimator uses OLS-through-origin on the +> linearized CDF (F=(i+1)/N), the biased non-MLE variant; some of the "finite- +> sample deflation" attributed to sampling is actually estimator bias. Treat the +> low-tens ID as indicative, and prefer the MLE TwoNN estimator before quoting +> exact values. See ADVERSARIAL_REVIEW.md. + +# TwoNN intrinsic dimension probe + +TwoNN estimator (Facco et al. 2017) for measuring the intrinsic dimension of an +embedding corpus directly. The aim was to close the citation gap (no published +TwoNN figure for sentence-transformers) — but this branch records only the +synthetic sphere/cluster validations below; **no clean real sentence-transformer +ID is measured here** (the estimator is biased; see the caveat above). There is +no published TwoNN figure for sentence-transformers; deep vision-CNN intrinsic +dimension is low-tens (Ansuini, arXiv:1905.12784) but that is a cross-domain +analogy, not sentence-transformer evidence — so this branch establishes no +sentence-transformer ID. + +Source: `examples/twonn_id.rs`. μ = r2/r1 (2nd/1st NN distance ratio), Pareto +fit d = Σ(log μ · -log(1-F)) / Σ(log μ)², top-10% tail trimmed. Anchors sampled +(≤3000), each searched exactly against the full corpus. + +## CRITICAL: metric must be locally LINEAR in distance + +First implementation used cosine distance (1 - cos θ ≈ θ²/2), a *squared* +distance — this HALVED every estimate (μ squared → d halved through log μ). +Fixed to chord/Euclidean distance between unit vectors: sqrt(2 - 2cos) ∝ +sin(θ/2), locally linear in the angle. Anyone adapting this for cosine spaces +must use the chord metric, not cosine distance. + +## Validation (isotropic sphere, true ID = ambient - 1) + +| ambient dim | true ID | TwoNN (chord) | TwoNN (cosine bug) | +|-------------|---------|---------------|--------------------| +| 3 | 2 | 1.99 | 0.99 | +| 5 | 4 | 4.19 | 2.09 | +| 8 | 7 | 6.99 | 3.49 | +| 12 | 11 | 10.41 | 5.21 | +| 20 | 19 | 16.89 | — | +| 50 | 49 | 35.18 | — | + +Verdict: **calibrated and trustworthy through ID ~12** (dense sampling at +n=20k). Above that, finite-sample deflation sets in (ID=20→16.9, ID=50→35) — +the curse of dimensionality floor on ALL NN-based ID estimators: ~exp(d) points +needed to populate a d-dim neighborhood. Reported IDs in the tens are therefore +LOWER BOUNDS; true ID may be higher. (The same finite-sample deflation would +also affect Ansuini's 12-25 vision-CNN last-layer figures — a methodological +note, not evidence those figures transfer to sentence-transformers.) + +## Measurements + +| corpus | n | ambient | TwoNN ID | +|--------|---|---------|----------| +| synthetic clustered (latent_dim=64) | 20k | 256 | 27.5 | +| isotropic dim=256 | 10k | 256 | (sampling-limited) | + +The synthetic clustered corpus (built from a 64-D latent manifold) reads ID≈27 +— deflated from 64 by finite-sample + the within-cluster local geometry being +lower-dim than the global latent space. + +## For real embeddings + +Run: `cargo run --release --example twonn_id -- --corpus-npy your_embeddings.npy` +Loads 2-D lower entropy AND breaks +//! constant-composition -> destroys the closed-form null. +//! So golden ratio is not merely unhelpful on the rank domain: it would remove +//! the property ordvec's headline theorem stands on. Equal-width is FORCED. +//! +//! MEASURED on real ranks (corpus rank codes), three boundary schemes for B bits: +//! equal-width : cuts at k*D/2^B (uniform) +//! golden : cuts at sorted {frac(i*phi)}*D, i=1..2^B-1 (three-gap/low-disc) +//! random : cuts at 2^B-1 sorted uniform random points (control) +//! Metrics: per-bucket composition std across docs (0 == constant-composition); +//! code entropy in bits (ceiling = B); hypergeometric-null error +//! |observed mean top-bucket overlap - n_top^2/D| / (n_top^2/D). +//! +//! PRE-REGISTERED VERDICT (fixed before running): +//! LEMMA HOLDS if, for B in {1,2,4}: +//! equal-width composition-std ≈ 0 (< 0.01 coords) AND entropy ≈ B (>= B-0.01) +//! AND hypergeom-null error < 0.02; AND golden/random are strictly worse on +//! composition-std and entropy. Else the lemma as stated is wrong — report it. +//! +//! Run: cargo run --release --example uniformity_lemma -- --corpus-npy /tmp/repo_corpus.npy + +use rand::{RngExt, SeedableRng}; +use rand_chacha::ChaCha8Rng; +use rayon::prelude::*; + +const SEED: u64 = 1; + +fn load_npy_f32(path: &str) -> (Vec, usize, usize) { + let bytes = std::fs::read(path).expect("read npy"); + assert!( + bytes.len() >= 10 && &bytes[..6] == b"\x93NUMPY", + "not a numpy file" + ); + let major = bytes[6]; + assert!( + major == 1 || major == 2, + "unsupported numpy .npy major version {major}" + ); + if major == 2 { + assert!(bytes.len() >= 12, "truncated numpy v2 header"); + } + let (hlen, hstart) = if major == 1 { + (u16::from_le_bytes([bytes[8], bytes[9]]) as usize, 10) + } else { + ( + u32::from_le_bytes([bytes[8], bytes[9], bytes[10], bytes[11]]) as usize, + 12, + ) + }; + assert!(hstart + hlen <= bytes.len(), "truncated numpy header"); + let header = std::str::from_utf8(&bytes[hstart..hstart + hlen]).expect("utf8 header"); + assert!(header.contains("'descr': ' = inner + .split(',') + .filter_map(|s| s.trim().parse().ok()) + .collect(); + let (n, dim) = (dims[0], dims[1]); + let dstart = hstart + hlen; + let mut out = vec![0.0f32; n * dim]; + for (i, ch) in bytes[dstart..].chunks_exact(4).enumerate() { + out[i] = f32::from_le_bytes([ch[0], ch[1], ch[2], ch[3]]); + } + (out, n, dim) +} + +/// rank[i] = ordinal position (0..dim) of coordinate i within the row. +fn rank_encode(rows: &[f32], n: usize, dim: usize) -> Vec { + let mut out = vec![0u16; n * dim]; + out.par_chunks_mut(dim).enumerate().for_each(|(r, code)| { + let row = &rows[r * dim..(r + 1) * dim]; + let mut idx: Vec = (0..dim as u16).collect(); + idx.sort_by(|&a, &b| row[a as usize].partial_cmp(&row[b as usize]).unwrap()); + for (rank, &i) in idx.iter().enumerate() { + code[i as usize] = rank as u16; + } + }); + out +} + +/// Boundary cut points in rank units [0, dim], length 2^B - 1, ascending. +fn boundaries(scheme: &str, bits: u32, dim: usize) -> Vec { + let nb = 1usize << bits; + match scheme { + "equal" => (1..nb).map(|k| k as f64 * dim as f64 / nb as f64).collect(), + "golden" => { + let phi = (1.0 + 5.0f64.sqrt()) / 2.0; + let mut cuts: Vec = (1..nb) + .map(|i| ((i as f64 * phi).fract()) * dim as f64) + .collect(); + cuts.sort_by(|a, b| a.partial_cmp(b).unwrap()); + cuts + } + "random" => { + let mut rng = ChaCha8Rng::seed_from_u64(SEED ^ 0xBEEF); + let mut cuts: Vec = (1..nb) + .map(|_| rng.random_range(0.0..1.0) * dim as f64) + .collect(); + cuts.sort_by(|a, b| a.partial_cmp(b).unwrap()); + cuts + } + _ => unreachable!(), + } +} + +fn bucket_of(rank: u16, cuts: &[f64]) -> usize { + let r = rank as f64 + 0.5; + cuts.iter().filter(|&&c| r > c).count() +} + +fn main() { + let mut corpus_path = None; + let mut a = std::env::args().skip(1); + while let Some(x) = a.next() { + if x == "--corpus-npy" { + corpus_path = Some(a.next().unwrap()); + } + } + let path = corpus_path.expect("--corpus-npy required"); + let (corpus, n, dim) = load_npy_f32(&path); + let ranks = rank_encode(&corpus, n, dim); + let row = |i: usize| &ranks[i * dim..(i + 1) * dim]; + + println!("# uniformity_lemma (CONTEXT-BOXED, pre-registered). corpus {n}x{dim}"); + println!("# per-bucket composition std across docs (0=constant-composition), code entropy (bits), hypergeom-null err"); + println!("scheme\tB\tcomp_std\tentropy\tnull_err"); + + let mut rng_pairs = ChaCha8Rng::seed_from_u64(SEED); + let pairs: Vec<(usize, usize)> = (0..50_000) + .map(|_| { + let i = rng_pairs.random_range(0..n); + let mut j = rng_pairs.random_range(0..n); + if j == i { + j = (j + 1) % n; + } + (i, j) + }) + .collect(); + + for &bits in &[1u32, 2, 4] { + let nb = 1usize << bits; + for scheme in ["equal", "golden", "random"] { + let cuts = boundaries(scheme, bits, dim); + + // composition: count per bucket per doc; std of counts across docs, per bucket, then mean. + let comps: Vec> = (0..n) + .into_par_iter() + .map(|d| { + let mut c = vec![0f64; nb]; + for &rk in row(d) { + c[bucket_of(rk, &cuts)] += 1.0; + } + c + }) + .collect(); + let mut comp_std_sum = 0.0; + for b in 0..nb { + let vals: Vec = comps.iter().map(|c| c[b]).collect(); + let m = vals.iter().sum::() / n as f64; + let v = vals.iter().map(|x| (x - m) * (x - m)).sum::() / n as f64; + comp_std_sum += v.sqrt(); + } + let comp_std = comp_std_sum / nb as f64; + + // code entropy: bucket occupancy probabilities (same for every doc under + // constant-composition); use the mean histogram normalized. + let mut hist = vec![0f64; nb]; + for c in &comps { + for b in 0..nb { + hist[b] += c[b]; + } + } + let tot: f64 = hist.iter().sum(); + let entropy: f64 = hist + .iter() + .filter(|&&h| h > 0.0) + .map(|&h| { + let p = h / tot; + -p * p.log2() + }) + .sum(); + + // hypergeometric null: top bucket = highest bucket id; overlap of top sets. + let n_top_counts: Vec = comps.iter().map(|c| c[nb - 1]).collect(); + let mean_ntop = n_top_counts.iter().sum::() / n as f64; + let top_sets: Vec> = (0..n) + .into_par_iter() + .map(|d| { + row(d) + .iter() + .enumerate() + .filter(|(_, &rk)| bucket_of(rk, &cuts) == nb - 1) + .map(|(i, _)| i as u16) + .collect() + }) + .collect(); + let obs_overlap: f64 = pairs + .par_iter() + .map(|&(i, j)| top_sets[i].intersection(&top_sets[j]).count() as f64) + .sum::() + / pairs.len() as f64; + let expected = mean_ntop * mean_ntop / dim as f64; + let null_err = if expected > 0.0 { + (obs_overlap - expected).abs() / expected + } else { + f64::NAN + }; + + println!("{scheme}\t{bits}\t{comp_std:.4}\t{entropy:.4}\t{null_err:.4}"); + } + } + println!("\n# LEMMA HOLDS iff equal-width: comp_std<0.01, entropy>=B-0.01, null_err<0.02; golden/random strictly worse on comp_std & entropy."); +} diff --git a/experiments/ordinal-routing-research/withdrawn/README.md b/experiments/ordinal-routing-research/withdrawn/README.md new file mode 100644 index 00000000..374b0515 --- /dev/null +++ b/experiments/ordinal-routing-research/withdrawn/README.md @@ -0,0 +1,16 @@ +# Withdrawn results + +These findings did NOT survive review and are kept only for the record — do not +cite them as results. See [../ADVERSARIAL_REVIEW.md](../ADVERSARIAL_REVIEW.md) +(Round 1) and [../README.md](../README.md) for the tiering. + +- **spectral_probe_results.md** / **corpus_zoo_results.md** — the number-variance + "embedding keys are super-Poisson, never rigid" finding. The probe's unfold is + uncalibrated: a smooth-unfold salvage attempt INVERTED the isotropic/clustered + ordering, proving neither unfold is validated against analytic ground truth. + The rigidity THEORY (../rigidity_impossibility_proofs.md) does not depend on + this probe and is unaffected. + +The `examples/spectral_probe.rs` instrument itself is retained (its +`--rigid-selftest` is a valid estimator check); only the corpus *findings* are +withdrawn. diff --git a/experiments/ordinal-routing-research/withdrawn/corpus_zoo_results.md b/experiments/ordinal-routing-research/withdrawn/corpus_zoo_results.md new file mode 100644 index 00000000..c18404ee --- /dev/null +++ b/experiments/ordinal-routing-research/withdrawn/corpus_zoo_results.md @@ -0,0 +1,66 @@ +> ⛔ WITHDRAWN. This zoo runs the number-variance probe whose unfold is +> uncalibrated (see spectral_probe_results.md banner and ADVERSARIAL_REVIEW.md). +> The smooth-unfold salvage inverted the isotropic/clustered ordering, so the +> "robust across geometries" claim is unsupported. Retained as record only. The +> gen_corpus.rs zoo GENERATOR itself is fine and reusable; the measurement on top +> is what's withdrawn. + +# Corpus zoo: number-variance robustness across geometries + +Rigorous multi-corpus verification of the "embedding routing keys are not rigid" +finding. Generates 6 geometries as .npy (`examples/gen_corpus.rs`) and runs the +number-variance probe on each. Includes instrument controls at BOTH ends. + +n=50k, dim=256, 8 random-projection keys, Gaussian unfold. + +## Σ²(L)/L (≈1 Poisson · >1 clustered · <1 RIGID) + +| kind | L=8 | L=32 | L=128 | L=512 | reading | +|-------------|-------|-------|-------|--------|---------| +| isotropic | 0.99 | 0.99 | 0.99 | 0.93 | Poisson NULL ✓ | +| clustered | 1.34 | 2.39 | 6.13 | 18.5 | super-Poisson | +| anisotropic | 2.38 | 6.48 | 22.7 | 85.0 | super-Poisson | +| rogue | 1.60 | 3.34 | 10.0 | 34.1 | super-Poisson | +| manifold | 2.93 | 8.73 | 30.3 | 96.4 | super-Poisson | + +## Instrument controls (CRITICAL — proves the probe isn't blind) + +| control | result | proves | +|---------|--------|--------| +| `--rigid-selftest` (perfect lattice key) | Σ²/L = 0.0000 all L | probe DETECTS rigidity ✓ | +| isotropic corpus | Σ²/L = 0.99 flat | probe DETECTS Poisson ✓ | + +## Findings + +1. **Robust across 5 geometries: embedding-like keys are super-Poisson + (clustered), never rigid.** Crucially this includes the NONLINEAR MANIFOLD + (2.9→96) — the one geometry where angular rigidity could plausibly have + hidden. It didn't. +2. **The probe is validated at both ends.** Perfect lattice → Σ²=0 (sees + rigidity); isotropic → Σ²/L≈1 (sees Poisson). So the super-Poisson readings + are real, not instrument blindness. This was the gap that would have let a + skeptic dismiss the whole spectral finding; it is now closed. +3. **Bonus substantive result (random-projection lattice control):** a sequence + that is rigid along ONE direction does NOT produce a rigid key after random + projection + L2 normalization (tested: still super-Poisson). I.e. even if an + embedding had rigid structure on a special axis, a data-oblivious + random-projection router would not see it — reinforcing that spectral + structure is not exploitable training-free. (The instrument self-test + isolates the estimator from this pipeline effect.) + +## Verdict + +The "keys are non-rigid → quantile bucketing is the whole story, no spectral +opening" conclusion is now verified across diverse corpus geometries with +validated instrument controls, not a single synthetic distribution. A FALLING +Σ²/L on a real embedding dump would still be the surprise worth chasing — run +the zoo recipe on real data via the runbook. + +Reproduce: +``` +for K in isotropic clustered anisotropic rogue manifold; do + cargo run --release --example gen_corpus -- --kind $K --n 50000 --dim 256 --out zoo_$K.npy + cargo run --release --example spectral_probe -- --corpus-npy zoo_$K.npy +done +cargo run --release --example spectral_probe -- --rigid-selftest # instrument control +``` diff --git a/experiments/ordinal-routing-research/withdrawn/spectral_probe_results.md b/experiments/ordinal-routing-research/withdrawn/spectral_probe_results.md new file mode 100644 index 00000000..68a678a4 --- /dev/null +++ b/experiments/ordinal-routing-research/withdrawn/spectral_probe_results.md @@ -0,0 +1,54 @@ +> ⛔ WITHDRAWN — PROBE DOES NOT MEASURE WHAT IT CLAIMS. The salvage attempt +> (`--unfold-smooth K`) INVERTED the result: smooth unfold makes isotropic read +> super-Poisson (Σ²/L 1.7→12) and clustered read LOWER. The two unfolds disagree +> and neither is validated against analytic ground truth; the Gaussian "isotropic +> = 0.99 Poisson" control was just the one marginal that unfold happened to fit. +> The numbers below are RETAINED ONLY as the record of the failure. Fixing needs +> the estimator calibrated against a process with known closed-form Σ²(L). The +> theory (rigidity_impossibility_proofs.md) does NOT depend on this probe. Full +> account: ../ADVERSARIAL_REVIEW.md. + +# Number-variance probe: is a 1-D routing key rigid or Poisson? + +Experiment for the "prime mile-marker / spectral index" conjecture. The +conjecture needs the routing key to exhibit **rigidity** (sub-linear number +variance, Σ²(L) ~ log L, like GUE eigenvalue / zeta-zero spectra) for any +spectral/prime-gap partition structure to beat plain quantile bucketing. + +Source: `examples/spectral_probe.rs`. Number variance Σ²(L) = variance of the +count of keys in a window of length L, after unfolding to unit mean density. +Poisson ⇒ Σ²(L) = L (ratio 1, flat). Rigid ⇒ Σ²(L) ~ log L (ratio → 0). + +## Results (synthetic, n=200k, dim=256, 8 random-projection keys, seed=1) + +| L | clustered Σ²/L | isotropic Σ²/L (control) | quantile-unfold Σ² | +|---|---|---|---| +| 2 | 1.28 | 1.00 | 0.0000 | +| 8 | 3.05 | 1.00 | 0.0000 | +| 32 | 3.26 | 1.00 | 0.0000 | +| 128 | 5.94 | 1.01 | 0.0000 | +| 512 | 17.93 | 1.01 | 0.0000 | + +## Verdict + +1. **Isotropic control = exact Poisson (ratio 1.00, flat).** Validates the + probe: it reports "no structure" when there is none. The clustered signal is + therefore real, not an unfold artifact. +2. **Clustered key is SUPER-Poissonian (ratio climbs 1.3→18, Σ² ≈ 0.037·L²).** + Clustering, the *opposite* of the rigidity the conjecture requires. There is + no sub-linear (log-L) regime for spectral/prime structure to grip. +3. **Quantile unfold = Σ² 0.0000 exactly.** Quantile (inverse-CDF) tiling + balances the key perfectly by construction — empirical confirmation that it + strictly dominates any fixed-density tiling (including Li(x)) on occupancy. + +Conclusion: on this corpus the scalar routing key is Poisson-to-clustered, never +rigid. Quantile bucketing is the whole story for occupancy balance. The opening +the conjecture needed (rigidity) is absent. Re-run on real embeddings with +`--corpus-npy` to confirm; prior is strongly that real keys are also non-rigid. + +Reproduce: +``` +cargo run --release --example spectral_probe # clustered +cargo run --release --example spectral_probe -- --isotropic # Poisson control +cargo run --release --example spectral_probe -- --unfold-empirical # quantile +```