Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Security

- Hardened the Python binding's GIL-released search, candidate, scoring, and
`add` paths: NumPy inputs are now copied into Rust-owned buffers before
`py.detach`, so safe Python code cannot race a detached Rust read by mutating
the same array from another thread. This intentionally trades zero-copy
detached reads for race-free copied inputs; large calls may temporarily require
an additional input-sized buffer.
- Updated release governance to document and audit the two-approver
`crates-io` / `pypi` GitHub Environment gates: `Fieldnote-Echo` and
`toadkicker` are listed as required reviewers, self-review is blocked, and a
30-minute wait timer applies before registry publish jobs can proceed.
- Exposed the calibration-profile byte limit through the `ordvec-manifest`
Python bindings, including the default constant, `default_resource_limits()`,
and verifier/create keyword arguments.
- Aligned `.ovfs` / `OVFS` security and provenance docs with the now-public
`RankQuantFastscan` persistence loader and fuzz target.
- Updated formalization links and release invariants after the companion
`ordvec-formalization` repository moved under `Project-Navi`.

### Fixed

- Added a persisted-format registry that drives probe, manifest-coverage, and
C-ABI load decisions from one table; `.ovfs` now remains explicitly
known-but-not-probeable/not-manifest-covered, and the C ABI reports it as an
unsupported format rather than a corrupt index.
- Hid the `SubsetScratch::capacities_for_test` helper behind the non-default
`test-utils` feature and cleaned stale release-doc comments around FastScan
and b=8 bucket rustdoc.

## 0.5.0 - 2026-06-19

Expand Down
6 changes: 3 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,9 @@ exclude = [
"tests/release_signed_release_invariants.sh",
]

# docs.rs build configuration: build with default features only, so the
# experimental MultiBucketBitmap scaffold stays off the published docs.
# (The `#[doc(hidden)]` FastScan path is hidden by its attribute either way.)
# docs.rs build configuration: build with default features only. Stable default
# APIs, including `RankQuantFastscan`, are documented; the experimental
# MultiBucketBitmap scaffold stays off the published docs.
[package.metadata.docs.rs]
all-features = false

Expand Down
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ are machine-checked in Lean 4, both `sorry`-free on Lean's standard axiom base
signal model makes an overlap-count threshold Bayes-optimal among
deterministic admission rules, and the uniform constant-weight bitmap null
assigns that same threshold event exactly the hypergeometric upper tail — in
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
[`ordvec-formalization`](https://github.com/Project-Navi/ordvec-formalization)
(theorem `exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).

This is an *in-model* result. It proves the rule shape and the idealized finite
Expand Down Expand Up @@ -276,9 +276,10 @@ The runtime dependency floor is `numpy>=2.2`.
The consolidated cross-language ownership and lifetime contract is in
[`docs/bindings-safety.md`](docs/bindings-safety.md).

Python search, candidate-generation, and scoring methods release the GIL and
read NumPy inputs in place. Callers must not mutate query, corpus, candidate,
or scoring input arrays passed to those methods until the call returns.
Python search, candidate-generation, scoring, and `add` methods release the GIL
after copying NumPy inputs into Rust-owned buffers, so ordinary Python in-place
array mutation in another thread cannot race the detached Rust scan. Large calls
may temporarily require an additional input-sized buffer.

The C ABI allows concurrent search and info calls on one loaded handle.
`ordvec_index_free` must not race with any other call on the same handle.
Expand Down Expand Up @@ -310,10 +311,10 @@ candidate slices passed to `Search` until the call returns.
[`docs/compatibility-policy.md`](docs/compatibility-policy.md) defines the
stable, experimental, repo-local sidecar, persisted-format, examples/docs,
MSRV, and release-note review surfaces.
- **Formal proof spine:** [`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization),
including its [`proof-spine`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/proof-spine.md),
[`theorem-map`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/theorem-map.md),
and [`reviewer brief`](https://github.com/Fieldnote-Echo/ordvec-formalization/blob/main/docs/reviewer-brief.md).
- **Formal proof spine:** [`ordvec-formalization`](https://github.com/Project-Navi/ordvec-formalization),
including its [`proof-spine`](https://github.com/Project-Navi/ordvec-formalization/blob/main/docs/proof-spine.md),
[`theorem-map`](https://github.com/Project-Navi/ordvec-formalization/blob/main/docs/theorem-map.md),
and [`reviewer brief`](https://github.com/Project-Navi/ordvec-formalization/blob/main/docs/reviewer-brief.md).
- **API docs:** <https://docs.rs/ordvec>, <https://docs.rs/ordvec-manifest>
- **Paper (OrdVec / RankQuant):** _link TBD — see
[Research collaboration](#research-collaboration)._
Expand Down
12 changes: 6 additions & 6 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ Use GitHub's private vulnerability reporting:
We aim to acknowledge reports within a few business days.

`ordvec` parses serialized index files (`.ovr` / `.ovrq` / `.ovbm` /
`.ovsb`; the loaders also accept the legacy `.tvr` / `.tvrq` / `.tvbm` /
`.tvsb` magics); the loaders are fuzzed (`cargo +nightly fuzz`), so
parsing-robustness reports against the deserialization paths are especially
welcome. Reports are also welcome against the `unsafe` SIMD kernels (shape /
bounds invariants), the Python FFI contract (buffer handling, GIL discipline),
and the release pipeline.
`.ovsb` / `.ovfs`; `.ovfs` uses `OVFS` FastScan magic, and the other loaders
also accept the legacy `.tvr` / `.tvrq` / `.tvbm` / `.tvsb` magics); the loaders
are fuzzed (`cargo +nightly fuzz`), so parsing-robustness reports against the
deserialization paths are especially welcome. Reports are also welcome against
the `unsafe` SIMD kernels (shape / bounds invariants), the Python FFI contract
(buffer handling, GIL discipline), and the release pipeline.

## Threat model

Expand Down
18 changes: 10 additions & 8 deletions THREAT_MODEL.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,16 @@ See also: [`SECURITY.md`](SECURITY.md) (reporting), [`RELEASING.md`](RELEASING.m

## Maintenance budget

`ordvec` is maintained by a single primary contributor. Mitigations are
prioritized when they are (1) low-maintenance once merged, (2) enforceable by
tests or CI, (3) local to the library boundary, and (4) unlikely to add
operational burden downstream. Heavyweight controls (mandatory index signing,
long-running fuzz farms, service-level admission control) are documented as
**deployment guidance** until there is maintainer capacity to own them. The
absence of a second maintainer is itself a tracked supply-chain residual
(see THREAT-SUPPLY-001).
`ordvec` has one project lead plus an additional maintainer / release
approver. Mitigations are prioritized when they are (1) low-maintenance once
merged, (2) enforceable by tests or CI, (3) local to the library boundary, and
(4) unlikely to add operational burden downstream. Heavyweight controls
(mandatory index signing, long-running fuzz farms, service-level admission
control) are documented as **deployment guidance** unless the project has
maintainer capacity to own them. Release publication requires a non-triggering
approver through protected GitHub Environments; the residual release
supply-chain risk is approver account compromise / collusion, not a
single-owner project structure (see THREAT-SUPPLY-001).

---

Expand Down
9 changes: 6 additions & 3 deletions docs/INDEX_PROVENANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,13 @@ files without panicking, aborting, or silently accepting garbage:
- an exact file-length match (trailing bytes or short files are rejected);
- per-row **structural** invariants: `Rank` rows must be a true permutation of
`[0, dim)`, `RankQuant` rows must satisfy constant composition, `Bitmap` rows
must have exactly `n_top` bits set.
must have exactly `n_top` bits set, and direct `RankQuantFastscan` `.ovfs`
rows must use valid FastScan nibbles, satisfy b=2 constant composition, and
have zero block-tail padding.

A file that survives all of this is **structurally well-formed**. The four
loaders are exercised by `cargo fuzz` (the `load_*` targets).
A file that survives all of this is **structurally well-formed**. The five
loaders are exercised by `cargo fuzz` (the `load_*` targets, including
`load_fastscan` for `.ovfs`).

## What the loaders do NOT validate

Expand Down
2 changes: 1 addition & 1 deletion docs/RANK_MODES.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ unknown embedding distribution.
**Checked finite model: symmetry, quotient sufficiency, threshold,
calibration.** The proof chain now has a larger machine-checked middle
than the implementation docs used to claim. In
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization),
[`ordvec-formalization`](https://github.com/Project-Navi/ordvec-formalization),
Lean proves that literal bitmap overlap is the canonical invariant
under query-preserving coordinate relabelings; finite quotient
sufficiency reduces the admission decision to ordered overlap
Expand Down
15 changes: 9 additions & 6 deletions docs/bindings-safety.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ still own scheduling, path trust, input mutability, and deployment provenance.
Mutation methods such as `add` require exclusive access.
- Python search, candidate-generation, scoring, and `add` methods release the
GIL while Rust performs the heavy work. PyO3 still enforces object borrow
rules, but caller-owned NumPy arrays are read in place while the GIL is
released.
rules, and the binding copies NumPy input arrays into Rust-owned buffers
before releasing the GIL. Large calls may temporarily require an additional
input-sized buffer.
- The C ABI permits concurrent `ordvec_index_search`,
`ordvec_index_probe`, and `ordvec_index_info` calls on one loaded handle.
`ordvec_index_free` must not race with any other call on that handle.
Expand All @@ -23,11 +24,13 @@ still own scheduling, path trust, input mutability, and deployment provenance.

## Borrowed Inputs

Caller-provided buffers are borrowed for the duration of the call and are not
retained after the function returns.
Caller-provided Rust slices, C buffers, and Go slices are borrowed for the
duration of the call and are not retained after the function returns. Python
NumPy inputs that cross a GIL-released call are copied before the GIL is
released.

- Do not mutate Rust slices, NumPy arrays, C buffers, or Go slices while a call
that received them is in progress.
- Do not mutate Rust slices, C buffers, or Go slices while a call that received
them is in progress.
- Query, corpus, candidate, output, hit, and stats buffers remain caller-owned
unless a specific API says otherwise.
- Candidate lists are entry lists, not sets. Duplicate candidate IDs are scored
Expand Down
2 changes: 1 addition & 1 deletion experiments/ordinal-routing-research/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ tiered below by **what survived scrutiny**. Read the tiers, not every doc.
|-----|-------|
| [density_collapse_results.md](density_collapse_results.md) | **Mechanism.** RankQuant b=2 density collapse = Hamming-near codes the scorer can't separate. Among those lookalikes, true neighbours have lower intra-code Kendall-tau (gap ≈ 0.04, CI > 0). Real but small. |
| [tau_rerank_bakeoff_results.md](tau_rerank_bakeoff_results.md) | **The verdict.** Does that tau signal beat b=4? NO — b=4 wins even at the tau ceiling; tau scores below b=2's own ordering. Signal is real-but-inert; just use b=4. Closes the line: research, not a feature. |
| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17) (open PR, `sorry`-free). |
| [crt_seam_oracle_results.md](crt_seam_oracle_results.md) | CRT vernier seam theorem — exhaustive finite proof: lcm spacing, one coincidence/period, capped density `∏min(2t+1,m_i)/m_i`. Lean 4 formalization lives in the companion repo: [ordvec-formalization#17](https://github.com/Project-Navi/ordvec-formalization/pull/17) (open PR, `sorry`-free). |
| [shard_recall_results.md](shard_recall_results.md) | Controlled ablation (post RNG-desync fix): random phase offsets add nothing vs aligned grids across R random directions. |
| [oblivious_directions_results.md](oblivious_directions_results.md) | **The directions arc (round 2).** Data-oblivious low-discrepancy directions (golden-angle / Sobol / Kronecker) do NOT beat iid-random for training-free routing — across 5 encoders (nomic, bge-m3, bge-large, snowflake-arctic-v2, harrier-oss) at real intrinsic dim 18–24. CLASS-DEAD, pre-registered, replicated (the one mid-ladder flicker failed to replicate). Centering removes the cone but fails at b=4 (penalty grows with capacity). One robust positive: data-aligned (PCA) directions lead at higher ID — the lever is data-alignment, which training-free forbids. Also **resolves the twonn_id PARTIAL**: real-corpus ID measured at ~18–24 across 5 encoders, and ID is a **corpus** property (repo≈13 vs fiqa≈24, same encoder), not an encoder constant. Probes: `uniformity_lemma.rs`, `overlap_decomp.rs`, `centering_recall.rs`, `subspace_directions.rs`, `partition_balance.rs`, `fib_*.rs`. |
| [length_mixture_lake_results.md](length_mixture_lake_results.md) | **Path B — chunk-length-mixture lake (closes the synthetic-lake arc).** Same fiqa docs embedded at 4 chunk lengths {128,256,512,1100} unioned into a 230k-doc lake; b=4 raw R@10 vs FP32 cosine is **immune** (+0.002, CR@100=1.0). Bonus measurement of the "chunk length is a third geometry axis" claim: real but **small and co-axial** — R̄ spreads only 0.705→0.723 over an 8.6× length range, cone axes ≥0.986 aligned (not the distinct geometries the mixture framing imagined). With Phase B (multi-domain) this leaves every synthetic lake pathology — multi-cone, hub, multi-length — benign for "spend the bits, b=4." Probe: `make_length_lake.py` + `centering_recall.rs`. |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# CRT seam oracle — corrected vernier theorem (exhaustive finite proof)

> Lean 4 formalization of this theorem lives in the companion repo:
> [ordvec-formalization#17](https://github.com/Fieldnote-Echo/ordvec-formalization/pull/17)
> [ordvec-formalization#17](https://github.com/Project-Navi/ordvec-formalization/pull/17)
> (open PR, `sorry`-free).

`examples/crt_seam_oracle.rs` enumerates the full ring Z/M to verify the
Expand Down
4 changes: 4 additions & 0 deletions ordvec-ffi/include/ordvec.h
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,10 @@ void ordvec_index_free(ordvec_index_t *index);
* and may be unsorted or duplicated. Duplicate candidates are scored as
* separate entries and can produce duplicate hits; callers that need unique
* output rows must deduplicate before calling.
* Full search is represented by `candidate_count == 0 && candidate_rows == NULL`.
* ABI v1 treats `candidate_count == 0 && candidate_rows != NULL` as
* `ORDVEC_STATUS_BAD_ARGUMENT`; callers should short-circuit explicit empty
* survivor sets before crossing the ABI.
*
* # Safety
*
Expand Down
59 changes: 41 additions & 18 deletions ordvec-ffi/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@ use std::path::Path;
use std::ptr;
use std::time::Instant;

use ordvec::{probe_index_metadata, Bitmap, IndexKind, IndexMetadata, IndexParams, RankQuant};
use ordvec::{
probe_index_metadata, Bitmap, FfiLoadSupport, IndexKind, IndexMetadata, IndexParams,
PersistedFormat, RankQuant,
};

pub type ordvec_status_t = u32;
pub type ordvec_index_kind_t = u32;
Expand Down Expand Up @@ -733,25 +736,29 @@ pub unsafe extern "C" fn ordvec_index_load(
.map_err(|err| io_to_ffi(err, "stat index"))?
.len();

// Accept both the current `OV*` magics and the legacy turbovec-era
// `TV*` magics (back-compat) — mirrors the loaders in `rank_io.rs`.
let index = match &magic {
b"OVRQ" | b"TVRQ" => LoadedIndex::RankQuant(
RankQuant::load(path).map_err(|err| io_to_ffi(err, "load RankQuant index"))?,
),
b"OVBM" | b"TVBM" => LoadedIndex::Bitmap(
Bitmap::load(path).map_err(|err| io_to_ffi(err, "load Bitmap index"))?,
),
b"OVR1" | b"OVSB" | b"TVR1" | b"TVSB" => {
return Err(FfiError::new(
ORDVEC_STATUS_UNSUPPORTED_FORMAT,
"ABI v1 supports only RankQuant and Bitmap indexes",
))
let spec = ordvec::format::lookup_magic(&magic).ok_or_else(|| {
FfiError::new(
ORDVEC_STATUS_CORRUPT_INDEX,
"unrecognized ordvec index magic",
)
})?;
let index = match spec.ffi_load {
FfiLoadSupport::Supported => match spec.format {
PersistedFormat::RankQuant => LoadedIndex::RankQuant(
RankQuant::load(path).map_err(|err| io_to_ffi(err, "load RankQuant index"))?,
),
PersistedFormat::Bitmap => LoadedIndex::Bitmap(
Bitmap::load(path).map_err(|err| io_to_ffi(err, "load Bitmap index"))?,
),
_ => unreachable!("only RankQuant and Bitmap are FFI-loadable in ABI v1"),
},
FfiLoadSupport::Unsupported { reason } => {
return Err(FfiError::new(ORDVEC_STATUS_UNSUPPORTED_FORMAT, reason))
}
_ => {
return Err(FfiError::new(
ORDVEC_STATUS_CORRUPT_INDEX,
"unrecognized ordvec index magic",
ORDVEC_STATUS_UNSUPPORTED_FORMAT,
"ABI v1 does not support this persisted index format",
))
}
};
Expand Down Expand Up @@ -894,6 +901,10 @@ pub unsafe extern "C" fn ordvec_index_free(index: *mut ordvec_index_t) {
/// and may be unsorted or duplicated. Duplicate candidates are scored as
/// separate entries and can produce duplicate hits; callers that need unique
/// output rows must deduplicate before calling.
/// Full search is represented by `candidate_count == 0 && candidate_rows == NULL`.
/// ABI v1 treats `candidate_count == 0 && candidate_rows != NULL` as
/// `ORDVEC_STATUS_BAD_ARGUMENT`; callers should short-circuit explicit empty
/// survivor sets before crossing the ABI.
///
/// # Safety
///
Expand Down Expand Up @@ -1500,14 +1511,25 @@ mod tests {
sign.add(&[0.0f32; 64]);
sign.write(&sign_path).unwrap();

let fastscan_path = temp_path("fastscan", "ovfs");
let mut fastscan = Vec::new();
fastscan.extend_from_slice(b"OVFS");
fastscan.push(1);
fastscan.extend_from_slice(&8u32.to_le_bytes());
fastscan.extend_from_slice(&0u32.to_le_bytes());
std::fs::File::create(&fastscan_path)
.unwrap()
.write_all(&fastscan)
.unwrap();

let corrupt_path = temp_path("corrupt", "ovrq");
std::fs::File::create(&corrupt_path)
.unwrap()
.write_all(b"OVRQ\x01")
.unwrap();

unsafe {
for path in [&rank_path, &sign_path] {
for path in [&rank_path, &sign_path, &fastscan_path] {
let cpath = CString::new(path.to_str().unwrap()).unwrap();
let mut out = ptr::null_mut();
assert_eq!(
Expand All @@ -1526,6 +1548,7 @@ mod tests {
}
std::fs::remove_file(rank_path).ok();
std::fs::remove_file(sign_path).ok();
std::fs::remove_file(fastscan_path).ok();
std::fs::remove_file(corrupt_path).ok();
}
}
2 changes: 2 additions & 0 deletions ordvec-manifest-python/python/ordvec_manifest/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES,
DEFAULT_MAX_AUXILIARY_ARTIFACTS,
DEFAULT_MAX_CACHED_REPORT_BYTES,
DEFAULT_MAX_CALIBRATION_PROFILE_BYTES,
DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES,
DEFAULT_MAX_MANIFEST_BYTES,
DEFAULT_MAX_REPORT_ISSUES,
Expand All @@ -37,6 +38,7 @@
"DEFAULT_MAX_ROW_IDENTITY_TRACKED_DB_ID_BYTES",
"DEFAULT_MAX_AUXILIARY_ARTIFACTS",
"DEFAULT_MAX_AUXILIARY_ARTIFACT_BYTES",
"DEFAULT_MAX_CALIBRATION_PROFILE_BYTES",
"DEFAULT_MAX_ENCODER_DISTORTION_PROFILE_BYTES",
"DEFAULT_MAX_REPORT_ISSUES",
"DEFAULT_MAX_CACHED_REPORT_BYTES",
Expand Down
Loading
Loading