diff --git a/README.md b/README.md index 98f6ed00..2975ba8c 100644 --- a/README.md +++ b/README.md @@ -516,7 +516,9 @@ floor from `is_multiple_of`). Because the kernels are built against those intrinsics, this is a hard compile floor, not just a convenience pin: a toolchain below 1.89 won't build the crate. Raising the MSRV is treated as a minor-version change under the -[compatibility policy](docs/compatibility-policy.md). +[compatibility policy](docs/compatibility-policy.md). The current feature +stability matrix and downstream embedding notes live in +[`docs/msrv-and-features.md`](docs/msrv-and-features.md). ## License diff --git a/RELEASING.md b/RELEASING.md index fc9a763f..1b9289ff 100644 --- a/RELEASING.md +++ b/RELEASING.md @@ -150,9 +150,16 @@ filename. Until a record is updated, the corresponding gated publish fails 3. Bump the lockstep version (`Cargo.toml`, `ordvec-manifest/Cargo.toml` including its `ordvec` dependency, `ordvec-python/Cargo.toml`, `ordvec-python/pyproject.toml`, - `ordvec-python/python/ordvec/__init__.py`, and `ordvec-ffi/Cargo.toml`) and - update `CHANGELOG.md` with migration notes for every intentional - compatibility break. Commit on `main`. + `ordvec-python/python/ordvec/__init__.py`, + `ordvec-manifest-python/Cargo.toml`, + `ordvec-manifest-python/pyproject.toml`, + `ordvec-manifest-python/python/ordvec_manifest/__init__.py`, and + `ordvec-ffi/Cargo.toml`) and update `CHANGELOG.md` with migration notes for + every intentional compatibility break. Commit on `main`. + - Run `python tests/release_publish_invariants.py` after the bump; it checks + lockstep versions, MSRV/docs drift, registry metadata parity, Python + classifier/URL parity, docs.rs feature policy, package contents, and + release workflow invariants. 4. Confirm CI is **green for current `main` HEAD**. `require-ci-green` checks `main` HEAD's SHA — which needs a **completed, successful** (not `cancelled`, not in-progress) run of `ci.yml`, `python.yml`, `fuzz.yml`, diff --git a/docs/INDEX_PROVENANCE.md b/docs/INDEX_PROVENANCE.md index 6f1b98ea..ac94b8d8 100644 --- a/docs/INDEX_PROVENANCE.md +++ b/docs/INDEX_PROVENANCE.md @@ -67,9 +67,11 @@ descriptors, or make mutable shared storage immutable; callers still own the final policy decision and should load from the returned paths only while the verified files remain under their control. `ordvec-manifest/README.md` shows the intended verify-then-immediate-load -pattern. If another process can mutate the manifest, index, row map, or sidecar -between verification and load, re-run `verify_for_load` at the load boundary or -load from immutable storage or a caller-owned loading path that pins bytes. +pattern, a concrete `manifest.json + index.ovrq + ids.bin` sidecar-backed +bundle, and the stable report fields/codes for sidecar audit logs. If another +process can mutate the manifest, index, row map, or sidecar between +verification and load, re-run `verify_for_load` at the load boundary or load +from immutable storage or a caller-owned loading path that pins bytes. The manifest verifier checks: diff --git a/docs/compatibility-policy.md b/docs/compatibility-policy.md index 7be7d561..ab0edb01 100644 --- a/docs/compatibility-policy.md +++ b/docs/compatibility-policy.md @@ -154,6 +154,8 @@ Deployment-side provenance guidance lives in The Rust MSRV is Rust 1.89. Raising it is a minor-version compatibility change and requires a reason in release notes. Keep `Cargo.toml` `rust-version`, the README MSRV badge/section, and the CI MSRV job synchronized. +The release-facing feature matrix lives in +[`msrv-and-features.md`](msrv-and-features.md). The core crate has no required system or numerical dependencies. Adding one, or adding an optional dependency feature that changes build expectations for diff --git a/docs/msrv-and-features.md b/docs/msrv-and-features.md new file mode 100644 index 00000000..2c95ba3a --- /dev/null +++ b/docs/msrv-and-features.md @@ -0,0 +1,69 @@ +# MSRV and Feature Stability + +This matrix is the release-facing build contract for downstream embedders, +packagers, and host systems. It complements the +[pre-1.0 compatibility policy](compatibility-policy.md), which defines how +compatibility-impacting changes are classified. + +Current MSRV: Rust 1.89. + +The MSRV applies to all Rust crates in this repository: `ordvec`, +`ordvec-manifest`, `ordvec-python`, `ordvec-manifest-python`, and +`ordvec-ffi`. The CI MSRV job, each `Cargo.toml` `rust-version`, and the +README MSRV badge/section must stay synchronized. Raising the MSRV is a +minor-version compatibility change and release notes must state the reason and +any migration note. + +## Feature Matrix + +| Surface | Default features | Stable default-off features | Optional dependency features | Experimental/internal features | +| --- | --- | --- | --- | --- | +| `ordvec` | none | none | none | `experimental` exposes `MultiBucketBitmap`; `test-utils` is repo-test-only and has no public stability promise. | +| `ordvec-manifest` | none | none | `cli`, `sqlite`, `sqlite-bundled` | none | +| `ordvec-python` | n/a | n/a | n/a | n/a | +| `ordvec-manifest-python` | n/a | n/a | n/a | n/a | +| `ordvec-ffi` | none | none | none | none | + +SIMD dispatch in `ordvec` is not feature-gated. x86_64 dispatches AVX-512 and +AVX2 at runtime, aarch64 uses NEON, wasm32 can use `simd128` when the target is +built with that target feature, and other targets use the scalar fallback. +Host systems should not need BLAS, LAPACK, `ndarray`, `faer`, or a native graph +library to embed the core crate. + +`ordvec-manifest` keeps its library default feature set empty. The `cli` +feature enables the `ordvec-manifest` binary and its `clap` dependency. The +`sqlite` feature enables the local cache/audit subcommands; `sqlite-bundled` +adds the bundled SQLite build through `rusqlite`. + +## Change Policy + +New feature flags must declare a stability class before merging: + +- stable default feature; +- stable default-off feature; +- optional dependency feature; +- experimental/default-off feature; +- internal repo-test-only feature. + +Changing the default feature set is compatibility-impacting and must be +classified in release notes. Adding a new required system dependency, changing +wheel platform expectations, or making an optional dependency effectively +required is also compatibility-impacting. + +Experimental and internal features can change before 1.0, but releases should +still call out changes likely to affect known downstream users. Stable feature +changes should include examples or migration notes when the visible build or +API surface changes. + +## Release Checks + +`python tests/release_publish_invariants.py` keeps the following in sync: + +- lockstep crate and Python package versions; +- Rust MSRV declarations, README badge text, and CI MSRV toolchain; +- crates.io metadata, PyPI metadata, docs.rs feature policy, and package + contents; +- release workflow and registry preflight expectations. + +Release review should also compare touched code against this matrix so host +systems can embed `ordvec` without hidden platform or feature surprises. diff --git a/fuzz/fuzz_targets/load_bitmap.rs b/fuzz/fuzz_targets/load_bitmap.rs index 985aa8de..ffb2f347 100644 --- a/fuzz/fuzz_targets/load_bitmap.rs +++ b/fuzz/fuzz_targets/load_bitmap.rs @@ -1,15 +1,11 @@ //! libFuzzer target for the `.ovbm` / `OVBM` loader (which also accepts the //! legacy `.tvbm` / `TVBM` magic), driven through the public -//! `ordvec::Bitmap::load` entry point. +//! `ordvec::Bitmap::load_from_bytes` entry point. //! //! The low-level `rank_io::load_bitmap` parser is crate-internal -//! (`pub(crate)`), so the fuzzer exercises it through `Bitmap::load` — which -//! runs that exact loader and then the type's post-load checks (the full -//! public load path). `load` takes a `&Path`, and the only public load entry -//! points are path-based (there is no public `&[u8]`/`Read` loader — issue -//! #6), so a shared process-local scratch file (see [`scratch`]) feeds the -//! loader the fuzz bytes without the per-iteration `mkstemp`/`unlink` churn a -//! fresh `NamedTempFile` each run would incur. +//! (`pub(crate)`), so the fuzzer exercises it through +//! `Bitmap::load_from_bytes` — which runs that exact loader and then the +//! type's post-load checks (the full public in-memory load path). //! //! Contract: on arbitrary bytes the loader must return `Ok(..)` or //! `Err(..)` — never panic, abort, or read out of bounds. libFuzzer @@ -20,10 +16,6 @@ use libfuzzer_sys::fuzz_target; -mod scratch; - fuzz_target!(|data: &[u8]| { - scratch::with_scratch_file(data, |path| { - let _ = ordvec::Bitmap::load(path); - }); + let _ = ordvec::Bitmap::load_from_bytes(data); }); diff --git a/fuzz/fuzz_targets/load_fastscan.rs b/fuzz/fuzz_targets/load_fastscan.rs index 85c6cbeb..26b08008 100644 --- a/fuzz/fuzz_targets/load_fastscan.rs +++ b/fuzz/fuzz_targets/load_fastscan.rs @@ -1,13 +1,12 @@ //! libFuzzer target for the `.ovfs` / `OVFS` loader (the FastScan b=2 //! persistence format — new in the ordvec format, no legacy `TV*` magic), -//! driven through the public `ordvec::RankQuantFastscan::load` entry point. +//! driven through the public `ordvec::RankQuantFastscan::load_from_bytes` +//! entry point. //! //! The low-level `rank_io::load_fastscan` parser is crate-internal -//! (`pub(crate)`), so the fuzzer exercises it through `RankQuantFastscan::load` -//! — which runs that exact loader (the full public load path). `load` takes a -//! `&Path` and the only public load entry points are path-based (issue #6), so -//! a shared process-local scratch file (see [`scratch`]) feeds the loader the -//! fuzz bytes without per-iteration `mkstemp`/`unlink` churn. +//! (`pub(crate)`), so the fuzzer exercises it through +//! `RankQuantFastscan::load_from_bytes` — which runs that exact loader (the +//! full public in-memory load path). //! //! Contract: on arbitrary bytes the loader must return `Ok(..)` or `Err(..)` — //! never panic, abort, or read out of bounds. libFuzzer treats any panic/abort @@ -17,11 +16,7 @@ use libfuzzer_sys::fuzz_target; -mod scratch; - fuzz_target!(|data: &[u8]| { - scratch::with_scratch_file(data, |path| { - // The only thing under test: arbitrary bytes -> Ok | Err, no panic. - let _ = ordvec::RankQuantFastscan::load(path); - }); + // The only thing under test: arbitrary bytes -> Ok | Err, no panic. + let _ = ordvec::RankQuantFastscan::load_from_bytes(data); }); diff --git a/fuzz/fuzz_targets/load_rank.rs b/fuzz/fuzz_targets/load_rank.rs index 62488b7f..b086e155 100644 --- a/fuzz/fuzz_targets/load_rank.rs +++ b/fuzz/fuzz_targets/load_rank.rs @@ -1,15 +1,11 @@ //! libFuzzer target for the `.ovr` / `OVR1` loader (which also accepts the -//! legacy `.tvr` / `TVR1` magic), driven through the public `ordvec::Rank::load` -//! entry point. +//! legacy `.tvr` / `TVR1` magic), driven through the public +//! `ordvec::Rank::load_from_bytes` entry point. //! //! The low-level `rank_io::load_rank` parser is crate-internal (`pub(crate)`), -//! so the fuzzer exercises it through `Rank::load` — which runs that exact -//! loader and then the type's post-load length check (the full public load -//! path). `load` takes a `&Path`, and the only public load entry points are -//! path-based (there is no public `&[u8]`/`Read` loader — issue #6), so a -//! shared process-local scratch file (see [`scratch`]) feeds the loader the -//! fuzz bytes without the per-iteration `mkstemp`/`unlink` churn a fresh -//! `NamedTempFile` each run would incur. +//! so the fuzzer exercises it through `Rank::load_from_bytes` — which runs +//! that exact loader and then the type's post-load length check (the full +//! public in-memory load path). //! //! Contract: on arbitrary bytes the loader must return `Ok(..)` or //! `Err(..)` — never panic, abort, or read out of bounds. libFuzzer @@ -20,11 +16,7 @@ use libfuzzer_sys::fuzz_target; -mod scratch; - fuzz_target!(|data: &[u8]| { - scratch::with_scratch_file(data, |path| { - // The only thing under test: arbitrary bytes -> Ok | Err, no panic. - let _ = ordvec::Rank::load(path); - }); + // The only thing under test: arbitrary bytes -> Ok | Err, no panic. + let _ = ordvec::Rank::load_from_bytes(data); }); diff --git a/fuzz/fuzz_targets/load_rankquant.rs b/fuzz/fuzz_targets/load_rankquant.rs index 95b329bd..644def06 100644 --- a/fuzz/fuzz_targets/load_rankquant.rs +++ b/fuzz/fuzz_targets/load_rankquant.rs @@ -1,15 +1,11 @@ //! libFuzzer target for the `.ovrq` / `OVRQ` loader (which also accepts the //! legacy `.tvrq` / `TVRQ` magic), driven through the public -//! `ordvec::RankQuant::load` entry point. +//! `ordvec::RankQuant::load_from_bytes` entry point. //! //! The low-level `rank_io::load_rankquant` parser is crate-internal -//! (`pub(crate)`), so the fuzzer exercises it through `RankQuant::load` — -//! which runs that exact loader and then the type's post-load checks (the -//! full public load path). `load` takes a `&Path`, and the only public load -//! entry points are path-based (there is no public `&[u8]`/`Read` loader — -//! issue #6), so a shared process-local scratch file (see [`scratch`]) feeds -//! the loader the fuzz bytes without the per-iteration `mkstemp`/`unlink` -//! churn a fresh `NamedTempFile` each run would incur. +//! (`pub(crate)`), so the fuzzer exercises it through +//! `RankQuant::load_from_bytes` — which runs that exact loader and then the +//! type's post-load checks (the full public in-memory load path). //! //! Contract: on arbitrary bytes the loader must return `Ok(..)` or //! `Err(..)` — never panic, abort, or read out of bounds. libFuzzer @@ -22,10 +18,6 @@ use libfuzzer_sys::fuzz_target; -mod scratch; - fuzz_target!(|data: &[u8]| { - scratch::with_scratch_file(data, |path| { - let _ = ordvec::RankQuant::load(path); - }); + let _ = ordvec::RankQuant::load_from_bytes(data); }); diff --git a/fuzz/fuzz_targets/load_sign_bitmap.rs b/fuzz/fuzz_targets/load_sign_bitmap.rs index 061f9869..1c7ac3de 100644 --- a/fuzz/fuzz_targets/load_sign_bitmap.rs +++ b/fuzz/fuzz_targets/load_sign_bitmap.rs @@ -1,15 +1,11 @@ //! libFuzzer target for the `.ovsb` / `OVSB` loader (which also accepts the //! legacy `.tvsb` / `TVSB` magic), driven through the public -//! `ordvec::SignBitmap::load` entry point. +//! `ordvec::SignBitmap::load_from_bytes` entry point. //! //! The low-level `rank_io::load_sign_bitmap` parser is crate-internal -//! (`pub(crate)`), so the fuzzer exercises it through `SignBitmap::load` — -//! which runs that exact loader and then the type's post-load checks (the full -//! public load path). `load` takes a `&Path`, and the only public load entry -//! points are path-based (there is no public `&[u8]`/`Read` loader — issue -//! #6), so a shared process-local scratch file (see [`scratch`]) feeds the -//! loader the fuzz bytes without the per-iteration `mkstemp`/`unlink` churn a -//! fresh `NamedTempFile` each run would incur. +//! (`pub(crate)`), so the fuzzer exercises it through +//! `SignBitmap::load_from_bytes` — which runs that exact loader and then the +//! type's post-load checks (the full public in-memory load path). //! //! Contract: on arbitrary bytes the loader must return `Ok(..)` or //! `Err(..)` — never panic, abort, or read out of bounds. libFuzzer @@ -22,10 +18,6 @@ use libfuzzer_sys::fuzz_target; -mod scratch; - fuzz_target!(|data: &[u8]| { - scratch::with_scratch_file(data, |path| { - let _ = ordvec::SignBitmap::load(path); - }); + let _ = ordvec::SignBitmap::load_from_bytes(data); }); diff --git a/fuzz/fuzz_targets/scratch.rs b/fuzz/fuzz_targets/scratch.rs deleted file mode 100644 index 24053b4b..00000000 --- a/fuzz/fuzz_targets/scratch.rs +++ /dev/null @@ -1,121 +0,0 @@ -//! Shared per-worker scratch temp file for the `.ovr` / `.ovrq` / `.ovbm` / -//! `.ovsb` loader fuzz targets (the loaders also accept the legacy `.tv*` -//! magics). -//! -//! # Why this exists (issue #6) -//! -//! The four `load_*` targets exercise the production loaders, but the only -//! public load entry points (`Rank::load` / `RankQuant::load` / `Bitmap::load` -//! / `SignBitmap::load`, and `probe_index_metadata`) take a `&Path` and open a -//! real file — the low-level `rank_io::load_*` parsers (which operate on a -//! generic `R: Read + Seek` and would accept a `Cursor<&[u8]>`) are -//! `pub(crate)` and unreachable from this external fuzz crate. A *true* -//! zero-temp-file in-memory driver therefore needs a new **public core API** -//! (e.g. `Type::load_from_bytes(&[u8])` or a `pub` `Read + Seek` loader), -//! tracked separately and out of scope for a fuzz-only change. -//! -//! What this *does* remove is the avoidable per-iteration filesystem churn the -//! issue calls out: instead of allocating a fresh `NamedTempFile` (an -//! `mkstemp` + `open`) and unlinking it on drop every single iteration, each -//! fuzzer worker creates **one** temp file and rewrites it in place. The loader -//! still runs its exact real path (`File::open` + `metadata().len()` + parse) on -//! the precise fuzz bytes, so the loader code path and the corpus/format -//! contract are unchanged. -//! -//! # Storage scope (per worker thread) -//! -//! The scratch file lives in a [`thread_local!`], i.e. one file **per worker -//! thread** — not a single shared file across threads. libFuzzer drives each -//! fuzz target from a single thread, and fork mode runs each job in its own -//! process (hence its own thread-local), so in practice this is one file per -//! fuzzer worker, never shared between concurrent workers — so reuse is -//! race-free. The file is auto-removed when the thread/process exits -//! (`NamedTempFile` drop), so a multi-million-run campaign does not leak into -//! `$TMPDIR`. -//! -//! # Determinism & truncation -//! -//! Reusing a file means a shorter iteration must not inherit trailing bytes from -//! a longer previous one — that would change the bytes the loader sees and break -//! determinism (a fresh `NamedTempFile` gave clean truncation for free). Each -//! call rewinds, writes exactly `data`, and truncates to `data.len()` **only -//! when the new input is shorter** than the previous one (a longer-or-equal -//! write already overwrites the old length). Identical input therefore always -//! yields identical loader input and behaviour. - -use std::cell::RefCell; -use std::io::{Seek, SeekFrom, Write}; -use std::path::{Path, PathBuf}; - -use tempfile::NamedTempFile; - -/// One reused temp file plus the byte length last written to it, so we only -/// `set_len` (truncate) when the next input is strictly shorter. -struct Scratch { - file: NamedTempFile, - len: usize, -} - -thread_local! { - /// One temp file per worker thread, reused across iterations. `None` until - /// the first successful create, and reset to `None` after any IO error so a - /// broken descriptor is discarded and recreated on the next iteration. - static SCRATCH: RefCell> = const { RefCell::new(None) }; -} - -/// Write `data` to a per-worker scratch file and invoke `run` with its path. -/// -/// On any transient temp-file/IO error (create, seek, write, truncate, flush) -/// the scratch state is reset and the iteration is skipped without calling -/// `run` — such failures are environmental, not loader bugs, so they must not be -/// reported as crashes. -/// -/// The thread-local `RefCell` borrow is released **before** `run` is invoked, so -/// a `run` that (directly or indirectly) re-enters `with_scratch_file` will not -/// trip a `RefCell` double-borrow panic — the helper is *borrow-safe*. It is -/// **not** safe for genuinely nested use, however: there is one scratch file per -/// worker thread, so a nested call would rewrite the same file the outer call is -/// still pointing `run` at, clobbering its bytes. The loader fuzz targets never -/// nest (one synchronous call per iteration), so this is a forward-looking -/// caveat, not a current bug. -pub fn with_scratch_file(data: &[u8], run: F) -where - F: FnOnce(&Path), -{ - // Prepare the file under the borrow and hand back an owned path; the borrow - // is dropped when this closure returns, before `run` is called below. - let path: Option = SCRATCH.with(|cell| { - let mut slot = cell.borrow_mut(); - if slot.is_none() { - match NamedTempFile::new() { - Ok(file) => *slot = Some(Scratch { file, len: 0 }), - Err(_) => return None, - } - } - let scratch = slot.as_mut().expect("scratch temp file initialized above"); - - // Overwrite the file with exactly `data`: rewind, write, then truncate - // to the new length only when it shrank (a longer/equal write already - // overwrites the old bytes) so stale trailing bytes from a longer - // previous iteration cannot leak in. - let ok = { - let file = scratch.file.as_file_mut(); - file.seek(SeekFrom::Start(0)).is_ok() - && file.write_all(data).is_ok() - && (data.len() >= scratch.len || file.set_len(data.len() as u64).is_ok()) - && file.flush().is_ok() - }; - if ok { - scratch.len = data.len(); - Some(scratch.file.path().to_path_buf()) - } else { - // Discard the possibly-broken descriptor; the next call recreates it. - *slot = None; - None - } - }); - - if let Some(path) = path { - run(&path); - } -} diff --git a/ordvec-manifest/README.md b/ordvec-manifest/README.md index 2b64a2cf..5a651073 100644 --- a/ordvec-manifest/README.md +++ b/ordvec-manifest/README.md @@ -60,6 +60,75 @@ let _app_ids = plan.require_auxiliary("app.ids")?; let index = ordvec::RankQuant::load(plan.artifact_path())?; ``` +Concrete sidecar-backed bundle pattern: + +```text +docs.odb/ + manifest.json + index.ovrq + ids.bin +``` + +The application writes the ordvec index and its own sidecar bytes first: + +```rust +use std::{fs, path::Path}; + +fn write_bundle( + index: &ordvec::RankQuant, + doc_ids: &[u64], + bundle: &Path, +) -> Result<(), Box> { + fs::create_dir_all(bundle)?; + index.write(bundle.join("index.ovrq"))?; + + let mut id_bytes = Vec::with_capacity(doc_ids.len() * std::mem::size_of::()); + for id in doc_ids { + id_bytes.extend_from_slice(&id.to_le_bytes()); + } + fs::write(bundle.join("ids.bin"), id_bytes)?; + Ok(()) +} +``` + +Then create and verify a manifest that binds both files: + +```sh +cargo run -p ordvec-manifest --features cli -- create \ + --index docs.odb/index.ovrq \ + --row-id-is-identity \ + --aux app.ids=docs.odb/ids.bin \ + --embedding-model bge-small-en-v1.5 \ + --out docs.odb/manifest.json + +cargo run -p ordvec-manifest --features cli -- verify \ + --manifest docs.odb/manifest.json \ + --json +``` + +The load side verifies the bundle before any caller-owned sidecar parsing: + +```rust +let plan = ordvec_manifest::verify_for_load( + &bundle.join("manifest.json"), + ordvec_manifest::VerifyOptions::default(), +)?; + +let metadata = plan.metadata(); +assert_eq!(metadata.vector_count, expected_rows); + +let index = ordvec::RankQuant::load(plan.artifact_path())?; +let ids_path = plan.require_auxiliary("app.ids")?; +let ids_bytes = std::fs::read(ids_path)?; +let doc_ids = parse_caller_owned_ids(&ids_bytes)?; +``` + +`ordvec-manifest` owns the path, size, SHA-256, and index-metadata checks. +The application still owns the `ids.bin` schema, count check, duplicate policy, +endianness, and any database reconstruction rules. If `ids.bin` is modified +after manifest creation, verification fails with a stable auxiliary artifact +size or digest code before the caller parses those bytes. + Racy load pattern: ```rust @@ -150,13 +219,58 @@ required auxiliary artifact (e.g. `app.ids`). That makes the vector row count an ordvec invariant while leaving the caller's `u64` document IDs as caller-owned sidecar bytes. Do not encode the ID sidecar as `RowIdentity::Jsonl`: v1 JSONL row identity is UUID-oriented (`id_kind = "uuid"`), and generic row-map ID -formats are intentionally deferred. The reserved `row_identity.db` metadata -block is rejected in v1 because it is not byte-bound or path-checked. +formats are intentionally deferred to +[#145](https://github.com/Fieldnote-Echo/ordvec/issues/145). The reserved +`row_identity.db` metadata block is rejected in v1 because it is not byte-bound +or path-checked. + +Stable row-identity boundary codes: + +| Condition | Verification report code | +| --- | --- | +| JSONL row identity declares an ID kind other than `uuid` | `row_identity_id_kind_unsupported` | +| JSONL row identity includes reserved `row_identity.db` metadata | `row_identity_db_unsupported` | +| JSONL `db_id` / `parent_id` is empty | `row_identity_db_id_empty` / `row_identity_parent_id_empty` | +| JSONL `db_id` / `parent_id` contains NUL | `row_identity_db_id_contains_nul` / `row_identity_parent_id_contains_nul` | +| JSONL `db_id` / `parent_id` is not a UUID | `row_identity_db_id_invalid_uuid` / `row_identity_parent_id_invalid_uuid` | The unified JSON report carries per-sidecar audit fields. A successful auxiliary artifact verification includes the manifest path, resolved/canonical paths, declared digest/length, and observed digest/length: +Stable sidecar report fields: + +| Field | Meaning | +| --- | --- | +| `auxiliary_artifacts[].name` | Caller-owned sidecar name from the manifest. | +| `manifest_path` | Manifest-declared relative path. | +| `resolved_path` / `canonical_path` | Path used for verification and its canonical form when available. | +| `expected_sha256` / `expected_size_bytes` | Manifest-declared digest and byte length. | +| `sha256` / `size_bytes` | Observed digest and byte length when bytes could be read. | +| `required` | Whether absence is a verification error. | +| `state` | One of `verified`, `optional_absent`, `missing_required`, or `failed`. | +| `reason_code` | Stable null-or-string reason for any non-verified state, or the first failure reason. | + +Stable sidecar states: + +| `state` | `reason_code` | Report outcome | +| --- | --- | --- | +| `verified` | `null` | The declared sidecar was present and matched path policy, size, and digest. | +| `optional_absent` | `auxiliary_artifact_optional_absent` | The optional sidecar was absent; this is not an error. | +| `missing_required` | `auxiliary_artifact_missing_required` | A required sidecar was absent and verification fails. | +| `failed` | Code-specific | Path policy, hashing, size, digest, or limit validation failed. | + +Common `failed` reason codes include `auxiliary_artifact_path_empty`, +`auxiliary_artifact_base_dir_unavailable`, +`auxiliary_artifact_path_unavailable`, +`auxiliary_artifact_path_escape_rejected`, +`auxiliary_artifact_file_too_large`, +`auxiliary_artifact_file_size_mismatch`, and +`auxiliary_artifact_sha256_mismatch`. `errors[].code` and `warnings[].code` +carry the same stable code namespace. `skipped_checks[]` is machine-readable +and records checks that were intentionally not run, such as +`attestations_absent`. + ```json { "ok": true, diff --git a/src/bitmap.rs b/src/bitmap.rs index 42737ec2..0ac18459 100644 --- a/src/bitmap.rs +++ b/src/bitmap.rs @@ -520,6 +520,11 @@ impl Bitmap { crate::rank_io::write_bitmap(path, self.dim, self.n_top, self.n_vectors, &self.bitmaps) } + /// Persist to any byte writer using the `.ovbm` format. + pub fn write_to(&self, writer: W) -> std::io::Result<()> { + crate::rank_io::write_bitmap_to(writer, self.dim, self.n_top, self.n_vectors, &self.bitmaps) + } + /// Load from a `.ovbm` file produced by [`Self::write`]. /// /// Legacy `.tvbm` files (magic `TVBM`) written by older versions of this @@ -531,6 +536,29 @@ impl Bitmap { /// expected `n_vectors * dim / 64` u64 lanes). pub fn load(path: impl AsRef) -> std::io::Result { let (dim, n_top, n_vectors, bitmaps) = crate::rank_io::load_bitmap(path)?; + Self::from_persisted_parts(dim, n_top, n_vectors, bitmaps) + } + + /// Load a `.ovbm`/legacy `.tvbm` index from any reader that can seek. + /// + /// The reader is parsed from its current position through EOF; any trailing + /// bytes after the declared payload are rejected. + pub fn read_from(reader: R) -> std::io::Result { + let (dim, n_top, n_vectors, bitmaps) = crate::rank_io::load_bitmap_from(reader)?; + Self::from_persisted_parts(dim, n_top, n_vectors, bitmaps) + } + + /// Load a `.ovbm`/legacy `.tvbm` index from an in-memory byte slice. + pub fn load_from_bytes(bytes: &[u8]) -> std::io::Result { + Self::read_from(std::io::Cursor::new(bytes)) + } + + fn from_persisted_parts( + dim: usize, + n_top: usize, + n_vectors: usize, + bitmaps: Vec, + ) -> std::io::Result { let qpv = dim / 64; // `checked_mul` (not `saturating`): on a 32-bit target `n_vectors * qpv` // can overflow `usize`; treat overflow as malformed rather than letting diff --git a/src/fastscan.rs b/src/fastscan.rs index a1c3c0e8..86b26f7d 100644 --- a/src/fastscan.rs +++ b/src/fastscan.rs @@ -653,6 +653,11 @@ impl RankQuantFastscan { crate::rank_io::write_fastscan(path, self.dim, self.n_vectors, &self.packed_fs) } + /// Persist to any byte writer using the `.ovfs` format. + pub fn write_to(&self, writer: W) -> std::io::Result<()> { + crate::rank_io::write_fastscan_to(writer, self.dim, self.n_vectors, &self.packed_fs) + } + /// Load a `.ovfs` FastScan index previously written by [`Self::write`]. /// /// The loader validates the header and that the payload length is exactly @@ -666,4 +671,22 @@ impl RankQuantFastscan { packed_fs, }) } + + /// Load a `.ovfs` FastScan index from any reader that can seek. + /// + /// The reader is parsed from its current position through EOF; any trailing + /// bytes after the declared payload are rejected. + pub fn read_from(reader: R) -> std::io::Result { + let (dim, n_vectors, packed_fs) = crate::rank_io::load_fastscan_from(reader)?; + Ok(Self { + dim, + n_vectors, + packed_fs, + }) + } + + /// Load a `.ovfs` FastScan index from an in-memory byte slice. + pub fn load_from_bytes(bytes: &[u8]) -> std::io::Result { + Self::read_from(std::io::Cursor::new(bytes)) + } } diff --git a/src/quant.rs b/src/quant.rs index ee9f0dbd..0c961535 100644 --- a/src/quant.rs +++ b/src/quant.rs @@ -882,6 +882,25 @@ impl RankQuant { crate::rank_io::write_rankquant(path, self.bits, self.dim, self.n_vectors, &self.packed) } + /// Persist to any byte writer using the `.ovrq` format. + pub fn write_to(&self, writer: W) -> std::io::Result<()> { + if self.bits == 8 { + return Err(std::io::Error::new( + std::io::ErrorKind::Unsupported, + "RankQuant b=8 persistence is not supported yet (the .ovrq loader \ + accepts bits ∈ {1, 2, 4}); b=8 is an in-memory evidence surface \ + in this phase", + )); + } + crate::rank_io::write_rankquant_to( + writer, + self.bits, + self.dim, + self.n_vectors, + &self.packed, + ) + } + /// Load from a `.ovrq` file produced by [`Self::write`]. /// /// Legacy `.tvrq` files (magic `TVRQ`) written by older versions of this @@ -893,10 +912,33 @@ impl RankQuant { /// any violation — never panics on malformed input. pub fn load(path: impl AsRef) -> std::io::Result { let (bits, dim, n_vectors, packed) = crate::rank_io::load_rankquant(path)?; + Self::from_persisted_parts(bits, dim, n_vectors, packed) + } + + /// Load a `.ovrq`/legacy `.tvrq` index from any reader that can seek. + /// + /// The reader is parsed from its current position through EOF; any trailing + /// bytes after the declared payload are rejected. + pub fn read_from(reader: R) -> std::io::Result { + let (bits, dim, n_vectors, packed) = crate::rank_io::load_rankquant_from(reader)?; + Self::from_persisted_parts(bits, dim, n_vectors, packed) + } + + /// Load a `.ovrq`/legacy `.tvrq` index from an in-memory byte slice. + pub fn load_from_bytes(bytes: &[u8]) -> std::io::Result { + Self::read_from(std::io::Cursor::new(bytes)) + } + + fn from_persisted_parts( + bits: u8, + dim: usize, + n_vectors: usize, + packed: Vec, + ) -> std::io::Result { // load_rankquant already validates bits ∈ {1,2,4} and bounds // dim/n_vectors; we replay the per-type invariants here. let n_buckets = 1usize << bits; - if dim % n_buckets != 0 { + if !dim.is_multiple_of(n_buckets) { return Err(std::io::Error::new( std::io::ErrorKind::InvalidData, format!( @@ -906,7 +948,7 @@ impl RankQuant { )); } let codes_per_byte = (8 / bits) as usize; - if dim % codes_per_byte != 0 { + if !dim.is_multiple_of(codes_per_byte) { return Err(std::io::Error::new( std::io::ErrorKind::InvalidData, format!("OVRQ dim {dim} is not a multiple of codes_per_byte = {codes_per_byte}",), diff --git a/src/rank.rs b/src/rank.rs index 10cd1e2b..902557db 100644 --- a/src/rank.rs +++ b/src/rank.rs @@ -545,6 +545,11 @@ impl Rank { crate::rank_io::write_rank(path, self.dim, self.n_vectors, &self.ranks) } + /// Persist to any byte writer using the `.ovr` format. + pub fn write_to(&self, writer: W) -> std::io::Result<()> { + crate::rank_io::write_rank_to(writer, self.dim, self.n_vectors, &self.ranks) + } + /// Load from a `.ovr` file produced by [`Self::write`]. /// /// Legacy `.tvr` files (magic `TVR1`) written by older versions of this @@ -557,6 +562,28 @@ impl Rank { /// specific to `Rank` are checked here. pub fn load(path: impl AsRef) -> std::io::Result { let (dim, n_vectors, ranks) = crate::rank_io::load_rank(path)?; + Self::from_persisted_parts(dim, n_vectors, ranks) + } + + /// Load a `.ovr`/legacy `.tvr` index from any reader that can seek. + /// + /// The reader is parsed from its current position through EOF; any trailing + /// bytes after the declared payload are rejected. + pub fn read_from(reader: R) -> std::io::Result { + let (dim, n_vectors, ranks) = crate::rank_io::load_rank_from(reader)?; + Self::from_persisted_parts(dim, n_vectors, ranks) + } + + /// Load a `.ovr`/legacy `.tvr` index from an in-memory byte slice. + pub fn load_from_bytes(bytes: &[u8]) -> std::io::Result { + Self::read_from(std::io::Cursor::new(bytes)) + } + + fn from_persisted_parts( + dim: usize, + n_vectors: usize, + ranks: Vec, + ) -> std::io::Result { // `checked_mul` (not `saturating`): on a 32-bit target `n_vectors * dim` // can overflow `usize`; treat that as malformed rather than letting a // saturated `usize::MAX` stand in for the expected length. diff --git a/src/rank_io.rs b/src/rank_io.rs index 27e6dc68..59a92562 100644 --- a/src/rank_io.rs +++ b/src/rank_io.rs @@ -42,13 +42,19 @@ //! //! # Persistence API & round-trip contract //! -//! The supported persistence API is the index types' `write()` / `load()` -//! methods: [`Rank`](crate::Rank) / [`RankQuant`](crate::RankQuant) / -//! [`Bitmap`](crate::Bitmap) / [`SignBitmap`](crate::SignBitmap) / +//! The supported persistence API is the index types' path, stream, and byte +//! loaders/writers: `write(path)`, `write_to(writer)`, `load(path)`, +//! `read_from(reader)`, and `load_from_bytes(bytes)` on +//! [`Rank`](crate::Rank), [`RankQuant`](crate::RankQuant), +//! [`Bitmap`](crate::Bitmap), [`SignBitmap`](crate::SignBitmap), and //! [`RankQuantFastscan`](crate::RankQuantFastscan) (the last via the `.ovfs` -//! format). The -//! `write_*` / `load_*` format helpers in this module are **crate-internal** -//! (`pub(crate)`); only the `MAX_*` capacity constants are public. +//! format). `read_from` parses from the reader's current position through EOF +//! and rejects trailing bytes; callers embedding an index inside a larger +//! container should pass a length-bounded reader such as `Cursor<&[u8]>`. +//! +//! The `write_*` / `load_*` format helpers in this module are +//! **crate-internal** (`pub(crate)`); only the `MAX_*` capacity constants are +//! public. //! //! Round-trip is a guarantee of the **index types**: each constructor //! validates its parameters (matching the loaders' `dim` / `n_top` / `bits` / @@ -63,7 +69,7 @@ //! that the helpers are no longer reachable with arbitrary external input. use std::fs::File; -use std::io::{self, BufReader, BufWriter, Read, Seek, Write}; +use std::io::{self, BufReader, BufWriter, Read, Seek, SeekFrom, Write}; use std::path::Path; // Current ordvec magics — written by this crate going forward. @@ -221,6 +227,13 @@ fn check_payload_matches_file( Ok(()) } +fn stream_len_from_current(reader: &mut R) -> io::Result { + let start = reader.stream_position()?; + let end = reader.seek(SeekFrom::End(0))?; + reader.seek(SeekFrom::Start(start))?; + Ok(end) +} + fn check_dim(dim: usize) -> io::Result<()> { if !(2..=MAX_DIM).contains(&dim) { return Err(invalid(format!("dim {dim} out of range [2, {MAX_DIM}]"))); @@ -521,10 +534,34 @@ pub(crate) fn write_rank( // Enforce the loaders' MAX_PAYLOAD cap *before* File::create so a rejected // oversized write never truncates an existing file. Defense-in-depth; the // round-trip guarantee is type-level (see module docs). Mirrors load_rank. + check_rank_write(dim, n_vectors, ranks)?; + write_rank_to_checked(File::create(path)?, dim, n_vectors, ranks) +} + +pub(crate) fn write_rank_to( + writer: W, + dim: usize, + n_vectors: usize, + ranks: &[u16], +) -> io::Result<()> { + check_rank_write(dim, n_vectors, ranks)?; + write_rank_to_checked(writer, dim, n_vectors, ranks) +} + +fn check_rank_write(dim: usize, n_vectors: usize, ranks: &[u16]) -> io::Result<()> { let payload_bytes = rank_payload_bytes(dim, n_vectors)?; check_payload_bytes(payload_bytes)?; assert_eq!(ranks.len(), payload_bytes / 2); - let mut f = BufWriter::new(File::create(path)?); + Ok(()) +} + +fn write_rank_to_checked( + writer: W, + dim: usize, + n_vectors: usize, + ranks: &[u16], +) -> io::Result<()> { + let mut f = BufWriter::new(writer); f.write_all(OVR_MAGIC)?; f.write_all(&[VERSION])?; f.write_all(&(dim as u32).to_le_bytes())?; @@ -545,7 +582,20 @@ pub(crate) fn load_rank(path: impl AsRef) -> io::Result<(usize, usize, Vec // the trailing-byte check. Both are wrong on a metadata race (NFS/procfs). let file_len = file.metadata()?.len(); let mut f = BufReader::new(file); - let magic = read_magic(&mut f, "OVR1")?; + load_rank_from_stream(&mut f, file_len) +} + +pub(crate) fn load_rank_from(reader: R) -> io::Result<(usize, usize, Vec)> { + let mut reader = reader; + let file_len = stream_len_from_current(&mut reader)?; + load_rank_from_stream(&mut reader, file_len) +} + +fn load_rank_from_stream( + mut f: &mut R, + file_len: u64, +) -> io::Result<(usize, usize, Vec)> { + let magic = read_magic(f, "OVR1")?; if &magic != OVR_MAGIC && &magic != TVR_MAGIC { return Err(invalid("not an OVR1/TVR1 (Rank) file: wrong magic")); } @@ -614,10 +664,36 @@ pub(crate) fn write_rankquant( // Enforce the loaders' MAX_PAYLOAD cap *before* File::create (defense-in- // depth; a rejected write must not truncate an existing file). Mirrors // load_rankquant: checked multiply before the /8 divide. + check_rankquant_write(bits, dim, n_vectors, packed)?; + write_rankquant_to_checked(File::create(path)?, bits, dim, n_vectors, packed) +} + +pub(crate) fn write_rankquant_to( + writer: W, + bits: u8, + dim: usize, + n_vectors: usize, + packed: &[u8], +) -> io::Result<()> { + check_rankquant_write(bits, dim, n_vectors, packed)?; + write_rankquant_to_checked(writer, bits, dim, n_vectors, packed) +} + +fn check_rankquant_write(bits: u8, dim: usize, n_vectors: usize, packed: &[u8]) -> io::Result<()> { let payload_bytes = rankquant_payload_bytes(dim, n_vectors, bits)?; check_payload_bytes(payload_bytes)?; assert_eq!(packed.len(), payload_bytes); - let mut f = BufWriter::new(File::create(path)?); + Ok(()) +} + +fn write_rankquant_to_checked( + writer: W, + bits: u8, + dim: usize, + n_vectors: usize, + packed: &[u8], +) -> io::Result<()> { + let mut f = BufWriter::new(writer); f.write_all(OVRQ_MAGIC)?; f.write_all(&[VERSION])?; f.write_all(&[bits])?; @@ -637,7 +713,22 @@ pub(crate) fn load_rankquant(path: impl AsRef) -> io::Result<(u8, usize, u // the trailing-byte check. Both are wrong on a metadata race (NFS/procfs). let file_len = file.metadata()?.len(); let mut f = BufReader::new(file); - let magic = read_magic(&mut f, "OVRQ")?; + load_rankquant_from_stream(&mut f, file_len) +} + +pub(crate) fn load_rankquant_from( + reader: R, +) -> io::Result<(u8, usize, usize, Vec)> { + let mut reader = reader; + let file_len = stream_len_from_current(&mut reader)?; + load_rankquant_from_stream(&mut reader, file_len) +} + +fn load_rankquant_from_stream( + mut f: &mut R, + file_len: u64, +) -> io::Result<(u8, usize, usize, Vec)> { + let magic = read_magic(f, "OVRQ")?; if &magic != OVRQ_MAGIC && &magic != TVRQ_MAGIC { return Err(invalid("not an OVRQ/TVRQ (RankQuant) file: wrong magic")); } @@ -724,10 +815,36 @@ pub(crate) fn write_bitmap( // Enforce the loaders' MAX_PAYLOAD cap *before* File::create (defense-in- // depth; a rejected write must not truncate an existing file). Mirrors // load_bitmap. + check_bitmap_write(dim, n_vectors, bitmaps)?; + write_bitmap_to_checked(File::create(path)?, dim, n_top, n_vectors, bitmaps) +} + +pub(crate) fn write_bitmap_to( + writer: W, + dim: usize, + n_top: usize, + n_vectors: usize, + bitmaps: &[u64], +) -> io::Result<()> { + check_bitmap_write(dim, n_vectors, bitmaps)?; + write_bitmap_to_checked(writer, dim, n_top, n_vectors, bitmaps) +} + +fn check_bitmap_write(dim: usize, n_vectors: usize, bitmaps: &[u64]) -> io::Result<()> { let payload_bytes = bitmap_payload_bytes(dim, n_vectors, "OVBM")?; check_payload_bytes(payload_bytes)?; assert_eq!(bitmaps.len(), payload_bytes / 8); - let mut f = BufWriter::new(File::create(path)?); + Ok(()) +} + +fn write_bitmap_to_checked( + writer: W, + dim: usize, + n_top: usize, + n_vectors: usize, + bitmaps: &[u64], +) -> io::Result<()> { + let mut f = BufWriter::new(writer); f.write_all(OVBM_MAGIC)?; f.write_all(&[VERSION])?; f.write_all(&(dim as u32).to_le_bytes())?; @@ -749,7 +866,22 @@ pub(crate) fn load_bitmap(path: impl AsRef) -> io::Result<(usize, usize, u // the trailing-byte check. Both are wrong on a metadata race (NFS/procfs). let file_len = file.metadata()?.len(); let mut f = BufReader::new(file); - let magic = read_magic(&mut f, "OVBM")?; + load_bitmap_from_stream(&mut f, file_len) +} + +pub(crate) fn load_bitmap_from( + reader: R, +) -> io::Result<(usize, usize, usize, Vec)> { + let mut reader = reader; + let file_len = stream_len_from_current(&mut reader)?; + load_bitmap_from_stream(&mut reader, file_len) +} + +fn load_bitmap_from_stream( + mut f: &mut R, + file_len: u64, +) -> io::Result<(usize, usize, usize, Vec)> { + let magic = read_magic(f, "OVBM")?; if &magic != OVBM_MAGIC && &magic != TVBM_MAGIC { return Err(invalid("not an OVBM/TVBM (Bitmap) file: wrong magic")); } @@ -815,10 +947,34 @@ pub(crate) fn write_sign_bitmap( // Enforce the loaders' MAX_PAYLOAD cap *before* File::create (defense-in- // depth; a rejected write must not truncate an existing file). Mirrors // load_sign_bitmap. + check_sign_bitmap_write(dim, n_vectors, bitmaps)?; + write_sign_bitmap_to_checked(File::create(path)?, dim, n_vectors, bitmaps) +} + +pub(crate) fn write_sign_bitmap_to( + writer: W, + dim: usize, + n_vectors: usize, + bitmaps: &[u64], +) -> io::Result<()> { + check_sign_bitmap_write(dim, n_vectors, bitmaps)?; + write_sign_bitmap_to_checked(writer, dim, n_vectors, bitmaps) +} + +fn check_sign_bitmap_write(dim: usize, n_vectors: usize, bitmaps: &[u64]) -> io::Result<()> { let payload_bytes = bitmap_payload_bytes(dim, n_vectors, "OVSB")?; check_payload_bytes(payload_bytes)?; assert_eq!(bitmaps.len(), payload_bytes / 8); - let mut f = BufWriter::new(File::create(path)?); + Ok(()) +} + +fn write_sign_bitmap_to_checked( + writer: W, + dim: usize, + n_vectors: usize, + bitmaps: &[u64], +) -> io::Result<()> { + let mut f = BufWriter::new(writer); f.write_all(OVSB_MAGIC)?; f.write_all(&[VERSION])?; f.write_all(&(dim as u32).to_le_bytes())?; @@ -854,7 +1010,22 @@ pub(crate) fn load_sign_bitmap(path: impl AsRef) -> io::Result<(usize, usi // the trailing-byte check. Both are wrong on a metadata race (NFS/procfs). let file_len = file.metadata()?.len(); let mut f = BufReader::new(file); - let magic = read_magic(&mut f, "OVSB")?; + load_sign_bitmap_from_stream(&mut f, file_len) +} + +pub(crate) fn load_sign_bitmap_from( + reader: R, +) -> io::Result<(usize, usize, Vec)> { + let mut reader = reader; + let file_len = stream_len_from_current(&mut reader)?; + load_sign_bitmap_from_stream(&mut reader, file_len) +} + +fn load_sign_bitmap_from_stream( + mut f: &mut R, + file_len: u64, +) -> io::Result<(usize, usize, Vec)> { + let magic = read_magic(f, "OVSB")?; if &magic != OVSB_MAGIC && &magic != TVSB_MAGIC { return Err(invalid("not an OVSB/TVSB (SignBitmap) file: wrong magic")); } @@ -904,6 +1075,21 @@ pub(crate) fn write_fastscan( n_vectors: usize, packed_fs: &[u8], ) -> io::Result<()> { + check_fastscan_write(dim, n_vectors, packed_fs)?; + write_fastscan_to_checked(File::create(path)?, dim, n_vectors, packed_fs) +} + +pub(crate) fn write_fastscan_to( + writer: W, + dim: usize, + n_vectors: usize, + packed_fs: &[u8], +) -> io::Result<()> { + check_fastscan_write(dim, n_vectors, packed_fs)?; + write_fastscan_to_checked(writer, dim, n_vectors, packed_fs) +} + +fn check_fastscan_write(dim: usize, n_vectors: usize, packed_fs: &[u8]) -> io::Result<()> { // Validate every header parameter *before* File::create, so a now-public // persistence API never (a) silently truncates `dim`/`n_vectors` through the // `as u32` casts below, (b) writes a corrupt/oversized file (the loaders' @@ -924,7 +1110,16 @@ pub(crate) fn write_fastscan( packed_fs.len() ))); } - let mut f = BufWriter::new(File::create(path)?); + Ok(()) +} + +fn write_fastscan_to_checked( + writer: W, + dim: usize, + n_vectors: usize, + packed_fs: &[u8], +) -> io::Result<()> { + let mut f = BufWriter::new(writer); f.write_all(OVFS_MAGIC)?; f.write_all(&[VERSION])?; f.write_all(&(dim as u32).to_le_bytes())?; @@ -938,7 +1133,20 @@ pub(crate) fn load_fastscan(path: impl AsRef) -> io::Result<(usize, usize, let file = File::open(path)?; let file_len = file.metadata()?.len(); let mut f = BufReader::new(file); - let magic = read_magic(&mut f, "OVFS")?; + load_fastscan_from_stream(&mut f, file_len) +} + +pub(crate) fn load_fastscan_from(reader: R) -> io::Result<(usize, usize, Vec)> { + let mut reader = reader; + let file_len = stream_len_from_current(&mut reader)?; + load_fastscan_from_stream(&mut reader, file_len) +} + +fn load_fastscan_from_stream( + mut f: &mut R, + file_len: u64, +) -> io::Result<(usize, usize, Vec)> { + let magic = read_magic(f, "OVFS")?; // OVFS is new in the ordvec format: there is no legacy TV* fastscan magic. if &magic != OVFS_MAGIC { return Err(invalid("not an OVFS (RankQuantFastscan) file: wrong magic")); diff --git a/src/sign_bitmap.rs b/src/sign_bitmap.rs index 2004a514..4c7b7899 100644 --- a/src/sign_bitmap.rs +++ b/src/sign_bitmap.rs @@ -459,6 +459,11 @@ impl SignBitmap { crate::rank_io::write_sign_bitmap(path, self.dim, self.n_vectors, &self.bitmaps) } + /// Persist to any byte writer using the `.ovsb` format. + pub fn write_to(&self, writer: W) -> std::io::Result<()> { + crate::rank_io::write_sign_bitmap_to(writer, self.dim, self.n_vectors, &self.bitmaps) + } + /// Load from a `.ovsb` file produced by [`Self::write`]. /// /// Legacy `.tvsb` files (magic `TVSB`) written by older versions of this @@ -470,6 +475,28 @@ impl SignBitmap { /// expected `n_vectors * dim / 64` u64 lanes. pub fn load(path: impl AsRef) -> std::io::Result { let (dim, n_vectors, bitmaps) = crate::rank_io::load_sign_bitmap(path)?; + Self::from_persisted_parts(dim, n_vectors, bitmaps) + } + + /// Load a `.ovsb`/legacy `.tvsb` index from any reader that can seek. + /// + /// The reader is parsed from its current position through EOF; any trailing + /// bytes after the declared payload are rejected. + pub fn read_from(reader: R) -> std::io::Result { + let (dim, n_vectors, bitmaps) = crate::rank_io::load_sign_bitmap_from(reader)?; + Self::from_persisted_parts(dim, n_vectors, bitmaps) + } + + /// Load a `.ovsb`/legacy `.tvsb` index from an in-memory byte slice. + pub fn load_from_bytes(bytes: &[u8]) -> std::io::Result { + Self::read_from(std::io::Cursor::new(bytes)) + } + + fn from_persisted_parts( + dim: usize, + n_vectors: usize, + bitmaps: Vec, + ) -> std::io::Result { let qpv = dim / 64; // `checked_mul` (not `saturating`): on a 32-bit target `n_vectors * qpv` // can overflow `usize`; treat overflow as malformed rather than letting diff --git a/tests/index/fastscan.rs b/tests/index/fastscan.rs index f30cd6df..ae4bca12 100644 --- a/tests/index/fastscan.rs +++ b/tests/index/fastscan.rs @@ -10,6 +10,7 @@ //! roundtrip) carried over from the author's earlier rank-modes //! development. +use std::io::Cursor; use std::sync::Arc; use std::thread; @@ -302,6 +303,76 @@ fn fastscan_write_load_roundtrip_searches_identically() { assert_eq!(after.scores, before.scores, "reloaded scores must match"); } +#[test] +fn fastscan_stream_persistence_roundtrips() { + const FD: usize = 128; + const FN: usize = 96; + const PREFIX: &[u8] = b"container-prefix"; + + let mut rng = ChaCha8Rng::seed_from_u64(909091); + let docs: Vec = (0..FN * FD).map(|_| rng.random_range(-1.0..1.0)).collect(); + let queries: Vec = (0..2 * FD).map(|_| rng.random_range(-1.0..1.0)).collect(); + + let mut idx = RankQuantFastscan::new(FD); + idx.add(&docs); + let before = idx.search(&queries, 10); + + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + assert_eq!(&bytes[..4], b"OVFS"); + + let path = fs_tmp("stream_bytes"); + idx.write(&path).unwrap(); + assert_eq!(std::fs::read(&path).unwrap(), bytes); + std::fs::remove_file(&path).ok(); + + let from_bytes = RankQuantFastscan::load_from_bytes(&bytes).unwrap(); + let mut prefixed = PREFIX.to_vec(); + prefixed.extend_from_slice(&bytes); + let mut cursor = std::io::Cursor::new(prefixed); + cursor.set_position(PREFIX.len() as u64); + let from_reader = RankQuantFastscan::read_from(cursor).unwrap(); + + for loaded in [from_bytes, from_reader] { + assert_eq!(loaded.dim(), FD); + assert_eq!(loaded.len(), FN); + assert_eq!(loaded.byte_size(), idx.byte_size()); + let after = loaded.search(&queries, 10); + assert_eq!(after.indices, before.indices); + assert_eq!(after.scores, before.scores); + } +} + +#[test] +fn fastscan_reader_does_not_buffer_past_reported_trailing_bytes() { + const FD: usize = 128; + const FN: usize = 96; + + let mut rng = ChaCha8Rng::seed_from_u64(909092); + let docs: Vec = (0..FN * FD).map(|_| rng.random_range(-1.0..1.0)).collect(); + + let mut idx = RankQuantFastscan::new(FD); + idx.add(&docs); + + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + bytes.extend_from_slice(b"next-record"); + + let mut cursor = Cursor::new(bytes); + let Err(err) = RankQuantFastscan::read_from(&mut cursor) else { + panic!("FastScan reader accepted trailing bytes"); + }; + assert!( + err.to_string().contains("OVFS payload has trailing bytes"), + "unexpected error: {err}" + ); + assert_eq!( + cursor.position(), + 13, + "FastScan reader should stop after header" + ); +} + #[test] fn fastscan_empty_index_roundtrips() { let idx = RankQuantFastscan::new(64); // never add()-ed → 0 vectors, empty payload diff --git a/tests/index/loader_validation.rs b/tests/index/loader_validation.rs index bc6fae41..2c45f9a6 100644 --- a/tests/index/loader_validation.rs +++ b/tests/index/loader_validation.rs @@ -9,7 +9,7 @@ //! Each case pairs a positive control (a freshly-written valid index still //! round-trips) with a corrupted-but-well-shaped negative case. -use std::io::Write; +use std::io::{Cursor, Write}; use ordvec::{Bitmap, Rank, RankQuant, SignBitmap}; @@ -44,6 +44,202 @@ fn set_u32_field(bytes: &mut [u8], offset: usize, value: u32) { bytes[offset..offset + 4].copy_from_slice(&value.to_le_bytes()); } +fn prefixed_cursor(bytes: &[u8]) -> Cursor> { + const PREFIX: &[u8] = b"container-prefix"; + let mut prefixed = PREFIX.to_vec(); + prefixed.extend_from_slice(bytes); + let mut cursor = Cursor::new(prefixed); + cursor.set_position(PREFIX.len() as u64); + cursor +} + +fn append_trailer(mut bytes: Vec) -> Cursor> { + bytes.extend_from_slice(b"next-record"); + Cursor::new(bytes) +} + +#[test] +fn public_stream_persistence_roundtrips_core_formats() { + let corpus = make_corpus(90_001); + let query = &make_corpus(90_002)[..D]; + + { + let mut idx = Rank::new(D); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + assert_eq!(&bytes[..4], b"OVR1"); + + let p = tmp("rank_stream_bytes"); + idx.write(&p).unwrap(); + assert_eq!(read_bytes(&p), bytes); + std::fs::remove_file(&p).ok(); + + let from_bytes = Rank::load_from_bytes(&bytes).unwrap(); + let from_reader = Rank::read_from(prefixed_cursor(&bytes)).unwrap(); + assert_eq!(from_bytes.len(), idx.len()); + assert_eq!(from_reader.dim(), idx.dim()); + assert_eq!( + from_bytes.search(query, 10).indices_for_query(0), + idx.search(query, 10).indices_for_query(0) + ); + assert_eq!( + from_reader.search(query, 10).indices_for_query(0), + idx.search(query, 10).indices_for_query(0) + ); + } + + { + let mut idx = RankQuant::new(D, 2); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + assert_eq!(&bytes[..4], b"OVRQ"); + + let p = tmp("rankquant_stream_bytes"); + idx.write(&p).unwrap(); + assert_eq!(read_bytes(&p), bytes); + std::fs::remove_file(&p).ok(); + + let from_bytes = RankQuant::load_from_bytes(&bytes).unwrap(); + let from_reader = RankQuant::read_from(prefixed_cursor(&bytes)).unwrap(); + assert_eq!(from_bytes.len(), idx.len()); + assert_eq!(from_reader.bits(), idx.bits()); + assert_eq!( + from_bytes.search_asymmetric(query, 10).indices_for_query(0), + idx.search_asymmetric(query, 10).indices_for_query(0) + ); + assert_eq!( + from_reader + .search_asymmetric(query, 10) + .indices_for_query(0), + idx.search_asymmetric(query, 10).indices_for_query(0) + ); + } + + { + let mut idx = Bitmap::new(D, D / 4); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + assert_eq!(&bytes[..4], b"OVBM"); + + let p = tmp("bitmap_stream_bytes"); + idx.write(&p).unwrap(); + assert_eq!(read_bytes(&p), bytes); + std::fs::remove_file(&p).ok(); + + let from_bytes = Bitmap::load_from_bytes(&bytes).unwrap(); + let from_reader = Bitmap::read_from(prefixed_cursor(&bytes)).unwrap(); + assert_eq!(from_bytes.len(), idx.len()); + assert_eq!(from_reader.n_top(), idx.n_top()); + assert_eq!( + from_bytes.search(query, 10).indices_for_query(0), + idx.search(query, 10).indices_for_query(0) + ); + assert_eq!( + from_reader.top_m_candidates(query, 32), + idx.top_m_candidates(query, 32) + ); + } + + { + let mut idx = SignBitmap::new(D); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + assert_eq!(&bytes[..4], b"OVSB"); + + let p = tmp("sign_bitmap_stream_bytes"); + idx.write(&p).unwrap(); + assert_eq!(read_bytes(&p), bytes); + std::fs::remove_file(&p).ok(); + + let from_bytes = SignBitmap::load_from_bytes(&bytes).unwrap(); + let from_reader = SignBitmap::read_from(prefixed_cursor(&bytes)).unwrap(); + assert_eq!(from_bytes.len(), idx.len()); + assert_eq!(from_reader.dim(), idx.dim()); + assert_eq!(from_bytes.score_all(query), idx.score_all(query)); + assert_eq!( + from_reader.top_m_candidates(query, 32), + idx.top_m_candidates(query, 32) + ); + } +} + +#[test] +fn public_readers_do_not_buffer_past_reported_trailing_bytes() { + let corpus = make_corpus(90_101); + + { + let mut idx = Rank::new(D); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + let mut cursor = append_trailer(bytes); + assert_load_err_contains( + Rank::read_from(&mut cursor), + "OVR1 payload has trailing bytes", + ); + assert_eq!( + cursor.position(), + 13, + "Rank reader should stop after header" + ); + } + + { + let mut idx = RankQuant::new(D, 2); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + let mut cursor = append_trailer(bytes); + assert_load_err_contains( + RankQuant::read_from(&mut cursor), + "OVRQ payload has trailing bytes", + ); + assert_eq!( + cursor.position(), + 14, + "RankQuant reader should stop after header" + ); + } + + { + let mut idx = Bitmap::new(D, D / 4); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + let mut cursor = append_trailer(bytes); + assert_load_err_contains( + Bitmap::read_from(&mut cursor), + "OVBM payload has trailing bytes", + ); + assert_eq!( + cursor.position(), + 17, + "Bitmap reader should stop after header" + ); + } + + { + let mut idx = SignBitmap::new(D); + idx.add(&corpus); + let mut bytes = Vec::new(); + idx.write_to(&mut bytes).unwrap(); + let mut cursor = append_trailer(bytes); + assert_load_err_contains( + SignBitmap::read_from(&mut cursor), + "OVSB payload has trailing bytes", + ); + assert_eq!( + cursor.position(), + 13, + "SignBitmap reader should stop after header" + ); + } +} + fn rank_payload_cases(dim: usize) -> (Vec, Vec) { let p = tmp("rank_empty_payload_case"); Rank::new(dim).write(&p).unwrap(); @@ -236,6 +432,25 @@ fn public_loaders_report_stable_malformed_payload_context() { ), _ => unreachable!(), } + match label { + "OVR1" => assert_load_err_contains( + Rank::load_from_bytes(&truncated_header), + &format!("{label} payload truncated"), + ), + "OVRQ" => assert_load_err_contains( + RankQuant::load_from_bytes(&truncated_header), + &format!("{label} payload truncated"), + ), + "OVBM" => assert_load_err_contains( + Bitmap::load_from_bytes(&truncated_header), + &format!("{label} payload truncated"), + ), + "OVSB" => assert_load_err_contains( + SignBitmap::load_from_bytes(&truncated_header), + &format!("{label} payload truncated"), + ), + _ => unreachable!(), + } std::fs::remove_file(&truncated).ok(); trailing_bytes.push(0); @@ -260,6 +475,25 @@ fn public_loaders_report_stable_malformed_payload_context() { ), _ => unreachable!(), } + match label { + "OVR1" => assert_load_err_contains( + Rank::load_from_bytes(&trailing_bytes), + &format!("{label} payload has trailing bytes"), + ), + "OVRQ" => assert_load_err_contains( + RankQuant::load_from_bytes(&trailing_bytes), + &format!("{label} payload has trailing bytes"), + ), + "OVBM" => assert_load_err_contains( + Bitmap::load_from_bytes(&trailing_bytes), + &format!("{label} payload has trailing bytes"), + ), + "OVSB" => assert_load_err_contains( + SignBitmap::load_from_bytes(&trailing_bytes), + &format!("{label} payload has trailing bytes"), + ), + _ => unreachable!(), + } std::fs::remove_file(&trailing).ok(); } } diff --git a/tests/release_publish_invariants.py b/tests/release_publish_invariants.py index 57d3c576..a7b420ad 100644 --- a/tests/release_publish_invariants.py +++ b/tests/release_publish_invariants.py @@ -344,6 +344,13 @@ def package_publish_setting(path: str) -> bool: return publish +def string_sequence(value: Any, context: str) -> list[str]: + items = sequence(value, context) + if not all(isinstance(item, str) for item in items): + fail(f"{context} must contain only strings") + return items + + def project_version(path: str) -> str: data = load_toml(path) project = mapping(data.get("project"), f"{path}: project") @@ -456,6 +463,21 @@ def check_release_compatibility_sync() -> None: if f"The Rust MSRV is Rust {core_msrv}." not in compatibility: fail(f"docs/compatibility-policy.md must mention Rust {core_msrv}") + msrv_features = read_text("docs/msrv-and-features.md") + if f"Current MSRV: Rust {core_msrv}." not in msrv_features: + fail(f"docs/msrv-and-features.md must mention Rust {core_msrv}") + for required in ( + "`Cargo.toml` `rust-version`", + "README MSRV badge/section", + "New feature flags must declare a stability class before merging", + "`experimental` exposes `MultiBucketBitmap`", + "`test-utils` is repo-test-only", + "`cli`, `sqlite`, `sqlite-bundled`", + "without hidden platform or feature surprises", + ): + if required not in msrv_features: + fail(f"docs/msrv-and-features.md must mention {required!r}") + ci = read_text(".github/workflows/ci.yml") msrv_toolchain = f"{core_msrv}.0" if f"name: msrv ({msrv_toolchain})" not in ci: @@ -464,6 +486,46 @@ def check_release_compatibility_sync() -> None: fail(f".github/workflows/ci.yml MSRV job must pin toolchain {msrv_toolchain}") +def check_registry_metadata_parity() -> None: + expected_crates = { + "Cargo.toml": { + "license": "MIT OR Apache-2.0", + "repository": "https://github.com/Fieldnote-Echo/ordvec", + "homepage": "https://github.com/Fieldnote-Echo/ordvec", + "documentation": "https://docs.rs/ordvec", + "readme": "README.md", + "keywords": ["vector-search", "quantization", "nearest-neighbor", "ann", "simd"], + "categories": ["algorithms", "science", "compression"], + }, + "ordvec-manifest/Cargo.toml": { + "license": "MIT OR Apache-2.0", + "repository": "https://github.com/Fieldnote-Echo/ordvec", + "homepage": "https://github.com/Fieldnote-Echo/ordvec", + "documentation": "https://docs.rs/ordvec-manifest", + "readme": "README.md", + "keywords": ["vector-search", "manifest", "provenance", "verification", "quantization"], + "categories": ["algorithms", "command-line-utilities", "data-structures"], + }, + } + + for path, expected in expected_crates.items(): + data = load_toml(path) + package = mapping(data.get("package"), f"{path}: package") + for key in ("license", "repository", "homepage", "documentation", "readme"): + if package.get(key) != expected[key]: + fail(f"{path}: package.{key} is {package.get(key)!r}, expected {expected[key]!r}") + for key in ("keywords", "categories"): + actual = string_sequence(package.get(key), f"{path}: package.{key}") + if actual != expected[key]: + fail(f"{path}: package.{key} is {actual!r}, expected {expected[key]!r}") + + metadata = mapping(package.get("metadata"), f"{path}: package.metadata") + docs = mapping(metadata.get("docs"), f"{path}: package.metadata.docs") + docs_rs = mapping(docs.get("rs"), f"{path}: package.metadata.docs.rs") + if docs_rs.get("all-features") is not False: + fail(f"{path}: package.metadata.docs.rs.all-features must be false") + + def check_publication_model() -> None: expected_publish = { "Cargo.toml": True, @@ -493,6 +555,38 @@ def check_python_package_metadata() -> None: ) if "numpy>=2.2" not in dependencies: fail("ordvec-python/pyproject.toml: project.dependencies must include numpy>=2.2") + license_table = mapping(project.get("license"), "ordvec-python/pyproject.toml: project.license") + if license_table.get("text") != "MIT OR Apache-2.0": + fail("ordvec-python/pyproject.toml: project.license.text must be MIT OR Apache-2.0") + classifiers = set( + string_sequence( + project.get("classifiers"), "ordvec-python/pyproject.toml: project.classifiers" + ) + ) + for classifier in ( + "Development Status :: 3 - Alpha", + "License :: OSI Approved :: MIT License", + "License :: OSI Approved :: Apache Software License", + "Operating System :: POSIX :: Linux", + "Operating System :: MacOS", + "Operating System :: Microsoft :: Windows", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Programming Language :: Rust", + ): + if classifier not in classifiers: + fail(f"ordvec-python/pyproject.toml: missing classifier {classifier!r}") + urls = mapping(project.get("urls"), "ordvec-python/pyproject.toml: project.urls") + for key, expected in { + "Homepage": "https://github.com/Fieldnote-Echo/ordvec", + "Repository": "https://github.com/Fieldnote-Echo/ordvec", + "Issues": "https://github.com/Fieldnote-Echo/ordvec/issues", + "Formalization": "https://github.com/Fieldnote-Echo/ordvec-formalization", + }.items(): + if urls.get(key) != expected: + fail(f"ordvec-python/pyproject.toml: project.urls.{key} must be {expected!r}") cargo = load_toml("ordvec-python/Cargo.toml") dependencies_table = mapping(cargo.get("dependencies"), "ordvec-python/Cargo.toml: dependencies") @@ -513,6 +607,49 @@ def check_python_package_metadata() -> None: fail("ordvec-manifest-python/pyproject.toml: project.name must be 'ordvec-manifest'") if manifest_project.get("requires-python") != ">=3.10": fail("ordvec-manifest-python/pyproject.toml: project.requires-python must be >=3.10") + manifest_license = mapping( + manifest_project.get("license"), + "ordvec-manifest-python/pyproject.toml: project.license", + ) + if manifest_license.get("text") != "MIT OR Apache-2.0": + fail( + "ordvec-manifest-python/pyproject.toml: project.license.text must be MIT OR Apache-2.0" + ) + manifest_classifiers = set( + string_sequence( + manifest_project.get("classifiers"), + "ordvec-manifest-python/pyproject.toml: project.classifiers", + ) + ) + for classifier in ( + "Development Status :: 3 - Alpha", + "License :: OSI Approved :: MIT License", + "License :: OSI Approved :: Apache Software License", + "Operating System :: POSIX :: Linux", + "Operating System :: MacOS", + "Operating System :: Microsoft :: Windows", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Programming Language :: Rust", + ): + if classifier not in manifest_classifiers: + fail(f"ordvec-manifest-python/pyproject.toml: missing classifier {classifier!r}") + manifest_urls = mapping( + manifest_project.get("urls"), + "ordvec-manifest-python/pyproject.toml: project.urls", + ) + for key, expected in { + "Homepage": "https://github.com/Fieldnote-Echo/ordvec", + "Repository": "https://github.com/Fieldnote-Echo/ordvec", + "Issues": "https://github.com/Fieldnote-Echo/ordvec/issues", + }.items(): + if manifest_urls.get(key) != expected: + fail( + "ordvec-manifest-python/pyproject.toml: " + f"project.urls.{key} must be {expected!r}" + ) manifest_cargo = load_toml("ordvec-manifest-python/Cargo.toml") manifest_dependencies = mapping( @@ -1720,6 +1857,7 @@ def main() -> None: ci_workflow = load_workflow(CI_WORKFLOW_PATH) check_release_version_sync() check_release_compatibility_sync() + check_registry_metadata_parity() check_publication_model() check_python_package_metadata() check_release_docs_include_manifest_pypi_lane()