diff --git a/RELEASING.md b/RELEASING.md new file mode 100644 index 00000000..605dca36 --- /dev/null +++ b/RELEASING.md @@ -0,0 +1,69 @@ +# Releasing `ordvec` + +> **Publish is held.** A real `cargo publish` / PyPI publish happens only +> on the maintainer's explicit go. CI never publishes for real — the crate job +> runs `cargo publish -p ordvec --dry-run --locked`, and the PyPI wheel is +> `publish = false` on crates.io and ships separately. + +`ordvec` (the Rust crate) and `ordvec` on PyPI (the PyO3 wheel built from +`ordvec-python/`) are released by **manually dispatching** the release +workflows. Nothing ships on a tag push or a merge. + +## Release pipeline controls + +Both `release-crate.yml` and `release-python.yml`: + +- are **`workflow_dispatch`-only** (no `push` / tag trigger); +- run a **`require-ci-green`** gate confirming `ci.yml` (and, for the wheel, + `python.yml`) are green for the target commit on `main`; +- publish via **OIDC trusted publishing** (no long-lived crates.io / PyPI + tokens in the repo); +- emit **SLSA build provenance** (`actions/attest-build-provenance`) **before** + publishing — a failed attestation fails the release closed, so nothing ships + without provenance recorded first; +- pin every third-party action by **commit SHA**, set + `persist-credentials: false`, and default to `permissions: contents: read`. + +`release-python.yml` additionally produces **PEP 740** attestations via the PyPI +Trusted Publishing step. + +### Environment protection (configured in repo settings, not in code) + +- **Required reviewer** — each environment (`crates-io`, `pypi`) requires + maintainer (`Fieldnote-Echo`) approval before the publish job runs. +- **Deployment branch** — each environment is restricted to **`main`**, the + only ref a release may be dispatched from. This makes "only `main` can + publish" a configuration invariant rather than a manual check at approval + time. + +> These two settings are the supply-chain backstop the workflow code cannot +> express on its own (THREAT-SUPPLY-001 in [THREAT_MODEL.md](THREAT_MODEL.md)). + +### Recommended (open) + +- A **`v*` tag-protection ruleset** (block update + deletion) and a basic + `main` ruleset, so a release tag cannot be force-moved and `main` cannot be + force-pushed/deleted (THREAT-SUPPLY-002). Registries are already immutable + (crates.io is yank-only; PyPI burns a version on delete), so this closes the + remaining GitHub-side mutability surface. + +## Checklist + +1. Land everything on `main`; confirm the working tree and `Cargo.lock` are in + sync (`cargo build --locked`). +2. Bump the version (crate `Cargo.toml`, and `ordvec-python` if the wheel + changed) and update `CHANGELOG.md`. Commit on `main`. +3. Confirm CI is **green for that exact `main` SHA** (the dispatch ref must be + `main` — the environment will refuse any other branch). +4. Get the maintainer's explicit go to publish. +5. Dispatch `release-crate.yml` (crate) and/or `release-python.yml` (wheel) + from **`main`**. +6. Approve the environment deployment when prompted (required reviewer). +7. Verify the published artifact (crates.io / docs.rs / PyPI) and its + provenance, and — for a coordinated release — the Zenodo deposit. + +## Coordinated release note + +The crate publish, the PyPI wheel, and the paper's Zenodo deposit are +coordinated (the paper consumes the bindings for a final cold-repro run). Do +not ship one leg in isolation without the maintainer's go. diff --git a/SECURITY.md b/SECURITY.md index cb71eee6..c2cba27b 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -18,4 +18,12 @@ We aim to acknowledge reports within a few business days. `ordvec` parses serialized index files (`.tvr` / `.tvrq` / `.tvbm` / `.tvsb`); the loaders are fuzzed (`cargo +nightly fuzz`), so parsing-robustness reports against the deserialization paths are especially -welcome. +welcome. Reports are also welcome against the `unsafe` SIMD kernels (shape / +bounds invariants), the Python FFI contract (buffer handling, GIL discipline), +and the release pipeline. + +## Threat model + +See [`THREAT_MODEL.md`](THREAT_MODEL.md) for the full attack-surface analysis — +existing defenses, known residual risks, and the library-owned vs +deployment-owned split. diff --git a/THREAT_MODEL.md b/THREAT_MODEL.md new file mode 100644 index 00000000..3786f416 --- /dev/null +++ b/THREAT_MODEL.md @@ -0,0 +1,450 @@ +# Threat Model — `ordvec` + +> **Status:** v0.2.0 (pre-1.0), 2026-05-25. This is the maintained threat model +> for the `ordvec` Rust crate and the `ordvec` PyO3/maturin Python bindings. It +> is reviewed when the attack surface changes (new persistence formats, new +> `unsafe` kernels, new FFI surface, or release-pipeline changes). +> +> Scope discipline: `ordvec` is a **pure computational library** — no network +> surface, no authentication/authorization, no secrets handling, no +> multi-tenancy of its own. This document deliberately does **not** enumerate +> web-application threats (SQLi/XSS/CSRF/session) that do not apply. It covers +> the surfaces that actually exist: untrusted-input parsing, `unsafe` SIMD, the +> Python FFI boundary, the supply chain, and resource use under untrusted +> callers. Deployment-owned risks (corpus trust, co-tenancy, admission control) +> are documented as *context* for integrators, not as library action items. + +See also: [`SECURITY.md`](SECURITY.md) (reporting), [`RELEASING.md`](RELEASING.md) +(release controls), [`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md) +(what the loaders do and do not guarantee). + +--- + +## Scope and security ownership + +**`ordvec` owns:** + +- Memory safety of all safe public APIs. +- Robust rejection of malformed serialized index files — no panic, no OOM + abort, no silent data corruption, no trailing-data acceptance. +- Deterministic, finite-input behavior for valid embeddings. +- Clear, documented failure contracts for invalid caller input (non-finite + floats, dimension mismatches, shape errors) — panic in Rust, `ValueError` + in Python. +- Supply-chain hygiene for the published crate and Python wheels. + +**`ordvec` does not own:** + +- Trustworthiness of the upstream embedding model. +- Corpus provenance or document-level poisoning. +- Authorization over which documents may be indexed or retrieved. +- Tenant isolation or microarchitectural isolation on a hosting platform. +- Cryptographic verification of index-file origin (callers add this externally + — see [`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md)). + +> A structurally valid index file can still be semantically malicious. The +> loaders validate format invariants — not truth, authorization, or corpus +> integrity. + +## Maintenance budget + +`ordvec` is maintained by a single primary contributor. Mitigations are +prioritized when they are (1) low-maintenance once merged, (2) enforceable by +tests or CI, (3) local to the library boundary, and (4) unlikely to add +operational burden downstream. Heavyweight controls (mandatory index signing, +long-running fuzz farms, service-level admission control) are documented as +**deployment guidance** until there is maintainer capacity to own them. The +absence of a second maintainer is itself a tracked supply-chain residual +(see THREAT-SUPPLY-001). + +--- + +## 1. Architecture and trust boundaries + +### 1.1 Component map + +| Layer | Components | Trust boundary | +|---|---|---| +| **Deserialization** | `rank_io.rs` — `.tvr` / `.tvrq` / `.tvbm` / `.tvsb` loaders | Untrusted filesystem / network byte stream | +| **Compute kernels** | `fastscan.rs`, `quant_kernels.rs`, `bitmap.rs`, `sign_bitmap.rs` | Trust established after format validation | +| **Index API** | `rank.rs`, `quant.rs`, `bitmap.rs`, `sign_bitmap.rs` | Caller-controlled query embeddings | +| **Python FFI** | `ordvec-python` (PyO3 / maturin) | Python ↔ Rust boundary; NumPy buffers | +| **CI / supply chain** | 12 GitHub Actions workflows; `Cargo.lock`; crates.io + PyPI | GitHub OIDC, crates.io, PyPI trust chains | + +The `fuzz/` directory holds **seven** cargo-fuzz targets: `load_rank`, +`load_rankquant`, `load_bitmap`, `load_sign_bitmap` (deserialization); +`roundtrip_rankquant` (write→load round-trip); `search_rankquant` (the +single-rate ingest + asymmetric-search compute path); and `fastscan_b2` (the +FastScan b=2 block-32 kernel — the one `unsafe`-heavy scan path the others do +not reach). + +### 1.2 Deployment contexts (for integrators) + +- **Offline / batch indexing** — a trusted operator encodes a corpus and writes + index files. Low risk unless files later cross a trust boundary. +- **Serving pipeline** — an index loaded at startup, then queried by + user-controlled embeddings. Query vectors cross the trust boundary on every + search call (see §6). +- **RAG substrate** — `ordvec` retrieves the *k* nearest documents fed to an + LLM. The retrieval layer becomes a target for corpus-level poisoning; this is + a **deployment risk**, not a parser risk (see §7). +- **Multi-tenant / cloud** — tenants sharing one process share SIMD execution + units. Microarchitectural isolation is a hosting-platform responsibility + (see THREAT-SIMD-002). + +--- + +## 2. Deserialization threats (THREAT-DESER) — library-owned + +### 2.1 Existing defenses (code-verified) + +`rank_io.rs` implements layered parser hardening: + +- Magic + version checks before any allocation. +- Fallible allocation via `try_reserve_exact` — an attacker-controlled length + field returns `InvalidData`, never an OOM abort. +- All payload sizes computed with `usize::checked_mul`; overflow returns `Err`. +- A 128 GiB `MAX_PAYLOAD` cap and `MAX_VECTORS` (64 Mi) / `MAX_DIM` caps, + enforced on **both** the load and write paths (the write-side cap runs + *before* `File::create`, so a rejected write cannot truncate an existing + file). +- Exact file-length match (`check_payload_matches_file`): trailing bytes or + short files are rejected. +- Per-row **structural** invariants: `Rank` rows must be a true permutation of + `[0, dim)` (verified by bound + duplicate checks ⇒ pigeonhole); + `RankQuant` rows must satisfy constant composition (uniform per-bucket + histogram); `Bitmap` rows must have exactly `n_top` bits set. +- No `panic!` on malformed data — all validation returns + `io::Error(InvalidData)`. +- The raw `rank_io` read/write functions are `pub(crate)`; the only public + persistence API is the index types' `write()` / `load()`, making the + write→load round-trip a type-level guarantee. + +The four loaders are covered by cargo-fuzz targets (the `load_*` targets). + +### 2.2 Index-file risk classes + +**THREAT-DESER-001 (library-owned, P4): Malformed index file.** +The loader must reject corrupt/invalid files without panic, OOM, or +trailing-data acceptance. The current implementation satisfies this for all +four formats. *Residual:* `file.metadata()?.len()` is sampled at open time; +on NFS/FUSE mounts with concurrent writers a TOCTOU window exists between +`metadata()` and the reads. On writable shared mounts the practical outcome is +a read error or `InvalidData`, not an exploit. *Likelihood:* Very Low. +*Impact:* error surfaced. + +**THREAT-DESER-002 (deployment-owned, P3 docs): Malicious-but-valid index.** +A structurally valid index with semantically poisoned contents passes every +parser check and returns attacker-influenced results. This is a *provenance* +problem, not a parser problem. *Mitigation (no format change):* +[`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md) documents that `ordvec` +validates structure, not origin, and lists verification options (checksum +manifest, artifact-store integrity, Sigstore / GitHub artifact attestation) +for deployments where index files cross trust boundaries. An optional sidecar +verifier (HMAC / BLAKE3) can be added later without a format bump; it is +deliberately **not** shipped now (no concrete deployment requires it, and an +in-format crypto layer would add unowned key management). + +--- + +## 3. Unsafe SIMD and memory-safety threats (THREAT-SIMD) — library-owned + +### 3.1 What the FastScan kernel does + +`scan_b2_fastscan_avx512` uses unaligned loads (`_mm256_loadu_si256`), +byte-shuffle LUT lookups (`_mm256_shuffle_epi8` / VPSHUFB), broadcast, widen +(`_mm256_cvtepu8_epi16`, `_mm512_cvtepu16_epi32`), and accumulate +(`_mm512_add_epi16/epi32`, `_mm512_storeu_si512`). It is a load/shuffle/widen/ +accumulate sequence with **no gather instructions**. The Intel DOWNFALL (GDS) +vulnerability is specific to gather-based data sampling and does **not** apply +to this kernel. + +### 3.2 Risks + +**THREAT-SIMD-001 (P1, mitigated this cycle; crate-wide rollout tracked): +Unsafe-kernel invariant preservation under future refactors.** +`scan_b2_fastscan_avx512` safety depends on caller-established invariants — +`packed_fs.len() == n_blocks * pairs * 32` (formed via `checked_mul`, overflow +⇒ caller panics) and `lut_u8.len() == pairs * 16`. These are asserted by the +`pub(crate)` entry point `search_asymmetric_fastscan_b2` before dispatch, and +`RankQuantFastscan::search` is the type-level safe wrapper that owns the shape +by construction. A future refactor calling the inner function directly could +bypass the asserts. *Mitigations:* the runtime asserts + the type wrapper are +the primary boundary; the scalar-vs-SIMD equivalence test +(`fastscan_b2_top10_matches_avx512_kernel`) guards behavior; and +**`#![deny(unsafe_op_in_unsafe_fn)]` is now enforced in `fastscan.rs`**, so +every unsafe operation in the kernel sits in an explicit `unsafe {}` block and +stays visible to future edits. *Open:* roll the lint out crate-wide to the +other SIMD modules (`bitmap.rs`, `sign_bitmap.rs`, `quant_kernels.rs`, +`util.rs` NEON) — tracked as a follow-up. + +**THREAT-SIMD-002 (P4, deployment note): Microarchitectural side channels in +co-tenancy.** `ordvec` does not claim protection against microarchitectural +side channels under hostile multi-tenant co-residency. The kernel uses no +gather instructions (ruling out DOWNFALL/GDS), but SIMD execution units are +shared across SMT threads, and port-contention timing channels remain +theoretically possible on vulnerable hardware. Sensitive deployments should +avoid sharing physical cores across trust domains and rely on the +OS/hypervisor side-channel posture. Not a library action item. + +**THREAT-SIMD-003 (P3): FastScan approximation is not CPU-dependent +divergence.** The 8-bit global-affine LUT in `build_fastscan_b2_query` +introduces `O(span/255)` per-pair approximation error — an intentional +trade-off matching FAISS FastScan semantics, documented in the code. The +scalar and AVX-512 paths agree on the same quantized inputs (equivalence test), +and `TopK` uses `total_cmp` for deterministic tie-breaking across all paths. +This is approximate *scoring*, not a CPU oracle. FastScan is a `#[doc(hidden)]` +pre-ranker; callers needing exact scores use `RankQuant::search_asymmetric`. + +--- + +## 4. Python FFI threats (THREAT-FFI) — binding-owned + +### 4.1 Existing defenses (code-verified) + +The binding takes `PyReadonlyArray`, rejects non-C-contiguous arrays with a +clear `ValueError`, validates finiteness (`ensure_finite`), maps shape errors +to `ValueError`, and releases the GIL (`py.detach`) around the pure-Rust +(Rayon-parallel) compute in every heavy method while reading the input arrays +in place. PyO3's `&mut self` borrow tracking means a second thread re-entering +the **same** index object during a released-GIL call gets a clean +`Already borrowed` `RuntimeError`, never concurrent mutation. + +### 4.2 Risks (documented contracts, implemented) + +**THREAT-FFI-001 (P2, documented): Concurrent input-array mutation during a +released-GIL call.** `PyReadonlyArray` keeps the input buffer alive and blocks +`rust-numpy`-mediated writes for the call's duration, but it cannot stop +another thread or native extension from mutating the *same backing memory* +through a reference obtained before the call. This can yield numerically +inconsistent results — a numeric-extension contract issue, not a UAF. *Status:* +documented in the module docstring and the per-method docs ("do not mutate an +input array from another thread while an `ordvec` call is in progress"), +matching the standard contract for GIL-releasing NumPy extensions. An optional +`safe_copy=True` hard-isolation parameter remains a possible future ergonomic. + +**THREAT-FFI-002 (P2, documented): Unsanitized filesystem-path forwarding.** +`write()` / `load()` forward the path to the filesystem unmodified (no `..` / +traversal sanitization). A service exposing these path arguments to user input +could enable traversal or arbitrary-file overwrite. This is a **caller +responsibility**. *Status:* documented in the module docstring and on every +`write`/`load` method ("treat the path as trusted input; web/multi-user +applications must validate paths before calling"). + +--- + +## 5. Supply-chain threats (THREAT-SUPPLY) + +### 5.1 Existing controls (verified) + +**Workflow code (all 12 workflows):** third-party actions pinned by commit +SHA; `persist-credentials: false` on every checkout; `permissions: contents: +read` default. **Release workflows** (`release-crate.yml`, `release-python.yml`) +are `workflow_dispatch`-only (no tag/push trigger), run a `require-ci-green` +gate against `main`, publish via **OIDC trusted publishing** (no long-lived +registry tokens), and emit **SLSA build provenance** +(`actions/attest-build-provenance`) **before** publish — a failed attestation +fails the release closed. `release-python` additionally gets **PEP 740** +attestations via Trusted Publishing. + +**Static / supply-chain analysis:** **CodeQL** scans Rust, Python, and Actions +(no-build databases); **OpenSSF Scorecard** publishes SARIF to code scanning +and the score badge; **zizmor** audits workflow hardening (pinned); a +`cargo-deny` / audit job gates advisories and licenses. The core crate has near +zero non-Rust dependencies by design (the `deps` gate greps `cargo tree -p +ordvec`); the Python binding's larger tree (numpy → ndarray) is intentional and +scoped to the wheel. + +### 5.2 Risks + +**THREAT-SUPPLY-001 (mitigated; residual = single-maintainer account +compromise): Release configuration and ownership.** The release **environments** +(`pypi`, `crates-io`) now require **approval by the maintainer** and restrict +deployment to the **`main`** branch only — so a release cannot be dispatched +from an unmerged or attacker branch, and no publish runs without an explicit +human approval. The remaining residual is *maintainer-account compromise*: a +single owner is both dispatcher and approver, so account takeover (or social +engineering) is not caught by a second human. *Mitigations:* strong 2FA / +passkeys on the maintainer account; recruiting a **second owner/maintainer** +(also an open OpenSSF Best-Practices item) — which would additionally make a +deployment **wait timer** worthwhile (a second party able to cancel a bad +release during the window). See [`RELEASING.md`](RELEASING.md). + +**THREAT-SUPPLY-002 (P3): Release immutability and tag integrity.** Published +artifacts are **immutable by registry design** — crates.io is yank-only (a +published version's bytes can never be overwritten) and PyPI burns a version on +delete (no different artifact may be re-uploaded under the same version). So +post-publish "silent replacement" of a version is not possible on either +registry, and consumers can verify artifacts against the SLSA / PEP 740 +provenance above. *Residual (GitHub-side):* `changelog.yml` cuts tagged GitHub +Releases, but the repo currently has **no tag-protection ruleset and no `main` +ruleset**, so a tag could be force-moved or a release asset replaced. +*Mitigation:* add a `v*` **tag ruleset** (block update + deletion) and a basic +`main` ruleset; optionally enable GitHub immutable releases. + +**THREAT-SUPPLY-003 (P3): Typosquatting adjacent names.** Namespace-adjacent +crate/package names (`ord-vec`, `ordvecs`, `order-vec`) could be registered to +typosquat dependents. *Mitigation:* publish the first functional release +promptly; optionally register adjacent names. + +--- + +## 6. Query and resource-exhaustion threats (THREAT-QUERY) — library-adjacent + +These arise from correct behavior on large-but-valid inputs from untrusted +callers, not from parser or unsafe bugs. + +**THREAT-QUERY-001 (P2, deployment docs): Caller-controlled batch / `k` +exhaustion.** `result_buffer_len(nq, k)` checks `nq * k` overflow and panics +loudly rather than under-allocating; `k` is clamped to `n_vectors`. But a +serving application can still be CPU/memory-exhausted by large query batches +(`nq`), large `k`, or concurrent scans over a large corpus. `ordvec` does not +enforce service-level quotas — by design (it is a library, not a server). +*Mitigation:* callers exposing search over a network must independently bound +batch size, `k`, request rate, and corpus size; a configurable `max_nq` / +`max_k` at the binding level is a possible future convenience. + +**THREAT-QUERY-002 (P3): Panic on contract violation in Rust server contexts.** +Rust APIs fail fast on invalid contract input (non-finite floats, dimension / +shape violations) via `assert!` / `expect`. In a Rust-native server an +unhandled panic crashes the thread/process; the Python bindings convert these +to typed `ValueError`. *Mitigation:* Rust service callers must validate +untrusted input before calling, or catch panics at the request boundary. + +--- + +## 7. Corpus and embedding poisoning (THREAT-POISON) — deployment-owned + +These sit **outside** the library's security perimeter; they are documented as +context for integrators using `ordvec` as a RAG substrate. Corpus poisoning of +embedding retrievers is a documented attack class (see PoisonedRAG and OWASP +LLM08:2025 in the references); the mitigations are corpus provenance, ingestion +access control, and (where applicable) hybrid lexical + vector retrieval — all +deployment concerns. The points below are the `ordvec`-specific shape of that +class. + +**THREAT-POISON-001: Ordinal rank inversion.** Because `ordvec` is +training-free, the rank transform is deterministic and invertible. An attacker +who controls the embedding pipeline can engineer an embedding whose ordinal +(Spearman) correlation with target queries is maximized — the ordinal analogue +of embedding-inversion attacks. `ordvec` has no codebook to protect and cannot +prevent construction of maximally correlated embeddings; mitigation requires +access control and provenance on the embedding source. + +**THREAT-POISON-002: Top-`n_top` overlap poisoning.** `Bitmap` scores documents +by `popcount(Q AND D)`. The loader enforces exactly `n_top` bits per row, so an +injected document cannot set arbitrary bits — the realistic attack is crafting a +document whose top-`n_top` coordinates maximally overlap the most-queried +coordinates. Requires knowledge of the query distribution and corpus write +access. + +**THREAT-POISON-003: RankQuant boundary exploitation.** `RankQuant` uses +equal-width bucket quantization; documents near bucket boundaries can be crafted +to score highly under the coarse pre-filter yet differ under exact reranking, +exploiting quantization information loss to pass the coarse stage. Requires +knowledge of quantization parameters and the document distribution. + +--- + +## 8. Fuzzing coverage (THREAT-FUZZ) + +Seven targets cover the four loaders, the write→load round-trip, the +single-rate compute path, and (new) the FastScan kernel. + +**THREAT-FUZZ-001 (closed this cycle): FastScan path was unfuzzed.** The +`fastscan_b2` target now drives `RankQuantFastscan` (`pack_fastscan_b2` + +`search_asymmetric_fastscan_b2` + the scalar/AVX-512 kernel), crossing the +32-doc block boundary so tail-padding blocks are exercised. On +non-AVX-512 CI runners it exercises the scalar reference kernel; under Intel SDE +it exercises the AVX-512 kernel. + +**THREAT-FUZZ-002 (P3): No CI-bound fuzzing for continuous regression.** Fuzzing +is run manually; there is no CI gate. A bounded weekly smoke job (e.g. +`-runs=50000` on `load_rank`, `load_rankquant`, and `fastscan_b2`) would catch +regressions between manual runs. (Low overhead; weighed against maintenance +budget.) + +*Note on `load_sign_bitmap`:* all bit patterns are structurally valid for sign +bitmaps (no per-row invariant), so that target is correctly scoped to parser +robustness — no OOM, no panic, no trailing-data acceptance. + +--- + +## 9. CI/CD pipeline threats (THREAT-CICD) + +**THREAT-CICD-001 (P3, mitigated by control): Workflow injection via PR +metadata.** If a `run:` step interpolated user-controlled context (PR title, +branch name) into a shell expression via `${{ ... }}` without an `env:` hop, a +script-injection could run in the runner. *Mitigation:* `zizmor` audits exactly +this class of issue and runs in CI; pass user-controlled context through `env:` +rather than inline `${{ }}` in `run:` blocks. SHA-pinned actions bound the +blast radius of a compromised dependency separately. + +--- + +## 10. Threat register + +| ID | Category | Owner | Description | Likelihood | Impact | Status / priority | +|---|---|---|---|---|---|---| +| THREAT-SIMD-001 | Memory safety | Library | Unsafe-kernel invariant bypass on refactor | Medium | High | **P1** — lint enforced in `fastscan.rs`; crate-wide rollout tracked | +| THREAT-FFI-001 | FFI | Binding | Concurrent input mutation during released-GIL call | Medium | Medium | **P2** — documented contract | +| THREAT-FFI-002 | FFI | Binding | Unsanitized path forwarding | Medium | Medium | **P2** — documented contract | +| THREAT-SUPPLY-001 | Supply chain | Config | Release config / single-owner | Low | Critical | **Mitigated** (reviewer + main-only); residual = account compromise / 2nd owner | +| THREAT-SUPPLY-002 | Supply chain | Config | Release immutability / tag integrity | Low | High | **P3** — registries immutable; add tag ruleset | +| THREAT-SUPPLY-003 | Supply chain | Config | Typosquatting adjacent names | Medium | Medium | P3 | +| THREAT-QUERY-001 | Resource | Deployment | Batch / `k` exhaustion in serving | Medium | Medium | **P2** — deployment docs | +| THREAT-QUERY-002 | Resource | Deployment | Panic on contract violation (Rust servers) | Low | Medium | P3 | +| THREAT-FUZZ-001 | Fuzzing | Library | FastScan path unfuzzed | Medium | High | **Closed** (`fastscan_b2` added) | +| THREAT-FUZZ-002 | Fuzzing | Library | No CI-bound fuzzing | Medium | Medium | P3 | +| THREAT-DESER-001 | Deserialization | Library | TOCTOU on shared mounts | Very Low | Low | P4 | +| THREAT-DESER-002 | Provenance | Deployment | Malicious-but-valid index | Medium | High | P3 (docs — `INDEX_PROVENANCE.md`) | +| THREAT-CICD-001 | CI/CD | Library | Workflow injection via PR metadata | Low | High | P3 — mitigated by `zizmor` | +| THREAT-SIMD-002 | Side channel | Deployment | Microarchitectural co-tenancy (no gather) | Low | Medium | P4 | +| THREAT-SIMD-003 | Semantic | Library | FastScan approximation (doc clarity) | Low | Low | P3 | +| THREAT-POISON-001 | Index poisoning | Deployment | Ordinal rank inversion | Medium | High | Deployment | +| THREAT-POISON-002 | Index poisoning | Deployment | Top-`n_top` overlap poisoning | Low | Medium | Deployment | +| THREAT-POISON-003 | Index poisoning | Deployment | RankQuant boundary exploitation | Low | Low | Deployment | + +--- + +## 11. Open mitigations + +**Done this cycle:** `#![deny(unsafe_op_in_unsafe_fn)]` in `fastscan.rs` +(SIMD-001); `fastscan_b2` fuzz target (FUZZ-001); release-environment reviewers ++ main-only deployment (SUPPLY-001); [`docs/INDEX_PROVENANCE.md`](docs/INDEX_PROVENANCE.md) +(DESER-002); [`RELEASING.md`](RELEASING.md) (SUPPLY-001). + +**Open, low cost:** + +1. Add a `v*` tag-protection ruleset (+ basic `main` ruleset) and optionally + enable GitHub immutable releases (THREAT-SUPPLY-002). +2. Roll `#![deny(unsafe_op_in_unsafe_fn)]` out crate-wide across the remaining + SIMD modules (THREAT-SIMD-001). +3. Add a bounded weekly CI fuzz smoke job (THREAT-FUZZ-002). +4. Document recommended `nq` / `k` / corpus bounds for single-process serving + in the Rust and Python API docs (THREAT-QUERY-001). + +**Later (not release blockers):** a second maintainer/owner (then a release +wait timer becomes meaningful); an optional sidecar index verifier +(`ordvec verify` / external HMAC/BLAKE3 manifest) if a deployment requires +tamper-evidence (DESER-002); a `safe_copy=True` FFI isolation option +(FFI-001). + +--- + +## References + +Only load-bearing, verifiable sources are listed. + +- **PoisonedRAG** — *Knowledge Corruption Attacks to Retrieval-Augmented + Generation of Large Language Models* (arXiv:2402.07867). Establishes that + injecting a small number of poisoned passages into a retriever corpus + achieves high attack-success rates — context for §7. +- **OWASP LLM08:2025 — Vector and Embedding Weaknesses.** Retrieval-layer risk + class (poisoning, embedding inversion, access-control bypass) — context for + §7 / scope. +- **"Memory-Safety Challenge Considered Solved? An In-Depth Study with All Rust + CVEs"** (arXiv:2003.03296). Real-world Rust memory-safety bugs require + `unsafe` code — the rationale for the §3 focus on the SIMD kernels. +- **GitHub Security Lab — preventing pwn-requests.** Expression-injection in + `run:` steps and untrusted-context handling — basis for THREAT-CICD-001. diff --git a/codecov.yml b/codecov.yml new file mode 100644 index 00000000..05f4ee71 --- /dev/null +++ b/codecov.yml @@ -0,0 +1,25 @@ +# Codecov is a dashboard + README badge for this repo. The *enforced* coverage +# gate is the cargo-llvm-cov `--fail-under-lines 78` floor in +# .github/workflows/coverage.yml — set under the AVX-512-free runner figure: +# the hosted coverage runner has no AVX-512, so the runtime SIMD dispatch never +# reaches the AVX-512 kernels (they are exercised by the separate `avx512` job +# under Intel SDE). See issue #68. +coverage: + status: + project: + default: + target: 78% # mirror the enforced cargo-llvm-cov floor + threshold: 1% + patch: + default: + # The AVX-512 kernels cannot be covered on the no-AVX-512 coverage + # runner, so patch coverage on any SIMD-kernel change is a false signal + # (touching a kernel re-indents lines the runner never executes — see + # #68). Keep patch advisory rather than blocking PRs on it; real + # coverage enforcement lives in the workflow floor above. + informational: true + +# The cargo-fuzz workspace is excluded from the crate build and is not part of +# the tested surface measured by cargo-llvm-cov. +ignore: + - "fuzz" diff --git a/docs/INDEX_PROVENANCE.md b/docs/INDEX_PROVENANCE.md new file mode 100644 index 00000000..ec8d3fce --- /dev/null +++ b/docs/INDEX_PROVENANCE.md @@ -0,0 +1,55 @@ +# Index file provenance + +`ordvec` persists indexes as `.tvr` / `.tvrq` / `.tvbm` / `.tvsb` files and +reloads them through `Rank::load`, `RankQuant::load`, `Bitmap::load`, and +`SignBitmap::load`. This note states exactly **what the loaders guarantee and +what they do not**, so you can decide whether an index file needs out-of-band +verification before you load it. + +## What the loaders validate + +The loaders treat the byte stream as **untrusted input** and reject malformed +files without panicking, aborting, or silently accepting garbage: + +- magic + version checks before any allocation; +- fallible allocation (`try_reserve_exact`) — an attacker-controlled length + field returns `InvalidData`, never an OOM abort; +- all payload sizes computed with `checked_mul`; overflow is an error; +- a 128 GiB `MAX_PAYLOAD` cap plus `MAX_VECTORS` / `MAX_DIM` caps; +- an exact file-length match (trailing bytes or short files are rejected); +- per-row **structural** invariants: `Rank` rows must be a true permutation of + `[0, dim)`, `RankQuant` rows must satisfy constant composition, `Bitmap` rows + must have exactly `n_top` bits set. + +A file that survives all of this is **structurally well-formed**. The four +loaders are exercised by `cargo fuzz` (the `load_*` targets). + +## What the loaders do NOT validate + +The loaders validate **structure, not origin or truth**: + +- They do **not** authenticate who produced the file or whether it was modified + in transit or at rest. There is no signature, MAC, or checksum in the format. +- A **structurally valid but semantically poisoned** index — one whose ranks, + buckets, or bitmaps were crafted to bias retrieval — passes every check and + returns attacker-influenced results. This is a *provenance* problem, not a + parser problem (THREAT-DESER-002 / THREAT-POISON-\* in + [../THREAT_MODEL.md](../THREAT_MODEL.md)). + +## Guidance for deployments where index files cross a trust boundary + +If you load index files that were produced elsewhere, transferred over a +network, or stored on shared/mutable infrastructure, verify them **before** +loading using whatever your deployment already trusts: + +- a checksum manifest (e.g. SHA-256) recorded by the build that produced the + index, verified at load time; +- your artifact store's integrity controls; +- a signature / attestation layer (e.g. Sigstore, GitHub artifact attestations) + over the index files. + +`ordvec` deliberately ships **no** built-in signing/MAC layer today: without a +concrete deployment requiring it, an in-format crypto layer would add key +management with no clear owner. A sidecar verifier (e.g. an `ordvec verify` +utility, or an external HMAC/BLAKE3 manifest) can be added later **without a +file-format change** if a real deployment needs tamper-evidence. diff --git a/fuzz/Cargo.lock b/fuzz/Cargo.lock index 8e0abbb1..14cf0d4a 100644 --- a/fuzz/Cargo.lock +++ b/fuzz/Cargo.lock @@ -246,9 +246,9 @@ checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" [[package]] name = "ordered-float" -version = "4.6.0" +version = "5.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7bb71e1b3fa6ca1c61f383464aaf2bb0e2f8e772a1f01d486832464de363b951" +checksum = "b7d950ca161dc355eaf28f82b11345ed76c6e1f6eb1f4f4479e0323b9e2fbd0e" dependencies = [ "num-traits", ] diff --git a/fuzz/Cargo.toml b/fuzz/Cargo.toml index 1929471c..f94324be 100644 --- a/fuzz/Cargo.toml +++ b/fuzz/Cargo.toml @@ -64,3 +64,12 @@ path = "fuzz_targets/roundtrip_rankquant.rs" test = false doc = false bench = false + +# FastScan b=2 compute path (`RankQuantFastscan`): the one unsafe-heavy scan +# kernel the `search_rankquant` target does not reach. +[[bin]] +name = "fastscan_b2" +path = "fuzz_targets/fastscan_b2.rs" +test = false +doc = false +bench = false diff --git a/fuzz/fuzz_targets/fastscan_b2.rs b/fuzz/fuzz_targets/fastscan_b2.rs new file mode 100644 index 00000000..5f7b5178 --- /dev/null +++ b/fuzz/fuzz_targets/fastscan_b2.rs @@ -0,0 +1,52 @@ +//! libFuzzer target for the FastScan b=2 compute path (`RankQuantFastscan`): +//! `add` (rank_transform -> bucket -> block-32 re-pack via `pack_fastscan_b2`) +//! then `search` (`search_asymmetric_fastscan_b2` -> the scalar / AVX-512 +//! VPSHUFB-LUT kernel -> TopK). This is the one `unsafe`-heavy scan path the +//! `search_rankquant` target does NOT reach: `RankQuant::search_asymmetric` +//! dispatches the single-rate kernels, never the FastScan block-32 kernel. +//! +//! `dim` is fixed at 64 — `RankQuantFastscan::new` requires `dim % 4 == 0` +//! (b=2 constant composition) and `dim <= u16::MAX`; 64 also gives a +//! `dim / 2 = 32`-pair inner loop. The fuzzer shapes the doc count (crossing +//! the 32-doc block boundary so tail-padding blocks are exercised), the +//! embedding/query values, and `k` (including `k == 0`). Values map to finite +//! f32: the public API rejects NaN / ±Inf by contract, so raw float bit +//! patterns would only re-exercise that guard, not the kernel. +//! +//! On CI runners without AVX-512 this drives the scalar reference kernel +//! (`scan_b2_fastscan_scalar`); under Intel SDE it drives the AVX-512 kernel. +//! +//! Contract: no panic, abort, or out-of-bounds access on any input. +#![no_main] + +use libfuzzer_sys::fuzz_target; +use ordvec::RankQuantFastscan; + +fuzz_target!(|data: &[u8]| { + if data.len() < 3 { + return; + } + // dim % 4 == 0 and dim <= u16::MAX (RankQuantFastscan::new contract). + const DIM: usize = 64; + // 1..=100 docs — crosses the 32-doc block boundary (1..=4 blocks) so the + // tail-padding path (`n % 32 != 0`) is exercised. + let n = (data[0] as usize % 100) + 1; + let k = data[1] as usize % (n + 1); // 0..=n + + let payload = &data[2..]; + let total = (n + 1) * DIM; + let floats: Vec = (0..total) + .map(|i| { + if payload.is_empty() { + 0.0 + } else { + payload[i % payload.len()] as f32 - 128.0 + } + }) + .collect(); + let (vecs, query) = floats.split_at(n * DIM); + + let mut idx = RankQuantFastscan::new(DIM); + idx.add(vecs); + let _ = idx.search(query, k); +}); diff --git a/src/fastscan.rs b/src/fastscan.rs index d2210082..7a757aed 100644 --- a/src/fastscan.rs +++ b/src/fastscan.rs @@ -35,6 +35,13 @@ //! [`l2_normalise`](crate::util::l2_normalise), and `k` is clamped to //! `n_vectors` exactly as the sibling search methods do. +// Make every unsafe operation inside an `unsafe fn` require an explicit +// `unsafe {}` block rather than leaning on the fn-level `unsafe`. This is +// defense-in-depth for the AVX-512 FastScan kernel below: it keeps the kernel's +// unsafe surface visible to future edits. Crate-wide rollout to the other SIMD +// modules is tracked separately (see THREAT_MODEL.md, THREAT-SIMD-001). +#![deny(unsafe_op_in_unsafe_fn)] + use rayon::prelude::*; use crate::rank::{bucket_ranks, rank_transform, rankquant_norm}; @@ -212,27 +219,37 @@ unsafe fn scan_b2_fastscan_avx512( // is 255, so FLUSH × 255 must fit in u16: FLUSH ≤ 257. Pick 256. const FLUSH: usize = 256; - for b in 0..n_blocks { - let block_ptr = packed_fs.as_ptr().add(b * bytes_per_block); - - // 32-lane u32 accumulators (split across two __m512i, lo/hi 16). - let mut acc32_lo = _mm512_setzero_si512(); - let mut acc32_hi = _mm512_setzero_si512(); - - let mut p = 0usize; - while p < pairs { - let chunk = (pairs - p).min(FLUSH); - - // 32-lane u16 accumulator split: each holds 16 u16 values - // in its low 256 bits. - let mut acc16_lo = _mm512_setzero_si512(); // lanes 0..16 - let mut acc16_hi = _mm512_setzero_si512(); // lanes 16..32 - - let inner_end = p + chunk; - let inner_chunks_4 = chunk / 4; - let mut pp = p; - - for _ in 0..inner_chunks_4 { + // SAFETY: every raw load/store and AVX-512 intrinsic in this loop is + // in-bounds and feature-gated per the function-level SAFETY comment above. + // The explicit block is required by `#![deny(unsafe_op_in_unsafe_fn)]`. + unsafe { + for b in 0..n_blocks { + let block_ptr = packed_fs.as_ptr().add(b * bytes_per_block); + + // 32-lane u32 accumulators (split across two __m512i, lo/hi 16). + let mut acc32_lo = _mm512_setzero_si512(); + let mut acc32_hi = _mm512_setzero_si512(); + + let mut p = 0usize; + while p < pairs { + let chunk = (pairs - p).min(FLUSH); + + // 32-lane u16 accumulator split: each holds 16 u16 values + // in its low 256 bits. + let mut acc16_lo = _mm512_setzero_si512(); // lanes 0..16 + let mut acc16_hi = _mm512_setzero_si512(); // lanes 16..32 + + let inner_end = p + chunk; + let inner_chunks_4 = chunk / 4; + let mut pp = p; + + // Score one coord-pair across all 32 lanes: VPSHUFB the per-pair + // 16-byte LUT (broadcast into both 128-bit halves) by the packed + // nibble codes, widen u8 -> u16, accumulate. `pp` / `block_ptr` / + // `lut_u8` / `acc16_*` are captured by name at each call site. + // (macro_rules is expanded at compile time, so defining it here + // has no runtime cost; it keeps the unrolled body in one place and + // is reused by the remainder loop below.) macro_rules! step { ($off:expr) => {{ let codes256 = @@ -250,53 +267,48 @@ unsafe fn scan_b2_fastscan_avx512( acc16_hi = _mm512_add_epi16(acc16_hi, _mm512_castsi256_si512(hi256)); }}; } - step!(0); - step!(1); - step!(2); - step!(3); - pp += 4; - } - while pp < inner_end { - let codes256 = _mm256_loadu_si256(block_ptr.add(pp * 32) as *const __m256i); - let lut128 = _mm_loadu_si128(lut_u8.as_ptr().add(pp * 16) as *const __m128i); - let lut256 = _mm256_broadcastsi128_si256(lut128); - let contrib = _mm256_shuffle_epi8(lut256, codes256); - let lo128 = _mm256_castsi256_si128(contrib); - let hi128 = _mm256_extracti128_si256(contrib, 1); - let lo256 = _mm256_cvtepu8_epi16(lo128); - let hi256 = _mm256_cvtepu8_epi16(hi128); - acc16_lo = _mm512_add_epi16(acc16_lo, _mm512_castsi256_si512(lo256)); - acc16_hi = _mm512_add_epi16(acc16_hi, _mm512_castsi256_si512(hi256)); - pp += 1; - } + // 4-wide unroll, then the remainder one pair at a time. + for _ in 0..inner_chunks_4 { + step!(0); + step!(1); + step!(2); + step!(3); + pp += 4; + } - // Widen u16 → u32. Meaningful u16s sit in the low 256 bits. - let lo256_u16 = _mm512_castsi512_si256(acc16_lo); - let hi256_u16 = _mm512_castsi512_si256(acc16_hi); - let lo32 = _mm512_cvtepu16_epi32(lo256_u16); - let hi32 = _mm512_cvtepu16_epi32(hi256_u16); - acc32_lo = _mm512_add_epi32(acc32_lo, lo32); - acc32_hi = _mm512_add_epi32(acc32_hi, hi32); + while pp < inner_end { + step!(0); + pp += 1; + } - p = inner_end; - } + // Widen u16 → u32. Meaningful u16s sit in the low 256 bits. + let lo256_u16 = _mm512_castsi512_si256(acc16_lo); + let hi256_u16 = _mm512_castsi512_si256(acc16_hi); + let lo32 = _mm512_cvtepu16_epi32(lo256_u16); + let hi32 = _mm512_cvtepu16_epi32(hi256_u16); + acc32_lo = _mm512_add_epi32(acc32_lo, lo32); + acc32_hi = _mm512_add_epi32(acc32_hi, hi32); - let mut tmp_lo = [0u32; 16]; - let mut tmp_hi = [0u32; 16]; - _mm512_storeu_si512(tmp_lo.as_mut_ptr() as *mut _, acc32_lo); - _mm512_storeu_si512(tmp_hi.as_mut_ptr() as *mut _, acc32_hi); + p = inner_end; + } - let doc_base = b * 32; - let docs_in_block = (n - doc_base).min(32); - for lane in 0..docs_in_block { - let acc = if lane < 16 { - tmp_lo[lane] - } else { - tmp_hi[lane - 16] - }; - let raw = bias_sum + (acc as f32) * inv_q; - top.maybe_insert(raw * scale, doc_base + lane); + let mut tmp_lo = [0u32; 16]; + let mut tmp_hi = [0u32; 16]; + _mm512_storeu_si512(tmp_lo.as_mut_ptr() as *mut _, acc32_lo); + _mm512_storeu_si512(tmp_hi.as_mut_ptr() as *mut _, acc32_hi); + + let doc_base = b * 32; + let docs_in_block = (n - doc_base).min(32); + for lane in 0..docs_in_block { + let acc = if lane < 16 { + tmp_lo[lane] + } else { + tmp_hi[lane - 16] + }; + let raw = bias_sum + (acc as f32) * inv_q; + top.maybe_insert(raw * scale, doc_base + lane); + } } } }