feat(python): input dtype boundary contracts (candidate ids → u32, embeddings → f32)#90
Conversation
RankQuant.search_asymmetric_subset and Bitmap.body_overlap_scores_subset declared their candidate/doc-id arrays as PyReadonlyArray1<u32>, which rust-numpy matches strictly. The int64 arrays NumPy produces by default (np.arange, np.where()[0], np.array([...]), fancy indexing, np.argpartition) therefore raised an opaque "ndarray cannot be cast as ndarray" TypeError. ordvec's own top_m_candidates* emit uint32, so the happy path worked and the suite never exercised a user-built candidate set. Accept any integer dtype and convert to the core's u32 with checked bounds: negatives and values >= 2^32 raise a clear ValueError rather than silently wrapping (np.asarray(-1, uint32) -> 4294967295, 2**32 -> 0, which would score the wrong document). Already-uint32 contiguous arrays are still borrowed zero-copy; other dtypes are copied once. The >= n bound check and the body_overlap sorted-ids policy are unchanged. Flips the two red-team tests that asserted the old strict-uint32 rejection to the new contract (integer dtypes converted by value, non-integer dtypes still TypeError) and adds dtype-matrix + fail-loud regression tests. 390 pytest pass. Reported via external review: candidate ids naturally arrive as int64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
Review Summary by QodoAccept any integer dtype for candidate and doc-id arrays
WalkthroughsDescription• Accept any integer dtype for candidate/doc-id arrays, not just uint32 - Converts int64, int32, uint64, int16, uint16, int8, uint8 to u32 with checked bounds - Already-uint32 contiguous arrays remain zero-copy; others copied once • Reject out-of-range ids with clear ValueError instead of silent wrap - Negative ids and values >= 2**32 now raise ValueError rather than wrapping - Preserves existing >= len(index) IndexError check • Reframe dtype-validation tests to new contract - Non-integer dtypes still raise TypeError - Integer dtypes converted by value, never byte-reinterpreted Diagramflowchart LR
A["NumPy array<br/>any integer dtype"] -->|"coerce_candidate_ids"| B["uint32 contiguous?"]
B -->|"yes"| C["Borrow zero-copy<br/>CandidateIds::Borrowed"]
B -->|"no"| D["Try each dtype<br/>int64, i32, u64, etc."]
D -->|"success"| E["Convert with bounds check<br/>CandidateIds::Owned"]
D -->|"failure"| F["TypeError:<br/>non-integer dtype"]
E -->|"negative or >= 2^32"| G["ValueError:<br/>out of range"]
E -->|"valid"| H["check_ids_in_range"]
H -->|">= corpus size"| I["IndexError:<br/>out of range"]
H -->|"valid"| J["Dispatch to core"]
File Changes1. ordvec-python/src/lib.rs
|
Code Review by Qodo
1. Contiguity docs outdated
|
There was a problem hiding this comment.
Code Review
This pull request introduces robust integer dtype coercion for candidate and document ID arrays in the Python bindings, allowing the core to accept any integer dtype (such as the default int64 from NumPy) and safely convert them to u32 with checked bounds. It also adds comprehensive test coverage for these input guards. The review feedback suggests optimizing the u32 array handling in coerce_candidate_ids to process both contiguous and non-contiguous paths in the initial fast-path block, which avoids redundant casting and unnecessary bounds checking, and subsequently removing the redundant try_int_dtype!(u32); macro invocation.
| if let Ok(a) = arr.cast::<PyArray1<u32>>() { | ||
| let ro = a.readonly(); | ||
| if ro.as_slice().is_ok() { | ||
| return Ok(CandidateIds::Borrowed(ro)); | ||
| } | ||
| // Non-contiguous uint32 falls through to the copying path below. | ||
| } |
There was a problem hiding this comment.
For u32 arrays, we can handle both the contiguous (zero-copy borrow) and non-contiguous (copying) paths directly within this block. This avoids a redundant cast and readonly call, and bypasses the unnecessary try_from bounds-checking for u32 values (which are guaranteed to fit in u32).
if let Ok(a) = arr.cast::<PyArray1<u32>>() {
let ro = a.readonly();
if ro.as_slice().is_ok() {
return Ok(CandidateIds::Borrowed(ro));
}
let view = ro.as_array();
let out = view.iter().copied().collect();
return Ok(CandidateIds::Owned(out));
}| // u32 first so non-contiguous uint32 (which fell through above) is handled | ||
| // before the wider/narrower dtypes; order is otherwise irrelevant since each | ||
| // downcast is an exact dtype match. | ||
| try_int_dtype!(u32); |
There was a problem hiding this comment.
Since u32 is now fully handled in the fast path above, we can remove the redundant try_int_dtype!(u32); macro invocation and simplify the comments.
| // u32 first so non-contiguous uint32 (which fell through above) is handled | |
| // before the wider/narrower dtypes; order is otherwise irrelevant since each | |
| // downcast is an exact dtype match. | |
| try_int_dtype!(u32); | |
| // Order is irrelevant since each downcast is an exact dtype match. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…d contract
Every embedding parameter was declared PyReadonlyArray{1,2}<f32>, which rust-numpy
matches strictly — so float64 (NumPy's default for np.array([...]) and most API
embeddings) and float16 raised an opaque TypeError. The premise of ordvec is
'float vector in -> rank/sign transform', so float32 is the internal working dtype,
not a contract the caller must pre-satisfy.
Add two choke-point helpers, as_f32_1d / as_f32_2d, that every embedding entry point
(18 methods + free functions) routes through, so dtype/layout/finite policy is
defined once:
- coerce float16/float32/float64 -> float32. The rank/sign transforms are
order/sign-only and f64->f32 rounding is monotonic, so coercion is faithful; the
asymmetric LUT scores against f32-quantised docs, so sub-f32 query precision is
meaningless there too.
- REJECT bool + all integer dtypes (TypeError): a {0,1}/narrow-int vector
rank-transforms to a degenerate index-tie artefact (silent retrieval garbage) — a
deliberate usage-error guard, not an ergonomic gap.
- reject complex/object/string (complex would silently drop the imaginary part).
- reject wrong ndim (TypeError).
- reject non-C-contiguous input (ValueError) BEFORE coercion, so a transposed
float64 is never silently laundered into a contiguous float32 (a hidden copy can
dominate runtime / poison benchmarks — the copy decision stays with the caller).
- all-finite check AFTER coercion (f64 > f32::MAX rounds to +inf — caught here).
Candidate IDs are a different boundary (labels, not measurements), so they keep the
permissive contract: rename coerce_candidate_ids -> as_u32_ids_1d, accept any
range-safe integer dtype (int64 included), reject bool/float/negative/overflow with
sharper messages. Coherent split: vectors = float-only/C-contiguous/finite;
candidate IDs = integer labels range-safe-coerced to u32.
Core, persistence (.tvr/.tvrq/.tvbm/.tvsb store no floats), and the integer
primitives are untouched. Inverts the 5 tests that asserted the old strict-f32
rejection (now coerced+faithful); adds test_input_dtype.py covering the full
accept/reject matrix across all four index types. 483 pytest pass; fmt + clippy
-D warnings clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
There was a problem hiding this comment.
Pull request overview
This PR establishes two input boundary contracts at the Python FFI layer of ordvec-python: embedding vectors are coerced to float32 via new as_f32_1d/as_f32_2d helpers, and candidate/doc IDs accept any integer dtype via as_u32_ids_1d. The goal is to remove opaque TypeError papercuts when callers pass float64 (NumPy default) embeddings or int64 index arrays (from np.arange, np.where, np.argpartition, etc.), while still rejecting categorically wrong inputs (bool/int for vectors; floats/bools for IDs; non-contiguous; non-finite; out-of-range).
Changes:
- Added
as_u32_ids_1d+check_ids_in_rangewith checked dtype conversion and bounds-checking, wired intoRankQuant.search_asymmetric_subsetandBitmap.body_overlap_scores_subset. - Added
as_f32_1d/as_f32_2d(withgate_float_ndim,require_c_contiguous, post-coercion finite check), wired into 18 embedding entry points plusrank_transform/search_asymmetric_byte_lut. - Updated existing
test_add_float64_is_rejectedtests totest_add_float64_is_coerced; added newtest_input_dtype.pyand candidate-ID coverage intest_input_guards.py; reframed the red-team dtype tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| ordvec-python/src/lib.rs | New as_f32_* / as_u32_ids_1d choke-point helpers; entry-point signatures changed from typed PyReadonlyArray* to &Bound<'py, PyAny> with coercion. |
| ordvec-python/tests/test_input_dtype.py | New dedicated dtype/layout/finite contract suite parametrized across all 4 index types and 1-D query path. |
| ordvec-python/tests/test_input_guards.py | Adds candidate/doc-id integer-dtype acceptance, fail-loud negative/overflow, non-contiguous uint32 cases. |
| ordvec-python/tests/test_redteam_fuzz.py | Updates _WRONG_F32_DTYPES (drops f16/f64), adds _NON_INTEGER_ID_DTYPES, reframes subset/body-overlap dtype rejection tests. |
| ordvec-python/tests/test_rank.py, test_rank_quant.py, test_bitmap.py, test_sign_bitmap.py | Replaces test_add_float64_is_rejected with test_add_float64_is_coerced (asserts f64 index matches f32 index). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The choke-point helpers coerced non-float32 input to float32 (a full copy) before the caller's width check ran, so a wrong-width float64 array was fully converted only to be rejected — wasteful, and a potential OOM on a large misshapen input. Move the width check into as_f32_1d / as_f32_2d via a cheap shape-metadata read (axis_len), so dtype -> ndim -> width -> contiguity are all validated on the ORIGINAL array before the ascontiguousarray copy. as_f32_2d takes the expected dim; as_f32_1d takes Option<usize> (rank_transform has no width constraint). Adds a wrong-width-float64 regression across all four index types. 491 pytest pass; fmt + clippy -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
Summary
Surfaced by Signal-Ridge-SysAdmin's external review of the published wheel (thank you!): the bindings declared NumPy parameters with strict dtypes, so the dtypes NumPy produces by default raised opaque
TypeErrors. The reported papercut —RankQuant.search_asymmetric_subset()rejectingint64candidate arrays — generalised into two coherent boundary contracts, one per kind of input, applied at single choke points so no entry point can diverge:The split is deliberate: vector values define ordinal structure (so bool/narrow-int must be rejected — they'd rank-transform to garbage), while candidate IDs are labels (so
int64must be accepted — rejecting it is hostile Python ergonomics).Commit 1 — candidate IDs (
as_u32_ids_1d)RankQuant.search_asymmetric_subset(candidates=…)andBitmap.body_overlap_scores_subset(doc_ids=…)required exactlyuint32;int64(whatnp.arange,np.where()[0], fancy-indexing,np.argpartitionall produce) raisedTypeError: 'ndarray' object cannot be cast as 'ndarray'.uint32(zero-copy) + any range-safe integer dtype (int64/int32/uint64/…).>= 2**32, wrong ndim — with clear messages, never a silent wrap (np.asarray(-1, uint32) → 4294967295would score the wrong doc).Commit 2 — embeddings (
as_f32_1d/as_f32_2d)Every embedding param was
PyReadonlyArray{1,2}<f32>(strict), sofloat64(NumPy's default) andfloat16raisedTypeError. The premise of ordvec is float vector in → rank/sign transform, so float32 is the internal working dtype, not a contract the caller must pre-satisfy.All 18 embedding entry points + the
rank_transform/search_asymmetric_byte_lutfree functions now route through two choke-point helpers:float16/float32/float64 → float32. The transforms are order-only (rank) or sign-only, andf64→f32rounding is monotonic — it can never reorder coordinates, only collapse a near-tie at the f32 floor (strictly less perturbation than the rank/bucket quantization). The asymmetric LUT keeps the query floats but scores against f32-quantized docs, so sub-f32 query precision is meaningless there too.bool+ all integer dtypes (TypeError): a{0,1}or narrow-int vector rank-transforms to a degenerate index-tie artifact — silent retrieval garbage. A deliberate usage-error guard.ValueError) before coercion — so a transposedfloat64is never silently laundered into a contiguousfloat32(a hidden copy that can dominate runtime / poison benchmarks; the copy decision stays with the caller).f64 > f32::MAXrounds to+inf— caught, not silently indexed).What is NOT touched
Core Rust, persistence (
.tvr/.tvrq/.tvbm/.tvsbstore only u16/u8/u64 — dtype is an ingestion/search boundary issue only), and the integer-alphabet primitives (bucket_ranks/pack_buckets/…, where a float is a category error).These reverse tests that asserted the old strict contracts — they are deliberate, not regressions:
uint32candidate rejection → reframed (integer dtypes converted by value; non-integer stillTypeError).test_add_float64_is_rejected→test_add_float64_is_coerced(now accepted + faithful to the f32 index)._WRONG_F32_DTYPESdropsfloat64/float16(now coerced); keeps int/bool/complex/object.Test plan
cargo fmt -p ordvec-python --check,cargo clippy -p ordvec-python --all-targets -- -D warningspytest ordvec-python/tests— 483 passedtest_input_dtype.py: full accept/reject matrix (float16/32/64 accepted+faithful; bool/int/complex/object/string/scalar/3-D/transpose-f32/transpose-f64/NaN/±inf/f64-overflow rejected) parametrized across all 4 index types, plus the 1-D query pathdetach, no byte reinterpretation, no panic paths, gate/error-type correctnessCredit
Reported by Signal-Ridge-SysAdmin, who reviewed the published wheel and caught the candidate-ID dtype papercut (and the broader "should this keep the floats at all?" question that motivated the embedding contract).
🤖 Generated with Claude Code