Skip to content

feat(python): input dtype boundary contracts (candidate ids → u32, embeddings → f32)#90

Merged
Navi Bot (project-navi-bot) merged 3 commits into
mainfrom
fix/python-candidate-id-dtype
May 28, 2026
Merged

feat(python): input dtype boundary contracts (candidate ids → u32, embeddings → f32)#90
Navi Bot (project-navi-bot) merged 3 commits into
mainfrom
fix/python-candidate-id-dtype

Conversation

@Fieldnote-Echo

@Fieldnote-Echo Fieldnote-Echo commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

Surfaced by Signal-Ridge-SysAdmin's external review of the published wheel (thank you!): the bindings declared NumPy parameters with strict dtypes, so the dtypes NumPy produces by default raised opaque TypeErrors. The reported papercut — RankQuant.search_asymmetric_subset() rejecting int64 candidate arrays — generalised into two coherent boundary contracts, one per kind of input, applied at single choke points so no entry point can diverge:

vectors:        float-only, C-contiguous, finite after f32 coercion   (as_f32_1d / as_f32_2d)
candidate IDs:  integer labels, range-safe-coerced to u32             (as_u32_ids_1d)

The split is deliberate: vector values define ordinal structure (so bool/narrow-int must be rejected — they'd rank-transform to garbage), while candidate IDs are labels (so int64 must be accepted — rejecting it is hostile Python ergonomics).

Commit 1 — candidate IDs (as_u32_ids_1d)

RankQuant.search_asymmetric_subset(candidates=…) and Bitmap.body_overlap_scores_subset(doc_ids=…) required exactly uint32; int64 (what np.arange, np.where()[0], fancy-indexing, np.argpartition all produce) raised TypeError: 'ndarray' object cannot be cast as 'ndarray'.

  • Accept uint32 (zero-copy) + any range-safe integer dtype (int64/int32/uint64/…).
  • Reject bool, float, complex/object/string, negatives, >= 2**32, wrong ndim — with clear messages, never a silent wrap (np.asarray(-1, uint32) → 4294967295 would score the wrong doc).

Commit 2 — embeddings (as_f32_1d / as_f32_2d)

Every embedding param was PyReadonlyArray{1,2}<f32> (strict), so float64 (NumPy's default) and float16 raised TypeError. The premise of ordvec is float vector in → rank/sign transform, so float32 is the internal working dtype, not a contract the caller must pre-satisfy.

All 18 embedding entry points + the rank_transform/search_asymmetric_byte_lut free functions now route through two choke-point helpers:

  • Coerce float16/float32/float64 → float32. The transforms are order-only (rank) or sign-only, and f64→f32 rounding is monotonic — it can never reorder coordinates, only collapse a near-tie at the f32 floor (strictly less perturbation than the rank/bucket quantization). The asymmetric LUT keeps the query floats but scores against f32-quantized docs, so sub-f32 query precision is meaningless there too.
  • Reject bool + all integer dtypes (TypeError): a {0,1} or narrow-int vector rank-transforms to a degenerate index-tie artifact — silent retrieval garbage. A deliberate usage-error guard.
  • Reject complex/object/string (complex would silently drop its imaginary part), and wrong ndim.
  • Reject non-C-contiguous input (ValueError) before coercion — so a transposed float64 is never silently laundered into a contiguous float32 (a hidden copy that can dominate runtime / poison benchmarks; the copy decision stays with the caller).
  • All-finite check after coercion (f64 > f32::MAX rounds to +inf — caught, not silently indexed).

What is NOT touched

Core Rust, persistence (.tvr/.tvrq/.tvbm/.tvsb store only u16/u8/u64 — dtype is an ingestion/search boundary issue only), and the integer-alphabet primitives (bucket_ranks/pack_buckets/…, where a float is a category error).

⚠️ Intentional contract changes (reviewers please note)

These reverse tests that asserted the old strict contracts — they are deliberate, not regressions:

  • The two red-team tests asserting strict-uint32 candidate rejection → reframed (integer dtypes converted by value; non-integer still TypeError).
  • The five test_add_float64_is_rejectedtest_add_float64_is_coerced (now accepted + faithful to the f32 index).
  • _WRONG_F32_DTYPES drops float64/float16 (now coerced); keeps int/bool/complex/object.

Test plan

  • cargo fmt -p ordvec-python --check, cargo clippy -p ordvec-python --all-targets -- -D warnings
  • pytest ordvec-python/tests483 passed
  • New test_input_dtype.py: full accept/reject matrix (float16/32/64 accepted+faithful; bool/int/complex/object/string/scalar/3-D/transpose-f32/transpose-f64/NaN/±inf/f64-overflow rejected) parametrized across all 4 index types, plus the 1-D query path
  • New candidate-ID coverage: dtype matrix, natural NumPy idioms, fail-loud negative/overflow, non-contiguous uint32
  • Two code-reviewer passes (one per commit): APPROVE, 0 findings — GIL-lifetime soundness across detach, no byte reinterpretation, no panic paths, gate/error-type correctness

Credit

Reported by Signal-Ridge-SysAdmin, who reviewed the published wheel and caught the candidate-ID dtype papercut (and the broader "should this keep the floats at all?" question that motivated the embedding contract).

🤖 Generated with Claude Code

RankQuant.search_asymmetric_subset and Bitmap.body_overlap_scores_subset
declared their candidate/doc-id arrays as PyReadonlyArray1<u32>, which
rust-numpy matches strictly. The int64 arrays NumPy produces by default
(np.arange, np.where()[0], np.array([...]), fancy indexing, np.argpartition)
therefore raised an opaque "ndarray cannot be cast as ndarray" TypeError.
ordvec's own top_m_candidates* emit uint32, so the happy path worked and the
suite never exercised a user-built candidate set.

Accept any integer dtype and convert to the core's u32 with checked bounds:
negatives and values >= 2^32 raise a clear ValueError rather than silently
wrapping (np.asarray(-1, uint32) -> 4294967295, 2**32 -> 0, which would score
the wrong document). Already-uint32 contiguous arrays are still borrowed
zero-copy; other dtypes are copied once. The >= n bound check and the
body_overlap sorted-ids policy are unchanged.

Flips the two red-team tests that asserted the old strict-uint32 rejection to
the new contract (integer dtypes converted by value, non-integer dtypes still
TypeError) and adds dtype-matrix + fail-loud regression tests. 390 pytest pass.

Reported via external review: candidate ids naturally arrive as int64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Accept any integer dtype for candidate and doc-id arrays

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Accept any integer dtype for candidate/doc-id arrays, not just uint32
  - Converts int64, int32, uint64, int16, uint16, int8, uint8 to u32 with checked
  bounds
  - Already-uint32 contiguous arrays remain zero-copy; others copied once
• Reject out-of-range ids with clear ValueError instead of silent wrap
  - Negative ids and values >= 2**32 now raise ValueError rather than wrapping
  - Preserves existing >= len(index) IndexError check
• Reframe dtype-validation tests to new contract
  - Non-integer dtypes still raise TypeError
  - Integer dtypes converted by value, never byte-reinterpreted
Diagram
flowchart LR
  A["NumPy array<br/>any integer dtype"] -->|"coerce_candidate_ids"| B["uint32 contiguous?"]
  B -->|"yes"| C["Borrow zero-copy<br/>CandidateIds::Borrowed"]
  B -->|"no"| D["Try each dtype<br/>int64, i32, u64, etc."]
  D -->|"success"| E["Convert with bounds check<br/>CandidateIds::Owned"]
  D -->|"failure"| F["TypeError:<br/>non-integer dtype"]
  E -->|"negative or >= 2^32"| G["ValueError:<br/>out of range"]
  E -->|"valid"| H["check_ids_in_range"]
  H -->|">= corpus size"| I["IndexError:<br/>out of range"]
  H -->|"valid"| J["Dispatch to core"]

Loading

Grey Divider

File Changes

1. ordvec-python/src/lib.rs ✨ Enhancement +123/-37

Implement flexible integer dtype coercion for candidate ids

• Added coerce_candidate_ids helper to accept any integer dtype and convert to u32 with checked
 bounds
• Added CandidateIds enum to distinguish borrowed zero-copy arrays from owned converted vectors
• Added check_ids_in_range helper to validate ids are within corpus bounds
• Added not_contiguous_err helper for consistent error messaging
• Updated search_asymmetric_subset to use coerce_candidate_ids instead of strict
 PyReadonlyArray1
• Updated body_overlap_scores_subset to use coerce_candidate_ids instead of strict
 PyReadonlyArray1
• Updated docstrings to document new dtype acceptance and bounds-checking behavior

ordvec-python/src/lib.rs


2. ordvec-python/tests/test_input_guards.py 🧪 Tests +122/-0

Add comprehensive dtype acceptance and bounds-checking tests

• Added parametrized test test_subset_candidate_dtype_accepted_and_equivalent covering all 8
 integer dtypes
• Added test_subset_candidate_natural_numpy_idioms_accepted for np.arange, np.where,
 np.argpartition
• Added test_subset_noncontiguous_uint32_candidates_accepted for strided arrays
• Added test_subset_negative_candidate_raises_value_error for fail-loud negative id rejection
• Added test_subset_overflow_candidate_raises_value_error for fail-loud >= 2**32 rejection
• Added test_subset_out_of_range_int64_candidate_raises_index_error for corpus bounds check
• Added test_subset_float_candidates_raise_type_error for non-integer dtype rejection
• Added test_body_overlap_doc_ids_int64_accepted for Bitmap.body_overlap_scores_subset dtype
 acceptance

ordvec-python/tests/test_input_guards.py


3. ordvec-python/tests/test_redteam_fuzz.py 🧪 Tests +36/-5

Reframe dtype tests to new integer-acceptance contract

• Added _NON_INTEGER_ID_DTYPES list for parametrized non-integer dtype tests
• Renamed test_subset_candidates_wrong_dtype_raises_type_error to
 test_subset_candidates_noninteger_dtype_raises_type_error with _NON_INTEGER_ID_DTYPES parameter
• Added test_subset_candidates_integer_dtype_converted_by_value adversarial test to verify
 value-based conversion, not byte reinterpretation
• Renamed test_body_overlap_doc_ids_wrong_dtype_raises_type_error to
 test_body_overlap_doc_ids_noninteger_dtype_raises_type_error with _NON_INTEGER_ID_DTYPES
 parameter

ordvec-python/tests/test_redteam_fuzz.py


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented May 27, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Contiguity docs outdated 🐞 Bug ⚙ Maintainability
Description
The new coerce_candidate_ids() intentionally accepts non-contiguous uint32 candidate/doc-id
arrays by copying, but the module-level documentation still claims the FFI “inline guard rejects
non-C-contiguous arrays”. This mismatch can mislead users/maintainers into thinking
np.ascontiguousarray() is required for candidates/doc_ids when it is not.
Code

ordvec-python/src/lib.rs[R140-147]

Evidence
The module docs state non-contiguous arrays are rejected, but coerce_candidate_ids() explicitly
falls back to a copying path for non-contiguous uint32, and the test suite asserts this behavior
is accepted.

ordvec-python/src/lib.rs[14-18]
ordvec-python/src/lib.rs[140-147]
ordvec-python/tests/test_input_guards.py[282-292]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The Rust module-level docs claim the bindings reject non-C-contiguous arrays, but candidate/doc-id inputs now explicitly accept non-contiguous arrays (they are copied through the checked conversion path). This documentation mismatch can mislead callers and future maintainers.

### Issue Context
- `coerce_candidate_ids()` borrows only when `uint32` and contiguous; otherwise it copies (including non-contiguous `uint32`).
- Tests explicitly validate that non-contiguous `uint32` candidate arrays are accepted.

### Fix Focus Areas
- ordvec-python/src/lib.rs[14-18]
- ordvec-python/src/lib.rs[140-147]
- ordvec-python/tests/test_input_guards.py[282-292]

### Suggested fix
Adjust the module-level comment to clarify that *most* array inputs must be C-contiguous, but candidate/doc-id arrays are allowed to be non-contiguous (they will be copied unless they hit the `uint32` contiguous fast path).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Range error has spacing 🐞 Bug ⚙ Maintainability
Description
The out-of-range ValueError message in coerce_candidate_ids() is built using a line-continuation
string literal with heavy indentation, which will insert a large run of spaces into the final
exception text. This makes the user-facing error message look malformed (e.g., many spaces before
“(must be in 0..=4294967295)”).
Code

ordvec-python/src/lib.rs[R156-160]

Evidence
The string literal uses \ to continue onto the next line, and the next line is indented; Rust
preserves that indentation as spaces in the resulting string.

ordvec-python/src/lib.rs[149-162]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
A Rust string literal using `\` line-continuation preserves the indentation whitespace on the next line, so the raised `ValueError` message will contain many spaces before the parenthetical range hint.

### Issue Context
Current message effectively becomes:
`"candidate id <x> is out of range for a u32 index                              (must be in 0..=4294967295)"`

### Fix Focus Areas
- ordvec-python/src/lib.rs[155-162]

### Suggested fix
Replace the multi-line literal with a single-line format string (or use `concat!` without indentation), e.g.:
```rust
pyo3::exceptions::PyValueError::new_err(format!(
   "{what} {x} is out of range for a u32 index (must be in 0..=4294967295)"
))
```

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robust integer dtype coercion for candidate and document ID arrays in the Python bindings, allowing the core to accept any integer dtype (such as the default int64 from NumPy) and safely convert them to u32 with checked bounds. It also adds comprehensive test coverage for these input guards. The review feedback suggests optimizing the u32 array handling in coerce_candidate_ids to process both contiguous and non-contiguous paths in the initial fast-path block, which avoids redundant casting and unnecessary bounds checking, and subsequently removing the redundant try_int_dtype!(u32); macro invocation.

Comment thread ordvec-python/src/lib.rs
Comment on lines +141 to +147
if let Ok(a) = arr.cast::<PyArray1<u32>>() {
let ro = a.readonly();
if ro.as_slice().is_ok() {
return Ok(CandidateIds::Borrowed(ro));
}
// Non-contiguous uint32 falls through to the copying path below.
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For u32 arrays, we can handle both the contiguous (zero-copy borrow) and non-contiguous (copying) paths directly within this block. This avoids a redundant cast and readonly call, and bypasses the unnecessary try_from bounds-checking for u32 values (which are guaranteed to fit in u32).

    if let Ok(a) = arr.cast::<PyArray1<u32>>() {
        let ro = a.readonly();
        if ro.as_slice().is_ok() {
            return Ok(CandidateIds::Borrowed(ro));
        }
        let view = ro.as_array();
        let out = view.iter().copied().collect();
        return Ok(CandidateIds::Owned(out));
    }

Comment thread ordvec-python/src/lib.rs
Comment on lines +167 to +170
// u32 first so non-contiguous uint32 (which fell through above) is handled
// before the wider/narrower dtypes; order is otherwise irrelevant since each
// downcast is an exact dtype match.
try_int_dtype!(u32);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since u32 is now fully handled in the fast path above, we can remove the redundant try_int_dtype!(u32); macro invocation and simplify the comments.

Suggested change
// u32 first so non-contiguous uint32 (which fell through above) is handled
// before the wider/narrower dtypes; order is otherwise irrelevant since each
// downcast is an exact dtype match.
try_int_dtype!(u32);
// Order is irrelevant since each downcast is an exact dtype match.

@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.68508% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
ordvec-python/src/lib.rs 96.68% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

…d contract

Every embedding parameter was declared PyReadonlyArray{1,2}<f32>, which rust-numpy
matches strictly — so float64 (NumPy's default for np.array([...]) and most API
embeddings) and float16 raised an opaque TypeError. The premise of ordvec is
'float vector in -> rank/sign transform', so float32 is the internal working dtype,
not a contract the caller must pre-satisfy.

Add two choke-point helpers, as_f32_1d / as_f32_2d, that every embedding entry point
(18 methods + free functions) routes through, so dtype/layout/finite policy is
defined once:
  - coerce float16/float32/float64 -> float32. The rank/sign transforms are
    order/sign-only and f64->f32 rounding is monotonic, so coercion is faithful; the
    asymmetric LUT scores against f32-quantised docs, so sub-f32 query precision is
    meaningless there too.
  - REJECT bool + all integer dtypes (TypeError): a {0,1}/narrow-int vector
    rank-transforms to a degenerate index-tie artefact (silent retrieval garbage) — a
    deliberate usage-error guard, not an ergonomic gap.
  - reject complex/object/string (complex would silently drop the imaginary part).
  - reject wrong ndim (TypeError).
  - reject non-C-contiguous input (ValueError) BEFORE coercion, so a transposed
    float64 is never silently laundered into a contiguous float32 (a hidden copy can
    dominate runtime / poison benchmarks — the copy decision stays with the caller).
  - all-finite check AFTER coercion (f64 > f32::MAX rounds to +inf — caught here).

Candidate IDs are a different boundary (labels, not measurements), so they keep the
permissive contract: rename coerce_candidate_ids -> as_u32_ids_1d, accept any
range-safe integer dtype (int64 included), reject bool/float/negative/overflow with
sharper messages. Coherent split: vectors = float-only/C-contiguous/finite;
candidate IDs = integer labels range-safe-coerced to u32.

Core, persistence (.tvr/.tvrq/.tvbm/.tvsb store no floats), and the integer
primitives are untouched. Inverts the 5 tests that asserted the old strict-f32
rejection (now coerced+faithful); adds test_input_dtype.py covering the full
accept/reject matrix across all four index types. 483 pytest pass; fmt + clippy
-D warnings clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) changed the title fix(python): accept any integer dtype for subset candidate ids feat(python): input dtype boundary contracts (candidate ids → u32, embeddings → f32) May 28, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR establishes two input boundary contracts at the Python FFI layer of ordvec-python: embedding vectors are coerced to float32 via new as_f32_1d/as_f32_2d helpers, and candidate/doc IDs accept any integer dtype via as_u32_ids_1d. The goal is to remove opaque TypeError papercuts when callers pass float64 (NumPy default) embeddings or int64 index arrays (from np.arange, np.where, np.argpartition, etc.), while still rejecting categorically wrong inputs (bool/int for vectors; floats/bools for IDs; non-contiguous; non-finite; out-of-range).

Changes:

  • Added as_u32_ids_1d + check_ids_in_range with checked dtype conversion and bounds-checking, wired into RankQuant.search_asymmetric_subset and Bitmap.body_overlap_scores_subset.
  • Added as_f32_1d / as_f32_2d (with gate_float_ndim, require_c_contiguous, post-coercion finite check), wired into 18 embedding entry points plus rank_transform / search_asymmetric_byte_lut.
  • Updated existing test_add_float64_is_rejected tests to test_add_float64_is_coerced; added new test_input_dtype.py and candidate-ID coverage in test_input_guards.py; reframed the red-team dtype tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ordvec-python/src/lib.rs New as_f32_* / as_u32_ids_1d choke-point helpers; entry-point signatures changed from typed PyReadonlyArray* to &Bound<'py, PyAny> with coercion.
ordvec-python/tests/test_input_dtype.py New dedicated dtype/layout/finite contract suite parametrized across all 4 index types and 1-D query path.
ordvec-python/tests/test_input_guards.py Adds candidate/doc-id integer-dtype acceptance, fail-loud negative/overflow, non-contiguous uint32 cases.
ordvec-python/tests/test_redteam_fuzz.py Updates _WRONG_F32_DTYPES (drops f16/f64), adds _NON_INTEGER_ID_DTYPES, reframes subset/body-overlap dtype rejection tests.
ordvec-python/tests/test_rank.py, test_rank_quant.py, test_bitmap.py, test_sign_bitmap.py Replaces test_add_float64_is_rejected with test_add_float64_is_coerced (asserts f64 index matches f32 index).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The choke-point helpers coerced non-float32 input to float32 (a full copy) before
the caller's width check ran, so a wrong-width float64 array was fully converted
only to be rejected — wasteful, and a potential OOM on a large misshapen input.

Move the width check into as_f32_1d / as_f32_2d via a cheap shape-metadata read
(axis_len), so dtype -> ndim -> width -> contiguity are all validated on the
ORIGINAL array before the ascontiguousarray copy. as_f32_2d takes the expected
dim; as_f32_1d takes Option<usize> (rank_transform has no width constraint). Adds
a wrong-width-float64 regression across all four index types. 491 pytest pass;
fmt + clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@project-navi-bot Navi Bot (project-navi-bot) merged commit 4ea6d0b into main May 28, 2026
32 checks passed
@project-navi-bot Navi Bot (project-navi-bot) deleted the fix/python-candidate-id-dtype branch May 28, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants