IVF-SQ C++ API by viclafargue · Pull Request #1865 · rapidsai/cuvs

viclafargue · 2026-03-03T09:49:30Z

Closes #1291.

Overview

IVF-SQ combines an inverted file (IVF) partitioning scheme with 8-bit scalar quantization (SQ8) of residuals. Each float32 dimension is compressed to a single uint8 code, giving a 4x memory reduction over IVF-Flat while retaining high recall. The index implements various metrics (L2, inner-product, and cosine distance), data type (float, half) and also filtering.

Build

K-Means clustering : A training subset (controlled by kmeans_trainset_fraction) is sampled from the dataset. Balanced K-Means is run on it to produce n_lists centroids that partition the vector space.
SQ parameter training : The training vectors are assigned to their nearest centroid and residuals are computed. A custom CUDA kernel computes the per-dimension min/max of these residuals (fast CUB reduction in shared memory). The observed range is expanded by a 5% margin on each side to reduce clipping on unseen data. The delta for each dimension is computed as sq_delta[d] = (range + 2*margin) / 255. These two per-dimension parameters (sq_vmin the lower end of the range, sq_delta the scale or quantization step) are stored in the index and are all that is needed to encode/decode any vector.
Data insertion : If add_data_on_build is true (the default), the full dataset is inserted via the extend path described below.

Extend

Extend adds new vectors to an existing index without retraining centroids or SQ parameters and in a batched fashion:

Cluster assignment : New vectors are assigned to their nearest centroid via K-Means predict. ~~(adaptative centers : when enabled, centroids are incrementally updated as new data arrives, and center norms are recomputed)~~.
List resizing : Per-list sizes are histogrammed and the IVF lists are grown to accommodate the new vectors, respecting the interleaved group alignment (kIndexGroupSize = 32).
Residual computation + SQ encoding : A first kernel computes residuals. A second kernel quantize each residual dimension to uint8 and write the code into the interleaved list layout.

Search

Search proceeds in three stages:

Coarse search (GEMM-based) : All query-to-centroid distances are computed in a single batched GEMM (queries x centers^T), with metric-specific pre/post-processing. The top n_probes nearest clusters per query are selected via select_k.
Fine scan (ivf_sq_scan_kernel) : This is the performance-critical kernel, templated on <BlockDim, Capacity, Metric, IdxT, SampleFilter>. It operates in two grid modes controlled by the compile-time Capacity parameter:
- Fused top-k path (Capacity > 0): grid is (grid_dim_x, n_queries), each block loops over its share of probes and maintains a block_sort (warpsort) priority queue, writing only the local top-k to global memory. grid_dim_x is chosen via an occupancy query (cudaOccupancyMaxActiveBlocksPerMultiprocessor) to saturate the GPU. Instantiated for Capacity ∈ {32, 64, 128, 256}, covering k ≤ 256.
- Materialized path (Capacity == 0): grid is (n_probes, n_queries), one block per (query, probe) pair writes every distance to a flat buffer. Used as a fallback when k > 256 or the fused kernel has zero occupancy.

Three dim-length shared-memory arrays are filled in two phases: Phase 1 (invariant across probes) loads sq_delta and query terms; Phase 2 (per-probe) folds in the centroid so the inner loop reduces to s_query_term − code × s_sq_scale (L2) or s_query_term × (s_aux + code × s_sq_scale) (IP/Cosine). Each warp processes one interleaved group of 32 vectors; each lane loads a uint4, giving a fully coalesced 512-byte read per dimension block. Distance accumulation is fused with decoding — no separate decompress pass.

Final top-k selection : In the fused path, a second select_k merges the per-block top-k results across grid_dim_x blocks (skipped when grid_dim_x == 1). In the materialized path, a single select_k over the full buffer extracts the global top-k. Distances then receive metric-specific fixups (e.g. sqrt for L2Sqrt) and list-local positions are mapped back to original dataset IDs.

Benchmarks on B200

jinsolp

Thanks @viclafargue ! Sharing my first batch of comments for build and extend.
Aside from this, I also think ivf_sq.hpp overall needs some documentation!

jinsolp · 2026-03-12T21:48:28Z

+  auto orig_centroids_view = raft::make_device_matrix_view<const float, int64_t>(
+    index->centers().data_handle(), n_lists, dim);
+
+  constexpr size_t kReasonableMaxBatchSize = 65536;


how is this determined? 👀

This is an arbitrary value used in other index types. This probably was a safe value that ensured that we never ran out of memory in smaller systems. But, thanks for pointing this out as it might be interesting to see if we could improve the way this value is determined (dataset dimensions and available VRAM). cc @achirkin @tfeher

That is correct, and one more argument here is that this size is well beyond the point where the GPU is saturated:

{big enough grid size} < kReasonableMaxBatchSize < {small enough memory footprint}

cjnolet · 2026-03-13T11:33:48Z

@viclafargue can you please share the build time speedups over Faiss IVFSQ on GPU and CPU? Those are going to be crucial as the major value prop of cuVS over Faiss (both cpu and gpu) is index build speedup.

jinsolp

Second batch of comments! I looked through search and serialize this time, looks good!

viclafargue · 2026-04-27T14:37:29Z

The wheel-build CI was failing because the IVF-SQ scan kernel pushed libcuvs.so over the pip wheel size limit. To unblock the PR, I ported the scan kernel to JIT-LTO (same model as IVF-Flat). The kernel body now lives in cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh and is linked at runtime from generated fatbins instead of being baked into the library.

coderabbitai · 2026-04-27T14:38:52Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added IVF‑SQ ANN index: build, extend, search, serialize/deserialize with float/half support, sample filtering, and JIT‑LTO fused scan/filter kernels covering multiple distance metrics and capacity variants.
Benchmarking
- Added benchmark targets and configs for cuVS, FAISS (GPU and CPU) IVF‑SQ.
Tests
- Added comprehensive parameterized unit tests for build/search/extend/serialize and filter correctness.
Documentation
- Added C++ API docs for IVF‑SQ.

Walkthrough

Adds a complete IVF‑SQ nearest‑neighbor implementation: public C++ API, device build/extend/search/serialize logic, JIT‑LTO scan/filter kernels and matrices, CMake/JIT wiring, benchmark integrations and configs, comprehensive tests, and documentation.

Changes

IVF‑SQ Feature (single cohesive DAG)

Layer / File(s)	Summary
Data Shape / Public API `cpp/include/cuvs/neighbors/ivf_sq.hpp`	Introduces public IVF‑SQ API types: `index_params`, `search_params`, `index<uint8_t>`, and free functions `build`/`extend`/`search`/`serialize`/`deserialize`.
Core Index Implementation `cpp/src/neighbors/ivf_sq_index.cpp`	Implements `index<CodeT>` constructors, device allocations, accessors, optional `center_norms` allocation, and `check_consistency`; explicit instantiation for `uint8_t`.
Build / Extend Core Logic `cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh`, `cpp/src/neighbors/ivf_sq/ivf_sq_build_extend_*_uint8_t_int64_t.cu`	Vectorized fused column min/max, k‑means training/prediction, residual computation, SQ quantization/encoding, clone/extend flows, per‑list resizing and warp‑cooperative fills; host/device wrappers and instantiations for `float`/`half`.
Search Core Logic `cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`, `cpp/src/neighbors/ivf_sq/ivf_sq_search_uint8_t_int64_t.cu`	Coarse GEMM selection, chunk index calc, runtime→compile-time dispatch for metric and capacity, fused-local-top‑k JIT‑LTO scan with occupancy probe and fallback materialize path, sample filter dispatch, and exported search entrypoints.
JIT‑LTO Scan Kernels & Planner `cpp/src/.../detail/jit_lto_kernels/scan_impl.cuh`, `.../scan_kernel.cu.in`, `.../kernel_def.hpp`, `.../scan_planner.hpp`	Adds `ivf_sq_scan_impl<Capacity>` device implementation (fused top‑k and materialize modes), exported kernel wrappers parameterized by capacity, function‑pointer signature alias, and planner registering metric/setup/accumulate/finalize/filter fragments.
JIT‑LTO Fragment Tags & Device Functions `cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp`, `cpp/src/.../device_functions.cuh`, `.../accumulate_distance_impl.cuh`, `.../finalize_distance_impl.cuh`, `.../setup_*_impl.cuh`	Declares fragment tag types for metrics/filters and device function forward declarations; provides metric‑specific implementations for setup, accumulate, and finalize steps.
JIT‑LTO Filter Wiring & Matrices `cpp/src/.../filter_kernel.cu.in`, `.../filter_matrix.json`, `.../scan_matrix.json`, `.../*_matrix.json`	Specializes `sample_filter<int64_t>` wiring for filter fragments; JSON matrices declare supported `capacity`, `metric_name`, and `filter_name` options.
Kernel Template Includes `cpp/src/.../_kernel.cu.in`, `cpp/src/.../_impl.cuh`	Adds metric- and filter‑parameterized `.cu.in` templates that bind to `*_impl` functions via compile‑time placeholders used by JIT‑LTO generator.
Serialization `cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh`, `cpp/src/neighbors/ivf_sq/ivf_sq_serialize_uint8_t.cu`	Implements `serialize`/`deserialize` for `index<CodeT>` (dtype/version/metadata, centers, optional center_norms, `sq_vmin`/`sq_delta`, per‑list data) and instantiates file wrappers for `uint8_t`.
JIT/CMake & Build Integration `cpp/CMakeLists.txt`	Adds IVF‑SQ JIT‑LTO kernel fragment registrations to `jit_lto_files` and appends new `ivf_sq` implementation sources to `cuvs_objs`.
Benchmarks: CMake / Wrapper / Parser `cpp/bench/ann/CMakeLists.txt`, `cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h`, `.../cuvs_ivf_sq.cu`, `.../cuvs_benchmark.cu`, `.../cuvs_ann_bench_param_parser.h`	Adds CUVS/FAISS CPU/GPU IVF‑SQ benchmark toggles and targets; new `cuvs_ivf_sq<T>` benchmark wrapper, explicit instantiations, and JSON parsing for `nlist`, `niter`, `ratio`, `nprobe`.
Tests `cpp/tests/neighbors/ann_ivf_sq.cuh`, `cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu`, `.../test_half_int64_t.cu`, `cpp/tests/CMakeLists.txt`	Adds parameterized GoogleTest fixture exercising build/extend/search/serialize/filter with `float` and `half`, test instantiations, and CMake test wiring.
Docs `docs/source/cpp_api/neighbors.rst`, `docs/source/cpp_api/neighbors_ivf_sq.rst`	Adds `neighbors_ivf_sq.rst` to the neighbors TOC and a new API page documenting IVF‑SQ types and functions.
Python benchmark configs & constraints `python/cuvs_bench/.../algorithms.yaml`, `.../constraints/__init__.py`, `.../cuvs_ivf_sq.yaml`, `.../faiss_*_ivf_sq.yaml`	Updates algorithm entries, adds `cuvs_ivf_sq`/`faiss_cpu_ivf_sq`/`faiss_gpu_ivf_sq` configs and constraint functions enforcing `nlist >= nprobe`, and adds YAML parameter grids for base/large/test groups.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'IVF-SQ C++ API' clearly and concisely describes the main feature being added—a new C++ API for the IVF-SQ index algorithm.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, covering build/extend/search implementation details, algorithm parameters, metrics support, and performance benchmarks.
Linked Issues check	✅ Passed	The PR successfully implements the IVF-SQ index feature (`#1291`) with complete build, extend, search, and serialization APIs; JIT-LTO kernel support for multiple metrics/capacities; and comprehensive tests.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with implementing IVF-SQ support. Documentation, CMake configuration, benchmark setup, test infrastructure, and benchmark algorithm configs are all in scope for a new feature.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (4)

docs/source/cpp_api/neighbors_ivf_sq.rst (1)

4-4: Consider adding a one-line capability summary in the intro.

A short explicit note for supported metrics (L2/IP/Cosine), data types (float/half), and filtering would make the page easier to scan.

As per coding guidelines: "Clarity: Flag confusing explanations, missing prerequisites, or unclear examples".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/cpp_api/neighbors_ivf_sq.rst` at line 4, Add a one-line
capability summary at the top of the IVF-SQ intro that explicitly lists
supported distance metrics (L2/IP/Cosine), supported data types (float, half),
and whether filtering is supported, while briefly referencing the key parameters
n_lists and n_probes and the scalar quantization behavior; update the paragraph
that begins "The IVF-SQ method is an ANN algorithm..." to prepend this concise
capability note so readers can quickly scan supported features.

cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh (1)

16-23: Narrow this declaration to the supported index type.

cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_kernel.cu.in only defines sample_filter<int64_t>. Keeping a generic declaration here advertises broader support and turns accidental IndexT != int64_t use into a late device-link failure instead of an earlier, clearer error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh` around
lines 16 - 23, The declaration for sample_filter should be narrowed to the
concrete index type that is actually implemented; replace the templated
declaration of sample_filter with a non-templated declaration that uses int64_t
for the index type so it matches the single instantiation defined elsewhere
(sample_filter<int64_t>), ensuring parameter types (inds_ptrs, bitset_len,
original_nbits, etc.) use int64_t and avoiding accidental use of other IndexT
types that would cause late device-link failures.

cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu (1)

15-15: Replace INSTANTIATE_TEST_CASE_P with INSTANTIATE_TEST_SUITE_P on line 15.

INSTANTIATE_TEST_CASE_P is the legacy GoogleTest macro. Use the modern INSTANTIATE_TEST_SUITE_P to avoid carrying deprecated spelling into new test files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu` at line 15, Replace the
deprecated GoogleTest macro INSTANTIATE_TEST_CASE_P with the modern
INSTANTIATE_TEST_SUITE_P in the test instantiation line (currently calling
INSTANTIATE_TEST_CASE_P for AnnIVFSQTest/AnnIVFSQTestF_float and
::testing::ValuesIn(inputs)); change only the macro name so the test still
instantiates the same parameterized test using the same test suite and value
source.

python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py (1)

53-57: Deduplicate repeated IVF-SQ nlist/nprobe constraint logic.

All three new functions implement the same condition; extracting a small helper will keep this easier to maintain.

♻️ Proposed refactor

+def _check_nlist_nprobe(params, build_params):
+    if "nlist" in build_params and "nprobe" in params:
+        return build_params["nlist"] >= params["nprobe"]
+    return True
+
 def cuvs_ivf_sq_search(params, build_params, k, batch_size):
-    if "nlist" in build_params and "nprobe" in params:
-        return build_params["nlist"] >= params["nprobe"]
-    return True
+    return _check_nlist_nprobe(params, build_params)

 def faiss_gpu_ivf_sq_search(params, build_params, k, batch_size):
-    if "nlist" in build_params and "nprobe" in params:
-        return build_params["nlist"] >= params["nprobe"]
-    return True
+    return _check_nlist_nprobe(params, build_params)

 def faiss_cpu_ivf_sq_search(params, build_params, k, batch_size):
-    if "nlist" in build_params and "nprobe" in params:
-        return build_params["nlist"] >= params["nprobe"]
-    return True
+    return _check_nlist_nprobe(params, build_params)

Also applies to: 64-68, 70-74

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py` around
lines 53 - 57, Extract the duplicated nlist/nprobe check into a small helper
(e.g., _ivf_sq_nlist_nprobe_ok(params, build_params)) that returns
build_params["nlist"] >= params["nprobe"] when both keys exist, otherwise True;
replace the inline conditional in cuvs_ivf_sq_search and the other two IVF‑SQ
constraint functions with a call to this helper, keep the helper in the same
module and use the original function names (cuvs_ivf_sq_search and the other
IVF‑SQ constraint functions) to locate the places to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h`:
- Around line 125-138: The search() method in cuvs_ivf_sq<T>::search
dereferences filter_ unconditionally but filter_ is only set in
set_search_param(), so calling build()/load() then search() can crash;
initialize filter_ to a safe default (e.g., make_cuvs_filter(nullptr,
index_->size()) or none_sample_filter) when the index is created/loaded (or in
the cuvs_ivf_sq constructor) and ensure set_search_param() still replaces it
when called, so cuvs_ivf_sq::search can safely pass *filter_ without null
checks.
- Around line 96-101: The current cuvs_ivf_sq<T>::set_search_param uses
index_params_.n_lists for validating search_params_.n_probes which can differ
from a loaded serialized index; change the logic to first check if index_ is
non-null (the in-memory index loaded via load()), and validate or clamp
search_params_.n_probes against index_->n_lists() when index_ is present
(fallback to index_params_.n_lists() only if index_ is null); replace the raw
assert against index_params_.n_lists with a guarded clamp/validation of
search_params_.n_probes (and optionally a warning) so release builds don't skip
the check and serialized indexes are handled correctly.

In `@cpp/include/cuvs/neighbors/ivf_sq.hpp`:
- Around line 133-139: The Doxygen comment for scalar quantization incorrectly
multiplies by 255; update the public docs in the comment block referencing
sq_vmin and sq_delta to match the implementation by replacing the formula with
code_i = round((x_i - vmin_i) / delta_i) (and note that delta_i is the per-level
step size used to produce the uint8_t code), and ensure any adjacent doc text
clarifies that decoding uses the inverse with sq_delta and sq_vmin.

In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh`:
- Around line 223-228: The current branch only writes out_distances/out_indices
when (valid), leaving skipped slots (especially when Capacity==0) uninitialized;
change the else branch that computes out_idx = query_ix * max_samples +
sample_offset + row so it always stores a deterministic sentinel for
filtered-out slots (e.g., distance = +INF or max sentinel value and index =
UINT32_MAX or -1) when !valid, and keep writing the actual dist/index when valid
so downstream top-k reads deterministic values.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh`:
- Around line 687-703: The public pointer-overload of extend dereferences idx
(calling idx->dim()) before validating it; add a nullptr guard at the top of the
overload(s) so RAFT_EXPECTS(idx != nullptr, "...") runs before any use of idx,
then use idx->dim() only after that check; apply the same change to the other
pointer overload variant (the overload covering lines ~706-723) so both
overloads produce a clean RAFT_EXPECTS failure instead of crashing when passed
nullptr.
- Around line 549-563: Validate the kmeans_trainset_fraction parameter before
it’s converted/used: add a precondition (using RAFT_EXPECTS) that
params.kmeans_trainset_fraction is > 0 and <= 1 (or whatever valid
inclusive/exclusive bounds your API requires) immediately before computing
trainset_ratio in ivf_sq_build.cuh so invalid zero/negative/out-of-range values
fail fast instead of being cast to size_t; reference the
params.kmeans_trainset_fraction symbol and perform the check prior to the
trainset_ratio/n_rows_train computation.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`:
- Around line 647-650: After clamping n_probes you must reject the case where
the index has zero lists to avoid launching kernels with grid.x == 0; add a
post-clamp check such as RAFT_EXPECTS(n_probes > 0, "index must contain at least
one list") or equivalently assert index.n_lists() > 0 before proceeding. Locate
the clamp code that computes auto n_probes = std::min<uint32_t>(params.n_probes,
index.n_lists()) (symbols: params.n_probes, n_probes, index.n_lists()) and
insert the RAFT_EXPECTS check immediately after it so ivf_sq_scan is never
invoked with grid.x == 0.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh`:
- Around line 154-159: The public wrapper function deserialize(raft::resources
const&, const std::string&, cuvs::neighbors::ivf_sq::index<CodeT>*) must
validate the out-pointer before dereferencing; add a nullptr check at the start
of that function and fail fast (e.g., throw std::invalid_argument or return an
error) with a clear message like "index is null" instead of unconditionally
doing *index = ..., so callers passing nullptr won't segfault.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh`:
- Around line 16-23: The declaration for sample_filter should be narrowed to the
concrete index type that is actually implemented; replace the templated
declaration of sample_filter with a non-templated declaration that uses int64_t
for the index type so it matches the single instantiation defined elsewhere
(sample_filter<int64_t>), ensuring parameter types (inds_ptrs, bitset_len,
original_nbits, etc.) use int64_t and avoiding accidental use of other IndexT
types that would cause late device-link failures.

In `@cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu`:
- Line 15: Replace the deprecated GoogleTest macro INSTANTIATE_TEST_CASE_P with
the modern INSTANTIATE_TEST_SUITE_P in the test instantiation line (currently
calling INSTANTIATE_TEST_CASE_P for AnnIVFSQTest/AnnIVFSQTestF_float and
::testing::ValuesIn(inputs)); change only the macro name so the test still
instantiates the same parameterized test using the same test suite and value
source.

In `@docs/source/cpp_api/neighbors_ivf_sq.rst`:
- Line 4: Add a one-line capability summary at the top of the IVF-SQ intro that
explicitly lists supported distance metrics (L2/IP/Cosine), supported data types
(float, half), and whether filtering is supported, while briefly referencing the
key parameters n_lists and n_probes and the scalar quantization behavior; update
the paragraph that begins "The IVF-SQ method is an ANN algorithm..." to prepend
this concise capability note so readers can quickly scan supported features.

In `@python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py`:
- Around line 53-57: Extract the duplicated nlist/nprobe check into a small
helper (e.g., _ivf_sq_nlist_nprobe_ok(params, build_params)) that returns
build_params["nlist"] >= params["nprobe"] when both keys exist, otherwise True;
replace the inline conditional in cuvs_ivf_sq_search and the other two IVF‑SQ
constraint functions with a call to this helper, keep the helper in the same
module and use the original function names (cuvs_ivf_sq_search and the other
IVF‑SQ constraint functions) to locate the places to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 86c332b9-8d53-48ad-b32a-0efb7774f2b0

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 3ba5e70.

📒 Files selected for processing (35)

cpp/CMakeLists.txt
cpp/bench/ann/CMakeLists.txt
cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
cpp/bench/ann/src/cuvs/cuvs_ivf_sq.cu
cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h
cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
cpp/include/cuvs/neighbors/ivf_sq.hpp
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/kernel_def.hpp
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_planner.hpp
cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh
cpp/src/neighbors/ivf_sq/ivf_sq_build_extend_float_uint8_t_int64_t.cu
cpp/src/neighbors/ivf_sq/ivf_sq_build_extend_half_uint8_t_int64_t.cu
cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh
cpp/src/neighbors/ivf_sq/ivf_sq_search_uint8_t_int64_t.cu
cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh
cpp/src/neighbors/ivf_sq/ivf_sq_serialize_uint8_t.cu
cpp/src/neighbors/ivf_sq_index.cpp
cpp/tests/CMakeLists.txt
cpp/tests/neighbors/ann_ivf_sq.cuh
cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu
cpp/tests/neighbors/ann_ivf_sq/test_half_int64_t.cu
docs/source/cpp_api/neighbors.rst
docs/source/cpp_api/neighbors_ivf_sq.rst
python/cuvs_bench/cuvs_bench/config/algorithms.yaml
python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py
python/cuvs_bench/cuvs_bench/config/algos/cuvs_ivf_sq.yaml
python/cuvs_bench/cuvs_bench/config/algos/faiss_cpu_ivf_sq.yaml
python/cuvs_bench/cuvs_bench/config/algos/faiss_gpu_ivf_sq.yaml

divyegala · 2026-04-30T18:12:49Z

+//     Reconstructed vector component: s_aux[d] + code*s_sq_scale[d].
+//
+//   After all probes are scanned, the smem is reused for block_sort merge.
+template <int Capacity, typename MetricTag>


We should not structure the kernel in a way where we need MetricTag. Usually, that is a way to tell us that related functionality could be split out in another device function fragment. Look at @KyleFromNVIDIA's work with IVF PQ for examples.

Also Kyle's rule of thumb: if a template parameter does not have a function parameter argument, it is likely that code related to that template parameter can be pulled into another fragment.

The rule of thumb that a template parameter missing from a signature means the templated code can be split out into another fragment only holds when said template parameter isn't tightly integrated into the body of the function. This one is borderline. Each fatbin generated by this fragment is <20KB (for a total of ~300KB), and I don't know how much of a gain we'd get by splitting out the metric-specific parts into their own fragment (it would be ~200KB at most, and that's not including the additional metric fragment). I'm inclined to think it's not worth it. At the very least, we can certainly save it for a follow-up.

My approval stands.

I think kAscending could be promoted to a template parameter of the main kernel as many metrics would share the same value (most would be true?), so you wouldn't have to instantiate this kernel for every metric and internally infer the same value.

The parts related to kIsCosine can be separated out to different fragments, a cosine specific one and a no-op one.

Thanks both for the careful reviews! Took some time to test different things. Conclusion : even though the metrics are tightly coupled with the main body it is still possible to recover some beyond the filter fragment.

Two changes:

Move metric specialization out of the kernel template: into four device-function fragments :

setup_invariant_smem

setup_per_probe_smem

accumulate_distance

finalize_distance (the cosine vs no-op one)

The entrypoint now needs (Capacity, Ascending) instead of (Capacity, MetricTag).

Apply the IVF-PQ inner-product trick: noticed that IVF-PQ negates IP scores in the kernel so warpsort::block_sort always runs ascending, then flip them back in postprocess_distances via account_for_max_close. Ascending becomes hardcoded, so the kernel is now templated on Capacity alone.

Net effect:

Before : 5 Capacity * 3 Metrics -> 15 main body instantiations -> 91MB libcuvs.so

After : 5 Capacity -> 5 main body instantiations -> 89MB libcuvs.so

No measurable performance impact

KyleFromNVIDIA

Approved CMake changes and JIT+LTO changes. Very nice.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh (1)

89-98: 💤 Low value

Add compile-time guard for unsupported filter types.

If this template is instantiated with an unsupported FilterT, the function has no return statement, leading to undefined behavior or an obscure compile error. A static_assert in the else branch would provide a clearer diagnostic.

Suggested improvement

 template <typename FilterT>
 constexpr auto get_filter_type_tag()
 {
   using namespace cuvs::neighbors::filtering;
   if constexpr (std::is_same_v<FilterT, none_sample_filter>) {
     return cuvs::neighbors::detail::tag_filter_none{};
   } else if constexpr (std::is_same_v<FilterT, bitset_filter<uint32_t, int64_t>>) {
     return cuvs::neighbors::detail::tag_filter_bitset{};
+  } else {
+    static_assert(sizeof(FilterT) == 0, "Unsupported IvfSampleFilterT for IVF-SQ scan.");
   }
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh` around lines 89 - 98, The
function get_filter_type_tag currently has no return for unsupported FilterT
values; add a compile-time guard in the final else branch that triggers a clear
static_assert when FilterT is not none_sample_filter or bitset_filter (use a
dependent-false pattern like always_false<FilterT>) to produce a readable
diagnostic; reference the template get_filter_type_tag, type parameters FilterT,
the known cases none_sample_filter and bitset_filter<uint32_t,int64_t>, and the
expected return tags tag_filter_none and tag_filter_bitset so the static_assert
message can state "unsupported FilterT for get_filter_type_tag" and prevent the
missing-return/UB compile scenario.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh`:
- Around line 39-43: The function finalize_distance_cosine_impl currently
returns 0.0f when denom <= 0 which wrongly promotes zero-norm vectors; change
the divide-protection to use a small epsilon (e.g. eps = 1e-8f) when checking
denom and, if denom <= eps, return a worst-distance sentinel (e.g. FLT_MAX or
INFINITY) instead of 0.0f; update finalize_distance_cosine_impl (variables:
denom, v_norm_sq, query_norm) accordingly and add the needed include for the
sentinel (e.g. <cfloat>) if not already present.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`:
- Around line 89-98: The function get_filter_type_tag currently has no return
for unsupported FilterT values; add a compile-time guard in the final else
branch that triggers a clear static_assert when FilterT is not
none_sample_filter or bitset_filter (use a dependent-false pattern like
always_false<FilterT>) to produce a readable diagnostic; reference the template
get_filter_type_tag, type parameters FilterT, the known cases none_sample_filter
and bitset_filter<uint32_t,int64_t>, and the expected return tags
tag_filter_none and tag_filter_bitset so the static_assert message can state
"unsupported FilterT for get_filter_type_tag" and prevent the missing-return/UB
compile scenario.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26665f07-6856-4bdb-b23d-ebf00065ec53

📥 Commits

Reviewing files that changed from the base of the PR and between 1e638e5 and c5948a2.

📒 Files selected for processing (20)

cpp/CMakeLists.txt
cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_impl.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_planner.hpp
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_impl.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_impl.cuh
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_matrix.json
cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh

✅ Files skipped from review due to trivial changes (9)

cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_matrix.json
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_kernel.cu.in
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_matrix.json
cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh

🚧 Files skipped from review as they are similar to previous changes (2)

cpp/CMakeLists.txt
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in

coderabbitai · 2026-05-05T11:10:50Z

+__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)
+{
+  float denom = query_norm * sqrtf(v_norm_sq);
+  return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f;
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cosine zero-norm fallback currently returns a best-score distance.

At Line 42, denom <= 0 returns 0.0f, which is min-close and can incorrectly promote invalid cosine comparisons to top results. Use a worst-distance sentinel instead.

Suggested fix

__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm) { float denom = query_norm * sqrtf(v_norm_sq); - return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f; + return (denom > 0.0f) ? 1.0f - dist / denom : CUDART_INF_F; }

As per coding guidelines, "Floating-point computations (distance calculations, kernel evaluations, accumulations) must handle numerical edge cases: protect divisions with epsilon checks, handle zero-norm vectors..."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)

{

float denom = query_norm * sqrtf(v_norm_sq);

return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f;

}

__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)

{

float denom = query_norm * sqrtf(v_norm_sq);

return (denom > 0.0f) ? 1.0f - dist / denom : CUDART_INF_F;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh` around lines 39 - 43, The function finalize_distance_cosine_impl currently returns 0.0f when denom <= 0 which wrongly promotes zero-norm vectors; change the divide-protection to use a small epsilon (e.g. eps = 1e-8f) when checking denom and, if denom <= eps, return a worst-distance sentinel (e.g. FLT_MAX or INFINITY) instead of 0.0f; update finalize_distance_cosine_impl (variables: denom, v_norm_sq, query_norm) accordingly and add the needed include for the sentinel (e.g. <cfloat>) if not already present.

viclafargue added 3 commits February 27, 2026 18:10

IVF-SQ

75ceb6f

add IVF-SQ bench constraints

cf19a86

Update default IVF-SQ benchmark config

6a95e8a

viclafargue requested review from a team as code owners March 3, 2026 09:49

github-project-automation Bot added this to Unstructured Data Processing Mar 3, 2026

viclafargue self-assigned this Mar 3, 2026

viclafargue added feature request New feature or request non-breaking Introduces a non-breaking change C++ labels Mar 3, 2026

aamijar moved this to In Progress in Unstructured Data Processing Mar 3, 2026

Merge branch 'main' into ivf-sq

2d78609

viclafargue mentioned this pull request Mar 11, 2026

IVF-SQ C API #1910

Open

viclafargue added 2 commits March 12, 2026 11:40

Update postprocess_neighbors signature

83b8c63

update testing

1050deb

jinsolp reviewed Mar 13, 2026

View reviewed changes

documentation

3a911d8

viclafargue added 6 commits March 13, 2026 14:03

memset in index constructor

b124628

random sampling

641c6ca

inplace residuals

70ca00a

improved kernel layout for residuals computation

e7d660c

raft::device_vector

96b28db

drop adaptative_centers feature

206cb2e

viclafargue requested a review from a team as a code owner March 13, 2026 16:32

viclafargue added 2 commits March 16, 2026 11:19

Add IVF-SQ FAISS benchmark

e34bdd8

Merge branch 'main' into ivf-sq

dcb8a59

jinsolp reviewed Mar 19, 2026

View reviewed changes

Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh

Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh

Comment thread cpp/src/neighbors/ivf_sq_index.cpp

Comment thread cpp/src/neighbors/ivf_sq_index.cpp

Adressing review

9bd7bc0

Addressing review (tests updates)

3a3427f

viclafargue force-pushed the ivf-sq branch from 0977a8a to 7ee0f7f Compare April 7, 2026 12:43

viclafargue requested review from a team as code owners April 7, 2026 12:43

viclafargue requested a review from gforsyth April 7, 2026 12:43

Merge branch 'main' into ivf-sq

7a238d3

viclafargue force-pushed the ivf-sq branch from 7ee0f7f to 7a238d3 Compare April 7, 2026 12:48

gforsyth approved these changes Apr 7, 2026

View reviewed changes

lowener requested changes Apr 17, 2026

View reviewed changes

viclafargue added 7 commits April 20, 2026 13:34

Swap IdxT for CodeT

1b182d7

addressing review

8c44557

Merge branch 'main' into ivf-sq

e087e19

Merge branch 'main' into ivf-sq

d8ada75

account for RAFT update

80a55fd

IVF-SQ JIT-LTO

ac8ea4e

Merge branch 'main' into ivf-sq

3ba5e70

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

doc fix + build assert addition

55a91bf

lowener reviewed Apr 30, 2026

View reviewed changes

Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated

divyegala requested changes Apr 30, 2026

View reviewed changes

divyegala requested a review from KyleFromNVIDIA April 30, 2026 18:13

KyleFromNVIDIA approved these changes Apr 30, 2026

View reviewed changes

viclafargue added 5 commits May 4, 2026 10:45

Switching to raft::TxN_t

6d5ec72

Merge branch 'main' into ivf-sq

1e638e5

Dropping the MetricTag template parameter

6889624

Inner product trick

df55c51

Fix + minor cleanups

c5948a2

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Conversation

viclafargue commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Build

Extend

Search

Benchmarks on B200

Uh oh!

jinsolp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinsolp Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

achirkin Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjnolet commented Mar 13, 2026

Uh oh!

jinsolp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viclafargue commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divyegala Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

KyleFromNVIDIA Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyegala Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue commented Mar 3, 2026 •

edited

Loading

viclafargue Mar 13, 2026 •

edited

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

KyleFromNVIDIA Apr 30, 2026 •

edited

Loading