Skip to content

IVF-SQ C++ API#1865

Open
viclafargue wants to merge 40 commits intorapidsai:mainfrom
viclafargue:ivf-sq
Open

IVF-SQ C++ API#1865
viclafargue wants to merge 40 commits intorapidsai:mainfrom
viclafargue:ivf-sq

Conversation

@viclafargue
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue commented Mar 3, 2026

Closes #1291.

Overview

IVF-SQ combines an inverted file (IVF) partitioning scheme with 8-bit scalar quantization (SQ8) of residuals. Each float32 dimension is compressed to a single uint8 code, giving a 4x memory reduction over IVF-Flat while retaining high recall. The index implements various metrics (L2, inner-product, and cosine distance), data type (float, half) and also filtering.

Build

  • K-Means clustering : A training subset (controlled by kmeans_trainset_fraction) is sampled from the dataset. Balanced K-Means is run on it to produce n_lists centroids that partition the vector space.
  • SQ parameter training : The training vectors are assigned to their nearest centroid and residuals are computed. A custom CUDA kernel computes the per-dimension min/max of these residuals (fast CUB reduction in shared memory). The observed range is expanded by a 5% margin on each side to reduce clipping on unseen data. The delta for each dimension is computed as sq_delta[d] = (range + 2*margin) / 255. These two per-dimension parameters (sq_vmin the lower end of the range, sq_delta the scale or quantization step) are stored in the index and are all that is needed to encode/decode any vector.
  • Data insertion : If add_data_on_build is true (the default), the full dataset is inserted via the extend path described below.

Extend

Extend adds new vectors to an existing index without retraining centroids or SQ parameters and in a batched fashion:

  • Cluster assignment : New vectors are assigned to their nearest centroid via K-Means predict. (adaptative centers : when enabled, centroids are incrementally updated as new data arrives, and center norms are recomputed).
  • List resizing : Per-list sizes are histogrammed and the IVF lists are grown to accommodate the new vectors, respecting the interleaved group alignment (kIndexGroupSize = 32).
  • Residual computation + SQ encoding : A first kernel computes residuals. A second kernel quantize each residual dimension to uint8 and write the code into the interleaved list layout.

Search

Search proceeds in three stages:

  • Coarse search (GEMM-based) : All query-to-centroid distances are computed in a single batched GEMM (queries x centers^T), with metric-specific pre/post-processing. The top n_probes nearest clusters per query are selected via select_k.
  • Fine scan (ivf_sq_scan_kernel) : This is the performance-critical kernel, templated on <BlockDim, Capacity, Metric, IdxT, SampleFilter>. It operates in two grid modes controlled by the compile-time Capacity parameter:
    • Fused top-k path (Capacity > 0): grid is (grid_dim_x, n_queries), each block loops over its share of probes and maintains a block_sort (warpsort) priority queue, writing only the local top-k to global memory. grid_dim_x is chosen via an occupancy query (cudaOccupancyMaxActiveBlocksPerMultiprocessor) to saturate the GPU. Instantiated for Capacity ∈ {32, 64, 128, 256}, covering k ≤ 256.
    • Materialized path (Capacity == 0): grid is (n_probes, n_queries), one block per (query, probe) pair writes every distance to a flat buffer. Used as a fallback when k > 256 or the fused kernel has zero occupancy.

Three dim-length shared-memory arrays are filled in two phases: Phase 1 (invariant across probes) loads sq_delta and query terms; Phase 2 (per-probe) folds in the centroid so the inner loop reduces to s_query_term − code × s_sq_scale (L2) or s_query_term × (s_aux + code × s_sq_scale) (IP/Cosine). Each warp processes one interleaved group of 32 vectors; each lane loads a uint4, giving a fully coalesced 512-byte read per dimension block. Distance accumulation is fused with decoding — no separate decompress pass.

  • Final top-k selection : In the fused path, a second select_k merges the per-block top-k results across grid_dim_x blocks (skipped when grid_dim_x == 1). In the materialized path, a single select_k over the full buffer extracts the global top-k. Distances then receive metric-specific fixups (e.g. sqrt for L2Sqrt) and list-local positions are mapped back to original dataset IDs.

Benchmarks on B200

ivf_sq_build_bench

wiki_all_10M-bench

deep-100M-bench

@viclafargue viclafargue requested review from a team as code owners March 3, 2026 09:49
@viclafargue viclafargue self-assigned this Mar 3, 2026
@viclafargue viclafargue added feature request New feature or request non-breaking Introduces a non-breaking change C++ labels Mar 3, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing Mar 3, 2026
@viclafargue viclafargue mentioned this pull request Mar 11, 2026
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viclafargue ! Sharing my first batch of comments for build and extend.
Aside from this, I also think ivf_sq.hpp overall needs some documentation!

Comment thread cpp/include/cuvs/neighbors/ivf_sq.hpp Outdated
Comment thread cpp/include/cuvs/neighbors/ivf_sq.hpp
Comment thread cpp/include/cuvs/neighbors/ivf_sq.hpp
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
auto orig_centroids_view = raft::make_device_matrix_view<const float, int64_t>(
index->centers().data_handle(), n_lists, dim);

constexpr size_t kReasonableMaxBatchSize = 65536;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this determined? 👀

Copy link
Copy Markdown
Contributor Author

@viclafargue viclafargue Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an arbitrary value used in other index types. This probably was a safe value that ensured that we never ran out of memory in smaller systems. But, thanks for pointing this out as it might be interesting to see if we could improve the way this value is determined (dataset dimensions and available VRAM). cc @achirkin @tfeher

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, and one more argument here is that this size is well beyond the point where the GPU is saturated:

{big enough grid size} < kReasonableMaxBatchSize < {small enough memory footprint}

Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh
@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Mar 13, 2026

@viclafargue can you please share the build time speedups over Faiss IVFSQ on GPU and CPU? Those are going to be crucial as the major value prop of cuVS over Faiss (both cpu and gpu) is index build speedup.

@viclafargue viclafargue requested a review from a team as a code owner March 13, 2026 16:32
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second batch of comments! I looked through search and serialize this time, looks good!

Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh
Comment thread cpp/src/neighbors/ivf_sq_index.cpp
Comment thread cpp/src/neighbors/ivf_sq_index.cpp
Comment thread cpp/include/cuvs/neighbors/ivf_sq.hpp Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
Comment thread cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu Outdated
@viclafargue
Copy link
Copy Markdown
Contributor Author

The wheel-build CI was failing because the IVF-SQ scan kernel pushed libcuvs.so over the pip wheel size limit. To unblock the PR, I ported the scan kernel to JIT-LTO (same model as IVF-Flat). The kernel body now lives in cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh and is linked at runtime from generated fatbins instead of being baked into the library.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added IVF‑SQ ANN index: build, extend, search, serialize/deserialize with float/half support, sample filtering, and JIT‑LTO fused scan/filter kernels covering multiple distance metrics and capacity variants.
  • Benchmarking

    • Added benchmark targets and configs for cuVS, FAISS (GPU and CPU) IVF‑SQ.
  • Tests

    • Added comprehensive parameterized unit tests for build/search/extend/serialize and filter correctness.
  • Documentation

    • Added C++ API docs for IVF‑SQ.

Walkthrough

Adds a complete IVF‑SQ nearest‑neighbor implementation: public C++ API, device build/extend/search/serialize logic, JIT‑LTO scan/filter kernels and matrices, CMake/JIT wiring, benchmark integrations and configs, comprehensive tests, and documentation.

Changes

IVF‑SQ Feature (single cohesive DAG)

Layer / File(s) Summary
Data Shape / Public API
cpp/include/cuvs/neighbors/ivf_sq.hpp
Introduces public IVF‑SQ API types: index_params, search_params, index<uint8_t>, and free functions build/extend/search/serialize/deserialize.
Core Index Implementation
cpp/src/neighbors/ivf_sq_index.cpp
Implements index<CodeT> constructors, device allocations, accessors, optional center_norms allocation, and check_consistency; explicit instantiation for uint8_t.
Build / Extend Core Logic
cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh, cpp/src/neighbors/ivf_sq/ivf_sq_build_extend_*_uint8_t_int64_t.cu
Vectorized fused column min/max, k‑means training/prediction, residual computation, SQ quantization/encoding, clone/extend flows, per‑list resizing and warp‑cooperative fills; host/device wrappers and instantiations for float/half.
Search Core Logic
cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh, cpp/src/neighbors/ivf_sq/ivf_sq_search_uint8_t_int64_t.cu
Coarse GEMM selection, chunk index calc, runtime→compile-time dispatch for metric and capacity, fused-local-top‑k JIT‑LTO scan with occupancy probe and fallback materialize path, sample filter dispatch, and exported search entrypoints.
JIT‑LTO Scan Kernels & Planner
cpp/src/.../detail/jit_lto_kernels/scan_impl.cuh, .../scan_kernel.cu.in, .../kernel_def.hpp, .../scan_planner.hpp
Adds ivf_sq_scan_impl<Capacity> device implementation (fused top‑k and materialize modes), exported kernel wrappers parameterized by capacity, function‑pointer signature alias, and planner registering metric/setup/accumulate/finalize/filter fragments.
JIT‑LTO Fragment Tags & Device Functions
cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp, cpp/src/.../device_functions.cuh, .../accumulate_distance_impl.cuh, .../finalize_distance_impl.cuh, .../setup_*_impl.cuh
Declares fragment tag types for metrics/filters and device function forward declarations; provides metric‑specific implementations for setup, accumulate, and finalize steps.
JIT‑LTO Filter Wiring & Matrices
cpp/src/.../filter_kernel.cu.in, .../filter_matrix.json, .../scan_matrix.json, .../*_matrix.json
Specializes sample_filter<int64_t> wiring for filter fragments; JSON matrices declare supported capacity, metric_name, and filter_name options.
Kernel Template Includes
cpp/src/.../*_kernel.cu.in, cpp/src/.../*_impl.cuh
Adds metric- and filter‑parameterized .cu.in templates that bind to *_impl functions via compile‑time placeholders used by JIT‑LTO generator.
Serialization
cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh, cpp/src/neighbors/ivf_sq/ivf_sq_serialize_uint8_t.cu
Implements serialize/deserialize for index<CodeT> (dtype/version/metadata, centers, optional center_norms, sq_vmin/sq_delta, per‑list data) and instantiates file wrappers for uint8_t.
JIT/CMake & Build Integration
cpp/CMakeLists.txt
Adds IVF‑SQ JIT‑LTO kernel fragment registrations to jit_lto_files and appends new ivf_sq implementation sources to cuvs_objs.
Benchmarks: CMake / Wrapper / Parser
cpp/bench/ann/CMakeLists.txt, cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h, .../cuvs_ivf_sq.cu, .../cuvs_benchmark.cu, .../cuvs_ann_bench_param_parser.h
Adds CUVS/FAISS CPU/GPU IVF‑SQ benchmark toggles and targets; new cuvs_ivf_sq<T> benchmark wrapper, explicit instantiations, and JSON parsing for nlist, niter, ratio, nprobe.
Tests
cpp/tests/neighbors/ann_ivf_sq.cuh, cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu, .../test_half_int64_t.cu, cpp/tests/CMakeLists.txt
Adds parameterized GoogleTest fixture exercising build/extend/search/serialize/filter with float and half, test instantiations, and CMake test wiring.
Docs
docs/source/cpp_api/neighbors.rst, docs/source/cpp_api/neighbors_ivf_sq.rst
Adds neighbors_ivf_sq.rst to the neighbors TOC and a new API page documenting IVF‑SQ types and functions.
Python benchmark configs & constraints
python/cuvs_bench/.../algorithms.yaml, .../constraints/__init__.py, .../cuvs_ivf_sq.yaml, .../faiss_*_ivf_sq.yaml
Updates algorithm entries, adds cuvs_ivf_sq/faiss_cpu_ivf_sq/faiss_gpu_ivf_sq configs and constraint functions enforcing nlist >= nprobe, and adds YAML parameter grids for base/large/test groups.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'IVF-SQ C++ API' clearly and concisely describes the main feature being added—a new C++ API for the IVF-SQ index algorithm.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, covering build/extend/search implementation details, algorithm parameters, metrics support, and performance benchmarks.
Linked Issues check ✅ Passed The PR successfully implements the IVF-SQ index feature (#1291) with complete build, extend, search, and serialization APIs; JIT-LTO kernel support for multiple metrics/capacities; and comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are directly aligned with implementing IVF-SQ support. Documentation, CMake configuration, benchmark setup, test infrastructure, and benchmark algorithm configs are all in scope for a new feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (4)
docs/source/cpp_api/neighbors_ivf_sq.rst (1)

4-4: Consider adding a one-line capability summary in the intro.

A short explicit note for supported metrics (L2/IP/Cosine), data types (float/half), and filtering would make the page easier to scan.

As per coding guidelines: "Clarity: Flag confusing explanations, missing prerequisites, or unclear examples".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/cpp_api/neighbors_ivf_sq.rst` at line 4, Add a one-line
capability summary at the top of the IVF-SQ intro that explicitly lists
supported distance metrics (L2/IP/Cosine), supported data types (float, half),
and whether filtering is supported, while briefly referencing the key parameters
n_lists and n_probes and the scalar quantization behavior; update the paragraph
that begins "The IVF-SQ method is an ANN algorithm..." to prepend this concise
capability note so readers can quickly scan supported features.
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh (1)

16-23: Narrow this declaration to the supported index type.

cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_kernel.cu.in only defines sample_filter<int64_t>. Keeping a generic declaration here advertises broader support and turns accidental IndexT != int64_t use into a late device-link failure instead of an earlier, clearer error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh` around
lines 16 - 23, The declaration for sample_filter should be narrowed to the
concrete index type that is actually implemented; replace the templated
declaration of sample_filter with a non-templated declaration that uses int64_t
for the index type so it matches the single instantiation defined elsewhere
(sample_filter<int64_t>), ensuring parameter types (inds_ptrs, bitset_len,
original_nbits, etc.) use int64_t and avoiding accidental use of other IndexT
types that would cause late device-link failures.
cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu (1)

15-15: Replace INSTANTIATE_TEST_CASE_P with INSTANTIATE_TEST_SUITE_P on line 15.

INSTANTIATE_TEST_CASE_P is the legacy GoogleTest macro. Use the modern INSTANTIATE_TEST_SUITE_P to avoid carrying deprecated spelling into new test files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu` at line 15, Replace the
deprecated GoogleTest macro INSTANTIATE_TEST_CASE_P with the modern
INSTANTIATE_TEST_SUITE_P in the test instantiation line (currently calling
INSTANTIATE_TEST_CASE_P for AnnIVFSQTest/AnnIVFSQTestF_float and
::testing::ValuesIn(inputs)); change only the macro name so the test still
instantiates the same parameterized test using the same test suite and value
source.
python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py (1)

53-57: Deduplicate repeated IVF-SQ nlist/nprobe constraint logic.

All three new functions implement the same condition; extracting a small helper will keep this easier to maintain.

♻️ Proposed refactor
+def _check_nlist_nprobe(params, build_params):
+    if "nlist" in build_params and "nprobe" in params:
+        return build_params["nlist"] >= params["nprobe"]
+    return True
+
 def cuvs_ivf_sq_search(params, build_params, k, batch_size):
-    if "nlist" in build_params and "nprobe" in params:
-        return build_params["nlist"] >= params["nprobe"]
-    return True
+    return _check_nlist_nprobe(params, build_params)

 def faiss_gpu_ivf_sq_search(params, build_params, k, batch_size):
-    if "nlist" in build_params and "nprobe" in params:
-        return build_params["nlist"] >= params["nprobe"]
-    return True
+    return _check_nlist_nprobe(params, build_params)

 def faiss_cpu_ivf_sq_search(params, build_params, k, batch_size):
-    if "nlist" in build_params and "nprobe" in params:
-        return build_params["nlist"] >= params["nprobe"]
-    return True
+    return _check_nlist_nprobe(params, build_params)

Also applies to: 64-68, 70-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py` around
lines 53 - 57, Extract the duplicated nlist/nprobe check into a small helper
(e.g., _ivf_sq_nlist_nprobe_ok(params, build_params)) that returns
build_params["nlist"] >= params["nprobe"] when both keys exist, otherwise True;
replace the inline conditional in cuvs_ivf_sq_search and the other two IVF‑SQ
constraint functions with a call to this helper, keep the helper in the same
module and use the original function names (cuvs_ivf_sq_search and the other
IVF‑SQ constraint functions) to locate the places to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h`:
- Around line 125-138: The search() method in cuvs_ivf_sq<T>::search
dereferences filter_ unconditionally but filter_ is only set in
set_search_param(), so calling build()/load() then search() can crash;
initialize filter_ to a safe default (e.g., make_cuvs_filter(nullptr,
index_->size()) or none_sample_filter) when the index is created/loaded (or in
the cuvs_ivf_sq constructor) and ensure set_search_param() still replaces it
when called, so cuvs_ivf_sq::search can safely pass *filter_ without null
checks.
- Around line 96-101: The current cuvs_ivf_sq<T>::set_search_param uses
index_params_.n_lists for validating search_params_.n_probes which can differ
from a loaded serialized index; change the logic to first check if index_ is
non-null (the in-memory index loaded via load()), and validate or clamp
search_params_.n_probes against index_->n_lists() when index_ is present
(fallback to index_params_.n_lists() only if index_ is null); replace the raw
assert against index_params_.n_lists with a guarded clamp/validation of
search_params_.n_probes (and optionally a warning) so release builds don't skip
the check and serialized indexes are handled correctly.

In `@cpp/include/cuvs/neighbors/ivf_sq.hpp`:
- Around line 133-139: The Doxygen comment for scalar quantization incorrectly
multiplies by 255; update the public docs in the comment block referencing
sq_vmin and sq_delta to match the implementation by replacing the formula with
code_i = round((x_i - vmin_i) / delta_i) (and note that delta_i is the per-level
step size used to produce the uint8_t code), and ensure any adjacent doc text
clarifies that decoding uses the inverse with sq_delta and sq_vmin.

In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh`:
- Around line 223-228: The current branch only writes out_distances/out_indices
when (valid), leaving skipped slots (especially when Capacity==0) uninitialized;
change the else branch that computes out_idx = query_ix * max_samples +
sample_offset + row so it always stores a deterministic sentinel for
filtered-out slots (e.g., distance = +INF or max sentinel value and index =
UINT32_MAX or -1) when !valid, and keep writing the actual dist/index when valid
so downstream top-k reads deterministic values.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh`:
- Around line 687-703: The public pointer-overload of extend dereferences idx
(calling idx->dim()) before validating it; add a nullptr guard at the top of the
overload(s) so RAFT_EXPECTS(idx != nullptr, "...") runs before any use of idx,
then use idx->dim() only after that check; apply the same change to the other
pointer overload variant (the overload covering lines ~706-723) so both
overloads produce a clean RAFT_EXPECTS failure instead of crashing when passed
nullptr.
- Around line 549-563: Validate the kmeans_trainset_fraction parameter before
it’s converted/used: add a precondition (using RAFT_EXPECTS) that
params.kmeans_trainset_fraction is > 0 and <= 1 (or whatever valid
inclusive/exclusive bounds your API requires) immediately before computing
trainset_ratio in ivf_sq_build.cuh so invalid zero/negative/out-of-range values
fail fast instead of being cast to size_t; reference the
params.kmeans_trainset_fraction symbol and perform the check prior to the
trainset_ratio/n_rows_train computation.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`:
- Around line 647-650: After clamping n_probes you must reject the case where
the index has zero lists to avoid launching kernels with grid.x == 0; add a
post-clamp check such as RAFT_EXPECTS(n_probes > 0, "index must contain at least
one list") or equivalently assert index.n_lists() > 0 before proceeding. Locate
the clamp code that computes auto n_probes = std::min<uint32_t>(params.n_probes,
index.n_lists()) (symbols: params.n_probes, n_probes, index.n_lists()) and
insert the RAFT_EXPECTS check immediately after it so ivf_sq_scan is never
invoked with grid.x == 0.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh`:
- Around line 154-159: The public wrapper function deserialize(raft::resources
const&, const std::string&, cuvs::neighbors::ivf_sq::index<CodeT>*) must
validate the out-pointer before dereferencing; add a nullptr check at the start
of that function and fail fast (e.g., throw std::invalid_argument or return an
error) with a clear message like "index is null" instead of unconditionally
doing *index = ..., so callers passing nullptr won't segfault.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh`:
- Around line 16-23: The declaration for sample_filter should be narrowed to the
concrete index type that is actually implemented; replace the templated
declaration of sample_filter with a non-templated declaration that uses int64_t
for the index type so it matches the single instantiation defined elsewhere
(sample_filter<int64_t>), ensuring parameter types (inds_ptrs, bitset_len,
original_nbits, etc.) use int64_t and avoiding accidental use of other IndexT
types that would cause late device-link failures.

In `@cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu`:
- Line 15: Replace the deprecated GoogleTest macro INSTANTIATE_TEST_CASE_P with
the modern INSTANTIATE_TEST_SUITE_P in the test instantiation line (currently
calling INSTANTIATE_TEST_CASE_P for AnnIVFSQTest/AnnIVFSQTestF_float and
::testing::ValuesIn(inputs)); change only the macro name so the test still
instantiates the same parameterized test using the same test suite and value
source.

In `@docs/source/cpp_api/neighbors_ivf_sq.rst`:
- Line 4: Add a one-line capability summary at the top of the IVF-SQ intro that
explicitly lists supported distance metrics (L2/IP/Cosine), supported data types
(float, half), and whether filtering is supported, while briefly referencing the
key parameters n_lists and n_probes and the scalar quantization behavior; update
the paragraph that begins "The IVF-SQ method is an ANN algorithm..." to prepend
this concise capability note so readers can quickly scan supported features.

In `@python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py`:
- Around line 53-57: Extract the duplicated nlist/nprobe check into a small
helper (e.g., _ivf_sq_nlist_nprobe_ok(params, build_params)) that returns
build_params["nlist"] >= params["nprobe"] when both keys exist, otherwise True;
replace the inline conditional in cuvs_ivf_sq_search and the other two IVF‑SQ
constraint functions with a call to this helper, keep the helper in the same
module and use the original function names (cuvs_ivf_sq_search and the other
IVF‑SQ constraint functions) to locate the places to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 86c332b9-8d53-48ad-b32a-0efb7774f2b0

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 3ba5e70.

📒 Files selected for processing (35)
  • cpp/CMakeLists.txt
  • cpp/bench/ann/CMakeLists.txt
  • cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h
  • cpp/bench/ann/src/cuvs/cuvs_benchmark.cu
  • cpp/bench/ann/src/cuvs/cuvs_ivf_sq.cu
  • cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h
  • cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
  • cpp/include/cuvs/neighbors/ivf_sq.hpp
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/kernel_def.hpp
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_planner.hpp
  • cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh
  • cpp/src/neighbors/ivf_sq/ivf_sq_build_extend_float_uint8_t_int64_t.cu
  • cpp/src/neighbors/ivf_sq/ivf_sq_build_extend_half_uint8_t_int64_t.cu
  • cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh
  • cpp/src/neighbors/ivf_sq/ivf_sq_search_uint8_t_int64_t.cu
  • cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh
  • cpp/src/neighbors/ivf_sq/ivf_sq_serialize_uint8_t.cu
  • cpp/src/neighbors/ivf_sq_index.cpp
  • cpp/tests/CMakeLists.txt
  • cpp/tests/neighbors/ann_ivf_sq.cuh
  • cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu
  • cpp/tests/neighbors/ann_ivf_sq/test_half_int64_t.cu
  • docs/source/cpp_api/neighbors.rst
  • docs/source/cpp_api/neighbors_ivf_sq.rst
  • python/cuvs_bench/cuvs_bench/config/algorithms.yaml
  • python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py
  • python/cuvs_bench/cuvs_bench/config/algos/cuvs_ivf_sq.yaml
  • python/cuvs_bench/cuvs_bench/config/algos/faiss_cpu_ivf_sq.yaml
  • python/cuvs_bench/cuvs_bench/config/algos/faiss_gpu_ivf_sq.yaml

Comment thread cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h
Comment thread cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h
Comment thread cpp/include/cuvs/neighbors/ivf_sq.hpp Outdated
Comment thread cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh
Comment thread cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh Outdated
// Reconstructed vector component: s_aux[d] + code*s_sq_scale[d].
//
// After all probes are scanned, the smem is reused for block_sort merge.
template <int Capacity, typename MetricTag>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not structure the kernel in a way where we need MetricTag. Usually, that is a way to tell us that related functionality could be split out in another device function fragment. Look at @KyleFromNVIDIA's work with IVF PQ for examples.

Also Kyle's rule of thumb: if a template parameter does not have a function parameter argument, it is likely that code related to that template parameter can be pulled into another fragment.

Copy link
Copy Markdown
Member

@KyleFromNVIDIA KyleFromNVIDIA Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule of thumb that a template parameter missing from a signature means the templated code can be split out into another fragment only holds when said template parameter isn't tightly integrated into the body of the function. This one is borderline. Each fatbin generated by this fragment is <20KB (for a total of ~300KB), and I don't know how much of a gain we'd get by splitting out the metric-specific parts into their own fragment (it would be ~200KB at most, and that's not including the additional metric fragment). I'm inclined to think it's not worth it. At the very least, we can certainly save it for a follow-up.

My approval stands.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kAscending could be promoted to a template parameter of the main kernel as many metrics would share the same value (most would be true?), so you wouldn't have to instantiate this kernel for every metric and internally infer the same value.

The parts related to kIsCosine can be separated out to different fragments, a cosine specific one and a no-op one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for the careful reviews! Took some time to test different things. Conclusion : even though the metrics are tightly coupled with the main body it is still possible to recover some beyond the filter fragment.

Two changes:

  1. Move metric specialization out of the kernel template: into four device-function fragments :
  • setup_invariant_smem
  • setup_per_probe_smem
  • accumulate_distance
  • finalize_distance (the cosine vs no-op one)

The entrypoint now needs (Capacity, Ascending) instead of (Capacity, MetricTag).

  1. Apply the IVF-PQ inner-product trick: noticed that IVF-PQ negates IP scores in the kernel so warpsort::block_sort always runs ascending, then flip them back in postprocess_distances via account_for_max_close. Ascending becomes hardcoded, so the kernel is now templated on Capacity alone.

Net effect:

  • Before : 5 Capacity * 3 Metrics -> 15 main body instantiations -> 91MB libcuvs.so
  • After : 5 Capacity -> 5 main body instantiations -> 89MB libcuvs.so
  • No measurable performance impact

Copy link
Copy Markdown
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved CMake changes and JIT+LTO changes. Very nice.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh (1)

89-98: 💤 Low value

Add compile-time guard for unsupported filter types.

If this template is instantiated with an unsupported FilterT, the function has no return statement, leading to undefined behavior or an obscure compile error. A static_assert in the else branch would provide a clearer diagnostic.

Suggested improvement
 template <typename FilterT>
 constexpr auto get_filter_type_tag()
 {
   using namespace cuvs::neighbors::filtering;
   if constexpr (std::is_same_v<FilterT, none_sample_filter>) {
     return cuvs::neighbors::detail::tag_filter_none{};
   } else if constexpr (std::is_same_v<FilterT, bitset_filter<uint32_t, int64_t>>) {
     return cuvs::neighbors::detail::tag_filter_bitset{};
+  } else {
+    static_assert(sizeof(FilterT) == 0, "Unsupported IvfSampleFilterT for IVF-SQ scan.");
   }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh` around lines 89 - 98, The
function get_filter_type_tag currently has no return for unsupported FilterT
values; add a compile-time guard in the final else branch that triggers a clear
static_assert when FilterT is not none_sample_filter or bitset_filter (use a
dependent-false pattern like always_false<FilterT>) to produce a readable
diagnostic; reference the template get_filter_type_tag, type parameters FilterT,
the known cases none_sample_filter and bitset_filter<uint32_t,int64_t>, and the
expected return tags tag_filter_none and tag_filter_bitset so the static_assert
message can state "unsupported FilterT for get_filter_type_tag" and prevent the
missing-return/UB compile scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh`:
- Around line 39-43: The function finalize_distance_cosine_impl currently
returns 0.0f when denom <= 0 which wrongly promotes zero-norm vectors; change
the divide-protection to use a small epsilon (e.g. eps = 1e-8f) when checking
denom and, if denom <= eps, return a worst-distance sentinel (e.g. FLT_MAX or
INFINITY) instead of 0.0f; update finalize_distance_cosine_impl (variables:
denom, v_norm_sq, query_norm) accordingly and add the needed include for the
sentinel (e.g. <cfloat>) if not already present.

---

Nitpick comments:
In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`:
- Around line 89-98: The function get_filter_type_tag currently has no return
for unsupported FilterT values; add a compile-time guard in the final else
branch that triggers a clear static_assert when FilterT is not
none_sample_filter or bitset_filter (use a dependent-false pattern like
always_false<FilterT>) to produce a readable diagnostic; reference the template
get_filter_type_tag, type parameters FilterT, the known cases none_sample_filter
and bitset_filter<uint32_t,int64_t>, and the expected return tags
tag_filter_none and tag_filter_bitset so the static_assert message can state
"unsupported FilterT for get_filter_type_tag" and prevent the missing-return/UB
compile scenario.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26665f07-6856-4bdb-b23d-ebf00065ec53

📥 Commits

Reviewing files that changed from the base of the PR and between 1e638e5 and c5948a2.

📒 Files selected for processing (20)
  • cpp/CMakeLists.txt
  • cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_impl.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_planner.hpp
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_impl.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_impl.cuh
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_matrix.json
  • cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh
✅ Files skipped from review due to trivial changes (9)
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_matrix.json
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_kernel.cu.in
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_matrix.json
  • cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/CMakeLists.txt
  • cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in

Comment on lines +39 to +43
__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)
{
float denom = query_norm * sqrtf(v_norm_sq);
return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cosine zero-norm fallback currently returns a best-score distance.

At Line 42, denom <= 0 returns 0.0f, which is min-close and can incorrectly promote invalid cosine comparisons to top results. Use a worst-distance sentinel instead.

Suggested fix
 __device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)
 {
   float denom = query_norm * sqrtf(v_norm_sq);
-  return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f;
+  return (denom > 0.0f) ? 1.0f - dist / denom : CUDART_INF_F;
 }

As per coding guidelines, "Floating-point computations (distance calculations, kernel evaluations, accumulations) must handle numerical edge cases: protect divisions with epsilon checks, handle zero-norm vectors..."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)
{
float denom = query_norm * sqrtf(v_norm_sq);
return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f;
}
__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)
{
float denom = query_norm * sqrtf(v_norm_sq);
return (denom > 0.0f) ? 1.0f - dist / denom : CUDART_INF_F;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh`
around lines 39 - 43, The function finalize_distance_cosine_impl currently
returns 0.0f when denom <= 0 which wrongly promotes zero-norm vectors; change
the divide-protection to use a small epsilon (e.g. eps = 1e-8f) when checking
denom and, if denom <= eps, return a worst-distance sentinel (e.g. FLT_MAX or
INFINITY) instead of 0.0f; update finalize_distance_cosine_impl (variables:
denom, v_norm_sq, query_norm) accordingly and add the needed include for the
sentinel (e.g. <cfloat>) if not already present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[FEA] Support IVF-SQ index

10 participants