IVF-SQ C++ API#1865
Conversation
jinsolp
left a comment
There was a problem hiding this comment.
Thanks @viclafargue ! Sharing my first batch of comments for build and extend.
Aside from this, I also think ivf_sq.hpp overall needs some documentation!
| auto orig_centroids_view = raft::make_device_matrix_view<const float, int64_t>( | ||
| index->centers().data_handle(), n_lists, dim); | ||
|
|
||
| constexpr size_t kReasonableMaxBatchSize = 65536; |
There was a problem hiding this comment.
This is an arbitrary value used in other index types. This probably was a safe value that ensured that we never ran out of memory in smaller systems. But, thanks for pointing this out as it might be interesting to see if we could improve the way this value is determined (dataset dimensions and available VRAM). cc @achirkin @tfeher
There was a problem hiding this comment.
That is correct, and one more argument here is that this size is well beyond the point where the GPU is saturated:
{big enough grid size} < kReasonableMaxBatchSize < {small enough memory footprint}
|
@viclafargue can you please share the build time speedups over Faiss IVFSQ on GPU and CPU? Those are going to be crucial as the major value prop of cuVS over Faiss (both cpu and gpu) is index build speedup. |
jinsolp
left a comment
There was a problem hiding this comment.
Second batch of comments! I looked through search and serialize this time, looks good!
|
The wheel-build CI was failing because the IVF-SQ scan kernel pushed |
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a complete IVF‑SQ nearest‑neighbor implementation: public C++ API, device build/extend/search/serialize logic, JIT‑LTO scan/filter kernels and matrices, CMake/JIT wiring, benchmark integrations and configs, comprehensive tests, and documentation. ChangesIVF‑SQ Feature (single cohesive DAG)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (4)
docs/source/cpp_api/neighbors_ivf_sq.rst (1)
4-4: Consider adding a one-line capability summary in the intro.A short explicit note for supported metrics (L2/IP/Cosine), data types (float/half), and filtering would make the page easier to scan.
As per coding guidelines: "Clarity: Flag confusing explanations, missing prerequisites, or unclear examples".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/source/cpp_api/neighbors_ivf_sq.rst` at line 4, Add a one-line capability summary at the top of the IVF-SQ intro that explicitly lists supported distance metrics (L2/IP/Cosine), supported data types (float, half), and whether filtering is supported, while briefly referencing the key parameters n_lists and n_probes and the scalar quantization behavior; update the paragraph that begins "The IVF-SQ method is an ANN algorithm..." to prepend this concise capability note so readers can quickly scan supported features.cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh (1)
16-23: Narrow this declaration to the supported index type.
cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_kernel.cu.inonly definessample_filter<int64_t>. Keeping a generic declaration here advertises broader support and turns accidentalIndexT != int64_tuse into a late device-link failure instead of an earlier, clearer error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh` around lines 16 - 23, The declaration for sample_filter should be narrowed to the concrete index type that is actually implemented; replace the templated declaration of sample_filter with a non-templated declaration that uses int64_t for the index type so it matches the single instantiation defined elsewhere (sample_filter<int64_t>), ensuring parameter types (inds_ptrs, bitset_len, original_nbits, etc.) use int64_t and avoiding accidental use of other IndexT types that would cause late device-link failures.cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu (1)
15-15: ReplaceINSTANTIATE_TEST_CASE_PwithINSTANTIATE_TEST_SUITE_Pon line 15.
INSTANTIATE_TEST_CASE_Pis the legacy GoogleTest macro. Use the modernINSTANTIATE_TEST_SUITE_Pto avoid carrying deprecated spelling into new test files.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu` at line 15, Replace the deprecated GoogleTest macro INSTANTIATE_TEST_CASE_P with the modern INSTANTIATE_TEST_SUITE_P in the test instantiation line (currently calling INSTANTIATE_TEST_CASE_P for AnnIVFSQTest/AnnIVFSQTestF_float and ::testing::ValuesIn(inputs)); change only the macro name so the test still instantiates the same parameterized test using the same test suite and value source.python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py (1)
53-57: Deduplicate repeated IVF-SQnlist/nprobeconstraint logic.All three new functions implement the same condition; extracting a small helper will keep this easier to maintain.
♻️ Proposed refactor
+def _check_nlist_nprobe(params, build_params): + if "nlist" in build_params and "nprobe" in params: + return build_params["nlist"] >= params["nprobe"] + return True + def cuvs_ivf_sq_search(params, build_params, k, batch_size): - if "nlist" in build_params and "nprobe" in params: - return build_params["nlist"] >= params["nprobe"] - return True + return _check_nlist_nprobe(params, build_params) def faiss_gpu_ivf_sq_search(params, build_params, k, batch_size): - if "nlist" in build_params and "nprobe" in params: - return build_params["nlist"] >= params["nprobe"] - return True + return _check_nlist_nprobe(params, build_params) def faiss_cpu_ivf_sq_search(params, build_params, k, batch_size): - if "nlist" in build_params and "nprobe" in params: - return build_params["nlist"] >= params["nprobe"] - return True + return _check_nlist_nprobe(params, build_params)Also applies to: 64-68, 70-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py` around lines 53 - 57, Extract the duplicated nlist/nprobe check into a small helper (e.g., _ivf_sq_nlist_nprobe_ok(params, build_params)) that returns build_params["nlist"] >= params["nprobe"] when both keys exist, otherwise True; replace the inline conditional in cuvs_ivf_sq_search and the other two IVF‑SQ constraint functions with a call to this helper, keep the helper in the same module and use the original function names (cuvs_ivf_sq_search and the other IVF‑SQ constraint functions) to locate the places to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.h`:
- Around line 125-138: The search() method in cuvs_ivf_sq<T>::search
dereferences filter_ unconditionally but filter_ is only set in
set_search_param(), so calling build()/load() then search() can crash;
initialize filter_ to a safe default (e.g., make_cuvs_filter(nullptr,
index_->size()) or none_sample_filter) when the index is created/loaded (or in
the cuvs_ivf_sq constructor) and ensure set_search_param() still replaces it
when called, so cuvs_ivf_sq::search can safely pass *filter_ without null
checks.
- Around line 96-101: The current cuvs_ivf_sq<T>::set_search_param uses
index_params_.n_lists for validating search_params_.n_probes which can differ
from a loaded serialized index; change the logic to first check if index_ is
non-null (the in-memory index loaded via load()), and validate or clamp
search_params_.n_probes against index_->n_lists() when index_ is present
(fallback to index_params_.n_lists() only if index_ is null); replace the raw
assert against index_params_.n_lists with a guarded clamp/validation of
search_params_.n_probes (and optionally a warning) so release builds don't skip
the check and serialized indexes are handled correctly.
In `@cpp/include/cuvs/neighbors/ivf_sq.hpp`:
- Around line 133-139: The Doxygen comment for scalar quantization incorrectly
multiplies by 255; update the public docs in the comment block referencing
sq_vmin and sq_delta to match the implementation by replacing the formula with
code_i = round((x_i - vmin_i) / delta_i) (and note that delta_i is the per-level
step size used to produce the uint8_t code), and ensure any adjacent doc text
clarifies that decoding uses the inverse with sq_delta and sq_vmin.
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuh`:
- Around line 223-228: The current branch only writes out_distances/out_indices
when (valid), leaving skipped slots (especially when Capacity==0) uninitialized;
change the else branch that computes out_idx = query_ix * max_samples +
sample_offset + row so it always stores a deterministic sentinel for
filtered-out slots (e.g., distance = +INF or max sentinel value and index =
UINT32_MAX or -1) when !valid, and keep writing the actual dist/index when valid
so downstream top-k reads deterministic values.
In `@cpp/src/neighbors/ivf_sq/ivf_sq_build.cuh`:
- Around line 687-703: The public pointer-overload of extend dereferences idx
(calling idx->dim()) before validating it; add a nullptr guard at the top of the
overload(s) so RAFT_EXPECTS(idx != nullptr, "...") runs before any use of idx,
then use idx->dim() only after that check; apply the same change to the other
pointer overload variant (the overload covering lines ~706-723) so both
overloads produce a clean RAFT_EXPECTS failure instead of crashing when passed
nullptr.
- Around line 549-563: Validate the kmeans_trainset_fraction parameter before
it’s converted/used: add a precondition (using RAFT_EXPECTS) that
params.kmeans_trainset_fraction is > 0 and <= 1 (or whatever valid
inclusive/exclusive bounds your API requires) immediately before computing
trainset_ratio in ivf_sq_build.cuh so invalid zero/negative/out-of-range values
fail fast instead of being cast to size_t; reference the
params.kmeans_trainset_fraction symbol and perform the check prior to the
trainset_ratio/n_rows_train computation.
In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`:
- Around line 647-650: After clamping n_probes you must reject the case where
the index has zero lists to avoid launching kernels with grid.x == 0; add a
post-clamp check such as RAFT_EXPECTS(n_probes > 0, "index must contain at least
one list") or equivalently assert index.n_lists() > 0 before proceeding. Locate
the clamp code that computes auto n_probes = std::min<uint32_t>(params.n_probes,
index.n_lists()) (symbols: params.n_probes, n_probes, index.n_lists()) and
insert the RAFT_EXPECTS check immediately after it so ivf_sq_scan is never
invoked with grid.x == 0.
In `@cpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuh`:
- Around line 154-159: The public wrapper function deserialize(raft::resources
const&, const std::string&, cuvs::neighbors::ivf_sq::index<CodeT>*) must
validate the out-pointer before dereferencing; add a nullptr check at the start
of that function and fail fast (e.g., throw std::invalid_argument or return an
error) with a clear message like "index is null" instead of unconditionally
doing *index = ..., so callers passing nullptr won't segfault.
---
Nitpick comments:
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh`:
- Around line 16-23: The declaration for sample_filter should be narrowed to the
concrete index type that is actually implemented; replace the templated
declaration of sample_filter with a non-templated declaration that uses int64_t
for the index type so it matches the single instantiation defined elsewhere
(sample_filter<int64_t>), ensuring parameter types (inds_ptrs, bitset_len,
original_nbits, etc.) use int64_t and avoiding accidental use of other IndexT
types that would cause late device-link failures.
In `@cpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cu`:
- Line 15: Replace the deprecated GoogleTest macro INSTANTIATE_TEST_CASE_P with
the modern INSTANTIATE_TEST_SUITE_P in the test instantiation line (currently
calling INSTANTIATE_TEST_CASE_P for AnnIVFSQTest/AnnIVFSQTestF_float and
::testing::ValuesIn(inputs)); change only the macro name so the test still
instantiates the same parameterized test using the same test suite and value
source.
In `@docs/source/cpp_api/neighbors_ivf_sq.rst`:
- Line 4: Add a one-line capability summary at the top of the IVF-SQ intro that
explicitly lists supported distance metrics (L2/IP/Cosine), supported data types
(float, half), and whether filtering is supported, while briefly referencing the
key parameters n_lists and n_probes and the scalar quantization behavior; update
the paragraph that begins "The IVF-SQ method is an ANN algorithm..." to prepend
this concise capability note so readers can quickly scan supported features.
In `@python/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.py`:
- Around line 53-57: Extract the duplicated nlist/nprobe check into a small
helper (e.g., _ivf_sq_nlist_nprobe_ok(params, build_params)) that returns
build_params["nlist"] >= params["nprobe"] when both keys exist, otherwise True;
replace the inline conditional in cuvs_ivf_sq_search and the other two IVF‑SQ
constraint functions with a call to this helper, keep the helper in the same
module and use the original function names (cuvs_ivf_sq_search and the other
IVF‑SQ constraint functions) to locate the places to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 86c332b9-8d53-48ad-b32a-0efb7774f2b0
📒 Files selected for processing (35)
cpp/CMakeLists.txtcpp/bench/ann/CMakeLists.txtcpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.hcpp/bench/ann/src/cuvs/cuvs_benchmark.cucpp/bench/ann/src/cuvs/cuvs_ivf_sq.cucpp/bench/ann/src/cuvs/cuvs_ivf_sq_wrapper.hcpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hppcpp/include/cuvs/neighbors/ivf_sq.hppcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/filter_matrix.jsoncpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/kernel_def.hppcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_matrix.jsoncpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_planner.hppcpp/src/neighbors/ivf_sq/ivf_sq_build.cuhcpp/src/neighbors/ivf_sq/ivf_sq_build_extend_float_uint8_t_int64_t.cucpp/src/neighbors/ivf_sq/ivf_sq_build_extend_half_uint8_t_int64_t.cucpp/src/neighbors/ivf_sq/ivf_sq_search.cuhcpp/src/neighbors/ivf_sq/ivf_sq_search_uint8_t_int64_t.cucpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuhcpp/src/neighbors/ivf_sq/ivf_sq_serialize_uint8_t.cucpp/src/neighbors/ivf_sq_index.cppcpp/tests/CMakeLists.txtcpp/tests/neighbors/ann_ivf_sq.cuhcpp/tests/neighbors/ann_ivf_sq/test_float_int64_t.cucpp/tests/neighbors/ann_ivf_sq/test_half_int64_t.cudocs/source/cpp_api/neighbors.rstdocs/source/cpp_api/neighbors_ivf_sq.rstpython/cuvs_bench/cuvs_bench/config/algorithms.yamlpython/cuvs_bench/cuvs_bench/config/algos/constraints/__init__.pypython/cuvs_bench/cuvs_bench/config/algos/cuvs_ivf_sq.yamlpython/cuvs_bench/cuvs_bench/config/algos/faiss_cpu_ivf_sq.yamlpython/cuvs_bench/cuvs_bench/config/algos/faiss_gpu_ivf_sq.yaml
| // Reconstructed vector component: s_aux[d] + code*s_sq_scale[d]. | ||
| // | ||
| // After all probes are scanned, the smem is reused for block_sort merge. | ||
| template <int Capacity, typename MetricTag> |
There was a problem hiding this comment.
We should not structure the kernel in a way where we need MetricTag. Usually, that is a way to tell us that related functionality could be split out in another device function fragment. Look at @KyleFromNVIDIA's work with IVF PQ for examples.
Also Kyle's rule of thumb: if a template parameter does not have a function parameter argument, it is likely that code related to that template parameter can be pulled into another fragment.
There was a problem hiding this comment.
The rule of thumb that a template parameter missing from a signature means the templated code can be split out into another fragment only holds when said template parameter isn't tightly integrated into the body of the function. This one is borderline. Each fatbin generated by this fragment is <20KB (for a total of ~300KB), and I don't know how much of a gain we'd get by splitting out the metric-specific parts into their own fragment (it would be ~200KB at most, and that's not including the additional metric fragment). I'm inclined to think it's not worth it. At the very least, we can certainly save it for a follow-up.
My approval stands.
There was a problem hiding this comment.
I think kAscending could be promoted to a template parameter of the main kernel as many metrics would share the same value (most would be true?), so you wouldn't have to instantiate this kernel for every metric and internally infer the same value.
The parts related to kIsCosine can be separated out to different fragments, a cosine specific one and a no-op one.
There was a problem hiding this comment.
Thanks both for the careful reviews! Took some time to test different things. Conclusion : even though the metrics are tightly coupled with the main body it is still possible to recover some beyond the filter fragment.
Two changes:
- Move metric specialization out of the kernel template: into four device-function fragments :
setup_invariant_smemsetup_per_probe_smemaccumulate_distancefinalize_distance(the cosine vs no-op one)
The entrypoint now needs (Capacity, Ascending) instead of (Capacity, MetricTag).
- Apply the IVF-PQ inner-product trick: noticed that IVF-PQ negates IP scores in the kernel so
warpsort::block_sortalways runs ascending, then flip them back inpostprocess_distancesviaaccount_for_max_close. Ascending becomes hardcoded, so the kernel is now templated on Capacity alone.
Net effect:
- Before : 5 Capacity * 3 Metrics -> 15 main body instantiations -> 91MB
libcuvs.so - After : 5 Capacity -> 5 main body instantiations -> 89MB
libcuvs.so - No measurable performance impact
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
Approved CMake changes and JIT+LTO changes. Very nice.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh (1)
89-98: 💤 Low valueAdd compile-time guard for unsupported filter types.
If this template is instantiated with an unsupported
FilterT, the function has no return statement, leading to undefined behavior or an obscure compile error. Astatic_assertin the else branch would provide a clearer diagnostic.Suggested improvement
template <typename FilterT> constexpr auto get_filter_type_tag() { using namespace cuvs::neighbors::filtering; if constexpr (std::is_same_v<FilterT, none_sample_filter>) { return cuvs::neighbors::detail::tag_filter_none{}; } else if constexpr (std::is_same_v<FilterT, bitset_filter<uint32_t, int64_t>>) { return cuvs::neighbors::detail::tag_filter_bitset{}; + } else { + static_assert(sizeof(FilterT) == 0, "Unsupported IvfSampleFilterT for IVF-SQ scan."); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh` around lines 89 - 98, The function get_filter_type_tag currently has no return for unsupported FilterT values; add a compile-time guard in the final else branch that triggers a clear static_assert when FilterT is not none_sample_filter or bitset_filter (use a dependent-false pattern like always_false<FilterT>) to produce a readable diagnostic; reference the template get_filter_type_tag, type parameters FilterT, the known cases none_sample_filter and bitset_filter<uint32_t,int64_t>, and the expected return tags tag_filter_none and tag_filter_bitset so the static_assert message can state "unsupported FilterT for get_filter_type_tag" and prevent the missing-return/UB compile scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh`:
- Around line 39-43: The function finalize_distance_cosine_impl currently
returns 0.0f when denom <= 0 which wrongly promotes zero-norm vectors; change
the divide-protection to use a small epsilon (e.g. eps = 1e-8f) when checking
denom and, if denom <= eps, return a worst-distance sentinel (e.g. FLT_MAX or
INFINITY) instead of 0.0f; update finalize_distance_cosine_impl (variables:
denom, v_norm_sq, query_norm) accordingly and add the needed include for the
sentinel (e.g. <cfloat>) if not already present.
---
Nitpick comments:
In `@cpp/src/neighbors/ivf_sq/ivf_sq_search.cuh`:
- Around line 89-98: The function get_filter_type_tag currently has no return
for unsupported FilterT values; add a compile-time guard in the final else
branch that triggers a clear static_assert when FilterT is not
none_sample_filter or bitset_filter (use a dependent-false pattern like
always_false<FilterT>) to produce a readable diagnostic; reference the template
get_filter_type_tag, type parameters FilterT, the known cases none_sample_filter
and bitset_filter<uint32_t,int64_t>, and the expected return tags
tag_filter_none and tag_filter_bitset so the static_assert message can state
"unsupported FilterT for get_filter_type_tag" and prevent the missing-return/UB
compile scenario.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 26665f07-6856-4bdb-b23d-ebf00065ec53
📒 Files selected for processing (20)
cpp/CMakeLists.txtcpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hppcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_impl.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_matrix.jsoncpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_matrix.jsoncpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_impl.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_matrix.jsoncpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_planner.hppcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_impl.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_matrix.jsoncpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_impl.cuhcpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_kernel.cu.incpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_matrix.jsoncpp/src/neighbors/ivf_sq/ivf_sq_search.cuh
✅ Files skipped from review due to trivial changes (9)
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_matrix.json
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_invariant_smem_matrix.json
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/accumulate_distance_kernel.cu.in
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_kernel.cu.in
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_matrix.json
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/setup_per_probe_smem_kernel.cu.in
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_matrix.json
- cpp/include/cuvs/detail/jit_lto/ivf_sq/scan_fragments.hpp
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/device_functions.cuh
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/CMakeLists.txt
- cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/scan_kernel.cu.in
| __device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm) | ||
| { | ||
| float denom = query_norm * sqrtf(v_norm_sq); | ||
| return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f; | ||
| } |
There was a problem hiding this comment.
Cosine zero-norm fallback currently returns a best-score distance.
At Line 42, denom <= 0 returns 0.0f, which is min-close and can incorrectly promote invalid cosine comparisons to top results. Use a worst-distance sentinel instead.
Suggested fix
__device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm)
{
float denom = query_norm * sqrtf(v_norm_sq);
- return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f;
+ return (denom > 0.0f) ? 1.0f - dist / denom : CUDART_INF_F;
}As per coding guidelines, "Floating-point computations (distance calculations, kernel evaluations, accumulations) must handle numerical edge cases: protect divisions with epsilon checks, handle zero-norm vectors..."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| __device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm) | |
| { | |
| float denom = query_norm * sqrtf(v_norm_sq); | |
| return (denom > 0.0f) ? 1.0f - dist / denom : 0.0f; | |
| } | |
| __device__ float finalize_distance_cosine_impl(float dist, float v_norm_sq, float query_norm) | |
| { | |
| float denom = query_norm * sqrtf(v_norm_sq); | |
| return (denom > 0.0f) ? 1.0f - dist / denom : CUDART_INF_F; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/neighbors/ivf_sq/detail/jit_lto_kernels/finalize_distance_impl.cuh`
around lines 39 - 43, The function finalize_distance_cosine_impl currently
returns 0.0f when denom <= 0 which wrongly promotes zero-norm vectors; change
the divide-protection to use a small epsilon (e.g. eps = 1e-8f) when checking
denom and, if denom <= eps, return a worst-distance sentinel (e.g. FLT_MAX or
INFINITY) instead of 0.0f; update finalize_distance_cosine_impl (variables:
denom, v_norm_sq, query_norm) accordingly and add the needed include for the
sentinel (e.g. <cfloat>) if not already present.
Closes #1291.
Overview
IVF-SQ combines an inverted file (IVF) partitioning scheme with 8-bit scalar quantization (SQ8) of residuals. Each float32 dimension is compressed to a single uint8 code, giving a 4x memory reduction over IVF-Flat while retaining high recall. The index implements various metrics (L2, inner-product, and cosine distance), data type (float, half) and also filtering.
Build
kmeans_trainset_fraction) is sampled from the dataset. Balanced K-Means is run on it to producen_listscentroids that partition the vector space.sq_delta[d] = (range + 2*margin) / 255. These two per-dimension parameters (sq_vminthe lower end of the range,sq_deltathe scale or quantization step) are stored in the index and are all that is needed to encode/decode any vector.add_data_on_buildis true (the default), the full dataset is inserted via the extend path described below.Extend
Extend adds new vectors to an existing index without retraining centroids or SQ parameters and in a batched fashion:
(adaptative centers : when enabled, centroids are incrementally updated as new data arrives, and center norms are recomputed).uint8and write the code into the interleaved list layout.Search
Search proceeds in three stages:
queries x centers^T), with metric-specific pre/post-processing. The topn_probesnearest clusters per query are selected viaselect_k.<BlockDim, Capacity, Metric, IdxT, SampleFilter>. It operates in two grid modes controlled by the compile-time Capacity parameter:(grid_dim_x, n_queries), each block loops over its share of probes and maintains a block_sort (warpsort) priority queue, writing only the local top-k to global memory. grid_dim_x is chosen via an occupancy query (cudaOccupancyMaxActiveBlocksPerMultiprocessor) to saturate the GPU. Instantiated for Capacity ∈ {32, 64, 128, 256}, covering k ≤ 256.(n_probes, n_queries), one block per (query, probe) pair writes every distance to a flat buffer. Used as a fallback when k > 256 or the fused kernel has zero occupancy.Three dim-length shared-memory arrays are filled in two phases: Phase 1 (invariant across probes) loads sq_delta and query terms; Phase 2 (per-probe) folds in the centroid so the inner loop reduces to s_query_term − code × s_sq_scale (L2) or s_query_term × (s_aux + code × s_sq_scale) (IP/Cosine). Each warp processes one interleaved group of 32 vectors; each lane loads a uint4, giving a fully coalesced 512-byte read per dimension block. Distance accumulation is fused with decoding — no separate decompress pass.
Benchmarks on B200