JIT LTO Cagra Search by divyegala · Pull Request #1807 · rapidsai/cuvs

divyegala · 2026-02-15T23:43:51Z

Apply updates from
CAGRA related PRs:

JIT related PRs:

Refactor JIT LTO kernel generation #1812

Benchmark:

KyleFromNVIDIA · 2026-04-20T16:10:36Z

          - libcusolver-dev
          - libcusparse-dev
          - libnvjitlink-dev
+          - cuda-nvrtc-dev


Please remove this as it was already added in #1804.

This still hasn't been removed.

Suggested change

- cuda-nvrtc-dev

coderabbitai · 2026-04-22T22:15:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The changes refactor CAGRA neighbor search from inline static kernels to a JIT-LTO (Just-In-Time Link-Time Optimization) compilation system. This introduces templated kernel fragments, descriptor-based metadata propagation, launcher factories, and dynamic kernel compilation infrastructure while removing existing inline implementations.

Changes

Cohort / File(s)	Summary
Build and Compilation `conda/recipes/libcuvs/recipe.yaml`, `cpp/CMakeLists.txt`, `cpp/src/detail/jit_lto/AlgorithmPlanner.cpp`	Build system updates: conda recipe dependency reordering, CMakeLists.txt migration from `cuvs-cagra-search` object library to JIT kernel files with `cuda-nvrtc` linkage, AlgorithmPlanner adds `-maxrregcount=64` compilation flag.
Algorithm Launcher Enhancements `cpp/include/cuvs/detail/jit_lto/AlgorithmLauncher.hpp`, `cpp/src/detail/jit_lto/AlgorithmLauncher.cpp`	New cooperative kernel dispatch path via `dispatch_cooperative`, public `get_library()` accessor, private `call_cooperative` method, and shared `get_cached_launchers()` static helper for launcher instance caching.
Fragment Tag Definitions `cpp/include/cuvs/detail/jit_lto/cagra/cagra_fragments.hpp`, `cpp/include/cuvs/detail/jit_lto/common_fragments.hpp`, `cpp/include/cuvs/detail/jit_lto/ivf_flat/interleaved_scan_fragments.hpp`	New CAGRA-specific tag structs (distance, metric, codebook, normalization, search stages), common data-type tags (`tag_f`, `tag_h`, `tag_i8`, `tag_u8`, `tag_index_u32`), and removal of duplicate tags from IVF-flat.
CAGRA Core Dependencies `cpp/src/neighbors/cagra.cuh`, `cpp/src/neighbors/detail/cagra/cagra_build.cuh`, `cpp/src/neighbors/detail/cagra/cagra_search.cuh`	Include path updates to use JIT-LTO graph core kernels, namespace documentation additions.
Descriptor Refactoring `cpp/src/neighbors/detail/cagra/compute_distance.hpp`, `cpp/src/neighbors/detail/cagra/compute_distance_standard-impl.cuh`, `cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh`	Device descriptor changes: removal of function pointers, addition of host-side JIT metadata fields (metric, dataset_block_dim, is_vpq, pq_bits, pq_len), refactored standard/VPQ constructors.
Removed Legacy Implementations `cpp/src/neighbors/detail/cagra/device_common.hpp`, `cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh`, `cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh`	Complete removal of inline device helpers (warp utilities, PTX memory ops, distance helpers) and large search kernels; replaced by JIT-LTO equivalents.
Distance Operations (Dist_op) `cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_*_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_matrix.json`	New metric-specific distance operators (L2, L1, inner-product, cosine, hamming) and JIT kernel wrapper with configuration matrix.
Setup Workspace `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_standard_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_vpq_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json`	Device workspace initialization implementations for standard and VPQ modes with shared-memory descriptor setup.
Compute Distance `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_standard_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_vpq_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json`	Device distance computation paths for standard (with external dist_op) and VPQ (vectorized half2 math) modes with parameter matrices.
Normalization `cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_*_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_matrix.json`	Cosine and no-op normalization implementations with JIT kernel wrapper and configuration matrix.
Filtering (Sample and Apply) `cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_*_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_matrix.json`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_matrix.json`	Sample filter implementations (none and bitset), filter data carriers, and apply-filter kernel with configuration matrices.
Device Intrinsics and Memory Ops `cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_intrinsics.hpp`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_memory_ops.hpp`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_common_jit.cuh`	Refactored warp utilities (team_sum, swizzling), PTX-based memory load/store wrappers (lds, sts, ldg_ca, ldg_cg), and JIT-friendly distance-computation helpers.
Single-CTA Search `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_p_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_device_helpers.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_*_matrix.json`	New JIT single-CTA and persistent kernels with device helpers (job descriptors, worker handles, bitonic sort/merge, top-k primitives) and configuration matrices.
Single-CTA Launcher `cpp/src/neighbors/detail/cagra/search_single_cta_kernel_explicit_inst.cuh`, `cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_common.cuh`, `cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`, `cpp/src/neighbors/detail/cagra/search_single_cta.cuh`, `cpp/src/neighbors/detail/cagra/search_single_cta_inst.cuh`	JIT launcher infrastructure for single-CTA with persistent runner management, cooperative scheduling, and launcher caching.
Multi-CTA Search `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_helpers.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_*_matrix.json`, `cpp/src/neighbors/detail/cagra/search_multi_cta.cuh`, `cpp/src/neighbors/detail/cagra/search_multi_cta_kernel_launcher_jit.cuh`, `cpp/src/neighbors/detail/cagra/search_multi_cta_inst.cuh`	New JIT multi-CTA kernels with iterative candidate refinement, parent pickup helpers, bitonic sorting, and launcher using RMM device vectors.
Multi-Kernel Search `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/random_pickup_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_to_child_nodes_kernel.cu.in`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/*_matrix.json`, `cpp/src/neighbors/detail/cagra/search_multi_kernel.cuh`, `cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh`	JIT multi-kernel path with random pickup, child-distance, and apply-filter stages; launcher wrappers dispatch individual kernels through AlgorithmLauncher.
Planner Infrastructure `cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_planner.hpp`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_planner.hpp`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_kernel_planner.hpp`	AlgorithmPlanner subclasses for CAGRA that add device functions, register kernel fragments, and manage compile-time/runtime parameters.
Launcher Factory `cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp`	Factory functions to construct and cache AlgorithmLauncher instances for single-CTA, multi-CTA, and multi-kernel paths with metric/VPQ dispatch.
Kernel Definitions and Bitset `cpp/src/neighbors/detail/cagra/jit_lto_kernels/kernel_def.hpp`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_bitset.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_data.h`	Kernel function pointer type aliases, bitset/sample-filter data carriers, and helper traits for filter extraction.
Utility Functions `cpp/src/neighbors/detail/cagra/jit_lto_kernels/topk_by_radix.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/set_value_batch.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/graph_core.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/hashmap.hpp`, `cpp/src/neighbors/detail/cagra/search_plan.cuh`	Path updates for existing utilities, host batch-value setting, graph core includes, and shared-memory-safe kernel launch infrastructure.
Type System and Shared Utilities `cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp`, `cpp/src/neighbors/detail/smem_utils.cuh`, `cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_jit.cuh`, `cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh`, `cpp/src/neighbors/detail/nn_descent.cuh`	Type tag mapping utilities for CAGRA JIT, shared-memory launch refactoring with per-kernel state, and cross-algorithm tag/utility updates.
Kernel Declaration Headers `cpp/src/neighbors/detail/cagra/jit_lto_kernels/extern_device_functions.cuh`	External device function declarations used as entry points for JIT compilation.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/src/neighbors/detail/cagra/search_multi_kernel.cuh (1)

604-666: ⚠️ Potential issue | 🟠 Major

Restore a final CUDA launch check before returning.

The tail of this function now returns after several async launches without a cudaPeekAtLastError(). If remove_parent_bit, apply_filter_jit, or one of the final batched_memcpy kernels has an invalid launch configuration, that failure will surface on an unrelated later CUDA call instead of here.

Suggested fix

       batched_memcpy(topk_distances_ptr,
                      topk,
                      result_distances_ptr,
                      result_buffer_allocation_size,
                      topk,
                      num_queries,
                      stream);
     }
+
+    RAFT_CUDA_TRY(cudaPeekAtLastError());
 
     if (num_executed_iterations) {
       for (std::uint32_t i = 0; i < num_queries; i++) {
         num_executed_iterations[i] = iter;
       }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_multi_kernel.cuh` around lines 604 -
666, The function ends after launching asynchronous kernels (apply_filter_jit,
remove_parent_bit, batched_memcpy) but lacks a final CUDA launch error check;
insert a cudaPeekAtLastError() (or your project's equivalent CUDA launch-check
macro) right before the function returns (after the
batched_memcpy/topk_distances_ptr handling) to catch invalid kernel launch
configurations and surface errors immediately; reference the existing symbols
apply_filter_jit, remove_parent_bit, batched_memcpy,
result_distances_ptr/result_indices_ptr/topk_distances_ptr when placing this
check.

♻️ Duplicate comments (2)

dependencies.yaml (1)

378-378: ⚠️ Potential issue | 🟡 Minor

Remove duplicate cuda-nvrtc-dev in the CUDA package list.

cuda-nvrtc-dev at Line 378 duplicates the same entry already present at Line 370 in the same list.

Proposed fix

       - output_types: [conda]
         packages:
           - cuda-nvrtc-dev
           - cuda-nvtx-dev
           - cuda-profiler-api
           - libcublas-dev
           - libcurand-dev
           - libcusolver-dev
           - libcusparse-dev
           - libnvjitlink-dev
-          - cuda-nvrtc-dev

#!/bin/bash
set -euo pipefail
# Expect exactly one match after fix.
rg -n '^\s+- cuda-nvrtc-dev$' dependencies.yaml

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dependencies.yaml` at line 378, Remove the duplicate package entry
"cuda-nvrtc-dev" from the CUDA package list in dependencies.yaml by deleting the
redundant line (the duplicate at Line 378) so only a single "- cuda-nvrtc-dev"
entry remains; ensure no other CUDA package entries are altered and run the
provided grep check to confirm exactly one match remains.

cpp/src/neighbors/detail/smem_utils.cuh (1)

35-39: ⚠️ Potential issue | 🔴 Critical

The cudaKernel_t launch path is still not thread-safe.

map_mutex has automatic storage, so concurrent calls do not actually serialize access to jit_smem_sizes. Also, safely_launch_kernel_with_smem_size_impl() still coordinates through the shared last_smem_size/last_kernel statics and never uses the passed current_smem_size, so different JIT kernels can race and publish a stale high-water mark.

🐛 Minimal safe fallback

 template <typename KernelLauncherT>
 void safely_launch_kernel_with_smem_size(cudaKernel_t kernel,
                                          uint32_t smem_size,
                                          KernelLauncherT const& launch)
 {
-  // For JIT kernels, track by kernel pointer since all cudaKernel_t have the same type
-  static std::unordered_map<cudaKernel_t, std::pair<std::mutex, std::atomic<uint32_t>>>
-    jit_smem_sizes;
-  std::mutex map_mutex;
-
-  std::pair<std::mutex, std::atomic<uint32_t>>* current_smem_size;
-  {
-    std::lock_guard<std::mutex> map_lock{map_mutex};
-    current_smem_size = &jit_smem_sizes[kernel];
-  }
-  safely_launch_kernel_with_smem_size_impl<cudaKernel_t, KernelLauncherT>(
-    kernel, smem_size, launch, current_smem_size->first, current_smem_size->second);
+  static std::mutex mutex;
+  static std::atomic<uint32_t> current_smem_size{0};
+  safely_launch_kernel_with_smem_size_impl<cudaKernel_t, KernelLauncherT>(
+    kernel, smem_size, launch, mutex, current_smem_size);
 }

#!/bin/bash
set -euo pipefail
sed -n '35,121p' cpp/src/neighbors/detail/smem_utils.cuh
printf '\n--- symbols involved in the JIT cache path ---\n'
rg -n 'map_mutex|jit_smem_sizes|current_smem_size|last_smem_size|last_kernel' cpp/src/neighbors/detail/smem_utils.cuh

Also applies to: 91-106

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/smem_utils.cuh` around lines 35 - 39, The
cudaKernel_t launch path is not thread-safe because map_mutex is local and the
function still uses the shared statics (last_smem_size/last_kernel) instead of
the passed current_smem_size; fix by making the mutex that protects
jit_smem_sizes a single shared (static or external) mutex referenced by
safely_launch_kernel_with_smem_size_impl, remove/use last_smem_size/last_kernel
only under that same shared lock, and replace any reads/writes to the shared
high-water mark with atomic operations on the passed current_smem_size
(std::atomic<uint32_t>&) so each kernel instance updates its own high-water
mark; also ensure the cudaKernel_t launch path uses the same shared mutex and
per-kernel key (e.g., kernel identity or hashed signature) when looking
up/setting jit_smem_sizes to avoid races between different JIT kernels.

🧹 Nitpick comments (15)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_intrinsics.hpp (1)
11-13: Extract _RAFT_HOST_DEVICE into a cuVS-local alternative to remove the internal RAFT header dependency.

The TODO comment already flags this concern: pulling raft/core/detail/macros.hpp into reusable helpers creates a brittle dependency on RAFT internals. While this pattern is already established in the codebase (other jit_lto_kernels headers and even public headers like common.hpp use it), it remains a fragile coupling that makes the code vulnerable to RAFT API changes.

A small local wrapper macro for _RAFT_HOST_DEVICE in this translation unit would eliminate this dependency without affecting the rest of the codebase. This would also resolve the TODO in the same PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_intrinsics.hpp` around
lines 11 - 13, The code currently depends on raft/core/detail/macros.hpp for
_RAFT_HOST_DEVICE; remove that internal RAFT include and add a cuVS-local
alternative macro (e.g., define CUVS_HOST_DEVICE or locally define
_RAFT_HOST_DEVICE if you prefer) inside device_intrinsics.hpp so
functions/classes in this translation unit use the local macro instead of RAFT
internals; ensure the macro expands to the same host/device attributes (and a
no-op fallback when compiling non-CUDA) and replace any uses of the
RAFT-provided symbol in this file with the local symbol to eliminate the fragile
RAFT dependency.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_l2_impl.cuh (1)
13-14: Optional consistency: cast operands to DISTANCE_T before subtraction.

This keeps behavior/style aligned with the other metric implementations.
Proposed tweak
-  DISTANCE_T diff = a - b;
+  DISTANCE_T diff = static_cast<DISTANCE_T>(a) - static_cast<DISTANCE_T>(b);
   return diff * diff;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_l2_impl.cuh` around
lines 13 - 14, Cast the operands to DISTANCE_T before performing the subtraction
so the subtraction uses the intended type: update the L2 distance expression in
dist_op_l2_impl (where DISTANCE_T diff = a - b; return diff * diff;) to convert
'a' and 'b' to DISTANCE_T prior to computing diff, ensuring the subtraction and
subsequent multiplication are done in DISTANCE_T.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_cosine_impl.cuh (1)
10-14: Consider deduplicating cosine and inner-product dist_op bodies.

Lines 10-14 are currently identical to dist_op_inner_product_impl.cuh; sharing one implementation would reduce drift risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_cosine_impl.cuh`
around lines 10 - 14, Duplicate dist_op template bodies for cosine and
inner-product; extract the shared template implementation into a single common
header (e.g., a new common dist_op_common.cuh) and include it from both
dist_op_cosine_impl.cuh and dist_op_inner_product_impl.cuh; move the template
definition for template<typename QUERY_T, typename DISTANCE_T> __device__
DISTANCE_T dist_op(QUERY_T a, QUERY_T b) into that common header, keep the same
symbol name dist_op, and update both files to `#include` the common header so they
both reuse the single implementation.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_hamming_impl.cuh (1)
14-14: Prefer explicit cast on return for template clarity.

Line 14 currently relies on implicit conversion from int to DISTANCE_T.
Proposed tweak
-  return __popc(v);
+  return static_cast<DISTANCE_T>(__popc(v));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_hamming_impl.cuh` at
line 14, The return statement currently returns __popc(v) implicitly as an int;
change it to explicitly cast the result to DISTANCE_T (e.g., use
static_cast<DISTANCE_T>(__popc(v))) so the function in dist_op_hamming_impl.cuh
returns the template type explicitly and avoids implicit int-to-DISTANCE_T
conversion.
cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_jit.cuh (1)
8-8: Move sample_filter_data.h to a shared JIT detail path.

Pulling IVF-flat through src/neighbors/detail/cagra/... couples it to CAGRA’s private layout. A neutral shared JIT location would keep these algorithms decoupled and make future file moves less brittle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_jit.cuh` at line 8, The
include in ivf_flat_interleaved_scan_jit.cuh currently pulls
sample_filter_data.h from the CAGRA-specific path; move sample_filter_data.h
into a neutral shared JIT detail location (e.g., a neighbors/detail/jit or
neighbors/detail/shared_jit directory) and update the include in
ivf_flat_interleaved_scan_jit.cuh to reference that new path instead of
../detail/cagra/jit_lto_kernels/sample_filter_data.h; also update the build
rules (include paths) so the compiler finds the new location and keep the
header’s include guards/namespaces unchanged.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_vpq_impl.cuh (1)
119-122: Tighten the VPQ specialization guard.

This entrypoint accepts any PQ_BITS > 0 && PQ_LEN > 0, but the implementation above already assumes the 8-bit half-codebook layout, and PQ_LEN == 1 makes the codebook permutation divide by zero. Please reject unsupported VPQ shapes here so bad matrix entries fail at the boundary instead of deeper in template instantiation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_vpq_impl.cuh`
around lines 119 - 122, Tighten the VPQ specialization guard in the
static_assert inside setup_workspace_vpq_impl.cuh: replace the current condition
(PQ_BITS > 0 && PQ_LEN > 0 && std::is_same_v<CodebookT, half> &&
std::is_same_v<QueryT, half>) with a stricter check that enforces the 8-bit
half-codebook layout and forbids PQ_LEN==1 (e.g. PQ_BITS == 8 && PQ_LEN > 1 &&
std::is_same_v<CodebookT, half> && std::is_same_v<QueryT, half>), and update the
static_assert message to reflect these exact requirements so unsupported VPQ
shapes are rejected at the boundary.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_helpers.cuh (1)
44-45: Use unsigned types for the ballot prefix mask, or remove the signed left-shift concern.

The code (1 << threadIdx.x) operates on a signed int. While C++20 (the target standard for this codebase) well-defines 1 << 31 as INT_MIN, using unsigned types is a clearer, more portable idiom for bit manipulation. However, the suggested fix __lanemask_lt() is incompatible: it requires sm_90+ (Hopper) but this code targets sm_70-sm_75. If refactoring for clarity, use (1U << threadIdx.x) - 1 instead, or verify the fix against the actual target architectures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_helpers.cuh`
around lines 44 - 45, The bitmask calculation for candidate_id uses a signed
left shift which can be ambiguous; change the mask expression in the __popc call
to use an unsigned shift so the prefix mask is unsigned—replace ((1 <<
threadIdx.x) - 1) with an unsigned variant such as ((1U << threadIdx.x) - 1) (or
otherwise ensure the left-shift operand is unsigned) when computing candidate_id
from ballot_mask produced by __ballot_sync to avoid signed-shift pitfalls on
target SM70–SM75.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh (2)
26-26: Typo: num_distilation should be num_distillation.

This parameter name appears to be misspelled. Consider renaming to num_distillation for clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh` at line
26, The parameter name num_distilation is misspelled; rename it to
num_distillation everywhere it appears (function signatures, declarations,
calls, and any comments) in the search_multi_jit.cuh kernels and related
functions so identifiers match; update the parameter in the affected
function/method prototypes and all call sites that pass num_distilation (e.g.,
in the search_multi_jit kernel/function signatures and usages) to use
num_distillation to avoid compilation/semantic mismatches.
204-204: Inconsistent invalid index check.

Line 204 checks result_indices_ptr[index] != ~index_msb_1_mask, but elsewhere in the codebase (e.g., lines 86, 141-144, 174), utils::get_max_value<IndexT>() is used for invalid indices. Consider using the same pattern for consistency:
-  if (result_indices_ptr[index] != ~index_msb_1_mask) {
+  if (result_indices_ptr[index] != utils::get_max_value<IndexT>()) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh` at line
204, The check in the conditional uses a bitmask (~index_msb_1_mask) instead of
the project's canonical invalid-index sentinel; replace the comparison in the if
statement so it tests result_indices_ptr[index] against
utils::get_max_value<IndexT>() (the same sentinel used at other sites) and
ensure the surrounding code that assigns invalid indices uses the same
utils::get_max_value<IndexT>() value so the check is consistent with
functions/variables like result_indices_ptr, index_msb_1_mask, and the
utils::get_max_value<IndexT>() usage elsewhere.
cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh (4)
570-573: const_cast on bitset data may be unsafe.

Using const_cast to remove constness from bitset_view.data() is risky. If the underlying bitset data is truly const (e.g., stored in read-only memory), this could lead to undefined behavior if any code path attempts to write through this pointer.

Since the kernel only reads from bitset_ptr, consider changing the member type to const uint32_t* instead:
Proposed fix in member declaration (around line 461)
-  uint32_t* bitset_ptr;         // Bitset data pointer (nullptr for none_filter)
+  const uint32_t* bitset_ptr;   // Bitset data pointer (nullptr for none_filter)
And update the extraction:
-      bitset_ptr       = const_cast<uint32_t*>(bitset_view.data());
+      bitset_ptr       = bitset_view.data();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`
around lines 570 - 573, The code unsafely uses const_cast on bitset_view.data()
to assign bitset_ptr; change the member declaration of bitset_ptr to be a const
uint32_t* (instead of uint32_t*) and remove the const_cast so you assign
bitset_ptr = bitset_view.data(); also update any callers/usages (e.g., where
bitset_ptr is passed into the kernel or stored in the launch params) to accept a
const uint32_t* and ensure no code attempts to mutate through bitset_ptr; keep
other casts for sizes (bitset_len, original_nbits) unchanged.
874-908: Unused lambda parameter kernel.

The kernel_launcher lambda at line 874 captures an unused parameter kernel in its signature. This parameter appears to be vestigial.
Proposed fix
-    auto kernel_launcher = [&](auto const& kernel) -> void {
+    auto kernel_launcher = [&](auto const& /* kernel */) -> void {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`
around lines 874 - 908, The lambda kernel_launcher currently declares an unused
parameter named kernel; remove this unused parameter from the lambda signature
(i.e., change kernel_launcher to capture-only or take no parameters) and update
any call sites accordingly so that the body still calls
launcher->dispatch<search_single_cta_kernel_func_t<DataT, IndexT, DistanceT,
SourceIndexT>>(...) with the same arguments; ensure the symbol kernel_launcher
and the template search_single_cta_kernel_func_t remain intact and that no
captured variables are accidentally removed when converting the lambda to a
no-arg form.
955-977: Detached thread may cause issues during process shutdown.

The detached thread for runner expiration could cause problems:

If the process exits while the thread is sleeping, it may not terminate cleanly

Static destruction order issues with persistent global state

Consider using a joinable thread with proper shutdown signaling, or a thread pool with managed lifetime.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`
around lines 955 - 977, Replace the detached expiration thread with a joinable,
managed thread and explicit shutdown signaling: create and store a std::thread
member (instead of detaching) alongside an atomic<bool> or condition_variable
stop flag in the persistent structure, have the thread (the lambda that creates
runner_outer, uses lifetime, ready, runner_weak and persistent.lock) wait on the
condition_variable or check the stop flag rather than unbounded sleep_for, and
on persistent teardown (or when persistent.runner is reset) set the stop
flag/notify and join the thread to ensure clean shutdown and avoid
static-destruction races.
813-813: Unused variable dev_desc_persistent.

The variable dev_desc_persistent is assigned but never used. It appears the descriptor is obtained from dataset_desc.dev_ptr(stream) via std::cref(dataset_desc) passed to the runner instead.
Proposed fix
-    const auto* dev_desc_persistent = dataset_desc.dev_ptr(stream);
     get_runner_jit<runner_type>(std::cref(dataset_desc),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh` at
line 813, Remove the unused local variable dev_desc_persistent (the assignment
const auto* dev_desc_persistent = dataset_desc.dev_ptr(stream);) from
search_single_cta_kernel_launcher_jit.cuh; rely on the existing
std::cref(dataset_desc) passed into the runner instead, and ensure no other code
paths expect dev_desc_persistent to exist—if any do, replace those uses with
dataset_desc.dev_ptr(stream) or the dataset_desc reference as appropriate.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh (1)
594-594: Hardcoded constant should use kMaxJobsNum.

Line 594 hardcodes 8192 but should use kMaxJobsNum which is already defined and used elsewhere in the codebase for consistency.
Proposed fix
-      constexpr uint32_t kMaxJobsNum = 8192;
-      job_ix                         = raft::shfl(job_ix, 0);
-      if (threadIdx.x < job_desc_type::kBlobSize && job_ix < kMaxJobsNum) {
+      job_ix = raft::shfl(job_ix, 0);
+      if (threadIdx.x < job_desc_type::kBlobSize && job_ix < single_cta_search::kMaxJobsNum) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh` at
line 594, Replace the hardcoded literal 8192 with the existing constant
kMaxJobsNum in the search_single_cta_jit.cuh usage so the code uses the defined
kMaxJobsNum symbol for consistency; find the occurrence where 8192 is used (near
the constexpr uint32_t kMaxJobsNum = 8192; declaration) and change that literal
to reference kMaxJobsNum so all job-size logic relies on the single shared
constant.
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp (1)
37-153: Consider using dispatch_cagra_team_dim to reduce code duplication.

The add_setup_workspace_device_function method manually enumerates all combinations of (team_size, dataset_block_dim) with nested if-else chains. The same pattern is repeated for add_compute_distance_device_function. Consider refactoring to use the existing dispatch_cagra_team_dim helper with an additional pq_len dispatch to reduce duplication.
Example refactor approach
void add_setup_workspace_device_function(...) {
  auto add_for_team_dim = [&]<uint32_t TeamSz, uint32_t Dim>() {
    if constexpr (std::is_same_v<CodebookTag, tag_codebook_none>) {
      add.template operator()<TeamSz, Dim, 0u, 0u>();
    } else {
      if (pq_len == 2) {
        add.template operator()<TeamSz, Dim, 8u, 2u>();
      } else {
        add.template operator()<TeamSz, Dim, 8u, 4u>();
      }
    }
  };
  dispatch_cagra_team_dim(team_size, dataset_block_dim, add_for_team_dim);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp` around
lines 37 - 153, The method add_setup_workspace_device_function currently
duplicates nested if/else dispatch over (team_size, dataset_block_dim, pq_len);
replace that manual enumeration by using the existing dispatch_cagra_team_dim
helper: create a templated lambda (e.g., add_for_team_dim) accepting template
parameters <uint32_t TeamSz, uint32_t Dim> that inside uses if constexpr on
CodebookTag to call add.template operator()<TeamSz,Dim,0u,0u>() for
tag_codebook_none or selects 8u and pq_len (2u or 4u) for VPQ, then call
dispatch_cagra_team_dim(team_size, dataset_block_dim, add_for_team_dim); apply
the same refactor pattern to add_compute_distance_device_function to remove
duplicated team/dim branches.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/cuvs/neighbors/ivf_flat.hpp`:
- Line 174: Add a Doxygen comment block describing the newly public method
set_metric(cuvs::distance::DistanceType) above its declaration in ivf_flat.hpp
(explain purpose, parameters, behavior, thread-safety and default/allowed metric
values, and any effects on indexing/search), and update the repository
API-change docs (e.g., CHANGELOG or docs/api) to note that set_metric was made
public and describe compatibility/usage guidance; ensure the doc text references
the cuvs::distance::DistanceType type and the ivf_flat class for
discoverability.

In `@cpp/src/detail/jit_lto/AlgorithmLauncher.cpp`:
- Around line 68-71: The static launchers map returned by get_cached_launchers()
is not thread-safe and can be corrupted under concurrent JIT callers; add a
process-wide static std::mutex (e.g., get_cached_launchers_mutex()) in the same
translation unit and require locking it around all accesses
(insert/find/erase/rehash) to the map returned by get_cached_launchers(); update
all call sites that touch AlgorithmLauncher cache to use
std::lock_guard<std::mutex> lock(get_cached_launchers_mutex()) before operating
on launchers to prevent races and UB.

In
`@cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_noop_impl.cuh`:
- Line 8: The include line references a misspelled header
"../compute_distance_standard-impl.cuh" which breaks the build; update the
include in apply_normalization_standard_noop_impl.cuh to the correct header name
(e.g., "../compute_distance_standard_impl.cuh") so it matches the actual file in
the repo and ensure the included symbol/implementation used by this file
(compute_distance_standard_impl.cuh) is present and exported.

In
`@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp`:
- Around line 69-129: The code incorrectly defaults all non-Hamming metrics to
QueryTag = query_type_tag_standard_t<..., DistanceType::L2Expanded>; update the
logic around dataset_desc.metric where QueryTag and planner are instantiated
(the blocks that create CagraSingleCtaSearchPlanner, call planner.add_* and
return planner.get_launcher()) to explicitly handle each supported standard
metric by selecting QueryTag = query_type_tag_standard_t<..., <ActualMetric>>
per dataset_desc.metric (e.g., L1, L2, Cosine, etc.), and for
unknown/unsupported metrics fail fast (throw or assert) instead of falling back
to L2Expanded; apply the same change to the other two similar regions noted
(around the other occurrences referenced in the review).

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh`:
- Around line 188-205: The kernel currently checks the min-iteration cutoff with
(iter >= min_iteration) after deciding no parent is available, but since iter is
incremented at the end of the loop this allows one extra no-op iteration; update
the condition to use (iter + 1 >= min_iteration) so the break honors the
minimum-iteration boundary immediately when parent_indices_buffer[0] ==
invalid_index. Change the condition in the loop that reads
(parent_indices_buffer[0] == invalid_index) && (iter >= min_iteration) to use
iter + 1 and leave other logic (iter increment at loop end) unchanged.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh`:
- Around line 47-52: The early return based on global_team_index causes thread
divergence at the __syncthreads() barrier; replace the early return with an
active boolean (e.g., bool active = global_team_index < num_pickup) so all
threads reach the barrier, only invoking
setup_workspace_base<DataT,IndexT,DistanceT>(...) and using smem_desc when
active, and ensure any inactive threads still participate in the __syncthreads()
(for example: if (!active) { __syncthreads(); return; } or move the barrier
after computing active and conditionally skipping work), referencing
global_team_index, num_pickup, setup_workspace_base, smem_desc and __syncthreads
in the change.

In
`@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_device_helpers.cuh`:
- Around line 647-650: The compaction check uses an undefined left shift:
replace the expression "who_has_invalid << (warp_size - lane_id)" with a safe
mask-based test that checks for any set bits from lanes with smaller IDs;
specifically compute a mask for lanes < lane_id (e.g. uint32_t mask = (lane_id
== 0 ? 0u : (1u << lane_id) - 1u)) and change the condition to "if
(who_has_invalid & mask)". Update usage of I_found_invalid, who_has_invalid
(from raft::ballot), lane_id and warp_size accordingly to use unsigned types so
shifts are never performed with 32, and ensure the mask computation handles
lane_id == 0 safely.
- Around line 123-125: The prefix mask calculation is using block-wide
threadIdx.x which assumes single-warp execution; make it lane-local by computing
the lane index (e.g., const unsigned lane = threadIdx.x & 31u;) and replace ((1
<< threadIdx.x) - 1) with ((1u << lane) - 1) when computing the prefix: use the
warp-local ballot_mask from __ballot_sync and __popc(ballot_mask & ((1u << lane)
- 1)) + num_new_parents so the code using ballot_mask, new_parent,
__ballot_sync, and __popc is correct even if callers aren't single-warp.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh`:
- Around line 300-321: The write to *filter_flag inside the loop is a data race;
replace the plain store "*filter_flag = 1;" with an atomic update (e.g.
atomicOr(filter_flag, 1) or atomicExch(filter_flag, 1)) so multiple threads can
set the flag safely; ensure filter_flag is an int/unsigned int (or cast to the
correct atomic pointer type) and keep the initial clear at threadIdx.x == 0,
leaving the rest of the logic around sample_filter, result_indices_buffer,
result_distances_buffer and the surrounding __syncthreads() unchanged.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/set_value_batch.cuh`:
- Around line 27-37: In set_value_batch, avoid launching the kernel when
count==0 or batch_size==0 by returning early from the function; then after
calling set_value_batch_kernel<<<...>>>(...) check the CUDA launch result with
cudaGetLastError() (and/or cudaDeviceSynchronize() if desired) and surface
failures immediately (e.g., throw or propagate a descriptive runtime_error
including cudaGetErrorString(err)). This change should be applied inside
set_value_batch around the grid_size computation and right after the kernel
launch to prevent invalid <<<0,...>>> launches and to report any launch
failures.

In `@cpp/src/neighbors/detail/cagra/search_multi_cta.cuh`:
- Around line 100-101: The workspace is declared as
rmm::device_uvector<uint32_t> (topk_workspace) but _cuann_find_topk_bufferSize()
returns size in bytes, so resize() is currently treating the byte count as a
count of uint32_t elements and over-allocating; change the topk_workspace
declaration/type to rmm::device_uvector<uint8_t> and call
topk_workspace.resize(buffer_size_bytes) using the value returned by
_cuann_find_topk_bufferSize(), then pass topk_workspace.data() (or
topk_workspace.data().get()) as the void* workspace to the API to ensure exact
byte-sized allocation.

In `@cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh`:
- Around line 115-148: The dispatch currently hardcodes
cuvs::neighbors::filtering::none_sample_filter{}; instead thread the caller's
sample_filter through the multi-kernel JIT path: update the
make_cagra_multi_kernel_jit_launcher registration to register
get_sample_filter_name<SAMPLE_FILTER_T>() for the multi-kernel path, ensure
compute_distance_to_child_nodes_kernel_func_t<DataT,IndexT,DistanceT,SourceIndexT>
(and any kernel-template instantiation) accepts a SAMPLE_FILTER_T parameter, and
replace the hardcoded none_sample_filter{} argument in the
launcher->dispatch(...) call with the forwarded sample_filter object so the
kernel receives and applies the caller's filter.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`:
- Around line 230-231: Add a proper error check around the
cudaStreamCreateWithFlags call: capture its return into a cudaError_t (e.g., err
= cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)), test against
cudaSuccess, and on failure log/propagate the error using
cudaGetErrorString(err) and perform appropriate cleanup/early return or throw
(so downstream code using stream is not executed). Update the call site
referencing cudaStreamCreateWithFlags and the local stream variable to use this
checked pattern.

In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp`:
- Around line 88-111: get_sample_filter_name currently falls back to returning
"filter_none_source_index_ui" for unrecognized SAMPLE_FILTER_T, silently
disabling filtering; instead fail fast at compile time for unknown filters.
Modify get_sample_filter_name (templated on SAMPLE_FILTER_T / DecayedFilter) to
replace the final default return with a static_assert using a dependent-false
idiom (e.g. template<typename> inline constexpr bool always_false = false;) that
triggers a clear message including DecayedFilter (and suggest mentioning
CagraSampleFilterWithQueryIdOffset and SourceIndexT) so new/unknown wrappers
don't silently map to none; keep the existing cases for none_sample_filter and
bitset_filter checks using is_bitset_filter/bitset_filter.

In `@cpp/src/neighbors/detail/smem_utils.cuh`:
- Around line 7-10: The header is missing a direct include for
std::unordered_map, causing a dependency on transitive includes; add the
explicit include directive for <unordered_map> at the top of smem_utils.cuh
(near the other standard includes so it is available where std::unordered_map is
used), ensuring the code that references std::unordered_map compiles regardless
of include order.

In `@cpp/src/neighbors/ivf_flat_index.cpp`:
- Around line 63-67: set_metric currently mutates metric_ without updating
center_norms_, which can leave center_norms_ as nullopt and cause unsafe
dereferences in search paths (e.g., ivf_flat_search.cuh cases L2Expanded and
CosineExpanded that call index.center_norms()->data_handle()). Fix by making
set_metric resilient: when switching to a metric that requires center norms,
allocate or recompute center_norms_ (or set a safe empty buffer) and when
switching to a metric that does not need norms, clear center_norms_;
additionally, ensure callers that dereference index.center_norms() (search code
paths L2Expanded, CosineExpanded) either check has_value() or rely on the
invariant established by set_metric so center_norms_ is always present when
those metrics are selected; update the index<T,IdxT>::set_metric implementation
and add an assertion or optional-resync helper used by set_metric to maintain
this invariant.

---

Outside diff comments:
In `@cpp/src/neighbors/detail/cagra/search_multi_kernel.cuh`:
- Around line 604-666: The function ends after launching asynchronous kernels
(apply_filter_jit, remove_parent_bit, batched_memcpy) but lacks a final CUDA
launch error check; insert a cudaPeekAtLastError() (or your project's equivalent
CUDA launch-check macro) right before the function returns (after the
batched_memcpy/topk_distances_ptr handling) to catch invalid kernel launch
configurations and surface errors immediately; reference the existing symbols
apply_filter_jit, remove_parent_bit, batched_memcpy,
result_distances_ptr/result_indices_ptr/topk_distances_ptr when placing this
check.

---

Duplicate comments:
In `@cpp/src/neighbors/detail/smem_utils.cuh`:
- Around line 35-39: The cudaKernel_t launch path is not thread-safe because
map_mutex is local and the function still uses the shared statics
(last_smem_size/last_kernel) instead of the passed current_smem_size; fix by
making the mutex that protects jit_smem_sizes a single shared (static or
external) mutex referenced by safely_launch_kernel_with_smem_size_impl,
remove/use last_smem_size/last_kernel only under that same shared lock, and
replace any reads/writes to the shared high-water mark with atomic operations on
the passed current_smem_size (std::atomic<uint32_t>&) so each kernel instance
updates its own high-water mark; also ensure the cudaKernel_t launch path uses
the same shared mutex and per-kernel key (e.g., kernel identity or hashed
signature) when looking up/setting jit_smem_sizes to avoid races between
different JIT kernels.

In `@dependencies.yaml`:
- Line 378: Remove the duplicate package entry "cuda-nvrtc-dev" from the CUDA
package list in dependencies.yaml by deleting the redundant line (the duplicate
at Line 378) so only a single "- cuda-nvrtc-dev" entry remains; ensure no other
CUDA package entries are altered and run the provided grep check to confirm
exactly one match remains.

---

Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp`:
- Around line 37-153: The method add_setup_workspace_device_function currently
duplicates nested if/else dispatch over (team_size, dataset_block_dim, pq_len);
replace that manual enumeration by using the existing dispatch_cagra_team_dim
helper: create a templated lambda (e.g., add_for_team_dim) accepting template
parameters <uint32_t TeamSz, uint32_t Dim> that inside uses if constexpr on
CodebookTag to call add.template operator()<TeamSz,Dim,0u,0u>() for
tag_codebook_none or selects 8u and pq_len (2u or 4u) for VPQ, then call
dispatch_cagra_team_dim(team_size, dataset_block_dim, add_for_team_dim); apply
the same refactor pattern to add_compute_distance_device_function to remove
duplicated team/dim branches.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_intrinsics.hpp`:
- Around line 11-13: The code currently depends on raft/core/detail/macros.hpp
for _RAFT_HOST_DEVICE; remove that internal RAFT include and add a cuVS-local
alternative macro (e.g., define CUVS_HOST_DEVICE or locally define
_RAFT_HOST_DEVICE if you prefer) inside device_intrinsics.hpp so
functions/classes in this translation unit use the local macro instead of RAFT
internals; ensure the macro expands to the same host/device attributes (and a
no-op fallback when compiling non-CUDA) and replace any uses of the
RAFT-provided symbol in this file with the local symbol to eliminate the fragile
RAFT dependency.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_cosine_impl.cuh`:
- Around line 10-14: Duplicate dist_op template bodies for cosine and
inner-product; extract the shared template implementation into a single common
header (e.g., a new common dist_op_common.cuh) and include it from both
dist_op_cosine_impl.cuh and dist_op_inner_product_impl.cuh; move the template
definition for template<typename QUERY_T, typename DISTANCE_T> __device__
DISTANCE_T dist_op(QUERY_T a, QUERY_T b) into that common header, keep the same
symbol name dist_op, and update both files to `#include` the common header so they
both reuse the single implementation.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_hamming_impl.cuh`:
- Line 14: The return statement currently returns __popc(v) implicitly as an
int; change it to explicitly cast the result to DISTANCE_T (e.g., use
static_cast<DISTANCE_T>(__popc(v))) so the function in dist_op_hamming_impl.cuh
returns the template type explicitly and avoids implicit int-to-DISTANCE_T
conversion.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_l2_impl.cuh`:
- Around line 13-14: Cast the operands to DISTANCE_T before performing the
subtraction so the subtraction uses the intended type: update the L2 distance
expression in dist_op_l2_impl (where DISTANCE_T diff = a - b; return diff *
diff;) to convert 'a' and 'b' to DISTANCE_T prior to computing diff, ensuring
the subtraction and subsequent multiplication are done in DISTANCE_T.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_helpers.cuh`:
- Around line 44-45: The bitmask calculation for candidate_id uses a signed left
shift which can be ambiguous; change the mask expression in the __popc call to
use an unsigned shift so the prefix mask is unsigned—replace ((1 << threadIdx.x)
- 1) with an unsigned variant such as ((1U << threadIdx.x) - 1) (or otherwise
ensure the left-shift operand is unsigned) when computing candidate_id from
ballot_mask produced by __ballot_sync to avoid signed-shift pitfalls on target
SM70–SM75.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh`:
- Line 26: The parameter name num_distilation is misspelled; rename it to
num_distillation everywhere it appears (function signatures, declarations,
calls, and any comments) in the search_multi_jit.cuh kernels and related
functions so identifiers match; update the parameter in the affected
function/method prototypes and all call sites that pass num_distilation (e.g.,
in the search_multi_jit kernel/function signatures and usages) to use
num_distillation to avoid compilation/semantic mismatches.
- Line 204: The check in the conditional uses a bitmask (~index_msb_1_mask)
instead of the project's canonical invalid-index sentinel; replace the
comparison in the if statement so it tests result_indices_ptr[index] against
utils::get_max_value<IndexT>() (the same sentinel used at other sites) and
ensure the surrounding code that assigns invalid indices uses the same
utils::get_max_value<IndexT>() value so the check is consistent with
functions/variables like result_indices_ptr, index_msb_1_mask, and the
utils::get_max_value<IndexT>() usage elsewhere.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh`:
- Line 594: Replace the hardcoded literal 8192 with the existing constant
kMaxJobsNum in the search_single_cta_jit.cuh usage so the code uses the defined
kMaxJobsNum symbol for consistency; find the occurrence where 8192 is used (near
the constexpr uint32_t kMaxJobsNum = 8192; declaration) and change that literal
to reference kMaxJobsNum so all job-size logic relies on the single shared
constant.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_vpq_impl.cuh`:
- Around line 119-122: Tighten the VPQ specialization guard in the static_assert
inside setup_workspace_vpq_impl.cuh: replace the current condition (PQ_BITS > 0
&& PQ_LEN > 0 && std::is_same_v<CodebookT, half> && std::is_same_v<QueryT,
half>) with a stricter check that enforces the 8-bit half-codebook layout and
forbids PQ_LEN==1 (e.g. PQ_BITS == 8 && PQ_LEN > 1 && std::is_same_v<CodebookT,
half> && std::is_same_v<QueryT, half>), and update the static_assert message to
reflect these exact requirements so unsupported VPQ shapes are rejected at the
boundary.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`:
- Around line 570-573: The code unsafely uses const_cast on bitset_view.data()
to assign bitset_ptr; change the member declaration of bitset_ptr to be a const
uint32_t* (instead of uint32_t*) and remove the const_cast so you assign
bitset_ptr = bitset_view.data(); also update any callers/usages (e.g., where
bitset_ptr is passed into the kernel or stored in the launch params) to accept a
const uint32_t* and ensure no code attempts to mutate through bitset_ptr; keep
other casts for sizes (bitset_len, original_nbits) unchanged.
- Around line 874-908: The lambda kernel_launcher currently declares an unused
parameter named kernel; remove this unused parameter from the lambda signature
(i.e., change kernel_launcher to capture-only or take no parameters) and update
any call sites accordingly so that the body still calls
launcher->dispatch<search_single_cta_kernel_func_t<DataT, IndexT, DistanceT,
SourceIndexT>>(...) with the same arguments; ensure the symbol kernel_launcher
and the template search_single_cta_kernel_func_t remain intact and that no
captured variables are accidentally removed when converting the lambda to a
no-arg form.
- Around line 955-977: Replace the detached expiration thread with a joinable,
managed thread and explicit shutdown signaling: create and store a std::thread
member (instead of detaching) alongside an atomic<bool> or condition_variable
stop flag in the persistent structure, have the thread (the lambda that creates
runner_outer, uses lifetime, ready, runner_weak and persistent.lock) wait on the
condition_variable or check the stop flag rather than unbounded sleep_for, and
on persistent teardown (or when persistent.runner is reset) set the stop
flag/notify and join the thread to ensure clean shutdown and avoid
static-destruction races.
- Line 813: Remove the unused local variable dev_desc_persistent (the assignment
const auto* dev_desc_persistent = dataset_desc.dev_ptr(stream);) from
search_single_cta_kernel_launcher_jit.cuh; rely on the existing
std::cref(dataset_desc) passed into the runner instead, and ensure no other code
paths expect dev_desc_persistent to exist—if any do, replace those uses with
dataset_desc.dev_ptr(stream) or the dataset_desc reference as appropriate.

In `@cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_jit.cuh`:
- Line 8: The include in ivf_flat_interleaved_scan_jit.cuh currently pulls
sample_filter_data.h from the CAGRA-specific path; move sample_filter_data.h
into a neutral shared JIT detail location (e.g., a neighbors/detail/jit or
neighbors/detail/shared_jit directory) and update the include in
ivf_flat_interleaved_scan_jit.cuh to reference that new path instead of
../detail/cagra/jit_lto_kernels/sample_filter_data.h; also update the build
rules (include paths) so the compiler finds the new location and keep the
header’s include guards/namespaces unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: c996c135-34b9-4e34-bb9e-693fbb63ea6d

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 5eadc1e.

📒 Files selected for processing (97)

conda/recipes/libcuvs/recipe.yaml
cpp/CMakeLists.txt
cpp/cmake/modules/generate_jit_lto_kernels.cmake
cpp/include/cuvs/detail/jit_lto/AlgorithmLauncher.hpp
cpp/include/cuvs/detail/jit_lto/cagra/cagra_fragments.hpp
cpp/include/cuvs/detail/jit_lto/common_fragments.hpp
cpp/include/cuvs/detail/jit_lto/ivf_flat/interleaved_scan_fragments.hpp
cpp/include/cuvs/neighbors/ivf_flat.hpp
cpp/src/detail/jit_lto/AlgorithmLauncher.cpp
cpp/src/detail/jit_lto/AlgorithmPlanner.cpp
cpp/src/neighbors/cagra.cuh
cpp/src/neighbors/detail/cagra/cagra_build.cuh
cpp/src/neighbors/detail/cagra/cagra_search.cuh
cpp/src/neighbors/detail/cagra/compute_distance.hpp
cpp/src/neighbors/detail/cagra/compute_distance_standard-impl.cuh
cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh
cpp/src/neighbors/detail/cagra/device_common.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_cosine_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_normalization_standard_noop_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/bitonic.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_standard_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_to_child_nodes_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_to_child_nodes_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_vpq_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_common_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_intrinsics.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/device_memory_ops.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_cosine_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_hamming_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_inner_product_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_l1_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_l2_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/dist_op_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/extern_device_functions.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/graph_core.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/hashmap.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/kernel_def.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/random_pickup_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/random_pickup_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_bitset_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_data.h
cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/sample_filter_none_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_helpers.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_planner.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_kernel_planner.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_device_helpers.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_p_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_p_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_planner.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/set_value_batch.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_standard_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_vpq_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/topk_by_radix.cuh
cpp/src/neighbors/detail/cagra/search_multi_cta.cuh
cpp/src/neighbors/detail/cagra/search_multi_cta_inst.cuh
cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh
cpp/src/neighbors/detail/cagra/search_multi_cta_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/cagra/search_multi_kernel.cuh
cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/cagra/search_plan.cuh
cpp/src/neighbors/detail/cagra/search_single_cta.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_inst.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_kernel_explicit_inst.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_common.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp
cpp/src/neighbors/detail/nn_descent.cuh
cpp/src/neighbors/detail/smem_utils.cuh
cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_explicit_inst.cuh
cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_ext.cuh
cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_jit.cuh
cpp/src/neighbors/ivf_flat/ivf_flat_search.cuh
cpp/src/neighbors/ivf_flat_index.cpp
cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh
cpp/src/neighbors/refine/refine_device.cuh
dependencies.yaml

💤 Files with no reviewable changes (3)

cpp/src/neighbors/detail/cagra/device_common.hpp
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh

coderabbitai

Actionable comments posted: 10

♻️ Duplicate comments (3)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh (1)

204-204: ⚠️ Potential issue | 🟠 Major

Off-by-one: iter >= min_iteration should be iter + 1 >= min_iteration.

This is the same issue flagged in the past review. Since iter is incremented at the end of the loop (line 275), the current check allows one extra no-op iteration before breaking when no parent is available.

🔧 Suggested fix

-    if ((parent_indices_buffer[0] == invalid_index) && (iter >= min_iteration)) { break; }
+    if ((parent_indices_buffer[0] == invalid_index) && (iter + 1 >= min_iteration)) { break; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh` at
line 204, The loop break condition incorrectly uses "iter >= min_iteration"
allowing one extra iteration because "iter" is incremented at the end of the
loop; update the condition that checks parent_indices_buffer[0] against
invalid_index to use "iter + 1 >= min_iteration" instead so the check accounts
for the imminent increment—modify the conditional around
parent_indices_buffer[0], invalid_index, iter, and min_iteration in the search
loop (the same block that currently reads the if (...) { break; }) to perform
the off-by-one correction.

cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp (1)

24-50: ⚠️ Potential issue | 🟡 Minor

Add static_assert or unreachable fallback for unsupported types.

These get_*_type_tag() functions have no return statement when T doesn't match any supported type. If called with an unsupported type, the function falls off the end without returning, which is undefined behavior.

🛠️ Suggested fix pattern

 template <typename T>
 constexpr auto get_data_type_tag()
 {
   if constexpr (std::is_same_v<T, float>) { return cuvs::neighbors::detail::tag_f{}; }
-  if constexpr (std::is_same_v<T, __half>) { return cuvs::neighbors::detail::tag_h{}; }
-  if constexpr (std::is_same_v<T, int8_t>) { return cuvs::neighbors::detail::tag_i8{}; }
-  if constexpr (std::is_same_v<T, uint8_t>) { return cuvs::neighbors::detail::tag_u8{}; }
+  else if constexpr (std::is_same_v<T, __half>) { return cuvs::neighbors::detail::tag_h{}; }
+  else if constexpr (std::is_same_v<T, int8_t>) { return cuvs::neighbors::detail::tag_i8{}; }
+  else if constexpr (std::is_same_v<T, uint8_t>) { return cuvs::neighbors::detail::tag_u8{}; }
+  else { static_assert(cagra_jit_get_sample_filter_name_type_always_false<T>, "Unsupported data type"); }
 }

Apply the same pattern to get_index_type_tag, get_distance_type_tag, and get_source_index_type_tag.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp` around lines 24 - 50,
The get_*_type_tag() helpers (get_data_type_tag, get_index_type_tag,
get_distance_type_tag, get_source_index_type_tag) can fall off the end for
unsupported T; add a compile-time guard and unreachable fallback: inside each
template, keep the existing if constexpr branches and after them add a
static_assert(false, "unsupported type for <function-name>") (or static_assert
always dependent on T, e.g. static_assert(std::dependent_false_v<T>, "...")) or
return a cuvs::neighbors::detail::tag_unreachable{} combined with unreachable()
to ensure a hard compile-time/error path; update the messages to reference the
specific function (e.g., get_index_type_tag) so unsupported types fail cleanly
rather than producing UB.

cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp (1)

69-130: ⚠️ Potential issue | 🟠 Major

Don’t route every non-Hamming standard metric through the L2Expanded tag.

These fallback branches still instantiate query_type_tag_standard_t<..., DistanceType::L2Expanded> for every standard metric that is not BitwiseHamming. A supported metric like L1 or Cosine will link the wrong specialization instead of failing fast.

Also applies to: 172-225, 274-331
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp`
around lines 69 - 130, The code incorrectly defaults non-Hamming standard
metrics to query_type_tag_standard_t<..., DistanceType::L2Expanded>; update the
factory to select the QueryTag based on dataset_desc.metric (e.g., switch or
if-chain mapping DistanceType::L1, ::Cosine, ::L2Expanded, etc. to the correct
query_type_tag_standard_t specializations) and only instantiate
CagraSingleCtaSearchPlanner with the matching QueryTag (symbols:
query_type_tag_standard_t, CagraSingleCtaSearchPlanner, planner.get_launcher,
dataset_desc.metric); for unsupported metrics return an explicit error or
throw/assert so the wrong specialization is not linked. Apply the same change to
the other duplicate blocks that instantiate planner with L2Expanded.

🧹 Nitpick comments (3)

cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp (1)

96-98: Simplify redundant is_bitset_filter check.

The condition is_bitset_filter<InnerFilter>::value already matches any bitset_filter<bitset_t, index_t> instantiation. The explicit checks for bitset_filter<uint32_t, int64_t> and bitset_filter<uint32_t, uint32_t> are redundant.

♻️ Suggested simplification

     using InnerFilter = decltype(std::declval<DecayedFilter>().filter);
-    if constexpr (is_bitset_filter<InnerFilter>::value ||
-                  std::is_same_v<InnerFilter, bitset_filter<uint32_t, int64_t>> ||
-                  std::is_same_v<InnerFilter, bitset_filter<uint32_t, uint32_t>>) {
+    if constexpr (is_bitset_filter<InnerFilter>::value) {
       return "filter_bitset_source_index_ui";

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp` around lines 96 - 98,
The condition in the constexpr if redundantly checks specific instantiations of
bitset_filter; replace the entire compound condition with a single check using
is_bitset_filter<InnerFilter>::value so the branch triggers for any
bitset_filter<bitset_t, index_t> instantiation; update the if-expression that
currently mentions is_bitset_filter<InnerFilter>::value ||
std::is_same_v<InnerFilter, bitset_filter<uint32_t, int64_t>> ||
std::is_same_v<InnerFilter, bitset_filter<uint32_t, uint32_t>> to just
is_bitset_filter<InnerFilter>::value, keeping the surrounding logic in the same
scope (symbols: is_bitset_filter, InnerFilter, bitset_filter).

cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh (1)

257-273: Filter loop only processes p=0 due to p < 1 bound.

The loop for (unsigned p = threadIdx.x; p < 1; p += blockDim.x) only executes for threadIdx.x == 0. This is intentional since there's only one parent, but the loop construct is confusing. Consider using a simple if (threadIdx.x == 0) for clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh`
around lines 257 - 273, The filter loop currently uses for (unsigned p =
threadIdx.x; p < 1; p += blockDim.x) which only runs for threadIdx.x == 0 and is
confusing; replace that loop with a clear single-thread guard (if (threadIdx.x
== 0) { ... }) keeping the existing body that references parent_indices_buffer,
result_indices_buffer, result_distances_buffer, invalid_index, bitset and the
call to sample_filter<SourceIndexT>(...), and leave the trailing __syncthreads()
in place so the kernel semantics stay identical.

cpp/src/neighbors/detail/cagra/search_multi_cta_kernel_launcher_jit.cuh (1)

70-79: Buffer size clamping is correct but could log the upgrade.

The logic correctly clamps result_buffer_size to supported values (64/128/256). Consider adding a debug log when the buffer is upgraded, as this affects memory usage.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_multi_cta_kernel_launcher_jit.cuh`
around lines 70 - 79, The clamp branch sets max_elements based on
result_buffer_size but silently upgrades smaller requested buffers to larger
supported sizes; add a debug log in the same block (around the assignments to
max_elements) that reports the original result_buffer_size and the chosen
max_elements whenever you increase the buffer (e.g., "requested %u -> using
%u"), using the project's debug/logging macro (e.g., LOG_DEBUG/VLOG or the
project's preferred logger) so callers can see when memory usage is increased;
reference the variables result_buffer_size, max_elements and the THROW fallback
when adding the log.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_kernel.cu.in`:
- Line 26: The parameter type for query_id_offset in the apply_filter kernel is
inconsistent: it is declared as const index_t query_id_offset but callers and
related code (e.g., cagra_sample_filter in cagra_bitset.cuh and other kernels)
expect std::uint32_t; change the parameter declaration to const std::uint32_t
query_id_offset in the apply_filter_kernel (and any matching kernel signatures
in this template) so the kernel signature matches the calling convention and
avoids template-instantiation mismatches when index_t != uint32_t.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh`:
- Around line 200-215: The check that skips invalid slots uses
"~index_msb_1_mask" but invalid entries are set to
utils::get_max_value<IndexT>(), so ensure you skip those before dereferencing
source_indices_ptr: modify the conditional around the sample_filter path to
treat result_indices_ptr[index] == utils::get_max_value<IndexT>() as invalid (or
add an additional early-return/continue when result_indices_ptr[index] equals
utils::get_max_value<IndexT>()), so you never compute node_id =
source_indices_ptr[result_indices_ptr[index]] with an out-of-range UINT_MAX;
update the logic in the block using result_indices_ptr, source_indices_ptr,
sample_filter, and utils::get_max_value<IndexT>() accordingly.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh`:
- Around line 354-357: The compaction treats invalid slots as valid because the
code masks before comparing to the unmasked sentinel; fix by computing validity
from the unmasked stored value instead of the masked value (or compare
masked-to-masked). Replace the current line that computes is_valid_index with a
check against invalid_index on the raw buffer value, e.g. use is_valid_index =
(result_indices_buffer[src_position] != invalid_index) (or compute masked_index
= result_indices_buffer[src_position] & (~index_msb_1_mask) and compare to
(invalid_index & (~index_msb_1_mask))). Keep the rest of the flow (new_position
and scan_op_t(...).InclusiveSum) unchanged.
- Around line 424-427: The code currently masks off the MSB and calls
to_source_index on every entry (result_indices_buffer), which remaps invalid
sentinel values into large positive indices and can read past
source_indices_ptr; change the logic in the loop around
result_indices_buffer[ii] to first test the MSB sentinel (use the same
index_msb_1_mask check) and if the entry is invalid skip calling to_source_index
and instead write the invalid sentinel (or a known INVALID_INDEX) into the
output via write_indices; only call to_source_index and write the mapped
source_index when the MSB is not set. Ensure you update the branch around
result_indices_buffer, index_msb_1_mask, to_source_index and write_indices
accordingly.

In `@cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh`:
- Around line 114-149: The sample filter returned by extract_cagra_sample_filter
contains both bitset and query_id_offset but the launcher dispatch in
make_cagra_multi_kernel_jit_launcher currently only passes bf.bitset; update the
dispatch call for compute_distance_to_child_nodes_kernel_func_t to also pass
bf.query_id_offset (and update the argument list passed through
launcher->dispatch) and then mirror the same added parameter in the kernel
signature/usage in compute_distance_to_child_nodes_kernel_jit (in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh) so the
kernel receives and uses query_id_offset instead of relying on raw blockIdx.y.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh`:
- Around line 715-745: In select_and_run, guard against num_queries == 0 by
returning early before any launch-state or launcher_jit_t/ grid configuration is
built; check num_queries at the top of the function (before computing grid.y,
constructing launcher_jit_t, or allocating zero-capacity deques) and return
immediately to avoid creating a zero-capacity launch state or submitting query 0
on both the non-persistent and persistent paths.
- Around line 699-706: The calculation of n_blocks (from persistent_device_usage
* (ctas_per_sm * num_sm)) can truncate to zero for small but valid
persistent_device_usage, leaving the worker queue empty and breaking
launcher_jit_t; after the current truncation and max-cap check, add a guard that
if persistent_device_usage > 0 and n_blocks == 0 then set n_blocks = 1 (and
optionally log a RAFT_LOG_WARN noting the adjustment), ensuring at least one
persistent worker is launched while preserving the existing kMaxWorkersNum cap
and existing behavior for zero usage.
- Around line 463-493: The hash returned by calculate_parameter_hash(...) omits
the sample_filter state, so persistent_runner_jit_t can reuse a runner with a
different filter; include the filter in the cache key by mixing in the filter's
identifying state (e.g., the bitset pointer/address and the
query_id_offset/epoch value that persistent_runner_jit_t captures) into the
returned uint64_t hash alongside the other fields (for example XOR or other
lightweight mixing of sample_filter.bitset pointer and
sample_filter.query_id_offset); update calculate_parameter_hash to read those
sample_filter fields and fold them into the final hash.

In `@cpp/src/neighbors/detail/smem_utils.cuh`:
- Around line 96-108: The local mutex map_mutex is not static and therefore does
not protect the shared unordered_map jit_smem_sizes across calls/threads; change
map_mutex to a static std::mutex (i.e., make it have static storage duration) so
the std::lock_guard<std::mutex> map_lock{map_mutex} actually serializes access
when obtaining current_smem_size, then proceed to use current_smem_size with
safely_launch_kernel_with_smem_size_impl<cudaKernel_t, KernelLauncherT>(kernel,
smem_size, launch, current_smem_size->first, current_smem_size->second).

---

Duplicate comments:
In
`@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp`:
- Around line 69-130: The code incorrectly defaults non-Hamming standard metrics
to query_type_tag_standard_t<..., DistanceType::L2Expanded>; update the factory
to select the QueryTag based on dataset_desc.metric (e.g., switch or if-chain
mapping DistanceType::L1, ::Cosine, ::L2Expanded, etc. to the correct
query_type_tag_standard_t specializations) and only instantiate
CagraSingleCtaSearchPlanner with the matching QueryTag (symbols:
query_type_tag_standard_t, CagraSingleCtaSearchPlanner, planner.get_launcher,
dataset_desc.metric); for unsupported metrics return an explicit error or
throw/assert so the wrong specialization is not linked. Apply the same change to
the other duplicate blocks that instantiate planner with L2Expanded.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh`:
- Line 204: The loop break condition incorrectly uses "iter >= min_iteration"
allowing one extra iteration because "iter" is incremented at the end of the
loop; update the condition that checks parent_indices_buffer[0] against
invalid_index to use "iter + 1 >= min_iteration" instead so the check accounts
for the imminent increment—modify the conditional around
parent_indices_buffer[0], invalid_index, iter, and min_iteration in the search
loop (the same block that currently reads the if (...) { break; }) to perform
the off-by-one correction.

In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp`:
- Around line 24-50: The get_*_type_tag() helpers (get_data_type_tag,
get_index_type_tag, get_distance_type_tag, get_source_index_type_tag) can fall
off the end for unsupported T; add a compile-time guard and unreachable
fallback: inside each template, keep the existing if constexpr branches and
after them add a static_assert(false, "unsupported type for <function-name>")
(or static_assert always dependent on T, e.g.
static_assert(std::dependent_false_v<T>, "...")) or return a
cuvs::neighbors::detail::tag_unreachable{} combined with unreachable() to ensure
a hard compile-time/error path; update the messages to reference the specific
function (e.g., get_index_type_tag) so unsupported types fail cleanly rather
than producing UB.

---

Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh`:
- Around line 257-273: The filter loop currently uses for (unsigned p =
threadIdx.x; p < 1; p += blockDim.x) which only runs for threadIdx.x == 0 and is
confusing; replace that loop with a clear single-thread guard (if (threadIdx.x
== 0) { ... }) keeping the existing body that references parent_indices_buffer,
result_indices_buffer, result_distances_buffer, invalid_index, bitset and the
call to sample_filter<SourceIndexT>(...), and leave the trailing __syncthreads()
in place so the kernel semantics stay identical.

In `@cpp/src/neighbors/detail/cagra/search_multi_cta_kernel_launcher_jit.cuh`:
- Around line 70-79: The clamp branch sets max_elements based on
result_buffer_size but silently upgrades smaller requested buffers to larger
supported sizes; add a debug log in the same block (around the assignments to
max_elements) that reports the original result_buffer_size and the chosen
max_elements whenever you increase the buffer (e.g., "requested %u -> using
%u"), using the project's debug/logging macro (e.g., LOG_DEBUG/VLOG or the
project's preferred logger) so callers can see when memory usage is increased;
reference the variables result_buffer_size, max_elements and the THROW fallback
when adding the log.

In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp`:
- Around line 96-98: The condition in the constexpr if redundantly checks
specific instantiations of bitset_filter; replace the entire compound condition
with a single check using is_bitset_filter<InnerFilter>::value so the branch
triggers for any bitset_filter<bitset_t, index_t> instantiation; update the
if-expression that currently mentions is_bitset_filter<InnerFilter>::value ||
std::is_same_v<InnerFilter, bitset_filter<uint32_t, int64_t>> ||
std::is_same_v<InnerFilter, bitset_filter<uint32_t, uint32_t>> to just
is_bitset_filter<InnerFilter>::value, keeping the surrounding logic in the same
scope (symbols: is_bitset_filter, InnerFilter, bitset_filter).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fe0cc595-4e89-44b0-bdbb-bdd7ea0e5437

📥 Commits

Reviewing files that changed from the base of the PR and between 84a93d4 and f52d8ce.

📒 Files selected for processing (19)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_bitset.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_to_child_nodes_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/extern_device_functions.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/kernel_def.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_jit.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_p_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in
cpp/src/neighbors/detail/cagra/search_multi_cta_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp
cpp/src/neighbors/detail/smem_utils.cuh

🚧 Files skipped from review as they are similar to previous changes (6)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_single_cta_p_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_to_child_nodes_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_cta_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/kernel_def.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (2)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh (2)

200-215: ⚠️ Potential issue | 🔴 Critical

Skip utils::get_max_value<IndexT>() entries before dereferencing source_indices_ptr.

Line 200 uses the wrong invalid check for this buffer, so Line 205 can index source_indices_ptr with an invalid value.

🐛 Proposed fix

-  if (result_indices_ptr[index] != ~index_msb_1_mask) {
+  const auto raw_index = result_indices_ptr[index];
+  if (raw_index != utils::get_max_value<IndexT>() && raw_index != ~index_msb_1_mask) {
     // Use extern sample_filter function with 3 params: query_id, node_id, filter_data
     // filter_data is a void* pointer to bitset_filter_data_t (or nullptr for none_filter)
     SourceIndexT node_id = source_indices_ptr == nullptr
-                             ? static_cast<SourceIndexT>(result_indices_ptr[index])
-                             : source_indices_ptr[result_indices_ptr[index]];
+                             ? static_cast<SourceIndexT>(raw_index)
+                             : source_indices_ptr[raw_index];

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh` around
lines 200 - 215, The code currently dereferences source_indices_ptr using
result_indices_ptr[index] without first checking for the other sentinel value
(utils::get_max_value<IndexT>()), which can cause an out-of-bounds access;
update the conditional around the block that computes node_id and calls
sample_filter to first ensure result_indices_ptr[index] is not equal to
utils::get_max_value<IndexT>() (in addition to the existing ~index_msb_1_mask
check), and only then compute node_id by indexing source_indices_ptr (or keep
the ternary but move it inside that guarded branch); reference
result_indices_ptr, source_indices_ptr, utils::get_max_value<IndexT>(), IndexT,
SourceIndexT, and sample_filter in your change.

47-52: ⚠️ Potential issue | 🔴 Critical

Do not return before __syncthreads() in this kernel.

Line 47 can exit a subset of threads while others reach Line 52, which can deadlock on partial blocks.

🐛 Proposed fix

-  if (global_team_index >= num_pickup) { return; }
+  const bool active = global_team_index < num_pickup;
   extern __shared__ uint8_t smem[];

   auto smem_desc =
     setup_workspace_base<DataT, IndexT, DistanceT>(dataset_desc, smem, queries_ptr, query_id);
   __syncthreads();
+  if (!active) { return; }

#!/bin/bash
# Verify launch sizing/block-team alignment for random pickup kernel
rg -n -C4 'random_pickup_kernel_jit|random_pickup<<<|num_pickup|num_teams_per_threadblock|blockDim' cpp/src/neighbors/detail/cagra/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh` around
lines 47 - 52, The early exit "if (global_team_index >= num_pickup) { return; }"
exits a subset of threads before the shared-memory setup and __syncthreads(),
causing a potential deadlock; move that conditional so all threads execute
setup_workspace_base<DataT, IndexT, DistanceT>(...) and reach __syncthreads()
first, then perform "if (global_team_index >= num_pickup) return;" (or otherwise
make out-of-range threads noop after the barrier) so every lane hits
__syncthreads(); update the code paths around setup_workspace_base, smem,
query_id, and __syncthreads() accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh`:
- Around line 133-141: The code returns or writes an invalid distance in several
paths without setting the corresponding index slot, which can leave stale
indices in result_indices_ptr; update the branches that check parent_list_index
== utils::get_max_value<INDEX_T>() (early return), the raw_parent_index ==
utils::get_max_value<INDEX_T>() branch where you set result_distances_ptr[ldd *
blockIdx.y + global_team_id], and the similar later branch (around the second
invalid-distance write) to also write a sentinel into result_indices_ptr[ldd *
blockIdx.y + global_team_id] (use utils::get_max_value<INDEX_T>() or an
appropriate INDEX_T sentinel) before returning, so every path that sets an
invalid distance also sets the corresponding invalid index.

In `@cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh`:
- Around line 54-57: The kernel launcher must return early when the computed
launch grid would be zero-sized to avoid invalid-configuration errors: before
calling dispatch(...) check the computed grid (e.g., the dim3 grid_size computed
from block_size, num_teams_per_threadblock, num_pickup, num_queries) and if
grid_size.x == 0 || grid_size.y == 0 (or equivalently if num_queries == 0 or
num_pickup == 0) simply return/do nothing; apply the same guard to the other
launch sites referenced (the analogous grid computations at the places around
the symbols on lines ~122 and ~196) so empty inputs become no-ops instead of
attempting a launch.
- Around line 62-65: Before narrowing the public std::size_t stride, add a guard
that checks if ldr > UINT32_MAX and handle that overflow before the
static_cast<uint32_t>(ldr) and before the subsequent dispatch/JIT kernel call;
specifically, replace the unchecked cast to ldr_u32 with a runtime check (if ldr
too large) and then either return an error/throw/abort or choose a safe fallback
path so the kernel never receives a truncated stride; update places that refer
to ldr_u32 and the dispatch mechanism to rely on this validated value.

In `@cpp/src/neighbors/detail/smem_utils.cuh`:
- Around line 36-45: The function safely_launch_kernel_with_smem_size_impl
currently uses function-static shared state last_smem_size and last_kernel while
callers may pass per-kernel mutexes (e.g., for cudaKernel_t), causing races; fix
by removing the function-static last_* atomics and fully using the per-kernel
state passed in (current_smem_size) so the lock and protected data share scope:
read and compare smem_size against *current_smem_size in the fast path, and
under the provided mutex update *current_smem_size and call cudaFuncSetAttribute
as needed (ensure all code paths that use cudaKernel_t consult and update
current_smem_size under that same mutex); delete last_smem_size/last_kernel and
any logic referencing them to avoid cross-kernel races.

---

Duplicate comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh`:
- Around line 200-215: The code currently dereferences source_indices_ptr using
result_indices_ptr[index] without first checking for the other sentinel value
(utils::get_max_value<IndexT>()), which can cause an out-of-bounds access;
update the conditional around the block that computes node_id and calls
sample_filter to first ensure result_indices_ptr[index] is not equal to
utils::get_max_value<IndexT>() (in addition to the existing ~index_msb_1_mask
check), and only then compute node_id by indexing source_indices_ptr (or keep
the ternary but move it inside that guarded branch); reference
result_indices_ptr, source_indices_ptr, utils::get_max_value<IndexT>(), IndexT,
SourceIndexT, and sample_filter in your change.
- Around line 47-52: The early exit "if (global_team_index >= num_pickup) {
return; }" exits a subset of threads before the shared-memory setup and
__syncthreads(), causing a potential deadlock; move that conditional so all
threads execute setup_workspace_base<DataT, IndexT, DistanceT>(...) and reach
__syncthreads() first, then perform "if (global_team_index >= num_pickup)
return;" (or otherwise make out-of-range threads noop after the barrier) so
every lane hits __syncthreads(); update the code paths around
setup_workspace_base, smem, query_id, and __syncthreads() accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ea81bc2-0517-44b3-bc50-c1f16eea78cc

📥 Commits

Reviewing files that changed from the base of the PR and between f52d8ce and d7fa7f3.

📒 Files selected for processing (6)

cpp/CMakeLists.txt
cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/kernel_def.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/search_multi_jit.cuh
cpp/src/neighbors/detail/cagra/search_multi_kernel_launcher_jit.cuh
cpp/src/neighbors/detail/smem_utils.cuh

✅ Files skipped from review due to trivial changes (2)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/apply_filter_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/kernel_def.hpp

KyleFromNVIDIA · 2026-04-23T15:38:21Z

          - libcusolver-dev
          - libcusparse-dev
          - libnvjitlink-dev
+          - cuda-nvrtc-dev


This still hasn't been removed.

Suggested change

- cuda-nvrtc-dev

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp (1)
96-99: ⚠️ Potential issue | 🟠 Major

int64_t filter inner type is still mapped to the *_source_index_ui symbol.

Line 97 allows bitset_filter<uint32_t, int64_t>, but Line 99 returns "filter_bitset_source_index_ui". In cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp (Line 394-Line 396), that name maps to tag_index_u32, so int64_t source-index filters are not distinguishable and can bind the wrong fragment. Please either add an explicit i64 symbol/fragment path or fail fast for int64 until that path exists.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp` around lines 96 - 99,
The current branch in shared_launcher_jit.hpp treats bitset_filter<uint32_t,
int64_t> the same as the u32 case and returns "filter_bitset_source_index_ui",
which collides with tag_index_u32 in cagra_planner_base.hpp; change the
conditional in the is_bitset_filter handling so that bitset_filter<uint32_t,
int64_t> is either mapped to a distinct symbol (e.g.,
"filter_bitset_source_index_i64") and ensure cagra_planner_base.hpp has the
matching tag/fragment, or add a compile-time failure (static_assert or
enable_if) when InnerFilter's source-index type is int64_t to fail fast until
the i64 fragment/symbol is implemented; reference the symbols is_bitset_filter,
bitset_filter<uint32_t, int64_t>, "filter_bitset_source_index_ui", and
tag_index_u32 when making the change so the int64 path is handled explicitly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp`:
- Around line 63-153: The registration branches in
add_setup_workspace_device_function and add_compute_distance_device_function
only cover specific (team_size, dataset_block_dim) pairs and silently skip
others; replace the manual fall-through logic with a call to
dispatch_cagra_team_dim(team_size, dataset_block_dim, ...) (or otherwise invoke
the same dispatcher used elsewhere) so unsupported combinations immediately
trigger RAFT_FAIL instead of leaving fragments unregistered; ensure both VPQ and
non-VPQ paths use this dispatcher (or explicitly call RAFT_FAIL with a clear
message) when no matching team_size/dataset_block_dim case is found.

---

Duplicate comments:
In `@cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp`:
- Around line 96-99: The current branch in shared_launcher_jit.hpp treats
bitset_filter<uint32_t, int64_t> the same as the u32 case and returns
"filter_bitset_source_index_ui", which collides with tag_index_u32 in
cagra_planner_base.hpp; change the conditional in the is_bitset_filter handling
so that bitset_filter<uint32_t, int64_t> is either mapped to a distinct symbol
(e.g., "filter_bitset_source_index_i64") and ensure cagra_planner_base.hpp has
the matching tag/fragment, or add a compile-time failure (static_assert or
enable_if) when InnerFilter's source-index type is int64_t to fail fast until
the i64 fragment/symbol is implemented; reference the symbols is_bitset_filter,
bitset_filter<uint32_t, int64_t>, "filter_bitset_source_index_ui", and
tag_index_u32 when making the change so the int64 path is handled explicitly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb75bb76-ee9d-405f-b90f-e53d794b7c48

📥 Commits

Reviewing files that changed from the base of the PR and between d7fa7f3 and 0053171.

📒 Files selected for processing (5)

cpp/CMakeLists.txt
cpp/include/cuvs/detail/jit_lto/cagra/cagra_fragments.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp
cpp/src/neighbors/detail/cagra/shared_launcher_jit.hpp
cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan_jit.cuh

✅ Files skipped from review due to trivial changes (1)

cpp/include/cuvs/detail/jit_lto/cagra/cagra_fragments.hpp

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/CMakeLists.txt

dantegd

Very nice work! Had less comments than I thought I would've for such a long PR

KyleFromNVIDIA · 2026-04-30T14:35:11Z

+
+using args_t = typename dataset_descriptor_base_t<data_t, index_t, distance_t>::args_t;
+template __device__ distance_t
+apply_normalization_standard<@team_size@, @dataset_block_dim@, data_t, index_t, distance_t, query_t>(distance_t,


There's probably room to turn this into an adapter function, remove team_size and dataset_block_dim from the signature, and thus shrink down whatever calls it, but I'm happy to do that in a follow-up.

This is not too large of a concern, it is not part of the main kernel. It is linked to a device function (that links to the main kernel) that already does not have these templates.

KyleFromNVIDIA · 2026-05-01T18:10:19Z

+    (void)metric;
+    (void)is_vpq;
+    (void)pq_bits;


If we're not using these arguments anymore, could we remove them?

On a broader level: we usually use template parameters that take fragment tags for these add_*_function() functions. Any reason we're not doing that here?

Any reason we're not doing that here?

See answer below.

KyleFromNVIDIA · 2026-05-01T18:20:52Z

+namespace cuvs::neighbors::cagra::detail {
+
+template <typename DataT, typename IndexT, typename DistanceT>
+extern __device__ const dataset_descriptor_base_t<DataT, IndexT, DistanceT>* setup_workspace_base(


Nitpick: I've usually called the device adapter functions e.g. setup_workspace() and the implementations setup_workspace_impl().

KyleFromNVIDIA · 2026-05-01T18:25:10Z

+namespace cuvs::neighbors::cagra::detail {
+
+template <class T>
+__global__ void set_value_batch_kernel(T* const dev_ptr,


This is not compiled into a JIT+LTO fragment. Should it be in the jit_lto_kernels directory?

KyleFromNVIDIA · 2026-05-01T18:33:58Z

On the whole, I love this. The one other overarching comment I'll give is that there are lots of small changes that seem to be unrelated to the purpose of the PR - comments and blank lines added, etc. Unless there's a good reason for adding them, I think we should try to keep the diff as minimal as possible - this is already a huge PR as it is.

divyegala added 30 commits October 2, 2025 18:33

jit lto interleaved scan

a024f61

fix dependencies.yaml

45da4aa

generate files at build time, use tags to avoid compilation of types

a7c8621

passing tests

eb2d74b

update gitignore

d2318e8

separate out distance function from main kernel

5e6afcd

fix deps

6eee4da

add filters as jit device functions, rework caching logic

1de8f28

lto post lambda, cleanup files, generate cmake in build dir

84c6020

don't read hardcoded kernels, use generator properly

22680c8

random cmake changes carried over from 25.10

37f1163

cmake format

0ae5383

remove dep on kernel list

fe56aec

attempt to solve overlinking problem

40c8fd6

reorder if-else in compiler check

e87a8c7

Merge branch 'branch-25.12' into jit-lto-ivf-flat-interleaved

179d733

use cudart apis

32a67bd

merge

c27612e

attempt to link cudart

a4b48b1

revert cudart link, try all arch build of jit lto fatbin sources

d5d692e

cmake format

1c6dd94

missing shared mem setting

30f5ab6

separate cuda 12 and 13 compilation

9674969

merge upstream

24fc47d

remove bench

db9a487

c include directory

aa9294f

style check

2eb77fe

merge upstream

6c685fa

guard cuda calls and use shared_ptr

3e35b99

add AlgorithmPlanner to main target

d0ff62c

delete extra files

ca67dea

KyleFromNVIDIA requested changes Apr 17, 2026

View reviewed changes

Comment thread cpp/include/cuvs/detail/jit_lto/registration_tags.hpp Outdated

Comment thread cpp/cmake/modules/register_fatbin.cpp.in Outdated

Comment thread cpp/cmake/modules/generate_jit_lto_kernels.cmake Outdated

divyegala added 4 commits April 18, 2026 02:29

reconcile jit and non jit paths

3fb3df0

merge 1780

c01ec16

remove unneeded wrappers

a6e31c9

specialize jit cache to reduce contention

bcf0470

divyegala removed the DO NOT MERGE label Apr 19, 2026

KyleFromNVIDIA requested changes Apr 20, 2026

View reviewed changes

divyegala added 5 commits April 21, 2026 16:34

rework standard/impl functions and fix recipe

a0636f3

keep the fragments separate

04769c7

more review

9c981e1

Merge remote-tracking branch 'upstream/main' into cagra-search-jit-lto

5eadc1e

fix recipe

84a93d4

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

code simplification and ai reviews

f52d8ce

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

use whole compilation for cagra TUs; ai reviews

d7fa7f3

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

KyleFromNVIDIA requested changes Apr 23, 2026

View reviewed changes

address reviews

0053171

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp Outdated

ai review

44d0ea2

dantegd requested changes Apr 29, 2026

View reviewed changes

attempt to fix smem launch

11711c0

KyleFromNVIDIA requested changes May 1, 2026

View reviewed changes

divyegala added 3 commits May 4, 2026 18:31

attempt to fix smem launch

1d58136

dante review

dc29e56

kyle review step 1

f329a82

Conversation

divyegala commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

divyegala commented Feb 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading