Multi segment cagra search by jamxia155 · Pull Request #133 · rapidsai/cuvs-lucene

jamxia155 · 2026-04-22T23:26:04Z

Companion PR to cuvs!2035, addresses #124.

Existing CAGRA search code path for each search query:

Call CAGRA search API on one index segment
Copy results back to host
Add results into host-side global top-k priority queue
Repeat for all index segments

Proposed change:

Leverage new multi-segment CAGRA search API to launch all per-segment searches in one API call
Leave results on device and run GPU-accelerated select-k API to compute global top-k
Copy final top-k results to host

Overrides rewrite() to run all segment searches into a shared device buffer and merge with cuvsSelectK entirely on GPU, eliminating per-segment D2H copies and CPU-side TopDocs.merge(). Falls back to the standard Lucene per-segment path when any segment lacks a CAGRA index, an explicit filter is set, or k > 1024. Also adds ordToDoc() and getCagraIndexForField() helpers to CuVS2510GPUVectorsReader to support result decoding. Fixes for Lucene 10.2 API changes: CodecReader moved to org.apache.lucene.index; createRewrittenQuery() removed and replaced with an inline docAndScoreQuery() implementation using the public Weight/ScorerSupplier API.

- CuVS2510GPUVectorsFormat: call CuVSProvider.provider().enableRMMAsyncMemory() in the static initializer so that cuda_async_memory_resource is active for the lifetime of the codec. This makes CAGRA workspace deallocations stream-ordered and non-blocking, which is required for the CudaStreamPool to provide any parallelism benefit. - GPUKnnFloatVectorQuery: upload the query vector to device once before the per-segment loop and share the resulting CuVSMatrix across all CagraQuery instances, reducing host-to-device copies from O(numSegments) to 1 per query. Wrap the shared device matrix in try-with-resources to close the RMM allocation promptly after MultiSegmentCagraSearch.search() returns. - FilterCuVSProvider: delegate enableRMMAsyncMemory() to the wrapped provider.

GPUKnnFloatVectorQuery / GPUPerLeafCuVSKnnCollector: - Add persistent, persistentLifetime, and persistentDeviceUsage parameters, threaded through all constructor overloads and forwarded to CagraSearchParams.Builder in both the multi-segment rewrite() path and the per-segment approximateSearch() fallback path. - Add threadBlockSize parameter (0 = auto) to allow tuning of the persistent kernel's worker_queue_size, which determines how many concurrent query threads can run without latency increase. Fix persistent-runner hash instability across segments (rewrite() path): - When max_iterations is 0 (auto), CAGRA computes it from each segment's dataset size. Different-sized segments produce different values, causing a distinct runner hash per segment and a destroy/recreate cycle on every search call. - Add computeMaxIterations(), which mirrors adjust_search_params() from search_plan.cuh, and call it once using the largest segment's graph size and degree. All segments then share the same max_iterations, producing a stable runner hash across the full multi-segment query. CuVS2510GPUVectorsReader: - Forward threadBlockSize, persistent, persistentLifetime, and persistentDeviceUsage from GPUPerLeafCuVSKnnCollector to CagraSearchParams.Builder in the per-segment fallback path.

Remove persistent kernel mode: - Drop persistent, persistentLifetime, and persistentDeviceUsage fields and parameters from GPUKnnFloatVectorQuery, GPUPerLeafCuVSKnnCollector, and CuVS2510GPUVectorsReader. The persistent kernel is superseded by the native multi-segment kernel (cuvsCagraSearchMultiSegment) which achieves better concurrency without the per-runner lifecycle overhead. - Collapse the 11-argument GPUKnnFloatVectorQuery constructor (which only existed to accept persistent parameters) into the standard 8-argument form. - Remove stale comments that described max_iterations uniformity in terms of persistent-runner hash stability; replace with accurate explanation (consistent search quality across segments of different sizes). Add workspace pool configuration: - Add WORKSPACE_POOL_SIZE_PROPERTY constant to ThreadLocalCuVSResourcesProvider. - On resources creation, read com.nvidia.cuvs.workspacePoolSize system property and call setWorkspacePool() if set, so callers can pre-warm the per-thread RMM pool without modifying cuvs-lucene source.

copy-pr-bot · 2026-04-22T23:26:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

jamxia155 added 5 commits April 22, 2026 16:05

Expose CAGRA SearchAlgo parameter

7a8b26b

jamxia155 mentioned this pull request Apr 22, 2026

Multi segment cagra search rapidsai/cuvs#2035

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi segment cagra search#133

Multi segment cagra search#133
jamxia155 wants to merge 5 commits intorapidsai:mainfrom
jamxia155:multi-segment-cagra-search

jamxia155 commented Apr 22, 2026

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamxia155 commented Apr 22, 2026

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant