Skip to content

Multi segment cagra search#133

Draft
jamxia155 wants to merge 5 commits intorapidsai:mainfrom
jamxia155:multi-segment-cagra-search
Draft

Multi segment cagra search#133
jamxia155 wants to merge 5 commits intorapidsai:mainfrom
jamxia155:multi-segment-cagra-search

Conversation

@jamxia155
Copy link
Copy Markdown

Companion PR to cuvs!2035, addresses #124.

Existing CAGRA search code path for each search query:

  • Call CAGRA search API on one index segment
  • Copy results back to host
  • Add results into host-side global top-k priority queue
  • Repeat for all index segments

Proposed change:

  • Leverage new multi-segment CAGRA search API to launch all per-segment searches in one API call
  • Leave results on device and run GPU-accelerated select-k API to compute global top-k
  • Copy final top-k results to host

  Overrides rewrite() to run all segment searches into a shared device
  buffer and merge with cuvsSelectK entirely on GPU, eliminating
  per-segment D2H copies and CPU-side TopDocs.merge(). Falls back to the
  standard Lucene per-segment path when any segment lacks a CAGRA index,
  an explicit filter is set, or k > 1024.

  Also adds ordToDoc() and getCagraIndexForField() helpers to
  CuVS2510GPUVectorsReader to support result decoding.

  Fixes for Lucene 10.2 API changes: CodecReader moved to
  org.apache.lucene.index; createRewrittenQuery() removed and replaced
  with an inline docAndScoreQuery() implementation using the public
  Weight/ScorerSupplier API.
- CuVS2510GPUVectorsFormat: call CuVSProvider.provider().enableRMMAsyncMemory()
  in the static initializer so that cuda_async_memory_resource is active for
  the lifetime of the codec. This makes CAGRA workspace deallocations
  stream-ordered and non-blocking, which is required for the CudaStreamPool
  to provide any parallelism benefit.

- GPUKnnFloatVectorQuery: upload the query vector to device once before the
  per-segment loop and share the resulting CuVSMatrix across all CagraQuery
  instances, reducing host-to-device copies from O(numSegments) to 1 per
  query. Wrap the shared device matrix in try-with-resources to close the
  RMM allocation promptly after MultiSegmentCagraSearch.search() returns.

- FilterCuVSProvider: delegate enableRMMAsyncMemory() to the wrapped provider.
GPUKnnFloatVectorQuery / GPUPerLeafCuVSKnnCollector:
- Add persistent, persistentLifetime, and persistentDeviceUsage
  parameters, threaded through all constructor overloads and forwarded
  to CagraSearchParams.Builder in both the multi-segment rewrite() path
  and the per-segment approximateSearch() fallback path.
- Add threadBlockSize parameter (0 = auto) to allow tuning of the
  persistent kernel's worker_queue_size, which determines how many
  concurrent query threads can run without latency increase.

Fix persistent-runner hash instability across segments (rewrite() path):
- When max_iterations is 0 (auto), CAGRA computes it from each
  segment's dataset size. Different-sized segments produce different
  values, causing a distinct runner hash per segment and a
  destroy/recreate cycle on every search call.
- Add computeMaxIterations(), which mirrors adjust_search_params() from
  search_plan.cuh, and call it once using the largest segment's graph
  size and degree. All segments then share the same max_iterations,
  producing a stable runner hash across the full multi-segment query.

CuVS2510GPUVectorsReader:
- Forward threadBlockSize, persistent, persistentLifetime, and
  persistentDeviceUsage from GPUPerLeafCuVSKnnCollector to
  CagraSearchParams.Builder in the per-segment fallback path.
Remove persistent kernel mode:
- Drop persistent, persistentLifetime, and persistentDeviceUsage fields
  and parameters from GPUKnnFloatVectorQuery, GPUPerLeafCuVSKnnCollector,
  and CuVS2510GPUVectorsReader. The persistent kernel is superseded by
  the native multi-segment kernel (cuvsCagraSearchMultiSegment) which
  achieves better concurrency without the per-runner lifecycle overhead.
- Collapse the 11-argument GPUKnnFloatVectorQuery constructor (which only
  existed to accept persistent parameters) into the standard 8-argument
  form.
- Remove stale comments that described max_iterations uniformity in terms
  of persistent-runner hash stability; replace with accurate explanation
  (consistent search quality across segments of different sizes).

Add workspace pool configuration:
- Add WORKSPACE_POOL_SIZE_PROPERTY constant to
  ThreadLocalCuVSResourcesProvider.
- On resources creation, read com.nvidia.cuvs.workspacePoolSize system
  property and call setWorkspacePool() if set, so callers can pre-warm
  the per-thread RMM pool without modifying cuvs-lucene source.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant