-
Notifications
You must be signed in to change notification settings - Fork 184
Centralize shared utilities across benchmark backends #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
c067d52
6e91f0e
99fb908
1e1ef66
21d8621
60f3aeb
39867d9
1e2de0f
9480257
3399692
896323d
93b78db
c99cf88
4f01af3
aa913dc
dddf1d2
948fb8a
4781e32
592afc3
3e561c3
b31ae14
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,192 @@ | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
|
|
||
| """ | ||
| Shared utilities for cuvs-bench backends. | ||
|
|
||
| Provides common functions for Python-native backends (e.g., OpenSearch, | ||
| Elasticsearch): | ||
| - Vector file I/O: used internally by the Dataset class for transparent | ||
| vector loading. The C++ backend does not use these since it passes file | ||
| paths directly to the subprocess. | ||
| - Parameter expansion: converts YAML param specs into lists of param dicts. | ||
| - Recall computation: computes recall@k from neighbors and ground truth. | ||
|
|
||
| The dtype_from_filename function originates from generate_groundtruth/utils.py. | ||
| Note: generate_groundtruth/utils.py uses .hbin for float16 while the rest of | ||
| the codebase (get_dataset/fbin_to_f16bin.py, OpenSearch backend) uses .f16bin. | ||
| We standardize on .f16bin here to match the naming convention of the other | ||
| formats (.fbin, .i8bin, .u8bin). | ||
| """ | ||
|
|
||
| import itertools | ||
| import os | ||
| from typing import Any, Dict, List, Optional | ||
|
|
||
| import numpy as np | ||
|
|
||
|
|
||
| def dtype_from_filename(filename): | ||
| """Map file extension to numpy dtype. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| filename : str | ||
| Path or filename with a supported extension. | ||
|
|
||
| Returns | ||
| ------- | ||
| numpy.dtype | ||
| The corresponding numpy dtype. | ||
|
|
||
| Raises | ||
| ------ | ||
| RuntimeError | ||
| If the file extension is not supported. | ||
| """ | ||
| ext = os.path.splitext(filename)[1] | ||
| if ext == ".fbin": | ||
| return np.float32 | ||
| if ext == ".f16bin": | ||
| return np.float16 | ||
| elif ext == ".ibin": | ||
| return np.int32 | ||
| elif ext == ".u8bin": | ||
| return np.ubyte | ||
| elif ext == ".i8bin": | ||
| return np.byte | ||
| else: | ||
| raise RuntimeError(f"Unsupported file extension: {ext}") | ||
|
|
||
|
|
||
| def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray: | ||
| """ | ||
| Read a binary vector file into a numpy array. | ||
|
|
||
| Supports the standard big-ann-bench binary format used by cuvs-bench | ||
| datasets: a 4-byte uint32 ``n_rows``, a 4-byte uint32 ``n_cols``, | ||
| followed by ``n_rows * n_cols`` elements of the dtype inferred from | ||
| the file extension via ``dtype_from_filename``. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| path : str | ||
| Path to the binary file. The dtype is inferred from the extension: | ||
| ``.fbin`` (float32), ``.f16bin`` (float16), ``.u8bin`` (uint8), | ||
| ``.i8bin`` (int8), ``.ibin`` (int32). | ||
| subset_size : Optional[int] | ||
| If provided, only the first ``subset_size`` rows are loaded. | ||
|
|
||
| Returns | ||
| ------- | ||
| np.ndarray | ||
| Array of shape ``(n_rows, n_cols)`` with the inferred dtype. | ||
|
|
||
| Raises | ||
| ------ | ||
| FileNotFoundError | ||
| If the file does not exist. | ||
| ValueError | ||
| If the file extension is unsupported, ``subset_size`` is not positive, | ||
| or the file is truncated. | ||
| """ | ||
| dtype = dtype_from_filename(path) | ||
| if subset_size is not None and subset_size < 1: | ||
| raise ValueError( | ||
| f"subset_size must be a positive integer, got {subset_size}" | ||
| ) | ||
| with open(path, "rb") as f: | ||
| header = f.read(8) | ||
| if len(header) < 8: | ||
| raise ValueError( | ||
| f"File too small to contain a valid header (expected 8 bytes, " | ||
| f"got {len(header)}): {path}" | ||
| ) | ||
| n_rows = int(np.frombuffer(header[:4], dtype=np.uint32)[0]) | ||
| n_cols = int(np.frombuffer(header[4:], dtype=np.uint32)[0]) | ||
| if subset_size is not None: | ||
| n_rows = min(n_rows, subset_size) | ||
|
Comment on lines
+96
to
+110
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Validate Lines 96-110 only enforce a lower bound. Non-integer values (e.g., Proposed fix def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray:
@@
- if subset_size is not None and subset_size < 1:
- raise ValueError(
- f"subset_size must be a positive integer, got {subset_size}"
- )
+ if subset_size is not None:
+ if not isinstance(subset_size, (int, np.integer)):
+ raise ValueError(
+ f"subset_size must be an integer, got {type(subset_size).__name__}"
+ )
+ if subset_size < 1:
+ raise ValueError(
+ f"subset_size must be a positive integer, got {subset_size}"
+ )As per coding guidelines, "Ensure missing validation does not cause crashes on invalid input through proper size/type checks." 🤖 Prompt for AI Agents |
||
| expected_bytes = n_rows * n_cols * np.dtype(dtype).itemsize | ||
| raw = f.read(expected_bytes) | ||
| if len(raw) < expected_bytes: | ||
| raise ValueError( | ||
| f"File is truncated: expected {expected_bytes} bytes of data " | ||
| f"({n_rows} rows x {n_cols} cols x {np.dtype(dtype).itemsize} bytes), " | ||
| f"got {len(raw)}: {path}" | ||
| ) | ||
| data = np.frombuffer(raw, dtype=dtype) | ||
| return data.reshape(n_rows, n_cols) | ||
|
|
||
|
|
||
| def expand_param_grid(param_spec: Dict[str, List[Any]]) -> List[Dict[str, Any]]: | ||
| """ | ||
| Expand a parameter specification into all combinations via Cartesian product. | ||
|
|
||
| Takes a dict where each key maps to a list of values (as defined in | ||
| algorithm YAML configs) and produces a list of dicts, one per combination. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| param_spec : Dict[str, List[Any]] | ||
| Parameter specification, e.g., {"m": [16, 32], "ef_construction": [100, 200]} | ||
|
|
||
| Returns | ||
| ------- | ||
| List[Dict[str, Any]] | ||
| List of parameter dicts, e.g., | ||
| [{"m": 16, "ef_construction": 100}, {"m": 16, "ef_construction": 200}, | ||
| {"m": 32, "ef_construction": 100}, {"m": 32, "ef_construction": 200}] | ||
| Returns [{}] if param_spec is empty. | ||
| """ | ||
| if not param_spec: | ||
| return [{}] | ||
| keys = list(param_spec.keys()) | ||
| return [dict(zip(keys, vals)) for vals in itertools.product(*param_spec.values())] | ||
|
|
||
|
|
||
| def compute_recall( | ||
| neighbors: np.ndarray, groundtruth: np.ndarray, k: int | ||
| ) -> float: | ||
| """ | ||
| Compute recall@k by comparing returned neighbors against ground truth. | ||
|
|
||
| For each query, counts how many of the k returned neighbors appear | ||
| in the ground truth set, then averages across all queries. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| neighbors : np.ndarray | ||
| Returned neighbor IDs, shape (n_queries, k) | ||
| groundtruth : np.ndarray | ||
| Ground truth neighbor IDs, shape (n_queries, gt_k) | ||
| k : int | ||
| Number of neighbors to evaluate | ||
|
|
||
| Returns | ||
| ------- | ||
| float | ||
| Recall@k in range [0.0, 1.0] | ||
| """ | ||
| if not isinstance(neighbors, np.ndarray) or not isinstance(groundtruth, np.ndarray): | ||
| raise ValueError("neighbors and groundtruth must be numpy ndarrays") | ||
| if neighbors.ndim != 2 or groundtruth.ndim != 2: | ||
| raise ValueError( | ||
| f"neighbors and groundtruth must be 2-D arrays, got " | ||
| f"neighbors.shape={neighbors.shape}, groundtruth.shape={groundtruth.shape}" | ||
| ) | ||
| n_queries = neighbors.shape[0] | ||
| if groundtruth.shape[0] != n_queries: | ||
| raise ValueError( | ||
| f"Row count mismatch: neighbors has {n_queries} rows, " | ||
| f"groundtruth has {groundtruth.shape[0]} rows" | ||
| ) | ||
| gt_k = min(k, groundtruth.shape[1]) | ||
| if gt_k == 0 or n_queries == 0: | ||
| return 0.0 | ||
| n_correct = sum( | ||
| len(set(neighbors[i, :k].tolist()) & set(groundtruth[i, :gt_k].tolist())) | ||
| for i in range(n_queries) | ||
| ) | ||
| return n_correct / (n_queries * gt_k) | ||
|
coderabbitai[bot] marked this conversation as resolved.
Comment on lines
+185
to
+192
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reject negative Lines 185-192 compute Proposed fix def compute_recall(
neighbors: np.ndarray, groundtruth: np.ndarray, k: int
) -> float:
@@
+ if not isinstance(k, (int, np.integer)):
+ raise ValueError(f"k must be an integer, got {type(k).__name__}")
+ if k < 0:
+ raise ValueError(f"k must be >= 0, got {k}")
+
n_queries = neighbors.shape[0]
@@
gt_k = min(k, groundtruth.shape[1])As per coding guidelines, "Provide clear and actionable error messages that include expected vs actual values where helpful." 🤖 Prompt for AI Agents |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is
_groundtruth_neighborstreated differently here than_base_vectorsand_query_vectors?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base_vectorsandquery_vectorsdefault to None in the constructor but are converted to np.empty((0, 0)) internally so that the dims, n_base, and n_queries properties can safely call .shape and .size without None guards.groundtruth_neighborsstays as None because it's optional (not every dataset has ground truth), and those properties don't depend on it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I can refactor both to match
groundtruth_neighborsby defaulting to None instead and introducing a None guard in the dims, n_base, and n_queries properties. This would make all three fields consistent