Skip to content

Benchmark: SQ uint8 vs float32 on 138M docs (MSMarco V2 SPLADE) #12

Description

@model-collapse

Benchmark Results: SQ uint8 vs Float32

Setup: 3x r6i.4xlarge (16 vCPU, 128GB RAM), JVM 16GB heap, 1 segment/shard (~45.4M docs each), total 136,362,605 docs (MSMarco V2 SPLADE).
Query: 3,903 queries, recall@10 against ground truth, top_n=3 (server-side token pruning).

Search: Recall & Latency (top_n=3)

heap_factor SQ uint8 recall float32 recall recall gap SQ uint8 p50 float32 p50 latency delta
1.03 0.8283 0.8413 -1.3% 10ms 3ms +7ms*
1.05 0.8428 0.8559 -1.3% 2ms 3ms -1ms
1.07 0.8552 0.8682 -1.3% 2ms 3ms -1ms
1.08 0.8606 0.8736 -1.3% 2ms 3ms -1ms
1.10 0.8700 0.8834 -1.3% 2ms 3ms -1ms
1.15 0.8874 0.9036 -1.6% 2ms 3ms -1ms
1.20 0.8998 0.9172 -1.7% 3ms 3ms same
1.50 0.9262 0.9467 -2.1% 4ms 4ms same
2.00 0.9311 0.9514 -2.0% 6ms 7ms -1ms

*hf=1.03 anomaly for SQ is a warmup artifact (first sweep point pays loading cost).

Index Build Time (SQ uint8, optimized batched clustering)

Phase Duration Notes
Lucene segment merge 34 min Segment I/O (3 shards in parallel)
Batch add (CSR construction) 1 min With reserve() pre-allocation (previously 10 min without pre-allocation)
K-means clustering (32 threads) 15 min Memory-aware batched clustering with per-batch inverted list construction + immediate free
Save to disk 2 min 32GB sequential write
Total 51 min

Optimizations applied

  1. Batch add with reserve(): Pre-allocates CSR vectors storage based on estimated total NNZ from first batch. Reduces batch add from ~10 min to 1 min by eliminating repeated std::vector resizing and reallocation.
  2. Memory-aware batched clustering: Reads /proc/meminfo MemAvailable to determine batch count. Processes posting lists in batches that fit within available memory, freeing each batch's inverted lists immediately after clustering. Prevents glibc heap fragmentation that previously retained ~38GB unreturnable memory post-build.
  3. release_build_memory() after save: Explicitly releases vectors_ and clustered_inverted_lists immediately after writing to disk (before deleteIndex), combined with mallopt(M_MMAP_THRESHOLD, 128KB) to force large allocations through mmap for individual reclamation.

Resource Usage

Metric SQ uint8
.nsparse file size per shard 32 GB
Post-load RSS (search steady state) 49.5 GB
Peak RSS during build 103.8 GB

Build Parameters

  • SQ uint8: idmap,seismic_sq,quantizer=8bit|vmin=0.0|vmax=4.0, quantization_ceiling_search=4.0
  • Float32: idmap,seismic
  • Both: lambda=22,724 (auto), beta=2,272 (auto, 0.1×lambda), alpha=0.4, OMP_THREADS=32

Summary

SQ uint8 trades 1.3% recall for:

  • 33% faster search (2ms vs 3ms p50 at hf=1.08)
  • 50% less disk/RAM (32GB vs 62GB per shard)

Recommended operating points:

  • hf=1.08: 86% recall @ 2ms p50 (best latency)
  • hf=1.15: 89% recall @ 2ms p50 (best recall/latency balance)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions