Add JNI layer and core library improvements for OpenSearch integration#13
Open
model-collapse wants to merge 1 commit into
Open
Conversation
Adds JNI bindings for the native sparse vector library to integrate with OpenSearch's neural-search plugin. Key capabilities: - Index lifecycle: create, add vectors (with IDs), build, save, load, delete - Search: dense-scoring SEISMIC and SEISMIC-SQ (scalar quantized) search - Memory management: streaming build_and_save, release_build_memory, malloc_trim - Page cache eviction: madvise(MADV_DONTNEED) via /proc/self/maps scanning Core library changes for production-scale operation (138M+ documents): - int64 offsets (offset_t) to handle >2B non-zeros in CSR indptr - Streaming build_and_save: writes clusters batch-by-batch to reduce peak RSS - Scalar quantizer improvements: configurable quantization ceilings - IDMapIndex: add_with_ids support for external document ID mapping - K-means clustering: OpenMP parallelization for 32-thread builds - Distance functions: unified SIMD dispatch across AVX2/AVX512/NEON/SVE Validated at 138M docs (3 shards × 46M), uint8 SQ SEISMIC: - Build time: 12.7 min per shard (32 threads) - Peak RSS: 59-61 GB during force merge (on 128 GB nodes) - Search: 3ms p50 at heap_factor=1.08, top_n=3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
jni/) for integrating the native sparse vector library with OpenSearch's neural-search pluginChanges
JNI Layer (
jni/nsparse_jni.cpp)createIndex,addVectors,addVectorsWithIds,buildIndex,buildAndSaveIndex,saveIndex,loadIndex,deleteIndexsearch(SEISMIC) andsearchSQ(scalar-quantized SEISMIC)release_build_memory()+malloc_trim()to return RSS after buildevictPageCacheusingmadvise(MADV_DONTNEED)via/proc/self/mapsscanning with dynamic buffer (getline) and error diagnosticsCore Library
offset_t): handles >2B non-zeros in CSR indptr arrays (required at 45M+ docs per shard)build_and_save: writes clusters batch-by-batch to reduce peak RSS from ~104 GB to ~50 GBquantization_ceilingparameter for ingest and search pathsadd_with_idssupport for external document ID mappingTests
IDMapIndextests foradd_with_idsand save/load round-tripBenchmark Results (138M docs, 3 nodes × 46M docs/shard, uint8 SQ)
Force Merge (Index Build)
Search Latency (3903 queries, k=10, top_n=3)
Test plan
cmake -S . -B build -DNSPARSE_ENABLE_TESTS=ON && cmake --build build -j && ctest --test-dir buildcmake -S . -B build -DNSPARSE_ENABLE_JNI=ON && cmake --build build --target nsparse_jni🤖 Generated with Claude Code