Summary
SeismicIndex::build() causes OOM on segments with 46M+ documents (5.9B NNZ) on 128GB nodes. The root cause is that build_inverted_lists() allocates a full transpose of all vectors before clustering begins, creating a peak memory footprint of ~2.3x the input data size before any useful work starts.
Environment
- 46M documents per shard, avg 127 non-zeros per document (5.9B total NNZ)
- 128GB RAM nodes (117GB available after JVM overhead)
- Index description:
"idmap,seismic,lambda=-1|beta=-1|alpha=0.40"
- cluster_ratio=0.1, OMP threads=48
- Linux OOM killer triggered at anon-rss ~127GB
- Occurred on all 3 nodes simultaneously during force merge (each building its own shard)
Memory Breakdown (verified from source)
Phase 1: Input vectors (SparseVectors vectors_) — 35.6GB resident
indptr_: int32_t × (46M + 1) = 184MB
indices_: uint16_t × 5.9B = 11.8GB
values_: float(U32) × 5.9B = 23.6GB
────────────────────────────────────────────
Total: 35.6GB
Phase 2: build_inverted_lists() adds 47.2GB (seismic_common.h:133-135)
ArrayInvertedLists::build_inverted_lists() (inverted_lists.cpp:159-180) iterates all 5.9B NNZ and stores each entry as:
doc_ids_: int32_t (4 bytes)
codes_: element_size=4 bytes (U32 float)
Per NNZ: 8 bytes → 5.9B × 8 = 47.2GB
Allocated all at once in a single sequential loop before clustering starts:
// inverted_lists.cpp:170-178
for (size_t i = 0; i < n_docs; ++i) {
int start = indptr_data[i];
int n_tokens = indptr_data[i + 1] - indptr_data[i];
for (size_t j = start; j < start + n_tokens; ++j) {
term_t term_id = indices_data[j];
inverted_lists->add_entry(term_id, i, values_data + j * element_size);
}
}
Peak: 82.8GB BASE before clustering begins
vectors_ (stays resident): 35.6GB
+ inverted_lists (all at once): +47.2GB
──────────────────────────────────────────
BASE: 82.8GB
Phase 3: Clustering loop adds ~40GB more at peak
The #pragma omp parallel for loop (seismic_common.h:142-152):
prune_and_keep_doc_ids(lambda) — temporary sort per term
RandomKMeans::train() — may allocate dense centroid vectors per thread
summarize() — output summaries accumulating toward ~27GB
invlist.clear() — frees each term incrementally, but early in the loop nearly all 47GB is still resident
Observed peak: ~120-130GB → OOM killed.
Timeline from Logs
17:24:43 — Created native index (46M docs)
17:29:01 — All 24 batches added (5.9B NNZ). Building index... ← build() starts
17:32:11 — OOM killed (anon-rss: 127,175,828 kB) ← 3 min later
Root Cause
build_inverted_lists() constructs the complete transpose of all vectors eagerly, coexisting in memory with the original vectors. This 82.8GB floor leaves only 34GB for clustering on a 117GB-available node — insufficient for parallel clustering temps + output summaries.
The invlist.clear() pattern in the clustering loop is a good optimization but insufficient because the full 47GB allocation happens before any term is processed.
Comparison with Streaming Approach
A Java SEISMIC implementation processes the same 46M docs without OOM by:
- Reading one term's posting list at a time from an on-disk inverted index
- Clustering just that term's postings (bounded to lambda ≈ 23K docs, ~368KB)
- Writing output and freeing memory before processing the next term
- Never materializing the full inverted list transpose in memory
Suggested Fixes
Option A: Batch-of-terms approach (best CPU/memory tradeoff)
Process inverted lists in batches (e.g., 1000 terms per batch):
- Scan vectors once per batch, building inverted lists only for those terms
- Cluster the batch (can still parallelize within it)
- Free batch, proceed to next
Memory: vectors (35.6GB) + one batch (~0.5GB) + output (growing) ≈ 60-70GB peak vs 130GB.
Option B: Sort-based external transpose
Write (term_id, doc_id, value) triples to temp file, external sort by term_id, read back sequentially.
Option C: Free vectors_ after transpose
If RandomKMeans::train() and summarize() can be refactored to read from inverted list data instead of original vectors, then vectors_ (35.6GB) could be freed after build_inverted_lists().
Option D: Memory budget parameter
Allow callers to specify max memory. If estimated peak exceeds budget, automatically fall back to streaming.
Current Workaround
Using 6 shards (23M docs each, ~60GB peak) instead of 3 shards (46M docs each, ~130GB peak) on 128GB nodes.
Key Source Files
nsparse/seismic_common.h:129-154 — build_inverted_lists_clusters()
nsparse/invlists/inverted_lists.cpp:159-180 — ArrayInvertedLists::build_inverted_lists()
nsparse/seismic_index.cpp:130-136 — SeismicIndex::build()
nsparse/invlists/inverted_lists.h — InvertedList data structures
nsparse/sparse_vectors.h — SparseVectors storage layout
nsparse/types.h — idx_t = int32_t, term_t = uint16_t
Summary
SeismicIndex::build()causes OOM on segments with 46M+ documents (5.9B NNZ) on 128GB nodes. The root cause is thatbuild_inverted_lists()allocates a full transpose of all vectors before clustering begins, creating a peak memory footprint of ~2.3x the input data size before any useful work starts.Environment
"idmap,seismic,lambda=-1|beta=-1|alpha=0.40"Memory Breakdown (verified from source)
Phase 1: Input vectors (
SparseVectors vectors_) — 35.6GB residentPhase 2:
build_inverted_lists()adds 47.2GB (seismic_common.h:133-135)ArrayInvertedLists::build_inverted_lists()(inverted_lists.cpp:159-180) iterates all 5.9B NNZ and stores each entry as:doc_ids_: int32_t (4 bytes)codes_: element_size=4 bytes (U32 float)Per NNZ: 8 bytes → 5.9B × 8 = 47.2GB
Allocated all at once in a single sequential loop before clustering starts:
Peak: 82.8GB BASE before clustering begins
Phase 3: Clustering loop adds ~40GB more at peak
The
#pragma omp parallel forloop (seismic_common.h:142-152):prune_and_keep_doc_ids(lambda)— temporary sort per termRandomKMeans::train()— may allocate dense centroid vectors per threadsummarize()— output summaries accumulating toward ~27GBinvlist.clear()— frees each term incrementally, but early in the loop nearly all 47GB is still residentObserved peak: ~120-130GB → OOM killed.
Timeline from Logs
Root Cause
build_inverted_lists()constructs the complete transpose of all vectors eagerly, coexisting in memory with the original vectors. This 82.8GB floor leaves only 34GB for clustering on a 117GB-available node — insufficient for parallel clustering temps + output summaries.The
invlist.clear()pattern in the clustering loop is a good optimization but insufficient because the full 47GB allocation happens before any term is processed.Comparison with Streaming Approach
A Java SEISMIC implementation processes the same 46M docs without OOM by:
Suggested Fixes
Option A: Batch-of-terms approach (best CPU/memory tradeoff)
Process inverted lists in batches (e.g., 1000 terms per batch):
Memory: vectors (35.6GB) + one batch (~0.5GB) + output (growing) ≈ 60-70GB peak vs 130GB.
Option B: Sort-based external transpose
Write (term_id, doc_id, value) triples to temp file, external sort by term_id, read back sequentially.
Option C: Free
vectors_after transposeIf
RandomKMeans::train()andsummarize()can be refactored to read from inverted list data instead of original vectors, thenvectors_(35.6GB) could be freed afterbuild_inverted_lists().Option D: Memory budget parameter
Allow callers to specify max memory. If estimated peak exceeds budget, automatically fall back to streaming.
Current Workaround
Using 6 shards (23M docs each, ~60GB peak) instead of 3 shards (46M docs each, ~130GB peak) on 128GB nodes.
Key Source Files
nsparse/seismic_common.h:129-154—build_inverted_lists_clusters()nsparse/invlists/inverted_lists.cpp:159-180—ArrayInvertedLists::build_inverted_lists()nsparse/seismic_index.cpp:130-136—SeismicIndex::build()nsparse/invlists/inverted_lists.h—InvertedListdata structuresnsparse/sparse_vectors.h—SparseVectorsstorage layoutnsparse/types.h—idx_t = int32_t,term_t = uint16_t