Skip to content

buildIndex() OOM on large segments: all-at-once inverted list construction requires 2.3x input memory #11

Description

@model-collapse

Summary

SeismicIndex::build() causes OOM on segments with 46M+ documents (5.9B NNZ) on 128GB nodes. The root cause is that build_inverted_lists() allocates a full transpose of all vectors before clustering begins, creating a peak memory footprint of ~2.3x the input data size before any useful work starts.

Environment

  • 46M documents per shard, avg 127 non-zeros per document (5.9B total NNZ)
  • 128GB RAM nodes (117GB available after JVM overhead)
  • Index description: "idmap,seismic,lambda=-1|beta=-1|alpha=0.40"
  • cluster_ratio=0.1, OMP threads=48
  • Linux OOM killer triggered at anon-rss ~127GB
  • Occurred on all 3 nodes simultaneously during force merge (each building its own shard)

Memory Breakdown (verified from source)

Phase 1: Input vectors (SparseVectors vectors_) — 35.6GB resident

indptr_:   int32_t × (46M + 1)  =   184MB
indices_:  uint16_t × 5.9B      = 11.8GB
values_:   float(U32) × 5.9B    = 23.6GB
────────────────────────────────────────────
Total:                             35.6GB

Phase 2: build_inverted_lists() adds 47.2GB (seismic_common.h:133-135)

ArrayInvertedLists::build_inverted_lists() (inverted_lists.cpp:159-180) iterates all 5.9B NNZ and stores each entry as:

  • doc_ids_: int32_t (4 bytes)
  • codes_: element_size=4 bytes (U32 float)

Per NNZ: 8 bytes → 5.9B × 8 = 47.2GB

Allocated all at once in a single sequential loop before clustering starts:

// inverted_lists.cpp:170-178
for (size_t i = 0; i < n_docs; ++i) {
    int start = indptr_data[i];
    int n_tokens = indptr_data[i + 1] - indptr_data[i];
    for (size_t j = start; j < start + n_tokens; ++j) {
        term_t term_id = indices_data[j];
        inverted_lists->add_entry(term_id, i, values_data + j * element_size);
    }
}

Peak: 82.8GB BASE before clustering begins

vectors_ (stays resident):       35.6GB
+ inverted_lists (all at once): +47.2GB
──────────────────────────────────────────
BASE:                            82.8GB

Phase 3: Clustering loop adds ~40GB more at peak

The #pragma omp parallel for loop (seismic_common.h:142-152):

  • prune_and_keep_doc_ids(lambda) — temporary sort per term
  • RandomKMeans::train() — may allocate dense centroid vectors per thread
  • summarize() — output summaries accumulating toward ~27GB
  • invlist.clear() — frees each term incrementally, but early in the loop nearly all 47GB is still resident

Observed peak: ~120-130GB → OOM killed.

Timeline from Logs

17:24:43 — Created native index (46M docs)
17:29:01 — All 24 batches added (5.9B NNZ). Building index...  ← build() starts
17:32:11 — OOM killed (anon-rss: 127,175,828 kB)               ← 3 min later

Root Cause

build_inverted_lists() constructs the complete transpose of all vectors eagerly, coexisting in memory with the original vectors. This 82.8GB floor leaves only 34GB for clustering on a 117GB-available node — insufficient for parallel clustering temps + output summaries.

The invlist.clear() pattern in the clustering loop is a good optimization but insufficient because the full 47GB allocation happens before any term is processed.

Comparison with Streaming Approach

A Java SEISMIC implementation processes the same 46M docs without OOM by:

  1. Reading one term's posting list at a time from an on-disk inverted index
  2. Clustering just that term's postings (bounded to lambda ≈ 23K docs, ~368KB)
  3. Writing output and freeing memory before processing the next term
  4. Never materializing the full inverted list transpose in memory

Suggested Fixes

Option A: Batch-of-terms approach (best CPU/memory tradeoff)

Process inverted lists in batches (e.g., 1000 terms per batch):

  1. Scan vectors once per batch, building inverted lists only for those terms
  2. Cluster the batch (can still parallelize within it)
  3. Free batch, proceed to next

Memory: vectors (35.6GB) + one batch (~0.5GB) + output (growing) ≈ 60-70GB peak vs 130GB.

Option B: Sort-based external transpose

Write (term_id, doc_id, value) triples to temp file, external sort by term_id, read back sequentially.

Option C: Free vectors_ after transpose

If RandomKMeans::train() and summarize() can be refactored to read from inverted list data instead of original vectors, then vectors_ (35.6GB) could be freed after build_inverted_lists().

Option D: Memory budget parameter

Allow callers to specify max memory. If estimated peak exceeds budget, automatically fall back to streaming.

Current Workaround

Using 6 shards (23M docs each, ~60GB peak) instead of 3 shards (46M docs each, ~130GB peak) on 128GB nodes.

Key Source Files

  • nsparse/seismic_common.h:129-154build_inverted_lists_clusters()
  • nsparse/invlists/inverted_lists.cpp:159-180ArrayInvertedLists::build_inverted_lists()
  • nsparse/seismic_index.cpp:130-136SeismicIndex::build()
  • nsparse/invlists/inverted_lists.hInvertedList data structures
  • nsparse/sparse_vectors.hSparseVectors storage layout
  • nsparse/types.hidx_t = int32_t, term_t = uint16_t

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions