buildIndex() OOM on large segments: all-at-once inverted list construction requires 2.3x input memory

## Summary

`SeismicIndex::build()` causes OOM on segments with 46M+ documents (5.9B NNZ) on 128GB nodes. The root cause is that `build_inverted_lists()` allocates a full transpose of all vectors before clustering begins, creating a peak memory footprint of ~2.3x the input data size before any useful work starts.

## Environment

- 46M documents per shard, avg 127 non-zeros per document (5.9B total NNZ)
- 128GB RAM nodes (117GB available after JVM overhead)
- Index description: `"idmap,seismic,lambda=-1|beta=-1|alpha=0.40"`
- cluster_ratio=0.1, OMP threads=48
- Linux OOM killer triggered at anon-rss ~127GB
- Occurred on all 3 nodes simultaneously during force merge (each building its own shard)

## Memory Breakdown (verified from source)

### Phase 1: Input vectors (`SparseVectors vectors_`) — 35.6GB resident

```
indptr_:   int32_t × (46M + 1)  =   184MB
indices_:  uint16_t × 5.9B      = 11.8GB
values_:   float(U32) × 5.9B    = 23.6GB
────────────────────────────────────────────
Total:                             35.6GB
```

### Phase 2: `build_inverted_lists()` adds 47.2GB (seismic_common.h:133-135)

`ArrayInvertedLists::build_inverted_lists()` (inverted_lists.cpp:159-180) iterates all 5.9B NNZ and stores each entry as:
- `doc_ids_`: int32_t (4 bytes)
- `codes_`: element_size=4 bytes (U32 float)

Per NNZ: **8 bytes** → 5.9B × 8 = **47.2GB**

Allocated **all at once** in a single sequential loop before clustering starts:

```cpp
// inverted_lists.cpp:170-178
for (size_t i = 0; i < n_docs; ++i) {
    int start = indptr_data[i];
    int n_tokens = indptr_data[i + 1] - indptr_data[i];
    for (size_t j = start; j < start + n_tokens; ++j) {
        term_t term_id = indices_data[j];
        inverted_lists->add_entry(term_id, i, values_data + j * element_size);
    }
}
```

### Peak: 82.8GB BASE before clustering begins

```
vectors_ (stays resident):       35.6GB
+ inverted_lists (all at once): +47.2GB
──────────────────────────────────────────
BASE:                            82.8GB
```

### Phase 3: Clustering loop adds ~40GB more at peak

The `#pragma omp parallel for` loop (seismic_common.h:142-152):
- `prune_and_keep_doc_ids(lambda)` — temporary sort per term
- `RandomKMeans::train()` — may allocate dense centroid vectors per thread
- `summarize()` — output summaries accumulating toward ~27GB
- `invlist.clear()` — frees each term incrementally, but early in the loop nearly all 47GB is still resident

**Observed peak: ~120-130GB** → OOM killed.

## Timeline from Logs

```
17:24:43 — Created native index (46M docs)
17:29:01 — All 24 batches added (5.9B NNZ). Building index...  ← build() starts
17:32:11 — OOM killed (anon-rss: 127,175,828 kB)               ← 3 min later
```

## Root Cause

`build_inverted_lists()` constructs the **complete transpose** of all vectors eagerly, coexisting in memory with the original vectors. This 82.8GB floor leaves only 34GB for clustering on a 117GB-available node — insufficient for parallel clustering temps + output summaries.

The `invlist.clear()` pattern in the clustering loop is a good optimization but insufficient because the full 47GB allocation happens before any term is processed.

## Comparison with Streaming Approach

A Java SEISMIC implementation processes the same 46M docs without OOM by:
1. Reading one term's posting list at a time from an on-disk inverted index
2. Clustering just that term's postings (bounded to lambda ≈ 23K docs, ~368KB)
3. Writing output and freeing memory before processing the next term
4. **Never materializing the full inverted list transpose in memory**

## Suggested Fixes

### Option A: Batch-of-terms approach (best CPU/memory tradeoff)

Process inverted lists in batches (e.g., 1000 terms per batch):
1. Scan vectors once per batch, building inverted lists only for those terms
2. Cluster the batch (can still parallelize within it)
3. Free batch, proceed to next

Memory: vectors (35.6GB) + one batch (~0.5GB) + output (growing) ≈ **60-70GB peak** vs 130GB.

### Option B: Sort-based external transpose

Write (term_id, doc_id, value) triples to temp file, external sort by term_id, read back sequentially.

### Option C: Free `vectors_` after transpose

If `RandomKMeans::train()` and `summarize()` can be refactored to read from inverted list data instead of original vectors, then `vectors_` (35.6GB) could be freed after `build_inverted_lists()`.

### Option D: Memory budget parameter

Allow callers to specify max memory. If estimated peak exceeds budget, automatically fall back to streaming.

## Current Workaround

Using 6 shards (23M docs each, ~60GB peak) instead of 3 shards (46M docs each, ~130GB peak) on 128GB nodes.

## Key Source Files

- `nsparse/seismic_common.h:129-154` — `build_inverted_lists_clusters()`
- `nsparse/invlists/inverted_lists.cpp:159-180` — `ArrayInvertedLists::build_inverted_lists()`
- `nsparse/seismic_index.cpp:130-136` — `SeismicIndex::build()`
- `nsparse/invlists/inverted_lists.h` — `InvertedList` data structures
- `nsparse/sparse_vectors.h` — `SparseVectors` storage layout
- `nsparse/types.h` — `idx_t = int32_t`, `term_t = uint16_t`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

buildIndex() OOM on large segments: all-at-once inverted list construction requires 2.3x input memory #11

Summary

Environment

Memory Breakdown (verified from source)

Phase 1: Input vectors (`SparseVectors vectors_`) — 35.6GB resident

Phase 2: `build_inverted_lists()` adds 47.2GB (seismic_common.h:133-135)

Peak: 82.8GB BASE before clustering begins

Phase 3: Clustering loop adds ~40GB more at peak

Timeline from Logs

Root Cause

Comparison with Streaming Approach

Suggested Fixes

Option A: Batch-of-terms approach (best CPU/memory tradeoff)

Option B: Sort-based external transpose

Option C: Free `vectors_` after transpose

Option D: Memory budget parameter

Current Workaround

Key Source Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

buildIndex() OOM on large segments: all-at-once inverted list construction requires 2.3x input memory #11

Description

Summary

Environment

Memory Breakdown (verified from source)

Phase 1: Input vectors (SparseVectors vectors_) — 35.6GB resident

Phase 2: build_inverted_lists() adds 47.2GB (seismic_common.h:133-135)

Peak: 82.8GB BASE before clustering begins

Phase 3: Clustering loop adds ~40GB more at peak

Timeline from Logs

Root Cause

Comparison with Streaming Approach

Suggested Fixes

Option A: Batch-of-terms approach (best CPU/memory tradeoff)

Option B: Sort-based external transpose

Option C: Free vectors_ after transpose

Option D: Memory budget parameter

Current Workaround

Key Source Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase 1: Input vectors (`SparseVectors vectors_`) — 35.6GB resident

Phase 2: `build_inverted_lists()` adds 47.2GB (seismic_common.h:133-135)

Option C: Free `vectors_` after transpose