Skip to content

perf(tsdb/index): allocation-free postings offset table scan#5159

Merged
simonswine merged 2 commits into
grafana:mainfrom
simonswine:20260514_index-no-allocs-offset-table
May 15, 2026
Merged

perf(tsdb/index): allocation-free postings offset table scan#5159
simonswine merged 2 commits into
grafana:mainfrom
simonswine:20260514_index-no-allocs-offset-table

Conversation

@simonswine
Copy link
Copy Markdown
Contributor

@simonswine simonswine commented May 15, 2026

Summary

newReader spent ~96% of its allocations inside ReadOffsetTable: every entry materialised a []string plus two string clones for name and value, even for the ~31/32 entries that symbolFactor sampling immediately discards.

This PR introduces readPostingsOffsetTable, a sibling of ReadOffsetTable that yields (name, value []byte) slices aliasing the mmap buffer instead of allocating strings. The newReader callback:

  • uses yoloString for map lookups (zero allocation),
  • defers string(name) / string(value) conversion until an entry is actually retained, and
  • tracks the "last value" candidate as raw byte-slice aliases, cloning only at flush time.

Impact

This path represents 66% of the allocation counts in query-backends (and a very small amount of compaction workers)

image

Benchmark

BenchmarkNewReader with 10 label names × 200 values each:

             │      sec/op       │      sec/op      vs base  │
NewReader-12      305.1µ ± 1%       124.2µ ± 1%    -59.30%

             │       B/op        │       B/op       vs base  │
NewReader-12      270.7Ki ± 0%      169.8Ki ± 0%   -37.30%

             │     allocs/op     │   allocs/op    vs base    │
NewReader-12       6075.0 ± 0%       244.0 ± 0%   -95.98%

Allocation count drops from 6 075 → 244 (−96%), wall time from 305 µs → 124 µs (−59%).

Notes

Test plan

  • go test ./pkg/phlaredb/tsdb/index/... -run . -bench BenchmarkNewReader passes and shows regression-free numbers
  • Existing index reader tests (TestNewReader, table-driven cases) pass unchanged

Note

Medium Risk
Touches TSDB index reading/parsing and introduces more buffer-aliasing ([]byte/yoloString) to avoid allocations, which could cause subtle lifetime/aliasing bugs if misused despite being localized and covered by tests/bench.

Overview
Reduces NewReader startup allocations/time by replacing the generic ReadOffsetTable postings scan with an allocation-free readPostingsOffsetTable that returns (name, value []byte) slices aliasing the index buffer, and only materializes string values for sampled/required entries (including correct handling of the per-label “last value” entry).

Updates PostingsRanges to use the new postings-table iterator, adjusts the label-indices test to decode the table directly, and adds BenchmarkNewReader to track open-index performance/regressions.

Reviewed by Cursor Bugbot for commit e07c1e0. Bugbot is set up for automated code reviews on this repo. Configure here.

newReader spent ~96% of its allocations in ReadOffsetTable: each entry
allocated a []string plus string clones for name and value, even for
the ~31/32 entries discarded by symbolFactor sampling.

Switch to readPostingsOffsetTable which yields (name, value []byte)
aliasing the index buffer. The callback uses yoloString for map lookups,
defers string conversion until an entry is retained, and tracks the
"last" candidate as raw byte-slice aliases — cloning only at flush time.

BenchmarkNewReader (10 label names × 200 values):

             │      sec/op       │   sec/op     vs base  │
NewReader-12      305.1µ           124.2µ ± 1%  -59.30%

             │       B/op        │     B/op     vs base  │
NewReader-12      270.7Ki          169.8Ki ± 0%  -37.30%

             │     allocs/op     │ allocs/op  vs base    │
NewReader-12       6075.0          244.0 ± 0%   -95.98%
aleks-p
aleks-p previously approved these changes May 15, 2026
Copy link
Copy Markdown
Contributor

@aleks-p aleks-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement!

Comment thread pkg/phlaredb/tsdb/index/index.go Outdated
}

var lastKey []string
// lastNameB/lastValueB alias the mmap buffer and are only converted to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, maybe we can remove the mention of mmap here to avoid confusion, it looks like the data is always heap allocated (and it shouldn't matter in general)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely coming from the prometheus context 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e07c1e0

Address review feedback — the reader uses heap-allocated buffers, not
mmap, so the term was misleading.
@simonswine simonswine merged commit 8c1d10c into grafana:main May 15, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants