Skip to content

Track quicklist memory incrementally via zmalloc_usable capture#2

Draft
liorsve wants to merge 6 commits intounstablefrom
min-overhead-qicklist-mem-tracking
Draft

Track quicklist memory incrementally via zmalloc_usable capture#2
liorsve wants to merge 6 commits intounstablefrom
min-overhead-qicklist-mem-tracking

Conversation

@liorsve
Copy link
Copy Markdown
Owner

@liorsve liorsve commented Mar 2, 2026

Summary

Add incremental memory tracking to quicklist by capturing allocation sizes already computed internally by zmalloc_usable/zrealloc_usable, trading 8 bytes per quicklistNode and 8 bytes per quicklist to avoid calling zmalloc_size() on mutation hot paths (LPUSH, RPUSH, LPOP, RPOP, LINSERT, LREM, LTRIM, LSET). Added objectComputeSizeWithTrackedSize() as a parallel to objectComputeSize() that uses the tracked size for quicklists, computing LIST memory in $O(1)$ instead of iterating all nodes.


Struct changes

quicklistNode — added size_t entry_alloc_sz

Stores the usable allocation size of the node's entry buffer (listpack or plain data).

quicklist — added size_t tracked_size

Running total of all memory allocated for the quicklist: sizeof(quicklist) + sum of sizeof(quicklistNode) + entry_alloc_sz for every node. Maintained incrementally on insert, delete, compress, decompress, merge, split, and replace.


How allocation size capture works

zmalloc_usable() and zrealloc_usable() already call zmalloc_size() internally to update jemalloc memory stats — the result was previously discarded. We now capture it instead of throwing it away.

There are three different capture paths depending on the operation:

  1. Listpack mutations (push, pop, delete, insert, replace, merge, split, delete-range):
    The listpack allocator macros lp_malloc/lp_realloc in listpack_malloc.h now pass &lp_last_alloc_size instead of NULL to zmalloc_usable/zrealloc_usable. After any listpack operation, quicklist.c reads the captured size via lpLastAllocSize() and updates the node's entry_alloc_sz and the quicklist's tracked_size through the quicklistTrackEntryResize() macro.

  2. Compress/decompress:
    __quicklistCompressNode calls zrealloc_usable() directly with &new_entry_alloc_sz when reallocating the LZF buffer. __quicklistDecompressNode calls zmalloc_usable() directly with &new_entry_alloc_sz when allocating the decompressed buffer. Both compute the delta against the old entry_alloc_sz and update tracked_size.

    Note: We cannot reuse the existing sz field in quicklistNode because sz stores the logical uncompressed data size, while entry_alloc_sz reflects the actual jemalloc allocation (which differs due to size-class rounding, and changes entirely when the entry is compressed into an LZF struct).

  3. Plain node creation (__quicklistCreateNode with QUICKLIST_NODE_CONTAINER_PLAIN):
    Calls zmalloc_usable() directly with &new_node->entry_alloc_sz.


Remaining zmalloc_size() call sites

Two RDB load paths still call zmalloc_size() because they receive pre-allocated buffers from the RDB loader (not through the listpack allocator):

  • quicklistAppendListpack — receives a listpack allocated during RDB deserialization
  • quicklistAppendPlainNode — receives a plain data buffer allocated during RDB deserialization

Listpack no-realloc paths

When a listpack operation skips reallocation (same-size replacement, growth within jemalloc slack, no-op delete/range), lp_last_alloc_size is explicitly set to lp_malloc_size(lp) so that lpLastAllocSize() returns the current allocation size. This is done in:

  • lpInsert — growth fits within jemalloc slack, or same-size replacement
  • lpShrinkToFit — allocation already at minimum
  • lpDeleteRange / lpDeleteRangeWithEntry — early returns on num == 0 or seek failure

Tradeoff

This approach adds 8 bytes per quicklistNode (entry_alloc_sz) and 8 bytes per quicklist (tracked_size) to avoid calling zmalloc_size() on mutation hot paths. Instead of two zmalloc_size() calls per operation (before/after), allocation sizes are captured for free from zmalloc_usable/zrealloc_usable which already compute them internally.

The implementation is a bit more complicated than a simple zmalloc_size approach—the listpack layer requires updates to set lp_last_alloc_size in all no-realloc code paths (same-size replacement, jemalloc slack, no-op deletes) to prevent stale values, and every new allocation path (compress, decompress, plain nodes) must explicitly capture its size.

Files changed

  • src/quicklist.h: Added entry_alloc_sz to quicklistNode, tracked_size to quicklist
  • src/quicklist.c: Added quicklistTrackEntryResize() macro; updated all mutation sites to maintain tracked_size; threaded quicklist* through compress/decompress for tracking deltas
  • src/listpack_malloc.h: Changed lp_malloc/lp_realloc to capture usable size into lp_last_alloc_size
  • src/listpack.c: Added lpLastAllocSize() getter; updated no-realloc paths to set lp_last_alloc_size
  • src/listpack.h: Declared lpLastAllocSize()
  • src/object.c: Added objectComputeSizeWithTrackedSize() that uses tracked_size for quicklists
  • src/unit/test_quicklist_tracking.cpp: 23 gtest cases validating tracked_size matches objectComputeSize() across all mutation types

liorsve added 6 commits March 1, 2026 21:40
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
@liorsve liorsve changed the title Min overhead qicklist mem tracking Track quicklist memory incrementally via zmalloc_usable capture Mar 2, 2026
@ranshid
Copy link
Copy Markdown

ranshid commented Mar 17, 2026

I've been thinking about a simpler alternative that avoids modifying the listpack allocator, avoids adding entry_alloc_sz per node, and avoids all the no-realloc edge case handling — while still achieving O(1) memory tracking with zero zmalloc_size() calls on hot paths.

Key insight

Every place node->sz changes already goes through one of two patterns:

  1. The quicklistNodeUpdateSz(node) macro (14 call sites)
  2. Direct node->sz = value assignment (9 call sites — mostly node creation before linking)

And every node enters/leaves the quicklist through:

  • __quicklistInsertNode() (insert)
  • __quicklistDelNode() (delete)

We can track the logical data size (sum of node->sz across all nodes) instead of the allocator-level size, using only these existing choke points.

The change

1. Add one field to quicklist:

typedef struct quicklist {
    quicklistNode *head;
    quicklistNode *tail;
    unsigned long count;
    unsigned long len;
    signed int fill : QL_FILL_BITS;
    unsigned int compress : QL_COMP_BITS;
    unsigned int bookmark_count : QL_BM_BITS;
    size_t tracked_data_bytes;  /* NEW: running sum of node->sz for all nodes */
    quicklistBookmark bookmarks[];
} quicklist;

2. Track on node link/unlink:

static void __quicklistInsertNode(quicklist *quicklist, ..., quicklistNode *new_node, ...) {
    /* ... existing linking code ... */
    quicklist->len++;
    quicklist->tracked_data_bytes += new_node->sz;  /* NEW */
}

static void __quicklistDelNode(quicklist *quicklist, quicklistNode *node) {
    /* ... existing unlinking code ... */
    quicklist->tracked_data_bytes -= node->sz;  /* NEW */
    quicklist->len--;
    /* ... rest of function ... */
}

3. Extend the macro to track deltas on in-place mutations:

#define quicklistNodeUpdateSz(ql, node)                              \
    do {                                                             \
        size_t _old_sz = (node)->sz;                                 \
        (node)->sz = lpBytes((node)->entry);                         \
        (ql)->tracked_data_bytes += (node)->sz - _old_sz;            \
    } while (0)

All 14 call sites of quicklistNodeUpdateSz already have the quicklist * pointer available — this is a mechanical update.

4. One companion macro for the single direct-assignment case on a linked node:

#define quicklistNodeSetSz(ql, node, new_sz)                         \
    do {                                                             \
        (ql)->tracked_data_bytes += (new_sz) - (node)->sz;           \
        (node)->sz = (new_sz);                                       \
    } while (0)

This is only needed in quicklistReplaceEntry() where a plain node's entry->node->sz = sz is assigned on an already-linked node (line 742). All other direct node->sz = X assignments happen on nodes before they are linked via __quicklistInsertNode, so the insert hook picks up the correct value automatically.

Why this works for every mutation site

Site Current code How it's tracked
quicklistCreateNode() node->sz = 0 Node not linked yet — no tracking needed
__quicklistCreateNode() new_node->sz = sz Node not linked yet — __quicklistInsertNode adds it
quicklistPushHead/Tail quicklistNodeUpdateSz(head/node) Extended macro tracks delta ✓
quicklistAppendListpack node->sz = lpBytes(zl) then insert Insert hook adds node->sz
quicklistAppendPlainNode node->sz = sz then insert Insert hook adds node->sz
quicklistDelIndex quicklistNodeUpdateSz(node) or __quicklistDelNode Macro tracks delta / delete hook subtracts ✓
quicklistDelEntry calls quicklistDelIndex Same as above ✓
quicklistReplaceEntry quicklistNodeUpdateSz(entry->node) or entry->node->sz = sz Macro or quicklistNodeSetSz
_quicklistMergeNodes quicklistNodeUpdateSz(keep) + __quicklistDelNode(nokeep) Both tracked ✓
_quicklistSplitNode quicklistNodeUpdateSz(node) + quicklistNodeUpdateSz(new_node) Both tracked (new_node linked after) ✓
_quicklistInsert (all branches) quicklistNodeUpdateSz(node/new_node) Extended macro ✓
quicklistDup node->sz = current->sz then insert into copy Insert hook on copy adds it ✓

Comparison with the current PR approach

Aspect This PR (#2) Alternative
New fields on quicklistNode entry_alloc_sz (8 bytes per node) None
New fields on quicklist tracked_size (8 bytes) tracked_data_bytes (8 bytes)
Listpack changes Modified lp_malloc/lp_realloc macros, added lp_last_alloc_size global, added lpLastAllocSize() None
What's tracked Allocator-level size (jemalloc usable size) Logical data size (node->sz = uncompressed entry bytes)
zmalloc_size() calls 2 remaining (RDB load paths) Zero
No-realloc edge cases Must handle same-size replacement, jemalloc slack, no-op deletes in listpack None — lpBytes() always returns the correct value
Compression handling Must track compressed vs uncompressed allocation sizes separately Transparent — node->sz always stores uncompressed size
Lines changed ~300+ across quicklist.c, listpack.c, listpack_malloc.h, listpack.h, object.c ~30 in quicklist.c + quicklist.h

What we track and what we don't

We track the logical uncompressed data size — the sum of node->sz for all nodes. This is the raw user data. It does NOT include:

  • sizeof(quicklist) — fixed, known at compile time
  • sizeof(quicklistNode) * ql->len — computable in O(1) from existing fields
  • Allocator rounding / jemalloc size classes
  • Compression savings (LZF compressed size < uncompressed)

For per-slot memory reporting, the metadata portion (sizeof(quicklist) + ql->len * sizeof(quicklistNode)) is trivially computable from existing fields. The data portion is ql->tracked_data_bytes. Both are O(1).

Note on accuracy

The logical size is arguably more useful than the allocator size for per-slot memory balancing. It represents the actual user data volume and is deterministic — it doesn't vary based on jemalloc version, size classes, or compression settings. Two nodes with the same data will report the same size regardless of whether they're compressed or which allocator is in use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants