Skip to content

Fixing pg_buffercache tests#2

Open
palak-chaturvedi wants to merge 8 commits intoashutosh-bapat:dev/shbuf_resizefrom
palak-chaturvedi:dev/resizebuffer
Open

Fixing pg_buffercache tests#2
palak-chaturvedi wants to merge 8 commits intoashutosh-bapat:dev/shbuf_resizefrom
palak-chaturvedi:dev/resizebuffer

Conversation

@palak-chaturvedi
Copy link

This PR fixes the code that was failing pg_buffercache tests.

ashutosh-bapat and others added 6 commits November 25, 2025 13:58
The view exposes the contents of the shared buffer lookup table for
debugging, testing and investigation.

This helped me in debugging issues where the buffer descriptor array and
buffer lookup table were out of sync; either the buffer lookup table had
a mapping page->buffer which wasn't present in the buffer descriptor
array or a page in the buffer descriptor array didn't have corresponding
entry in the buffer lookup table. pg_buffercache doesn't help with those
kind of issues. Also doing that under the debugger in very painful.

I intend to keep this patch while the rest of the code matures. If it is
found useful as a debugging tool, we may consider make it committable
and commit it.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
This has three changes

1. Allow to use multiple shared memory mappings
============================================

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
A new shared memory API is introduced, extended with a segment as a new
parameter. As a path of least resistance, the original API is kept in
place, utilizing the main shared memory segment.

Modifies pg_shmem_allocations to report shared memory segment as well.
Adds pg_shmem_segments to report shared memory segment information.

2. Address space reservation for shared memory
============================================

Currently the shared memory layout is designed to pack everything tight
together, leaving no space between mappings for resizing. Here is how it
looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
    ...

Make the layout more dynamic via splitting every shared memory segment
into two parts:

* An anonymous file, which actually contains shared memory content.
  Such an anonymous file is created via memfd_create, it lives in
  memory, behaves like a regular file and semantically equivalent to an
  anonymous memory allocated via mmap with MAP_ANONYMOUS.

* A reservation mapping, which size is much larger than required shared
  segment size. This mapping is created with flag MAP_NORESERVE (to not
  count the reserved space against memory limits). The anonymous file is
  mapped into this reservation mapping.

If we have to change the address maps while resizing the shared buffer
pool, it is needed to be done in Postmaster too, so that the new
backends will inherit the resized address space from the Postmaster.
However, Postmaster is not invovled in ProcSignalBarrier mechanism and
we don't want it to spend time in things other than its core
functionality. To achive that, maximum required address space maps are
setup upfront with read and write access when starting the server. When
resizing the buffer pool only the backing file object is resized from
the coordinator. This also makes the ProcSignalBarrier handling code
light for backends other than the coordinator.

The resulting layout looks like this:

    00400000-00490000         /path/bin/postgres
    ...
    3f526000-3f590000 rw-p 		[heap]
    7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
    7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
    7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
    7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34

To resize a shared memory segment in this layout it's possible to use
ftruncate on the memory mapped file.

This approach also do not impact the actual memory usage as reported by
the kernel.

TODO: Verify that Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched within
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16.6 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    20770816 (~19.8 MB)

There are also few unrelated advantages of using memory mapped files:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps.

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process. - Some hackers have expressed concerns over it.

The downside is that memfd_create is Linux specific.

3. Refactor CalculateShmemSize()
=============================

This function calls many functions which return the amount of shared
memory required for different shared memory data structures. Up until
now, the returned total of these sizes was used to create a single
shared memory segment. With this change, CalculateShmemSize() needs to
estimate memory requirements for each of the segments. It now takes an
array of MemoryMappingSizes, containing as many elements as the number
of segments, as an argument. The sizes returned by all the function it
calls, except BufferManagerShmemSize(), are added and saved in the first
element (index 0) of the array.  BufferManagerShmemSize() is modified to
save the amount of memory required for buffer manager related segments
in the corresponding array element. Additionally it also saves the
amount of reserved space. For now, the amount of reserved address space
is same as the amount of required memory but that is expected to change
with the next commit which implements buffer pool resize.
CalculateShmemSize() now returns the total of sizes corresponding to all
the sizes.

Author: Dmitrii Dolgov and Ashutosh Bapat
Reviewed-by: Tomas Vondra
shared_buffers is now PGC_SIGHUP instead of PGC_POSTMASTER.  The value
of this GUC is saved in NBuffersPending instead of NBuffers. When the
server starts, the shared memory size is estimated and the memory is
allocated using NBuffersPending.

When a server is running, the new value of GUC (set using ALTER SYSTEM
... SET shared_buffers = ...; followed by SELECT pg_reload_conf()) does
not come into effect immediately. Instead a function
pg_resize_shared_buffers() is used to resize the buffer pool. The
function uses the current value of GUC in the backends where it is
executed. The function also coordinates the buffer resizing
synchronization across backends.

SHOW shared_buffers now shows the current size of the shared buffer pool
but it also shows pending size of shared buffers, if any.

A new GUC max_shared_buffers is introduced to control the maximum value
of shared_buffers that can be set. By default it is 0. When explicitly
set it needs to be higher than 'shared_buffers'. When max_shared_buffers
is set to 0, it assumes the same value as GUC shared_buffers. This GUC
determines the size of address space reserved for future buffer pool
sizes and the size of buffer look up table.

TBD: Describe the protocol used by pg_resize_shared_buffers() to
synchronize buffer resizing operation with other backends.

When shrinking the shared buffers pool, each buffer in the area being
shrunk needs to be flushed if it's dirty so as not to loose the changes
to that buffer after shrinking. Also, each such buffer needs to be
removed from the buffer mapping table so that backends do not access it
after shrinking.

If a buffer being evicted is pinned, we abort the resizing operation.
There are other alternative which are not implemented in the current
patches 1. to wait for the pinned buffer to get unpinned, 2. the backend
is killed or it itself cancels the query  or 3.  rollback the operation.
Note that option 1 and 2 would require the pinning related local and
shared records to be accessed. But we need infrastructure to do either
of this right now.

So far the buffer pool metdata (NBuffers and the shared memory segment
address space) is saved in process local heap memory since it's static
for the life of a server. It is passed to a new backend through
Postmaster. But with buffer pool being resized while the server running,
we need Postmaster to update its buffer pool metadata as the resizing
progresses and pass it to the new backend. This has few complications:

1. Postmaster does not receive ProcSignalBarrier. So we need to signal
   it separately.

2. Postmaster's local state is inherited by the new backend when
   fork()ed. But we need more complex implementation to pass it to an
   exec()ed backend.

3. A new backend may receive the updated state from Postmaster and also
   the signal barrier which prompts the same update. Thus the proc
   signal barrier code needs to be idempotent; adding further complexity
   to it.

4. This task takes away Postmaster resources from it's core
   functionality.

This can be avoided by following two changes:
1. The shared memory is resized only in a single backend without
   requiring any changes to the memory address space.
2. Maintaining the buffer pool metadata in the shared memory instead of
   process local memory. This change may affect performance so verify
   that performance is not degraded.

TODO: In case the backend executing pg_resize_shared_buffers() exits
before the operation finishes, we need to make sure that the changes
made to the shared memory while resizing are cleaned up properly.

Removing the evicted buffers from buffer ring
=============================================
If the buffer pool has been shrunk, the buffers in the buffer ring may
not be valid anymore. Modify GetBufferFromRing to check if the buffer is
still valid before using it. This makes GetBufferFromRing() a bit more
expensive because of additional boolean condition and masks any bug that
introduces an invalid buffer into the ring. The alternative fix is more
complex as explained below.

The strategy object is created in CurrentMemoryContext and is not
available in any global structure thus accessible when processing buffer
resizing barriers. We may modify GetAccessStrategy() to register
strategy in a global linked list and then arrange to deregister it once
it's no more in use. Looking at the places which use
GetAccessStrategy(), fixing all those may be some work.

Author: Ashutosh Bapat
Author: Dmitrii Dolgov
Author of some tests: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Reviewed-by: Tomas Vondra

More detailed note follow: Need to see which of those fit in the commit
message and which should be removed.

Reinitializing strategry control area
=====================================
The commit introduces a separate function StrategyReInitialize() instead
of reusing StrategyInitialize() since some of the things that the second
one does are not required in the first one. Here's list of what
StrategyReInitialize() does and how does it differ from
StrategyInitialize().

1. StrategyControl pointer needn't be fetched again since it should not
   change. But added an Assert to make sure the pointer is valid.
2. &StrategyControl->buffer_strategy_lock need not be initialized again.
3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
   the context of NBuffers. Now that NBuffers itself has changed, those
   three do not make sense. Reset them as if the server has restarted
   again.

Ability to delay resizing operation
===================================
This commit introduces a flag delay_shmem_resize, which postgresql
backends and workers can use to signal the coordinator to delay resizing
operation. Background writer sets this flag when its scanning buffers.

Background writer operation (needs a rethink)
===========================
Background writer is blocked when the actual resizing is in progress. It
stops a scan in progress when it sees that the resizing has begun or is
about to begin. Once the buffer resizing is finished, before resuming
the regular operation, bgwriter resets the information saved so far.
This information is viewed in the context of NBuffers and hence does not
make sense after resizing which chanegs NBuffers.

Buffer lookup table
===================
Right now there is no way to free shared memory. Even if we shrink the
buffer lookup table when shrinking the buffer pool the unused hash table
entries can not be freed. When we expand the buffer pool, more entries
can be allocated but we can not resize the hash table directory without
rehashing all the entries. Just allocating more entries will lead to
more contention. Hence we setup the buffer lookup table considering the
maximum possible size of the buffer pool which is MaxAvailableMemory
only once at the beginning.  Shared buffer lookup table and
StrategyControl are not resized even if the buffer pool is resized hence
they are allocated in the main shared memory segment

BgWriter refactoring
====================

The way BgBufferSync is written today, it packs four functionalities:
setting up the buffer sync state, performing the buffer sync, resetting
the buffer sync state when bgwriter_lru_maxpages <= 0 and setting it up
again after bgwriter_lru_maxpages > 0. That makes the code hard to read.
It will be good to divide this function into 3/4 different functions
each performing one functionality. Then pack all the state (the local
variables from that function converted to static global) into a
structure, which is passed to these functions. Once that happens
BgBufferSyncReset() will call one of the functions to reset the state
when buffer pool is resized.
A new test triggers an injection point in the BufferSync() after it has
collected buffers to flushed. Simultaneously it starts buffer shrinking. The
expectation is that the checkpointer would crash accessing a buffer (descriptor)
outside the new range of shared buffers. But that does not happen because of a
bug in synchronization. The checkpointer does not reload configuration when
checkpoint is going on. It does not load the new value of the configuration.
When the resizing is triggered by the PM, checkpointer receives the proc signal
barrier but it does not start it doesn't enter the barrier mechanism and doesn't
alter its address maps or memory sizes. Hence the test does not crash. But of
course it means that it won't consider the correct size of buffers next time it
performs a checkpoint.

The test was at least useful to detect this anomaly. Once we fix the
synchronization issue we should see the crash and then fix the crash.

Author: Ashutosh Bapat

Notes to reviewers
------------------

1. pg_buffercache used a query on pg_settings to fetch the value of the
   number of buffers. That doesn't work anymore because of change in the
   SHOW shared_buffers. Modified the test to convert the setting value
   to the number of shared buffers, save it in a variable and use the
   variable in queries which need the number of shared buffers. We could
   instead fix ShowGUCOption() to pass use_units flag to show_hook and
   let it output the number of shared buffers instead.  But that seems a
   larger change. There aren't other GUCs whose show_hook outputs their
   values with units. So this local fix might be better.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments