Skip to content

Fix process and file descriptor leaks in Salt Master#69227

Open
dwoz wants to merge 18 commits into
saltstack:3006.xfrom
dwoz:3006leak
Open

Fix process and file descriptor leaks in Salt Master#69227
dwoz wants to merge 18 commits into
saltstack:3006.xfrom
dwoz:3006leak

Conversation

@dwoz

@dwoz dwoz commented May 26, 2026

Copy link
Copy Markdown
Contributor

Ensure proper resource lifecycle management and process reaping to resolve leaks introduced between 3006.20 and 3006.25.

  • Call wait() after kill() in TimedProc to prevent zombie processes.
  • Implement context manager protocol and destroy() in SaltEvent, RunnerClient, WheelClient, and MasterMinion.
  • Update masterapi.py to ensure RunnerClient is used within a with statement.
  • Explicitly destroy persistent objects in RemoteFuncs and LocalFuncs during teardown.
  • Initialize internal attributes to None and fix variable scope issues to achieve 10/10 pylint rating.

Comment thread salt/utils/timed_subprocess.py
@dwoz

dwoz commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Heads up: 30 integration shard 4-6 failures across distros (zeromq/tcp) at 29ea4d6 — same pattern as the earlier 57, same root cause: MWorker._handle_payload destroys aes_funcs/clear_funcs mid-flight on the 100th request while the tornado coroutine still holds futures bound to those handlers. Sampled tests/pytests/integration/proxy/test_deltaproxy.py etc; FactoryTimeouts + stuck "state.highstate is running as PID..." on every distro. Recycling block is correct in intent but needs coroutine drain + atomic swap before it's safe. Branch is rebased onto origin/3006.x; let me know if you want it removed again or left for your follow-up.

Daniel A. Wozniak and others added 18 commits June 17, 2026 15:38
Ensure proper resource lifecycle management and process reaping to resolve leaks introduced between 3006.20 and 3006.25.

- Call wait() after kill() in TimedProc to prevent zombie processes.
- Implement context manager protocol and destroy() in SaltEvent, RunnerClient, WheelClient, and MasterMinion.
- Update masterapi.py to ensure RunnerClient is used within a with statement.
- Explicitly destroy persistent objects in RemoteFuncs and LocalFuncs during teardown.
- Initialize internal attributes to None and fix variable scope issues to achieve 10/10 pylint rating.
Resolve process, file descriptor, and memory leaks by adding explicit
teardown and context managers for clients, periodic worker cache resets,
and optimizing token listing with os.scandir.
Include a GitHub Actions workflow that performs aggressive stress testing
on Master, Minion, and API components, with automated Prometheus-based
regression analysis to detect resource leaks.
Address pylint warnings in analyze_stats.py and fd_exporter.py related
to unused imports, broad exceptions, and resource management.
Enable post-build introspection by snapshotting the Prometheus data
directory and uploading it as a GitHub Action artifact.
Explicit resource cleanup is preferred; __del__ methods introduced
during memory and file handle leak fixes have been removed from:
- MasterMinion
- RunnerClient
- SaltEvent
- WheelClient
Close the underlying PyZMQ monitor socket explicitly in `ZeroMQSocketMonitor.stop()`
to prevent eventfd accumulation during repeated Minion connection failures.

Fixes test_fd_leak.py
…ripts

- rest_cherrypy Login.POST referenced self.api which was removed from
  LowDataAdapter; wrap _is_master_running in a NetapiClient with-block.
  This fixes HTTP 500 on /login in functional/integration tests.
- store_job now uses 'with MasterMinion(...)'; MockMasterMinion and
  MockNetapiClient need __enter__/__exit__.
- Exclude tests/monitoring scripts from test_module_names check.
The per-100-request recycling destroyed aes_funcs/clear_funcs (channels,
event, ckminions, loadauth, masterapi) on a live worker while requests
were still in flight, silently breaking publish/return on heavier
integration shards (4-6) across every distro. Removed the block; the
existing destroy() paths on shutdown handle cleanup correctly.
- Restored worker_resource_backcount logic in MWorker to prevent memory
  accumulation in long-running clear/aes functions.
- Increased flat memory buffer in test_publisher_mem to prevent false
  positive failures from PyZMQ/Python memory fragmentation.
NetapiClient.runner now uses RunnerClient as a context manager
(salt/netapi/__init__.py). The four regression tests in
test_netapi_client_runner.py patched salt.runner.RunnerClient with a
FakeRunner that lacked __enter__/__exit__, so every distro on unit
zeromq shard 3 failed with AttributeError: __enter__.

Add no-op __enter__ returning self and __exit__ returning False to
each FakeRunner so the with-block in NetapiClient.runner works against
the test double.
Address several long-standing memory leaks discovered during stress
testing of the salt master and minions.

salt/daemons/masterapi.py + salt/master.py
  Maintenance loader caching. clean_expired_tokens() and
  clean_old_jobs() now accept optional loadauth/mminion arguments so
  callers can reuse a long-lived instance.  Maintenance.__init__
  caches one LoadAuth and one MasterMinion in _post_fork_init and
  reuses them every iteration, destroying both in destroy().  Without
  this each Maintenance loop iteration constructed fresh LoadAuth +
  MasterMinion instances, triggering a fresh LazyLoader + __virtual__
  cascade + module-load chain whose bytecode/dict/string allocations
  were retained in sys.modules.  This was the dominant driver of the
  Maintenance-process slow drift (~2.4 MB/min) — now flat.

salt/transport/frame.py + salt/transport/ipc.py
  4-byte big-endian length prefix on frame_msg_ipc, and matching
  exact-length readers in IPCServer.handle_stream and
  IPCMessageSubscriber._read (drops the streaming msgpack Unpacker).
  The Unix-domain-socket atomic-write boundary (~65 536 bytes) was
  causing concurrent large writes (e.g. beacon status frames + flood
  events) to interleave, leaving the streaming Unpacker desynchronised
  and producing UnicodeDecodeError / ExtraData crashes in
  EventReturn and any other long-running subscribers.  With explicit
  framing the receiver always knows where one message ends and the
  next begins.

salt/transport/ipc.py
  IPCMessagePublisher._write converted from a @gen.coroutine to a
  regular function with a done-callback.  Each published message was
  spawning a long-lived gen.Runner per subscriber stream that waited
  inside the stream.write yield until the OS drained the bytes.
  Under high event rates the Runner / generator / frame / Future
  quadruple was the dominant residual minion leak (905+ Runners
  observed).  Now the callback fires asynchronously without spawning
  a coroutine.

salt/transport/zeromq.py
  Three ZMQ callback-registration sites (RequestServer,
  PublishServer pull_sock, and PublishClient.on_recv) now wrap the
  Tornado @gen.coroutine handler in a _dispatch shim that routes
  through io_loop.spawn_callback() and returns None to PyZMQ.
  Previously PyZMQ's _run_callback wrapped any Awaitable return
  value with asyncio.ensure_future(), creating Tasks on the asyncio
  event loop that was never driven by the Tornado IOLoop — Tasks
  (plus their gen.Runner / Future / WeakRef tracking sets)
  accumulated indefinitely.  The minion-side fix (PublishClient)
  alone removed ~18,800 Tornado Runners observed under stress.

salt/utils/event.py
  Three independent hardening changes:
  - SaltEvent._get_event now catches SaltDeserializationError so a
    single malformed/corrupted IPC frame can no longer kill the
    entire subscriber loop.
  - EventPublisher.run installs a 5-minute PeriodicCallback that
    calls libc.malloc_trim(0) to release glibc arena pages.  High-
    throughput event publishing fragments the allocator heavily; the
    EventPublisher's RSS routinely sat at >1 GB of free-but-unreturned
    pages without this.
  - EventReturn now validates configured event_return returners at
    startup (emitting a clear one-shot error if anything is missing)
    and rate-limits the per-event "returner not found" message to
    once per 60 s per returner.  With event_return_queue=0 the
    previous code emitted that error for every single event, which
    could fill log volumes in minutes under stress.

salt/minion.py
  Periodic gc.collect() PeriodicCallback in Minion.tune_in.  Tornado
  coroutine timeouts (FutureWithTimeout, Runner.handle_yield
  closures, traceback objects, etc.) create reference cycles that
  Python's default GC thresholds (700, 10, 10) collect too rarely
  for the rate at which they accumulate in a busy minion.  Running
  a full collection every 60 s keeps the working set steady.
…n, dashboard

Improvements to the tests/monitoring stack so the salt master/minion
leak hunt is reproducible and observable.

tests/monitoring/Dockerfile.salt
  Rebuild Python 3.10.20 from source with
  CFLAGS="-g -O2 -fno-omit-frame-pointer" so memray's --native frame
  unwinder and gdb can resolve CPython symbols (the stock python:3.10
  slim image strips them).  Adds gdb and memray at image-build time
  so attaching a profiler to a long-running process no longer requires
  apt-get + pip inside the container.  Also fixes the stale
  requirements/base.in -> .txt path.

tests/monitoring/master.conf + tests/monitoring/minion.conf
  Set ipc_write_buffer: 104857600 (100 MB) on both master and minion
  to cap Tornado IOStream._write_buffer growth on the local event-bus
  IPC publisher.  Without this, one slow subscriber on either side
  caused a single bytearray to grow unbounded under stress (>1 GB
  observed on master EventPublisher, ~80+ MB/process and climbing on
  minions).
  Switch minion log_level from debug -> warning; debug logging in
  long-running container stress runs filled tens of GB of Docker JSON
  logs per minion.

tests/monitoring/prometheus.yml
  Move the salt-fds target to port 8002 to match fd_exporter.py's
  listen port.

tests/monitoring/stress_api.sh
  Drop the per-iteration "frequent logins" call.  Hammering /login
  10x/sec generated a CherryPy session per request, which inflated
  the salt-api Netapi process to >1 GB of session state — not a salt
  bug, just an unrealistic stress pattern.  Real clients reuse one
  token.

tests/monitoring/grafana/.../salt_monitoring.json
  Add a Current Time stat panel (uses Prometheus time() so the
  dashboard prominently shows when "now" is during long captures),
  default to a 30-minute window with 10-second auto-refresh, and
  honour the browser timezone.  Reshapes the row heights to make room.
The maintenance loader caching work (commit d4e2e07) moved cached
LoadAuth + MasterMinion instance construction into Maintenance._post_fork_init
and made the main loop reference self._cached_mminion / self._cached_loadauth.

test_run_func mocks _post_fork_init wholesale, so those attributes never
get set, and Maintenance.run() now raises AttributeError on the first
iteration. Have the mocked _post_fork_init seed both attributes with
MagicMock so the loop body still has something to call.
The earlier IPC framing rewrite collapsed _read to a single read.
read_async, however, calls _read(None, callback) once and expects
the coroutine to loop forever invoking the callback on every
incoming message, the same shape the streaming-Unpacker version
had via its `while True:` outer loop.

With the loop gone, every subscriber registered via
SaltEvent.set_event_handler delivered exactly one event and then
went deaf.  On the minion that breaks the
`__master_req_channel_payload/<id>/<master>` handler, so command
returns never reach the master and `salt '*' ...` reports
"Minion did not return [No response]".

Restore the `while True:` loop, breaking out only when no callback
was supplied (one-shot read) or the stream closes / times out.
Drop the `timeout` after the first length prefix arrives so the
payload read is not artificially constrained.
CI run 27918914154 surfaced two regressions introduced by
d4e2e07 ("Fix multiple memory leaks ...").

1. EventPublisher hardcoded ctypes.CDLL("libc.so.6")

   The malloc_trim PeriodicCallback was glibc-only and raised
   OSError on macOS and Windows where libc.so.6 does not exist.
   The EventPublisher process crashed at startup and was
   restarted in a tight loop by the SignalHandlingProcess
   parent, so the master fixture never became fully usable and
   every test in every chunk that depends on a real master
   failed with FactoryNotStarted.  malloc_trim was never a
   real leak fix to begin with -- it only released free()'d
   glibc arena pages back to the OS to make RSS look smaller
   on graphs; glibc would have re-used the same pages on the
   next allocation cycle.  Drop the malloc_trim call entirely
   (and the now-unused `import ctypes`).

2. IPCMessagePublisher.publish iterated a live set while
   _write could discard from it

   When _write was converted from a coroutine to a regular
   function it began calling self.streams.discard(stream)
   synchronously on StreamClosedError.  publish() was iterating
   self.streams directly, so a stream that was closed at write
   time raised RuntimeError: Set changed size during iteration.
   The exception killed EventPublisher's handle_publish loop,
   so beacon events (and many other minion-local fire_event
   payloads) never reached the local subscribers, and
   salt-call commands like beacons.reset hung until the
   pytest-shellutils factory timed out.

   Iterate tuple(self.streams) so _write's discards do not
   mutate the iteration target.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants