Skip to content

Review fixes: close leaks, UAF race, framing, sanitizer CI#4

Merged
nyo16 merged 20 commits into
masterfrom
review/code-review-fixes
Apr 17, 2026
Merged

Review fixes: close leaks, UAF race, framing, sanitizer CI#4
nyo16 merged 20 commits into
masterfrom
review/code-review-fixes

Conversation

@nyo16
Copy link
Copy Markdown
Owner

@nyo16 nyo16 commented Apr 17, 2026

Summary

Implements the findings from the professional code review in
/Users/nmaroulis/.claude/plans/do-a-thoroough-code-wild-bengio.md,
addressing zero-leak / zero-zombie guarantees across the NIF, shepherd,
and Elixir layers. 18 commits, each a focused fix.

Critical fixes (correctness)

  • FD leak on mutex alloc failure (nif_create_fd) — previously the
    dtor gated close() on res->lock being non-NULL, so an ENOMEM
    during mutex create leaked the fd and armed a NULL-deref in any
    later nif_close.
  • Use-after-close racenif_read/nif_write copied fd out of
    the lock before the syscall; a concurrent nif_close / owner-death
    could close the fd before the syscall ran, hitting a recycled fd.
    Lock is now held across the syscall and enif_select; the actual
    close() runs in the io_resource_stop callback after BEAM drains
    any pending select.
  • Lost initial stderr in :consume modekick_stderr_read sent
    {:stderr_data, _} to self() but no handle_info matched, so
    first-chunk stderr for short-lived processes was dropped.
  • write_loop {:ok, 0} spin — if the kernel ever returned 0 on
    a non-empty write, the GenServer spun forever. Bounded with a 1 ms
    sleep-retry.

High-severity fixes (shepherd C layer)

  • Command framing — shepherd parsed only buf[0], discarding any
    coalesced/tail commands. Now frames are length-dispatched per
    opcode with a 64-byte carry-over buffer across poll() iterations.
  • sendmsg in send_fds — EINTR loop, verify full 1-byte send,
    log errno on failure instead of returning an opaque -1.
  • Post-fork signal safety — replaced fprintf/strerror in the
    post-fork child window with a write(2)-based child_fail()
    helper (async-signal-safe). Checked every dup2 / setsid /
    TIOCSCTTY return; on failure the child exits 127 with a
    diagnostic instead of running with broken stdio.
  • waitpid after SIGKILL — bounded WNOHANG loop (~3 s cap) so
    shepherd can't hang on a D-state child.
  • SIGCHLD reaping — loop waitpid(-1, ..., WNOHANG) per drain;
    previously one signal could leave zombies if ever coalesced.
  • Cgroup & UDS path hardening — validate every snprintf return,
    reject too-long UDS paths, set FD_CLOEXEC on PTY master, treat
    user-requested cgroup setup failure as fatal, and replace the fixed
    100 ms usleep in cgroup_cleanup with a bounded polling rmdir.

Elixir layer hardening

  • Stream consumer crash cleanup — Process GenServer now accepts an
    optional :owner pid, monitors it, and SIGKILL-stops the OS process
    on DOWN. NetRunner.Stream.stream/3 passes self() so a consumer
    crash tears down the chain even though Stream.resource's after
    didn't fire.
  • Watcher non-blocking escalationProcess.sleep(5_000) in
    handle_info replaced with Process.send_after/3 + a new clause.
  • Parked-caller monitoringOperations.park/4 now
    Process.monitors the caller. Dead callers are pruned on DOWN
    instead of accumulating in the pending map.
  • read_uds_message — replaced peek-then-recv (race-prone when
    the payload arrives a moment after the opcode) with opcode-first
    reads and proper timeouts.
  • Input validation — reject non-binary / empty / NUL-containing
    cmd and args at the spawn boundary.
  • File.rm error handling — tolerate :enoent on the UDS socket
    path, propagate real errors.
  • Signal source of truthSignal.resolve delegates to the NIF
    instead of maintaining a duplicate allow-list; integer range
    clamped to 1..31.
  • Daemon drain resilience — warn (and keep running) when a drain
    task dies abnormally; wrap drain_loop so reader/logger exceptions
    can't take the daemon down via Task link.
  • terminate/2 explicitly closes the shepherd Port after the
    UDS socket so the teardown order is deterministic.

Tooling

  • ASan + UBSan build: SANITIZE=1 make all (or make asan). CI
    gains a sanitizers job that builds the NIF and shepherd with
    AddressSanitizer + UBSan, preloads libasan, and runs the full
    mix test. The publish job depends on it.
  • Stale socket sweep in test_helper.exs (before + after suite).
  • New test: binary-with-NUL-byte round-trip through the shepherd.

Test plan

  • mix compile --warnings-as-errors
  • mix format --check-formatted
  • mix credo --strict (clean)
  • mix dialyzer (0 errors)
  • mix test — 120 passed, 10× consecutive runs all green
  • SANITIZE=1 make clean all builds locally on macOS
  • CI sanitizers job runs clean on Linux (verified on push)

nyo16 added 20 commits April 16, 2026 19:17
If enif_mutex_create returned NULL the resource was returned to Elixir
with res->lock == NULL and res->fd set. The destructor gated close() on
a non-NULL lock, so the FD leaked; any subsequent nif_close would also
dereference NULL and crash.

Now nif_create_fd checks the mutex result, releases the resource on
failure (so dtor reclaims the fd), and returns {:error, :mutex_failed}.
io_resource_dtor no longer gates close() on lock presence.
nif_read / nif_write copied res->fd under lock and released the lock
before the read()/write() syscall. A concurrent nif_close or owner-death
down callback could close the fd on another thread before the syscall
ran, letting the read/write target a recycled fd.

Restructure so the mutex is held across the syscall and the subsequent
enif_select registration. Since fds are O_NONBLOCK, the lock is held
only briefly.

Move the actual close() into the enif_select stop callback. nif_close
and io_resource_down now mark the resource closed, detach the fd from
res->fd, release the lock, and call enif_select(SELECT_STOP). BEAM waits
for any in-flight select to drain before running the stop callback,
which closes the fd safely.
In :consume mode, kick_stderr_read is called from init/1 to arm
enif_select. If the initial Pipe.read returned {:ok, data} synchronously
(short-lived processes whose stderr is already in the pipe buffer), it
sent {:stderr_data, data} to self() but no handle_info clause matched,
so the generic catch-all dropped it. The first (and sometimes only)
chunk of stderr was lost, corrupting AbnormalExit messages and
run/2 stderr capture for fast-exiting commands.

Add the missing handle_info({:stderr_data, _}, state) clause that
appends to stderr_buffer, records the bytes in stats, and drains any
remaining data via consume_stderr.

Also pass @default_read_size explicitly from kick_stderr_read instead
of relying on Pipe.read/1's implicit default, keeping read sizes
consistent across the module.
Two issues in the stdin write path:

1. If nif_write returned {:ok, 0} on a non-empty buffer (legal for
   write(2) on non-regular files), write_loop would recurse forever
   with the same data, wedging the GenServer. Park the caller and
   wait for :ready_output instead of spinning.

2. On repeated partial writes, write_loop kept recursing inside
   handle_call, delaying unrelated messages (kill, await_exit,
   :select) until either completion or EAGAIN. After one successful
   partial, park the caller with the remainder and let :ready_output
   drive subsequent writes.

Apply the same shape to retry_write_loop on the :ready_output path.
The event loop read up to 16 bytes from the UDS socket and passed them
straight to handle_command, which only parsed buf[0] and ignored the
remainder. Coalesced commands (e.g. CMD_CLOSE_STDIN followed by
CMD_KILL delivered in a single read) dropped everything after the
first opcode. A partial tail crossing a read boundary was also lost.

Introduce command_length() per opcode, a loop-parsing handle_commands()
helper, and a 64-byte carry-over buffer in event_loop() that retains
any truncated tail across reads. Unknown opcodes advance one byte so
the parser cannot stall on bad input. read() errors other than EINTR
/ EAGAIN now kill the child and exit rather than silently looping.
send_fds:
  * Loop on EINTR and require ret == 1 (the full 1-byte send) instead
    of accepting any positive return.
  * Log errno on both calloc and sendmsg failures so the caller sees
    a diagnostic instead of an opaque "failed to send FDs".

Post-fork child:
  * Between fork() and execvp(), POSIX allows only async-signal-safe
    functions. Replace fprintf/strerror on failure with a new
    child_fail() helper that uses only write(2) on a stack buffer.
  * Check setsid, TIOCSCTTY, and every dup2 return value. If any of
    these fail the child now reports via child_fail() and _exit(127)
    instead of silently running with the wrong stdio or no controlling
    terminal.
kill_child() previously called waitpid(child_pid, NULL, 0) after
SIGKILL. If the child was in uninterruptible kernel sleep (D-state),
this blocked the shepherd indefinitely — leaking the process group
and preventing any further protocol communication with BEAM. Replace
with a bounded WNOHANG reap loop (~3s cap) so control returns even
when the kernel cannot reap immediately; cgroup_cleanup() provides the
authoritative kill path on Linux.

The event-loop SIGCHLD handler reaped at most one child per drain.
SIGCHLD is coalesced by the kernel, so multiple pending exits could
deliver as a single signal; the single waitpid missed them. Loop
with waitpid(-1, ..., WNOHANG) until no more reapable children remain
and flip child_exited only for the managed pid.
* Validate every snprintf return value for cgroup full_path / procs_path
  / kill_path; treat truncation as error instead of silently writing to
  the wrong file.
* Validate uds_path length before strncpy into sockaddr_un.sun_path;
  reject paths that would truncate rather than failing opaquely in
  connect().
* Replace the fixed 100ms usleep before rmdir in cgroup_cleanup() with
  a bounded polling loop (~1s) that retries on EBUSY / ENOTEMPTY.
  Stops cgroup directories leaking when the kernel reaper is slow.
* Treat cgroup_setup failure as fatal for the spawn (send_error, kill
  child, exit). A user who supplied a cgroup_path expects isolation;
  silently running unisolated violates the contract.
* Set FD_CLOEXEC on the PTY master fd so any future shepherd fork
  doesn't leak it.
Stream.resource's after callback is only run on normal termination or
{:halt, _}. If the consumer crashed mid-iteration (e.g. pattern-match
failure), cleanup_process/1 was never invoked, the Process GenServer
lived on, and Watcher couldn't help because it monitors the GenServer,
not the consumer.

Add an optional :owner option to NetRunner.Process that takes a pid
to monitor. On DOWN the GenServer SIGKILLs the OS process (through
both the shepherd protocol and a direct NIF kill) and stops :normal.

NetRunner.Stream.stream/3 now passes self() as :owner so consumer
crashes reliably tear down the whole chain.
process.ex terminate/2 now explicitly closes the shepherd Port after
pipes and the UDS socket. The Port is linked so it would die with the
GenServer anyway, but closing it explicitly makes the cleanup order
deterministic.

watcher.ex replaced Process.sleep(5000) inside handle_info/2 with a
Process.send_after(:escalate_to_sigkill, 5000) message and a new
handle_info/2 clause. A Watcher can now still receive messages (e.g.
supervisor shutdown) during the SIGTERM grace window instead of being
wedged unresponsive.
* Exec.read_uds_message: replace peek-then-recv (which can desync if
  the first byte arrives ahead of its payload) with opcode-first reads.
  Read 1 byte for the opcode, then the opcode-specific tail. Timeouts
  bumped to 500ms.
* Exec.spawn_process: validate cmd and args before Port.open. Reject
  non-binary, empty, or NUL-byte-containing values. Passing these
  through Port.open's args: is undefined behaviour on the C side.
* Exec.cleanup_listener: propagate File.rm errors except :enoent
  (which is expected when the shepherd already unlinked the socket).
* Signal.resolve: reject integer signals outside the POSIX 1..31 range
  instead of forwarding arbitrary ints to kill(2).
Operations.park previously stored the GenServer.from tuple but did not
monitor the caller pid. If a caller timed out its GenServer.call or
crashed while parked on EAGAIN, its entry stayed in pending until the
Process exited; finish_exit then replied to a dead pid. For a long-
running process with many client timeouts this leaked memory.

Operations now takes Process.monitor(caller_pid) in park/4 and tracks a
second map (monitors: caller_monitor_ref → op_ref). pop/2 demonitors
the caller. New pop_by_monitor/2 lets the GenServer remove an entry
when it observes the caller's :DOWN. reply_all/2 flushes all monitors.

The Process GenServer's :DOWN handler now dispatches: if the monitor
ref matches owner_ref (the stream consumer) it SIGKILLs and stops, else
it tries Operations.pop_by_monitor to prune a dead parked caller.
Two defects in NetRunner.Daemon:

* The generic handle_info({:DOWN, ...}) clause matched any :DOWN and
  dropped it. A drain Task crashing (abnormal reason) looked identical
  to normal EOF, so a failing drain stopped invisibly and output
  thereafter piled up in the pipe until blocking the child process.
  Narrow the handler to refs we recognise (drain_ref / stderr_drain_ref)
  and log a Logger.warning when reason != :normal.

* drain_loop relied on reader.() never raising. If the Proc GenServer
  terminated mid-call or on_output blew up on a non-UTF-8 chunk, the
  linked drain Task crashed, which propagated to the Daemon through
  Task.async's link and crashed the whole daemon. Wrap drain_loop in
  try/rescue/catch and isolate on_output in its own rescue so neither
  path can take down the Daemon.
Signal.resolve previously maintained an Elixir @signals allow-list that
duplicated the strcmp list inside nif_signal_number in C. Adding or
renaming a signal meant editing two places and any drift produced a
silent "unknown_signal" even when the other side supported it.

Delegate to the NIF for any atom input; integer-range validation stays
in Elixir. The NIF's existing {:error, :unknown_signal} return surfaces
unsupported atoms unchanged.
Enable \`SANITIZE=1 make all\` (or the \`make asan\` convenience target)
to rebuild the NIF and shepherd with AddressSanitizer + UBSan. When
SANITIZE=1 the build switches to -O1 and drops _FORTIFY_SOURCE (they
conflict with ASan's memcpy interceptors).

Add a new CI job \`sanitizers\` that runs the full Elixir test suite
with LD_PRELOAD=libasan so the NIF and shepherd catch memory-safety
and undefined-behaviour bugs in the C layer on every push. Gate the
publish job on the sanitizers job passing.
The earlier change that parked the caller after a single partial write
broke large-buffer streaming. Parking skipped the NIF's EAGAIN path,
which is the only place enif_select is armed for :ready_output; the
parked caller then hung forever because no readiness signal would
arrive. A 100 KB cat stream test reliably timed out after 60s.

Restore the original write_loop shape (recurse until the NIF returns
either :eagain or completion) so enif_select stays in charge of
readiness. Keep the {:ok, 0} guard from the original review fix — if
the kernel ever returns 0 on a non-empty write we sleep(1) and retry
instead of spinning the dirty scheduler. retry_write_loop mirrors the
same shape.

Also add a test_helper.exs hook that purges stale /tmp/net_runner_*.sock
files before and after every test run; stops accumulation from tests
that crash before cleanup_listener runs.
Runs \`printf 'a\0b\0c'\` through the shepherd and asserts the full 5
bytes survive the read path. Guards against any accidental use of
String functions on the output binary that would truncate at the
first NUL.
Credo strict flagged the cond/true as a refactoring opportunity;
with only two branches an if/else is clearer and identical in
behaviour.
Public-API fix: NetRunner.run pattern-matched {:ok, pid} on Proc.start,
which raised MatchError for any {:error, reason} — including the new
validation errors from E12. Return the error tuple directly.

New regression tests:

* test/signal_test.exs — Signal.resolve range, unknown atoms, bad
  types, resolve! raising behaviour (E15).
* test/process_test.exs —
    * NUL-byte rejection in cmd/args (E12)
    * Proc.start with empty cmd rejected (E12)
    * stderr-only fast-exit command: clean exit + bytes_err stats
      nonzero (E1)
    * :owner monitor SIGKILLs OS process when owner crashes (E3)
* test/net_runner_test.exs —
    * run/2 and stream/2 surface validation errors instead of crashing
    * Timeout path reaps OS processes (no sleep pid leaks after five
      100ms timeouts against sleep 30)
Fix CI regression from C9 (treat cgroup setup failure as fatal):

The existing Linux-only test assumed CI had write access to
/sys/fs/cgroup/; it does not. Before C9 this silently succeeded
(child ran outside the cgroup). After C9 the shepherd correctly
reports {:error, _}. Rewrite the test to accept either outcome
— success on a privileged run, a clean error on an unprivileged
one. Either confirms the option is plumbed through.

Add a negative test for path validation (absolute paths and
traversal are rejected).

CHANGELOG: document every fix from the review pass under
[Unreleased] — FD leak on mutex failure, use-after-close race,
lost stderr, write-loop spin, shepherd framing, sendmsg
hardening, post-fork signal safety, waitpid bounds, SIGCHLD
reap loop, cgroup/path validation, stream consumer-crash
cleanup, Watcher send_after, parked-caller monitors, UDS read
redesign, NUL-byte validation, run/2 error surface, File.rm
errors, Signal range/source-of-truth, Daemon drain resilience,
explicit Port.close, ASan/UBSan build + CI job, stale socket
sweep, regression tests.

README: new sections on input validation / error returns,
binary output with NUL bytes, the Command DSL, an error-handling
cheatsheet, and examples for :owner monitoring and per-call
kill_timeout.
@nyo16 nyo16 merged commit f75659f into master Apr 17, 2026
10 checks passed
@nyo16 nyo16 deleted the review/code-review-fixes branch April 17, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant