Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
76d7cbb
Fix FD leak in nif_create_fd when mutex allocation fails
nyo16 Apr 16, 2026
b0dea55
Fix use-after-close race in NIF FD resource
nyo16 Apr 16, 2026
a29fe16
Recover initial stderr chunk and use explicit read size
nyo16 Apr 16, 2026
829a735
Fix write_loop spin on {:ok, 0} and message starvation on large writes
nyo16 Apr 16, 2026
51b68f6
Frame shepherd UDS commands correctly with carry-over buffer
nyo16 Apr 16, 2026
1357251
Harden send_fds and post-fork child path
nyo16 Apr 16, 2026
50237f3
Harden waitpid paths in shepherd
nyo16 Apr 16, 2026
1ee2aab
Harden cgroup and UDS path handling in shepherd
nyo16 Apr 16, 2026
bb4a226
Stop Process GenServer when stream consumer crashes
nyo16 Apr 16, 2026
9a4f463
Explicit Port.close in terminate/2; non-blocking Watcher escalation
nyo16 Apr 16, 2026
10391a1
Harden exec boundary: validate cmd/args, UDS cleanup, signal range
nyo16 Apr 16, 2026
75ad97f
Monitor parked callers in Operations; prune on caller DOWN
nyo16 Apr 16, 2026
8c70722
Surface Daemon drain-task crashes instead of silently dropping them
nyo16 Apr 16, 2026
069b15e
Make the NIF the single source of truth for signal atoms
nyo16 Apr 16, 2026
c4cd677
Add ASan/UBSan opt-in build and a CI sanitizers job
nyo16 Apr 17, 2026
611912b
Revert write_loop parking, keep {:ok, 0} guard; stale socket cleanup
nyo16 Apr 17, 2026
f143b0f
Add binary-with-NUL-bytes round-trip test
nyo16 Apr 17, 2026
fa2546d
Replace cond with if/else in Process DOWN dispatch
nyo16 Apr 17, 2026
dc69018
Add regression tests for review fixes + fix run/2 error surface
nyo16 Apr 17, 2026
ad15ff8
Update cgroup_test for fatal setup + CHANGELOG + README examples
nyo16 Apr 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -165,10 +165,42 @@ jobs:
- run: mix compile
- run: mix dialyzer

sanitizers:
name: Sanitizers (ASan + UBSan)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: erlef/setup-beam@v1
with:
elixir-version: ${{ env.ELIXIR_VERSION }}
otp-version: ${{ env.OTP_VERSION }}

- name: Cache deps & build
uses: actions/cache@v4
with:
path: |
deps
_build
key: ${{ runner.os }}-asan-${{ env.ELIXIR_VERSION }}-${{ env.OTP_VERSION }}-${{ hashFiles('mix.lock') }}

- run: mix deps.get
- run: mix deps.compile

- name: Build C with ASan + UBSan
run: SANITIZE=1 make clean all

- name: Run tests under sanitizers
env:
ASAN_OPTIONS: "detect_leaks=0:halt_on_error=1:abort_on_error=1"
UBSAN_OPTIONS: "halt_on_error=1:print_stacktrace=1"
LD_PRELOAD: /usr/lib/x86_64-linux-gnu/libasan.so.8
run: mix test

publish:
name: Publish to Hex
runs-on: ubuntu-latest
needs: [compile, format, credo, test, dialyzer]
needs: [compile, format, credo, test, dialyzer, sanitizers]
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4
Expand Down
110 changes: 110 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,116 @@ All notable changes to this project will be documented in this file.

This project adheres to [Semantic Versioning](https://semver.org/).

## [Unreleased]

Focused code-review pass across the NIF, shepherd, and Elixir layers.
Correctness-first: closes two real-world race/leak bugs, hardens the
post-fork child window, and adds an AddressSanitizer + UBSan CI job.

### Fixed

- **FD leak in `nif_create_fd`** when `enif_mutex_create` failed
— the destructor previously gated `close(fd)` on a non-NULL lock,
so a failed mutex allocation leaked the file descriptor and armed
a NULL-deref in any later `nif_close`. The mutex result is checked
and the dtor now closes the fd unconditionally.
- **Use-after-close race in NIF read/write vs. close/down**
— `nif_read`/`nif_write` copied `res->fd` under the mutex and
released the lock before the syscall; a concurrent `nif_close` or
owner-death callback could close the fd before the syscall ran,
letting the read/write target a recycled fd. The mutex is now held
across the syscall and the subsequent `enif_select` registration;
the actual `close()` is deferred to the `io_resource_stop` callback
so BEAM can drain pending selects before the fd is released.
- **Lost initial stderr chunk in `:consume` mode**
— `kick_stderr_read` in `init/1` sent `{:stderr_data, data}` to
`self()` but no `handle_info/2` clause matched, so the first (and
often only) chunk of stderr for fast-exiting processes was silently
dropped. The missing handler now appends to the stderr buffer and
drains any remainder.
- **`write_loop` spin on `{:ok, 0}`** — if the kernel ever returned
0 bytes on a non-empty write, the GenServer would recurse forever
on the dirty scheduler. Bounded with a 1 ms sleep-retry.
- **Shepherd UDS command framing** — the event loop parsed only
`buf[0]`, discarding any coalesced or tail commands (e.g.
`CMD_CLOSE_STDIN` followed immediately by `CMD_KILL`). Frames are
now length-dispatched per opcode with a carry-over buffer across
`poll()` iterations.
- **Post-fork child stdio and signal safety** — replaced `fprintf` /
`strerror` in the post-fork / pre-exec window with a `write(2)`-
based `child_fail()` helper (async-signal-safe). Every `dup2`,
`setsid`, and `TIOCSCTTY` return is now checked; on failure the
child exits 127 with a diagnostic instead of running with broken
stdio.
- **`waitpid` after SIGKILL** — replaced the unbounded
`waitpid(child_pid, NULL, 0)` with a bounded WNOHANG loop
(~3 s cap) so the shepherd cannot hang on a child stuck in
uninterruptible kernel sleep (D-state).
- **SIGCHLD reap loop** — reap all pending children per SIGCHLD
(`while waitpid(-1, ..., WNOHANG) > 0`) so a coalesced signal
never leaks zombies.
- **Cgroup / UDS path hardening** — validate every `snprintf` return,
reject too-long UDS paths, set `FD_CLOEXEC` on the PTY master,
treat user-requested cgroup setup failure as fatal, and replace
the fixed 100 ms `usleep` in `cgroup_cleanup` with a bounded
polling `rmdir`.
- **`Stream` consumer crash cleanup** — `Stream.resource`'s `after`
callback is only run on normal termination. A consumer crash
orphaned the `NetRunner.Process` GenServer and its OS child.
`NetRunner.Process.start/3` now accepts an `:owner` option that
monitors the caller; `NetRunner.Stream.stream/3` passes `self()`,
so a consumer crash SIGKILLs the OS process and stops the
GenServer.
- **Watcher blocking on `Process.sleep`** — the 5 s sleep in
`handle_info/2` wedged the Watcher unresponsive (including to
supervisor shutdown). Replaced with `Process.send_after/3` and a
new `:escalate_to_sigkill` handler.
- **Parked-caller tracking in `Operations`** — callers parked on
EAGAIN are now `Process.monitor/1`-ed; dead callers are pruned on
`:DOWN` instead of lingering in the pending map until process
exit.
- **`read_uds_message` race** — replaced the `:peek` + full-recv
pattern (which could time out if the payload arrived a moment
after the opcode) with an opcode-first read flow and longer
timeouts.
- **`cmd` / `args` validation** — reject non-binary, empty, or
NUL-containing cmd and args at the spawn boundary. Passing NUL
bytes through `Port.open`'s `args:` is undefined on the C side.
- **`NetRunner.run/2` error surface** — previously pattern-matched
`{:ok, pid}` from `Proc.start`, raising `MatchError` when
validation failed. Now returns `{:error, reason}` cleanly.
- **`File.rm` cleanup of UDS socket** — tolerate `:enoent`
(shepherd may have unlinked), propagate other errors.
- **`Signal.resolve` integer range** — integer signals outside
POSIX `1..31` now return `{:error, :unknown_signal}` instead of
being forwarded to `kill(2)`.
- **`Signal` single source of truth** — `Signal.resolve` delegates
to the NIF for known-atom lookup instead of maintaining a duplicate
allow-list that drifted from the C side.
- **Daemon drain resilience** — drain-task crashes used to match a
catch-all `:DOWN` handler and silently stop draining; the pipe
then filled until the child blocked. Narrowed to recognised refs
with a warning log; `drain_loop` wrapped in `try/rescue/catch` so
a reader or logger exception cannot take the daemon down through
the linked Task.
- **`terminate/2`** explicitly closes the shepherd `Port` after the
UDS socket for deterministic teardown order.

### Added

- **AddressSanitizer + UBSan** — opt-in build via `SANITIZE=1 make all`
or `make asan`. New CI job (`sanitizers`) rebuilds the NIF and
shepherd with `-fsanitize=address,undefined`, preloads `libasan`,
and runs the full `mix test`. The publish job depends on it.
- **Stale UDS socket sweep** in `test/test_helper.exs` (before and
after the suite) — stops accumulation from test crashes before
`cleanup_listener/2` runs.
- **Regression tests** for: NUL-byte validation in `cmd` and `args`,
`Signal.resolve` range + type handling, `:owner` monitor SIGKILL
path, stderr-only fast-exit stats, binary-with-NUL round-trip, and
`NetRunner.run` / `NetRunner.stream` returning validation errors
cleanly.

## [1.0.0] - 2026-02-26

Initial release.
Expand Down
28 changes: 26 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,21 @@ ERL_INTERFACE_LIB_DIR ?= $(shell erl -noshell -eval "io:format(\"~ts\", [code:li
UNAME_S := $(shell uname -s)

CC ?= cc
CFLAGS_BASE = -O2 -Wall -Wextra -Werror -std=c99 -fstack-protector-strong -D_FORTIFY_SOURCE=2

# Opt-in sanitizer build. Usage:
# make clean && SANITIZE=1 make all
# mix test (from Elixir — the NIF and shepherd are rebuilt with ASan/UBSan)
#
# Requires LD_PRELOAD of libasan at runtime on Linux when the BEAM isn't
# built with sanitizers; see ci.yml for the invocation.
ifeq ($(SANITIZE),1)
# _FORTIFY_SOURCE is incompatible with ASan (ASan already intercepts
# memcpy/etc.). Disable optimisation to -O1 and skip FORTIFY.
SAN_FLAGS = -fsanitize=address,undefined -fno-omit-frame-pointer -g
CFLAGS_BASE = -O1 -Wall -Wextra -Werror -std=c99 -fstack-protector-strong $(SAN_FLAGS)
else
CFLAGS_BASE = -O2 -Wall -Wextra -Werror -std=c99 -fstack-protector-strong -D_FORTIFY_SOURCE=2
endif

ifeq ($(UNAME_S),Darwin)
# macOS needs _DARWIN_C_SOURCE for SCM_RIGHTS, CMSG_SPACE, etc.
Expand All @@ -31,6 +45,11 @@ else
NIF_EXT = .so
endif

ifeq ($(SANITIZE),1)
NIF_LDFLAGS += $(SAN_FLAGS)
SHEPHERD_LDFLAGS += $(SAN_FLAGS)
endif

NIF_CFLAGS = $(CFLAGS) -I$(ERTS_INCLUDE_DIR) -I$(C_SRC_DIR) -fPIC

# Targets
Expand All @@ -45,10 +64,15 @@ NIF_OBJ = $(C_SRC_DIR)/net_runner_nif.o

HEADERS = $(C_SRC_DIR)/protocol.h $(C_SRC_DIR)/utils.h

.PHONY: all clean
.PHONY: all clean asan

all: $(PRIV_DIR) $(SHEPHERD) $(NIF_LIB)

# Convenience: force a sanitizer rebuild. Same as SANITIZE=1 make clean all.
asan:
$(MAKE) clean
$(MAKE) SANITIZE=1 all

$(PRIV_DIR):
mkdir -p $(PRIV_DIR)

Expand Down
125 changes: 125 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,45 @@ Enum.to_list(stream)
NetRunner.run(~w(my_server), kill_timeout: 2000, timeout: 10_000)
```

## Input Validation and Error Returns

`run/2` and `stream/2` return tagged errors for bad input instead of
crashing. NUL bytes inside `cmd` or `args` are rejected early (they
are undefined in `argv` on the C side).

```elixir
# Empty executable
{:error, {:invalid_cmd, _}} = NetRunner.run([""])

# NUL byte in an argument
{:error, {:invalid_args, _}} = NetRunner.run(["echo", "he\0llo"])

# Same behaviour for streaming
{:error, {:invalid_args, _}} = NetRunner.stream(["echo", "he\0llo"])

# Unknown signal atoms come back as tagged errors, not raises
{:error, :unknown_signal} = NetRunner.Signal.resolve(:sigwhatever)
{:error, :unknown_signal} = NetRunner.Signal.resolve(99)
```

## Working with Binary Output

stdout is delivered as a BEAM binary, not a String. It is safe to pass
bytes containing NUL, high-bit, or anything else through the pipeline.

```elixir
# NUL bytes round-trip unchanged
{out, 0} = NetRunner.run(["sh", "-c", ~S|printf 'a\0b\0c'|])
byte_size(out) # => 5
out == "a\0b\0c" # => true

# UTF-8 boundaries straddle chunks fine — just concatenate and then
# decode.
"héllo\n" =
NetRunner.stream!(~w(echo héllo))
|> Enum.join()
```

## Process API

For fine-grained control over the OS process lifecycle:
Expand Down Expand Up @@ -164,12 +203,45 @@ Proc.await_exit(pid)
stats = Proc.stats(pid)
stats.bytes_in # => 5 (bytes written to stdin)
stats.bytes_out # => 5 (bytes read from stdout)
stats.bytes_err # => 0 (bytes read from stderr, :consume mode)
stats.read_count # => 1 (number of read calls)
stats.write_count # => 1 (number of write calls)
stats.duration_ms # => 3 (wall-clock time)
stats.exit_status # => 0 (exit code)
```

### Tying an OS process to an owner

If the calling process crashes, the OS process it launched should go
with it. Pass `:owner` to have the Process GenServer monitor a pid;
on `:DOWN` it SIGKILLs the child and stops cleanly. `NetRunner.stream/2`
does this automatically with `self()`.

```elixir
# Spawn a long-lived command tied to the caller
parent = self()

spawn(fn ->
{:ok, pid} = Proc.start("sleep", ["30"], owner: self())
send(parent, {:os_pid, Proc.os_pid(pid)})
exit(:boom) # caller dies → Process SIGKILLs sleep, stops itself
end)
```

### Per-call kill timeout

Tune the SIGTERM→SIGKILL escalation window per-process. Useful when a
command has its own graceful shutdown hook you want to honour, or when
you need a fast hard-kill.

```elixir
# Give my_server 10s to drain on SIGTERM before SIGKILL
{:ok, pid} = Proc.start("my_server", [], kill_timeout: 10_000)

# Or make it effectively immediate for tests
{:ok, pid} = Proc.start("sleep", ["100"], kill_timeout: 100)
```

## PTY Mode

Run commands with a pseudo-terminal for programs that require a TTY. PTY mode is designed for **interactive and long-running programs** — shells, REPLs, curses apps.
Expand Down Expand Up @@ -268,6 +340,59 @@ Isolate child processes in a cgroup v2 hierarchy for resource control:

The shepherd creates the cgroup directory, moves the child into it, and cleans up on exit (kills all processes via `cgroup.kill`, then removes the directory). No-op on macOS.

## Command DSL

Bundle an executable, default args, and default options into a reusable
`%NetRunner.Command{}`. Both `run/2` and `stream/2` accept it, and
call-site options override the defaults.

```elixir
alias NetRunner.Command

# Inline construction
cmd = Command.new("curl", ["-sS"], timeout: 30_000)
{body, 0} = NetRunner.run(cmd, args: ["https://example.com"])

# Extend at call time (args append; opts merge with runtime winning)
listing = Command.new("ls", ["-la"])
{out, 0} = NetRunner.run(listing, args: ["/tmp"])

# `defcommand` in your own module captures a reusable template:
defmodule MyCmds do
use NetRunner.Command

defcommand :curl, "curl", ["-sS", "--max-time", "30"]
defcommand :echo, "echo"
end

{out, 0} = NetRunner.run(MyCmds.echo(["hi"]))
{:ok, stream} = NetRunner.stream(MyCmds.curl(["https://example.com"]))
```

## Error Handling Cheatsheet

```elixir
case NetRunner.run(["my_tool", arg], timeout: 5_000) do
{output, 0} ->
{:ok, output}

{_partial, status} when status != 0 ->
{:error, {:nonzero_exit, status}}

{:error, :timeout} ->
{:error, :took_too_long}

{:error, {:max_output_exceeded, partial}} ->
{:error, {:too_much_output, byte_size(partial)}}

{:error, {:invalid_cmd, msg}} ->
{:error, {:bad_cmd, msg}}

{:error, {:invalid_args, msg}} ->
{:error, {:bad_args, msg}}
end
```

## Parallel Execution

Every NetRunner process is fully independent — no shared state, no singleton bottleneck:
Expand Down
Loading
Loading