From ce4f51fa9fd413871412231585f3e09d8183436f Mon Sep 17 00:00:00 2001 From: Niko Maroulis Date: Fri, 17 Apr 2026 12:09:11 -0400 Subject: [PATCH] Retry UDS drain on port exit to fix macOS CI race MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit macOS runners can deliver the shepherd's Port {:exit_status} message before the UDS buffer has MSG_CHILD_EXITED readable. A single 500 ms recv then missed the payload, the status stayed :running, and 5 s later :force_exit_timeout synthesised 137 — producing the flaky {"", 137} instead of {"", 0} on echo-to-stderr fast-exit tests. Replace the single call in maybe_read_exit_status with drain_uds_for_exit/2: retry up to 5 times (2.5 s worst case) on :no_message, stop immediately on :closed (shepherd died without sending) or unexpected frames, and call finish_exit on success. Still well under the 5 s force timeout fallback. --- lib/net_runner/process.ex | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/lib/net_runner/process.ex b/lib/net_runner/process.ex index 6bafa9a..738cfc2 100644 --- a/lib/net_runner/process.ex +++ b/lib/net_runner/process.ex @@ -516,11 +516,33 @@ defmodule NetRunner.Process do defp maybe_read_exit_status(%{status: :exited} = state), do: state defp maybe_read_exit_status(state) do + # Shepherd has exited. The UDS may or may not have delivered + # MSG_CHILD_EXITED yet — on slow CI runners (notably macOS) the + # buffer can trail the Port's {:exit_status, _} notification. + # Retry a few times on timeout; bail immediately on :closed so + # the 5 s force_exit_timeout can apply a synthetic 137. + drain_uds_for_exit(state, _tries_left = 5) + end + + defp drain_uds_for_exit(state, 0), do: state + + defp drain_uds_for_exit(state, tries_left) do case Exec.read_uds_message(state.uds_socket) do {:child_exited, status} -> finish_exit(state, status) - _ -> + {:error, reason} when reason in [:closed, :econnreset, :enotconn] -> + # Peer closed without delivering an exit message; fall through to + # force_exit_timeout which will apply status 137. + state + + {:error, :no_message} -> + # Read timed out (data not yet buffered). Give the kernel another + # chance — read_uds_message already waited 500 ms per attempt. + drain_uds_for_exit(state, tries_left - 1) + + _other -> + # Unexpected shape — stop draining to avoid spinning on bad data. state end end