Skip to content

Conversation

@Aman-Cool
Copy link

Prevent IPC hangs during container startup

This PR fixes a long-standing reliability issue in urunc’s IPC handshake between create and start.

Previously, the IPC helper AwaitMessage() would block indefinitely while waiting for a Unix socket connection and message. If the peer process never connected — for example because containerd restarted, the urunc start process was OOM-killed, or the node was under heavy load — the waiting process would never exit. This resulted in orphaned urunc --reexec processes, containers stuck in ContainerCreating, and gradual resource leaks on the node, with no clear error reported.

The fix adds a bounded timeout to the IPC accept and read steps. When the expected message is not received in time, the process now exits with a clear error instead of hanging forever. This makes failed container startups deterministic and observable, while leaving the normal, successful startup path unchanged.

In short: container creation now either succeeds, fails, or times out — but it no longer gets stuck silently.

@netlify
Copy link

netlify bot commented Jan 25, 2026

Deploy Preview for urunc canceled.

Name Link
🔨 Latest commit 247d9be
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/6975e57460b1a80008a2665f

- Add IPCAcceptTimeout (60s) and IPCReadTimeout (10s) to prevent
  orphaned processes when counterpart never connects
- Fix closure bug in executeHooksConcurrently using wrong loop variable
- Fix isRunning() using annotType instead of annotHypervisor
- Add tests for timeout and wrong message handling

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the fix/ipc-timeout-prevent-hanging branch from d84f485 to 247d9be Compare January 25, 2026 09:42
@Aman-Cool
Copy link
Author

This adds reasonable IPC timeouts so urunc doesn’t hang indefinitely during create/start, making failures safer and easier to recover from.

@cmainas
Copy link
Contributor

cmainas commented Jan 26, 2026

Hello @Aman-Cool ,

thank you for this contribution. Please create an issue before opening a PR. Have you encountered such an issue you describe? Are there any steps to reproduce it?

The waiting of the reexec process is a container runtime design choice. I am not negative to adding a timeout but I think we need to search a bit more on how other container runtimes handle such cases and what would be a reasonable timeout.

@Aman-Cool
Copy link
Author

Thanks @cmainas for the feedback.
I agree it’s worth looking at how other runtimes approach IPC handshakes, but I want to clarify my perspective on the timeout itself. The intent here isn’t to tune a performance parameter, but to avoid an unbounded wait in a failure path. In the scenarios I’ve observed (e.g. peer process never connecting due to restart or termination), an infinite block results in leaked processes and stuck container state, whereas a conservative timeout makes the failure explicit and recoverable.
I’ll open an issue to document the problem and the conditions under which this occurs, and we can use that as a place to discuss whether the timeout should be configurable or adjusted further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants