Skip to content

Conversation

@sidneychang
Copy link

@sidneychang sidneychang commented Jan 26, 2026

  • Call new function cleanupOrphanTaps() at the start of DynamicNetwork.NetworkSetup().
  • Add cleanupOrphanTaps() (new): scan current netns for interfaces matching ^tap.*_urunc$ and use kernel carrier/operational state as the sole criterion:
    • NO-CARRIER => delete (orphan) after removing TC/qdisc
    • LOWER_UP / OperUp / FlagRunning => consider in-use and abort with error
  • Preserve existing networkSetup() create-only semantics and ensure TC/qdisc cleanup before link deletion. This resolves an issue observed on Kubernetes where restarting/exiting urunc left orphan TAP devices in the pod's network namespace, causing subsequent network setup to fail. The new cleanup removes such orphan TAPs so a new urunc instance can create and configure a fresh TAP.

Description

In kubernetes setups when a pod is getting restarted, the network namespace (created by the pause container) remains active and hence the tap0_urunc device still exists. Therefore, when urunc (re)creates the container it identifies the tap0_urunc device and it does not recreates it.

Related issues

How was this tested?

  1. Deploy the test Deployment/Service and observe the Pod status
kubectl apply -f nginx-urunc.yaml
kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-urunc-67f8694dd6-ntvgg 1/1 Running 0 8s
  1. Find the QEMU process on the host (record the PID)
ps aux | grep qemu

Example output (initial QEMU PID = 1166356):

root      377624  0.0  0.0 838184 84172 ?        Ssl  Jan23   2:52 /usr/bin/qemu-system-x86_64 ...
root     1166356  4.7  0.0 840108 84624 ?        Ssl  18:42   0:00 /usr/bin/qemu-system-x86_64 ...
root     1166481  0.0  0.0   9212  2308 pts/1    S+   18:42   0:00 grep --color=auto qemu
  1. Inspect the tap device inside that QEMU netns (should be LOWER_UP)
sudo nsenter -t 1166356 -n ip link show tap0_urunc
3: tap0_urunc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 96:27:d0:99:cd:9a brd ff:ff:ff:ff:ff:ff
  1. Force-kill QEMU (simulate a crash)
sudo kill -9 1166356

Then observe the Pod status and restart:

kubectl get pods
NAME                           READY   STATUS   RESTARTS   AGE
nginx-urunc-67f8694dd6-ntvgg   0/1     Error    0          46s
# then the Pod restarts successfully
NAME                           READY   STATUS    RESTARTS     AGE
nginx-urunc-67f8694dd6-ntvgg   1/1     Running   1 (6s ago)   48s
  1. Find the new QEMU process and verify the new tap (new PID = 1166964)
ps aux | grep qemu
root 1166964 9.0 0.0 840108 84456 ? Ssl 18:43 0:00 /usr/bin/qemu-system-x86_64 ...

sudo nsenter -t 1166964 -n ip link show tap0_urunc
3: tap0_urunc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether d2:b6:52:78:bf:0e brd ff:ff:ff:ff:ff:ff
  1. Verify the Pod IP and network connectivity
kubectl get pods -o wide
ping -c 3 <POD_IP>

NAME                           READY   STATUS    RESTARTS      AGE   IP            NODE   ...
nginx-urunc-67f8694dd6-ntvgg   1/1     Running   1 (30s ago)   72s   10.88.0.104   test

PING 10.88.0.104 (10.88.0.104) 56(84) bytes of data.
64 bytes from 10.88.0.104: icmp_seq=1 ttl=255 time=0.513 ms
64 bytes from 10.88.0.104: icmp_seq=2 ttl=255 time=0.245 ms
--- 10.88.0.104 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss
  1. Optional: List all interfaces in the netns to confirm both eth0 and tap are UP
sudo nsenter -t 1166964 -n ip link
# old tap interface has been removed
1: lo: <LOOPBACK,UP,LOWER_UP> ...
2: eth0@if281: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
4: tap0_urunc: <BROADCAST,MULTICAST,UP,LOWER_UP> ...

LLM usage

I use LLM to disscuss how to identify a orphan tap device.

Checklist

  • I have read the contribution guide.
  • The linter passes locally (make lint).
  • The e2e tests of at least one tool pass locally (make test_ctr, make test_nerdctl, make test_docker, make test_crictl).
  • If LLMs were used: I have read the llm policy.

@netlify
Copy link

netlify bot commented Jan 26, 2026

Deploy Preview for urunc ready!

Name Link
🔨 Latest commit dbb9c93
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/6978272d95f0a00008c59aef
😎 Deploy Preview https://deploy-preview-407--urunc.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@sidneychang sidneychang force-pushed the cleanup_zoombie_tap branch 2 times, most recently from 2e20d21 to 157d093 Compare January 26, 2026 17:21
@sidneychang sidneychang force-pushed the cleanup_zoombie_tap branch 2 times, most recently from 7fcd818 to dbb9c93 Compare January 27, 2026 02:47
…r state

- Call new function cleanupOrphanTaps() at the start of
  DynamicNetwork.NetworkSetup().
- Add cleanupOrphanTaps(): scan netns for interfaces matching
  ^tap.*_urunc$ and use kernel carrier/operational state as the
  sole criterion:
  - NO-CARRIER => delete orphan (remove TC/qdisc, then delete link)
  - LOWER_UP / operational up / FlagRunning => treat as in-use and abort
- Do not scan /proc or check /dev/net/tun; do not attempt to reuse TAPs.
- Skip cleanup when no container interface (e.g. no eth0) is present.
- Remove PID/FD based checks and netns flock; document the single
  unikernel-per-netns assumption.
- Preserve networkSetup() create-only semantics and ensure TC/qdisc
  cleanup before link deletion.

This resolves an issue on Kubernetes where restarting urunc left
orphan TAP devices in the pod network namespace and prevented
subsequent network setup.

Signed-off-by: sidneychang <2190206983@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphan tap*_urunc left after urunc restart in Kubernetes, preventing NetworkSetup

2 participants