fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change#378
fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change#378woshihoujinxin wants to merge 1 commit into
Conversation
|
Claude finished @woshihoujinxin's task in 2m 17s —— View job Reviewing PR #378 —
|
…DR change When the CIDR in Cubelet config changes, several code paths silently accepted old-CIDR IPs via Assign(), causing sandboxes to receive IPs from the wrong network segment. - ipam: add Contains() to check if IP falls within current CIDR range - recover(): destroy kernel taps with out-of-range IPs - reconcileState(): reject sandbox with IP outside current CIDR - dequeueTapLocked(): defensively destroy out-of-range taps from pool - Wire CIDR and tap_init_num via environment variables Signed-off-by: HouJinxin <houjinxin@example.com>
c4c97a3 to
48e4078
Compare
|
Claude finished @woshihoujinxin's task in 1m 46s —— View job Review of PR #378 —
|
| Test | What it validates |
|---|---|
TestReconcileStateRejectsOutOfCIDRSandbox |
reconcileState rejects old-CIDR IPs |
TestReconcileStateAcceptsInCIDRSandbox |
reconcileState accepts new-CIDR IPs |
TestReconcileStateRejectsInvalidSandboxIP |
reconcileState rejects unparseable IPs |
TestRecoverDestroysOutOfRangeTapsOnCIDRChange |
recover destroys old-CIDR taps (2 taps, both destroyed, none pooled) |
TestRecoverSkipsPersistedStateWithOutOfCIDRIP |
recover cleans up persisted state for destroyed taps |
TestRecoverSkipsSandboxWhenReconcileFailsForCIDRChange |
recover handles reconcile failure gracefully (IP released, state cleaned) |
TestDequeueTapLockedDestroysOutOfRangeTapAfterCIDRChange |
dequeueTapLocked destroys out-of-range tap from pool |
TestEnsureNetworkAllocatesFromNewCIDRAfterChange |
EnsureNetwork allocates from new CIDR |
The Contains() method is tested in ipam_test.go:261-288 with 9 cases covering boundary values (network addr, gateway, mid-range, broadcast) and rejections (out-of-range, nil, IPv6).
Issues with the main.go / logging.go changes
The main.go and logging.go changes are tangentially related but bring real improvements:
- Panic recovery handler to prevent silent crashes
initServicefailure writes to stderr +os.Exit(1)instead ofCubeLog.Fatalf(which may not be initialized)- Safe
localIP := "unknown"fallback inlogging.go
These should be called out separately from the CIDR fix.
Summary
Overall: Correct with minor cleanup suggestions. The three-layer defense-in-depth correctly prevents old-CIDR IP allocation. Test coverage is comprehensive. The main improvement I'd recommend is addressing Finding 2 (port mapping cleanup for out-of-CIDR taps in recover()) and updating the PR description to accurately describe the main.go/logging.go changes.
|
The premise of this PR is wrong. Changing the local tap CIDR inherently breaks networking for all running sandboxes — their tap interfaces already have IPs from the old segment. Destroying those taps just kills their connectivity. There is no way to make this safe at runtime. This is a deployment-time config that should never change on a live node. If the CIDR must change, drain the node and restart clean. Adding 600+ lines of validation and test code to handle a scenario that should not happen operationally only adds complexity to already-complex paths like If anything is needed, it is a simple startup check: refuse to start if the CIDR in config does not match persisted state, and tell the operator to drain the node first. That is it. What is the actual operational scenario that requires runtime CIDR changes? |
|
I agree with @fslongjin . Let's focus on #379 instead. |
chenhengqi
left a comment
There was a problem hiding this comment.
I think this change is not the right way.
|
The default CIDR conflicts with my K8s cluster network range, which is why I modify the CIDR after the initial installation. Changing the CIDR requires cleaning up the TAP network resources created during that first install. At this point, no sandboxes have been started yet. The scenario I'm describing is: |
|
Additionally, if the TAP devices were created during installation and no sandbox is currently using them, they should be removed when the service is stopped. The current situation is: after installation, the system doesn't work, and when I try to uninstall and reinstall, the system has left behind a bunch of |
A reboot should work. Let's stick to fixing the installation script. Thanks. |
Problem
When the CIDR in Cubelet config is changed and network-agent restarts, several code paths silently accepted old-CIDR IPs via
Assign(), causing sandboxes to receive IPs from the wrong network segment.Changes
Contains()to check if IP falls within current CIDR rangeTesting
cidr_change_test.gocover recover, reconcileState, and dequeue pathsContains()tested inipam_test.go