fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change by woshihoujinxin · Pull Request #378 · TencentCloud/CubeSandbox

woshihoujinxin · 2026-05-27T11:39:12Z

Problem

When the CIDR in Cubelet config is changed and network-agent restarts, several code paths silently accepted old-CIDR IPs via Assign(), causing sandboxes to receive IPs from the wrong network segment.

Changes

ipam: add Contains() to check if IP falls within current CIDR range
recover(): destroy kernel taps with out-of-range IPs instead of pooling them
reconcileState(): reject sandbox with IP outside current CIDR
dequeueTapLocked(): defensively destroy out-of-range taps from pool
Wire CIDR and tap_init_num via environment variables
Add comprehensive unit tests covering CIDR change scenarios

Testing

Unit tests in cidr_change_test.go cover recover, reconcileState, and dequeue paths
IPAM Contains() tested in ipam_test.go

cubesandboxbot · 2026-05-27T11:39:36Z

Claude finished @woshihoujinxin's task in 2m 17s —— View job

Reviewing PR #378 — `fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change`

Progress Checklist

Gather context and read changed files
Analyze the code changes
Check for correctness, edge cases, and potential issues
Review test coverage
~~Run build/tests~~ (blocked by sandbox permissions)

Overview

This PR fixes a real problem: when the CIDR in Cubelet config changes and network-agent restarts, the allocator's Assign() accepted old-CIDR IPs silently, causing sandboxes to get IPs from the wrong network segment. The fix adds Contains() checks at three different layers for defense-in-depth.

Strengths

Defense in depth — CIDR validation happens at three distinct boundaries:
- recover() at startup — taps with out-of-range IPs are destroyed immediately (local_service.go:492-500)
- reconcileState() on reconciliation — sandbox IPs outside the CIDR return an error (local_service.go:370-372)
- dequeueTapLocked() — defensive check for out-of-range taps that made it into the pool (tap_lifecycle.go:65-74)
  This layered approach correctly handles all entry points.
Graceful cleanup — When out-of-range taps are destroyed, all resources are properly released: port mappings cleared, cubeVS device deleted, kernel tap destroyed, and persisted state cleaned up.
Good test coverage — The 648-line cidr_change_test.go covers:
- reconcileState rejects out-of-CIDR sandboxes
- reconcileState accepts in-CIDR sandboxes
- reconcileState rejects invalid/unparseable sandbox IPs
- recover destroys out-of-range taps and does not pool them
- recover skips persisted states with out-of-CIDR IPs and cleans up
- recover handles reconcileState failure gracefully (cleans up and moves on)
- dequeueTapLocked destroys out-of-range taps from pool
- EnsureNetwork correctly allocates from the new CIDR
IP allocator validation at startup — NewLocalService (local_service.go:74-85) now validates that tap_init_num doesn't exceed available IPs in the CIDR, preventing a startup with unusable configuration.

Findings

1. PR description inaccuracy: "Wire CIDR and tap_init_num via environment variables"

The PR description mentions wiring CIDR and tap_init_num via environment variables, but the actual changes in main.go and logging.go are about:

Adding a panic recovery handler to main()
Writing init errors to stderr + os.Exit(1) instead of CubeLog.Fatalf (which may not be initialized yet)
Fixing a potential nil dereference on CubeLog.LocalIP in logging.go

CIDR and tap_init_num are read from the Cubelet TOML config file via LoadConfigFromCubeletTOML in config.go, which was already the case before this PR. No environment variable wiring was added. The description is misleading — consider updating it.

2. `recover()` function complexity

The recover() function (local_service.go:451-586) is already quite complex with multiple layers of branching. The CIDR change path adds additional branches. For example, the CIDR out-of-range check at line 492 uses continue to skip the tap, then the function falls through to the stale-state cleanup loop at line 561. While correct, this function would benefit from a refactor into smaller helper functions in a follow-up.

3. `recover()` — skipped IP release is intentional but not documented

When a tap has an IP outside the current CIDR (local_service.go:492-500):

if !s.allocator.Contains(tap.IP) {
    // ... destroy ...
    continue
}

The code does not call s.allocator.Release(tap.IP). This is correct because s.allocator.Assign() is never called for out-of-range IPs (the Contains check gates Assign at line 502). However, this is a subtle correctness point — a short comment explaining this would help future readers avoid wondering whether there's a leak.

4. `dequeueTapLocked()` — `cubevsDelTAPDevice` error ignored

At tap_lifecycle.go:71, the result of cubevsDelTAPDevice for out-of-range taps is discarded:

_ = cubevsDelTAPDevice(uint32(tap.Index), tap.IP.To4())

This is consistent with other destruction paths in this codebase (there's already precedent at local_service.go:498), but if this fails, the tap device entry remains in the cubeVS map — a potential source of stale state. Consider at least logging the error.

5. `Contains()` vs `Assign()` consistency

The Contains() method (ipam.go:150-157) uses the exact same index calculation logic as Assign() and Release(), so there can't be false negatives/positives due to inconsistent bounds. This is good engineering.

Minor Notes

The main.go panic recovery and logging changes are tangentially related to the CIDR fix but are reasonable safety improvements to include.
The logging.go LocalIP fix (localIP := "unknown" guard) prevents a panic when CubeLog.LocalIP is empty — a nice defensive fix.
tap.IP.To4() is called at local_service.go:499 without a nil check, but Contains() already validated the IP is in-range (implying it's a valid IPv4), so this is safe.

Summary

Verdict: Changes are correct and well-structured. The defense-in-depth approach with checks in recover(), reconcileState(), and dequeueTapLocked() correctly prevents old-CIDR IPs from being assigned to new sandboxes. The test coverage is comprehensive. The code properly cleans up all resources (kernel taps, cubeVS entries, port mappings, persisted state) when out-of-range taps are detected.

The main actionable finding is issue #1 — the PR description should be updated to accurately describe the main.go/logging.go changes.

…DR change When the CIDR in Cubelet config changes, several code paths silently accepted old-CIDR IPs via Assign(), causing sandboxes to receive IPs from the wrong network segment. - ipam: add Contains() to check if IP falls within current CIDR range - recover(): destroy kernel taps with out-of-range IPs - reconcileState(): reject sandbox with IP outside current CIDR - dequeueTapLocked(): defensively destroy out-of-range taps from pool - Wire CIDR and tap_init_num via environment variables Signed-off-by: HouJinxin <houjinxin@example.com>

cubesandboxbot · 2026-05-27T11:54:28Z

Claude finished @woshihoujinxin's task in 1m 46s —— View job

Review of PR #378 — `fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change`

Progress

Read all 7 changed files
Analyze correctness of CIDR validation logic
Review test coverage
Check for edge cases and resource leaks

Overview

This PR prevents sandboxes from receiving IPs from an outdated network segment when the CIDR in Cubelet config changes and network-agent restarts. The fix adds Contains() validation at three layers (defense-in-depth) in recover(), reconcileState(), and dequeueTapLocked(), plus a TapInitNum validation at startup in NewLocalService().

Correctness Analysis

1. Three-layer defense is sound

Each layer catches a distinct code path through which an old-CIDR IP could reach a sandbox:

recover() (local_service.go:492-500): Startup — taps with out-of-range IPs are destroyed before anything can use them.
reconcileState() (local_service.go:370-372): Reconciliation — a sandbox whose persisted IP doesn't match the current CIDR is rejected.
dequeueTapLocked() (tap_lifecycle.go:65-74): Defensive — an out-of-range tap surviving in the pool (race/maint loop) is destroyed on dequeue.

All three paths properly clean up: clearPortMappings → cubevsDelTAPDevice → destroyTap. No resources are leaked.

2. Contains() uses consistent index math (ipam.go:150-157)

Same ip2Idx - startIdx calculation as Assign() and Release(), so there's no divergence between what the allocator considers "in range" vs what the guard checks consider "in range."

3. TapInitNum validation at startup (local_service.go:74-85)

Rejects a configuration where tap_init_num exceeds available IPs in the CIDR with a clear error message suggesting fixes. This prevents a silent startup failure where the allocator can't create enough pool taps.

Findings

Finding 1: PR description inaccuracy — "Wire CIDR and tap_init_num via environment variables"

The PR description mentions wiring CIDR and tap_init_num via environment variables, but these values are read from the Cubelet TOML config file via LoadConfigFromCubeletTOML — no environment variable wiring was added. The description should be updated.

Finding 2: Potential port mapping leak for out-of-CIDR taps in recover()

In recover() (local_service.go:492-500), when an out-of-CIDR tap is destroyed:

if !s.allocator.Contains(tap.IP) {
    s.clearPortMappings(tap)       // tap.PortMappings is nil here
    _ = cubevsDelTAPDevice(...)
    _ = destroyTapFunc(tap.Index)
    continue
}

At this point, tap.PortMappings is nil (set by listCubeTapsFunc, not yet populated from mappingsByIfindex at line 503). So clearPortMappings is a no-op — the port mappings previously loaded at line 464-466 are assigned in s.ports but never released. While cubevsDelTAPDevice may cascade-delete port mappings in cubeVS, s.ports retains the stale assignments. Impact is low (on restart, s.ports is fresh), but it's a subtle inconsistency worth noting.

To fix, the code could look up mappingsByIfindex[tap.Index] before destroying:

if !s.allocator.Contains(tap.IP) {
    tap.PortMappings = append([]PortMapping(nil), mappingsByIfindex[tap.Index]...)
    s.clearPortMappings(tap)
    _ = cubevsDelTAPDevice(...)
    _ = destroyTapFunc(tap.Index)
    continue
}

Finding 3: recover() nil IP edge case

If listCubeTapsFunc returns a tap with a nil IP, s.allocator.Contains(nil) returns false (because ip.To4() on nil returns nil), so the code enters the out-of-CIDR destroy path and calls tap.IP.To4() — which would panic on nil. In practice, kernel taps always have valid IPs, so this isn't reachable. Adding a nil guard before the Contains check would make the code robust against unexpected data:

if tap.IP == nil || !s.allocator.Contains(tap.IP) {
    // ...
}

Finding 4: dequeueTapLocked() — cubevsDelTAPDevice error silently ignored

At tap_lifecycle.go:71:

_ = cubevsDelTAPDevice(uint32(tap.Index), tap.IP.To4())

The error is discarded. If this fails, the cubeVS TAP entry remains, creating stale state. Unless there's a strong reason (like the call sometimes "legitimately" fails), a logged warning would help debugging. This pattern exists elsewhere in the codebase too.

Finding 5: recover() function complexity

The recover() function (local_service.go:451-586) handles three distinct concerns in a single function: (a) loading persisted states, (b) iterating kernel taps and matching/destroying/recovering them, and (c) cleaning up stale persisted states with no matching tap. The CIDR check adds another branch. Extracting helper functions (e.g., destroyOutOfRangeTap, recoverTapWithState, recoverTapNoState) would improve readability.

Test Coverage

The 648-line cidr_change_test.go covers all three CIDR validation paths plus edge cases:

Test	What it validates
`TestReconcileStateRejectsOutOfCIDRSandbox`	`reconcileState` rejects old-CIDR IPs
`TestReconcileStateAcceptsInCIDRSandbox`	`reconcileState` accepts new-CIDR IPs
`TestReconcileStateRejectsInvalidSandboxIP`	`reconcileState` rejects unparseable IPs
`TestRecoverDestroysOutOfRangeTapsOnCIDRChange`	`recover` destroys old-CIDR taps (2 taps, both destroyed, none pooled)
`TestRecoverSkipsPersistedStateWithOutOfCIDRIP`	`recover` cleans up persisted state for destroyed taps
`TestRecoverSkipsSandboxWhenReconcileFailsForCIDRChange`	`recover` handles reconcile failure gracefully (IP released, state cleaned)
`TestDequeueTapLockedDestroysOutOfRangeTapAfterCIDRChange`	`dequeueTapLocked` destroys out-of-range tap from pool
`TestEnsureNetworkAllocatesFromNewCIDRAfterChange`	`EnsureNetwork` allocates from new CIDR

The Contains() method is tested in ipam_test.go:261-288 with 9 cases covering boundary values (network addr, gateway, mid-range, broadcast) and rejections (out-of-range, nil, IPv6).

Issues with the main.go / logging.go changes

The main.go and logging.go changes are tangentially related but bring real improvements:

Panic recovery handler to prevent silent crashes
initService failure writes to stderr + os.Exit(1) instead of CubeLog.Fatalf (which may not be initialized)
Safe localIP := "unknown" fallback in logging.go

These should be called out separately from the CIDR fix.

Summary

Overall: Correct with minor cleanup suggestions. The three-layer defense-in-depth correctly prevents old-CIDR IP allocation. Test coverage is comprehensive. The main improvement I'd recommend is addressing Finding 2 (port mapping cleanup for out-of-CIDR taps in recover()) and updating the PR description to accurately describe the main.go/logging.go changes.

fslongjin · 2026-05-27T13:11:06Z

The premise of this PR is wrong. Changing the local tap CIDR inherently breaks networking for all running sandboxes — their tap interfaces already have IPs from the old segment. Destroying those taps just kills their connectivity. There is no way to make this safe at runtime.

This is a deployment-time config that should never change on a live node. If the CIDR must change, drain the node and restart clean. Adding 600+ lines of validation and test code to handle a scenario that should not happen operationally only adds complexity to already-complex paths like recover().

If anything is needed, it is a simple startup check: refuse to start if the CIDR in config does not match persisted state, and tell the operator to drain the node first. That is it.

What is the actual operational scenario that requires runtime CIDR changes?

chenhengqi · 2026-05-28T02:03:06Z

I agree with @fslongjin . Let's focus on #379 instead.

chenhengqi

I think this change is not the right way.

woshihoujinxin · 2026-05-29T03:36:38Z

The default CIDR conflicts with my K8s cluster network range, which is why I modify the CIDR after the initial installation. Changing the CIDR requires cleaning up the TAP network resources created during that first install. At this point, no sandboxes have been started yet. The scenario I'm describing is:
▎ discovering a CIDR conflict after the initial installation that prevents the K8s cluster from replying to packets, forcing a CIDR change.

woshihoujinxin · 2026-05-29T04:13:38Z

Additionally, if the TAP devices were created during installation and no sandbox is currently using them, they should be removed when the service is stopped. The current situation is: after installation, the system doesn't work, and when I try to uninstall and reinstall, the system has left behind a bunch of
▎ orphaned TAP resources that prevent network-agent from starting after reinstallation.

chenhengqi · 2026-05-29T06:49:49Z

Additionally, if the TAP devices were created during installation and no sandbox is currently using them, they should be removed when the service is stopped. The current situation is: after installation, the system doesn't work, and when I try to uninstall and reinstall, the system has left behind a bunch of ▎ orphaned TAP resources that prevent network-agent from starting after reinstallation.

A reboot should work.

Let's stick to fixing the installation script. Thanks.

woshihoujinxin force-pushed the fix/network-agent-cidr-ip-validation branch from c4c97a3 to 48e4078 Compare May 27, 2026 11:53

woshihoujinxin requested a review from fslongjin as a code owner May 27, 2026 11:53

chenhengqi requested changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change#378

fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change#378
woshihoujinxin wants to merge 1 commit into
TencentCloud:masterfrom
woshihoujinxin:fix/network-agent-cidr-ip-validation

woshihoujinxin commented May 27, 2026

Uh oh!

cubesandboxbot Bot commented May 27, 2026 •

edited

Loading

Uh oh!

cubesandboxbot Bot commented May 27, 2026 •

edited

Loading

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

chenhengqi commented May 28, 2026

Uh oh!

chenhengqi left a comment

Uh oh!

woshihoujinxin commented May 29, 2026

Uh oh!

woshihoujinxin commented May 29, 2026

Uh oh!

chenhengqi commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

woshihoujinxin commented May 27, 2026

Problem

Changes

Testing

Uh oh!

cubesandboxbot Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #378 — fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change

Progress Checklist

Overview

Strengths

Findings

1. PR description inaccuracy: "Wire CIDR and tap_init_num via environment variables"

2. recover() function complexity

3. recover() — skipped IP release is intentional but not documented

4. dequeueTapLocked() — cubevsDelTAPDevice error ignored

5. Contains() vs Assign() consistency

Minor Notes

Summary

Uh oh!

cubesandboxbot Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #378 — fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change

Progress

Overview

Correctness Analysis

Findings

Test Coverage

Issues with the main.go / logging.go changes

Summary

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

chenhengqi commented May 28, 2026

Uh oh!

chenhengqi left a comment

Choose a reason for hiding this comment

Uh oh!

woshihoujinxin commented May 29, 2026

Uh oh!

woshihoujinxin commented May 29, 2026

Uh oh!

chenhengqi commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cubesandboxbot Bot commented May 27, 2026 •

edited

Loading

Reviewing PR #378 — `fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change`

2. `recover()` function complexity

3. `recover()` — skipped IP release is intentional but not documented

4. `dequeueTapLocked()` — `cubevsDelTAPDevice` error ignored

5. `Contains()` vs `Assign()` consistency

cubesandboxbot Bot commented May 27, 2026 •

edited

Loading

Review of PR #378 — `fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change`