Skip to content

fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change#378

Open
woshihoujinxin wants to merge 1 commit into
TencentCloud:masterfrom
woshihoujinxin:fix/network-agent-cidr-ip-validation
Open

fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change#378
woshihoujinxin wants to merge 1 commit into
TencentCloud:masterfrom
woshihoujinxin:fix/network-agent-cidr-ip-validation

Conversation

@woshihoujinxin
Copy link
Copy Markdown

Problem

When the CIDR in Cubelet config is changed and network-agent restarts, several code paths silently accepted old-CIDR IPs via Assign(), causing sandboxes to receive IPs from the wrong network segment.

Changes

  • ipam: add Contains() to check if IP falls within current CIDR range
  • recover(): destroy kernel taps with out-of-range IPs instead of pooling them
  • reconcileState(): reject sandbox with IP outside current CIDR
  • dequeueTapLocked(): defensively destroy out-of-range taps from pool
  • Wire CIDR and tap_init_num via environment variables
  • Add comprehensive unit tests covering CIDR change scenarios

Testing

  • Unit tests in cidr_change_test.go cover recover, reconcileState, and dequeue paths
  • IPAM Contains() tested in ipam_test.go

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 27, 2026

Claude finished @woshihoujinxin's task in 2m 17s —— View job


Reviewing PR #378fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change

Progress Checklist

  • Gather context and read changed files
  • Analyze the code changes
  • Check for correctness, edge cases, and potential issues
  • Review test coverage
  • Run build/tests (blocked by sandbox permissions)

Overview

This PR fixes a real problem: when the CIDR in Cubelet config changes and network-agent restarts, the allocator's Assign() accepted old-CIDR IPs silently, causing sandboxes to get IPs from the wrong network segment. The fix adds Contains() checks at three different layers for defense-in-depth.

Strengths

  1. Defense in depth — CIDR validation happens at three distinct boundaries:

    • recover() at startup — taps with out-of-range IPs are destroyed immediately (local_service.go:492-500)
    • reconcileState() on reconciliation — sandbox IPs outside the CIDR return an error (local_service.go:370-372)
    • dequeueTapLocked() — defensive check for out-of-range taps that made it into the pool (tap_lifecycle.go:65-74)
      This layered approach correctly handles all entry points.
  2. Graceful cleanup — When out-of-range taps are destroyed, all resources are properly released: port mappings cleared, cubeVS device deleted, kernel tap destroyed, and persisted state cleaned up.

  3. Good test coverage — The 648-line cidr_change_test.go covers:

    • reconcileState rejects out-of-CIDR sandboxes
    • reconcileState accepts in-CIDR sandboxes
    • reconcileState rejects invalid/unparseable sandbox IPs
    • recover destroys out-of-range taps and does not pool them
    • recover skips persisted states with out-of-CIDR IPs and cleans up
    • recover handles reconcileState failure gracefully (cleans up and moves on)
    • dequeueTapLocked destroys out-of-range taps from pool
    • EnsureNetwork correctly allocates from the new CIDR
  4. IP allocator validation at startupNewLocalService (local_service.go:74-85) now validates that tap_init_num doesn't exceed available IPs in the CIDR, preventing a startup with unusable configuration.

Findings

1. PR description inaccuracy: "Wire CIDR and tap_init_num via environment variables"

The PR description mentions wiring CIDR and tap_init_num via environment variables, but the actual changes in main.go and logging.go are about:

  • Adding a panic recovery handler to main()
  • Writing init errors to stderr + os.Exit(1) instead of CubeLog.Fatalf (which may not be initialized yet)
  • Fixing a potential nil dereference on CubeLog.LocalIP in logging.go

CIDR and tap_init_num are read from the Cubelet TOML config file via LoadConfigFromCubeletTOML in config.go, which was already the case before this PR. No environment variable wiring was added. The description is misleading — consider updating it.

2. recover() function complexity

The recover() function (local_service.go:451-586) is already quite complex with multiple layers of branching. The CIDR change path adds additional branches. For example, the CIDR out-of-range check at line 492 uses continue to skip the tap, then the function falls through to the stale-state cleanup loop at line 561. While correct, this function would benefit from a refactor into smaller helper functions in a follow-up.

3. recover() — skipped IP release is intentional but not documented

When a tap has an IP outside the current CIDR (local_service.go:492-500):

if !s.allocator.Contains(tap.IP) {
    // ... destroy ...
    continue
}

The code does not call s.allocator.Release(tap.IP). This is correct because s.allocator.Assign() is never called for out-of-range IPs (the Contains check gates Assign at line 502). However, this is a subtle correctness point — a short comment explaining this would help future readers avoid wondering whether there's a leak.

4. dequeueTapLocked()cubevsDelTAPDevice error ignored

At tap_lifecycle.go:71, the result of cubevsDelTAPDevice for out-of-range taps is discarded:

_ = cubevsDelTAPDevice(uint32(tap.Index), tap.IP.To4())

This is consistent with other destruction paths in this codebase (there's already precedent at local_service.go:498), but if this fails, the tap device entry remains in the cubeVS map — a potential source of stale state. Consider at least logging the error.

5. Contains() vs Assign() consistency

The Contains() method (ipam.go:150-157) uses the exact same index calculation logic as Assign() and Release(), so there can't be false negatives/positives due to inconsistent bounds. This is good engineering.

Minor Notes

  • The main.go panic recovery and logging changes are tangentially related to the CIDR fix but are reasonable safety improvements to include.
  • The logging.go LocalIP fix (localIP := "unknown" guard) prevents a panic when CubeLog.LocalIP is empty — a nice defensive fix.
  • tap.IP.To4() is called at local_service.go:499 without a nil check, but Contains() already validated the IP is in-range (implying it's a valid IPv4), so this is safe.

Summary

Verdict: Changes are correct and well-structured. The defense-in-depth approach with checks in recover(), reconcileState(), and dequeueTapLocked() correctly prevents old-CIDR IPs from being assigned to new sandboxes. The test coverage is comprehensive. The code properly cleans up all resources (kernel taps, cubeVS entries, port mappings, persisted state) when out-of-range taps are detected.

The main actionable finding is issue #1 — the PR description should be updated to accurately describe the main.go/logging.go changes.


…DR change

When the CIDR in Cubelet config changes, several code paths silently
accepted old-CIDR IPs via Assign(), causing sandboxes to receive IPs
from the wrong network segment.

- ipam: add Contains() to check if IP falls within current CIDR range
- recover(): destroy kernel taps with out-of-range IPs
- reconcileState(): reject sandbox with IP outside current CIDR
- dequeueTapLocked(): defensively destroy out-of-range taps from pool
- Wire CIDR and tap_init_num via environment variables

Signed-off-by: HouJinxin <houjinxin@example.com>
@woshihoujinxin woshihoujinxin force-pushed the fix/network-agent-cidr-ip-validation branch from c4c97a3 to 48e4078 Compare May 27, 2026 11:53
@woshihoujinxin woshihoujinxin requested a review from fslongjin as a code owner May 27, 2026 11:53
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 27, 2026

Claude finished @woshihoujinxin's task in 1m 46s —— View job


Review of PR #378fix(network-agent): prevent out-of-CIDR IP allocation on CIDR change

Progress

  • Read all 7 changed files
  • Analyze correctness of CIDR validation logic
  • Review test coverage
  • Check for edge cases and resource leaks

Overview

This PR prevents sandboxes from receiving IPs from an outdated network segment when the CIDR in Cubelet config changes and network-agent restarts. The fix adds Contains() validation at three layers (defense-in-depth) in recover(), reconcileState(), and dequeueTapLocked(), plus a TapInitNum validation at startup in NewLocalService().

Correctness Analysis

1. Three-layer defense is sound

Each layer catches a distinct code path through which an old-CIDR IP could reach a sandbox:

  • recover() (local_service.go:492-500): Startup — taps with out-of-range IPs are destroyed before anything can use them.
  • reconcileState() (local_service.go:370-372): Reconciliation — a sandbox whose persisted IP doesn't match the current CIDR is rejected.
  • dequeueTapLocked() (tap_lifecycle.go:65-74): Defensive — an out-of-range tap surviving in the pool (race/maint loop) is destroyed on dequeue.

All three paths properly clean up: clearPortMappingscubevsDelTAPDevicedestroyTap. No resources are leaked.

2. Contains() uses consistent index math (ipam.go:150-157)

Same ip2Idx - startIdx calculation as Assign() and Release(), so there's no divergence between what the allocator considers "in range" vs what the guard checks consider "in range."

3. TapInitNum validation at startup (local_service.go:74-85)

Rejects a configuration where tap_init_num exceeds available IPs in the CIDR with a clear error message suggesting fixes. This prevents a silent startup failure where the allocator can't create enough pool taps.

Findings

Finding 1: PR description inaccuracy — "Wire CIDR and tap_init_num via environment variables"

The PR description mentions wiring CIDR and tap_init_num via environment variables, but these values are read from the Cubelet TOML config file via LoadConfigFromCubeletTOML — no environment variable wiring was added. The description should be updated.

Finding 2: Potential port mapping leak for out-of-CIDR taps in recover()

In recover() (local_service.go:492-500), when an out-of-CIDR tap is destroyed:

if !s.allocator.Contains(tap.IP) {
    s.clearPortMappings(tap)       // tap.PortMappings is nil here
    _ = cubevsDelTAPDevice(...)
    _ = destroyTapFunc(tap.Index)
    continue
}

At this point, tap.PortMappings is nil (set by listCubeTapsFunc, not yet populated from mappingsByIfindex at line 503). So clearPortMappings is a no-op — the port mappings previously loaded at line 464-466 are assigned in s.ports but never released. While cubevsDelTAPDevice may cascade-delete port mappings in cubeVS, s.ports retains the stale assignments. Impact is low (on restart, s.ports is fresh), but it's a subtle inconsistency worth noting.

To fix, the code could look up mappingsByIfindex[tap.Index] before destroying:

if !s.allocator.Contains(tap.IP) {
    tap.PortMappings = append([]PortMapping(nil), mappingsByIfindex[tap.Index]...)
    s.clearPortMappings(tap)
    _ = cubevsDelTAPDevice(...)
    _ = destroyTapFunc(tap.Index)
    continue
}

Finding 3: recover() nil IP edge case

If listCubeTapsFunc returns a tap with a nil IP, s.allocator.Contains(nil) returns false (because ip.To4() on nil returns nil), so the code enters the out-of-CIDR destroy path and calls tap.IP.To4() — which would panic on nil. In practice, kernel taps always have valid IPs, so this isn't reachable. Adding a nil guard before the Contains check would make the code robust against unexpected data:

if tap.IP == nil || !s.allocator.Contains(tap.IP) {
    // ...
}

Finding 4: dequeueTapLocked()cubevsDelTAPDevice error silently ignored

At tap_lifecycle.go:71:

_ = cubevsDelTAPDevice(uint32(tap.Index), tap.IP.To4())

The error is discarded. If this fails, the cubeVS TAP entry remains, creating stale state. Unless there's a strong reason (like the call sometimes "legitimately" fails), a logged warning would help debugging. This pattern exists elsewhere in the codebase too.

Finding 5: recover() function complexity

The recover() function (local_service.go:451-586) handles three distinct concerns in a single function: (a) loading persisted states, (b) iterating kernel taps and matching/destroying/recovering them, and (c) cleaning up stale persisted states with no matching tap. The CIDR check adds another branch. Extracting helper functions (e.g., destroyOutOfRangeTap, recoverTapWithState, recoverTapNoState) would improve readability.

Test Coverage

The 648-line cidr_change_test.go covers all three CIDR validation paths plus edge cases:

Test What it validates
TestReconcileStateRejectsOutOfCIDRSandbox reconcileState rejects old-CIDR IPs
TestReconcileStateAcceptsInCIDRSandbox reconcileState accepts new-CIDR IPs
TestReconcileStateRejectsInvalidSandboxIP reconcileState rejects unparseable IPs
TestRecoverDestroysOutOfRangeTapsOnCIDRChange recover destroys old-CIDR taps (2 taps, both destroyed, none pooled)
TestRecoverSkipsPersistedStateWithOutOfCIDRIP recover cleans up persisted state for destroyed taps
TestRecoverSkipsSandboxWhenReconcileFailsForCIDRChange recover handles reconcile failure gracefully (IP released, state cleaned)
TestDequeueTapLockedDestroysOutOfRangeTapAfterCIDRChange dequeueTapLocked destroys out-of-range tap from pool
TestEnsureNetworkAllocatesFromNewCIDRAfterChange EnsureNetwork allocates from new CIDR

The Contains() method is tested in ipam_test.go:261-288 with 9 cases covering boundary values (network addr, gateway, mid-range, broadcast) and rejections (out-of-range, nil, IPv6).

Issues with the main.go / logging.go changes

The main.go and logging.go changes are tangentially related but bring real improvements:

  • Panic recovery handler to prevent silent crashes
  • initService failure writes to stderr + os.Exit(1) instead of CubeLog.Fatalf (which may not be initialized)
  • Safe localIP := "unknown" fallback in logging.go

These should be called out separately from the CIDR fix.

Summary

Overall: Correct with minor cleanup suggestions. The three-layer defense-in-depth correctly prevents old-CIDR IP allocation. Test coverage is comprehensive. The main improvement I'd recommend is addressing Finding 2 (port mapping cleanup for out-of-CIDR taps in recover()) and updating the PR description to accurately describe the main.go/logging.go changes.


@fslongjin
Copy link
Copy Markdown
Member

The premise of this PR is wrong. Changing the local tap CIDR inherently breaks networking for all running sandboxes — their tap interfaces already have IPs from the old segment. Destroying those taps just kills their connectivity. There is no way to make this safe at runtime.

This is a deployment-time config that should never change on a live node. If the CIDR must change, drain the node and restart clean. Adding 600+ lines of validation and test code to handle a scenario that should not happen operationally only adds complexity to already-complex paths like recover().

If anything is needed, it is a simple startup check: refuse to start if the CIDR in config does not match persisted state, and tell the operator to drain the node first. That is it.

What is the actual operational scenario that requires runtime CIDR changes?

@chenhengqi
Copy link
Copy Markdown
Collaborator

I agree with @fslongjin . Let's focus on #379 instead.

Copy link
Copy Markdown
Collaborator

@chenhengqi chenhengqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is not the right way.

@woshihoujinxin
Copy link
Copy Markdown
Author

The default CIDR conflicts with my K8s cluster network range, which is why I modify the CIDR after the initial installation. Changing the CIDR requires cleaning up the TAP network resources created during that first install. At this point, no sandboxes have been started yet. The scenario I'm describing is:
▎ discovering a CIDR conflict after the initial installation that prevents the K8s cluster from replying to packets, forcing a CIDR change.

@woshihoujinxin
Copy link
Copy Markdown
Author

Additionally, if the TAP devices were created during installation and no sandbox is currently using them, they should be removed when the service is stopped. The current situation is: after installation, the system doesn't work, and when I try to uninstall and reinstall, the system has left behind a bunch of
▎ orphaned TAP resources that prevent network-agent from starting after reinstallation.

@chenhengqi
Copy link
Copy Markdown
Collaborator

Additionally, if the TAP devices were created during installation and no sandbox is currently using them, they should be removed when the service is stopped. The current situation is: after installation, the system doesn't work, and when I try to uninstall and reinstall, the system has left behind a bunch of ▎ orphaned TAP resources that prevent network-agent from starting after reinstallation.

A reboot should work.

Let's stick to fixing the installation script. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants