Skip to content

fix(dhcp): add expiry for declined IPs to prevent pool exhaustion#152

Open
AprilNEA wants to merge 2 commits intomasterfrom
fix/dhcp-decline-ip-expiry
Open

fix(dhcp): add expiry for declined IPs to prevent pool exhaustion#152
AprilNEA wants to merge 2 commits intomasterfrom
fix/dhcp-decline-ip-expiry

Conversation

@AprilNEA
Copy link
Copy Markdown
Member

Summary

  • Add 5-minute quarantine for DHCP DECLINE'd IPs instead of permanent allocation
  • Track declined IPs with timestamps in declined_ips: HashMap<Ipv4Addr, Instant>
  • Release quarantined IPs back to pool during cleanup_expired_leases()

Test plan

  • Unit test: test_declined_ip_quarantine_and_expiry

Closes ABX-235

Copilot AI review requested due to automatic review settings March 31, 2026 12:38
@linear
Copy link
Copy Markdown

linear Bot commented Mar 31, 2026

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21b37fea8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +383 to +385
for ip in expired_declines {
self.declined_ips.remove(&ip);
self.allocator.release(ip);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve reservations when decline quarantine expires

When quarantine entries expire, cleanup_expired_leases unconditionally calls self.allocator.release(ip). This breaks reserved addresses: if a reserved IP is ever declined, it is added to declined_ips and then released after 5 minutes even though self.reservations still maps that IP to a MAC. At that point the allocator can reassign the reserved IP to other clients, causing reservation violations and potential duplicate-IP conflicts when the reserved client reconnects. The release path should skip IPs that are still reserved.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a time-bounded quarantine for DHCP DECLINE’d IP addresses to avoid permanent pool exhaustion, and tightens machine creation concurrency to prevent duplicate machine names under concurrent create requests.

Changes:

  • Track declined IPs with a timestamp and release them back to the allocator after a 5-minute quarantine.
  • Update DHCPDECLINE handling to remove the client’s pending lease and mark the declined IP as allocated during quarantine.
  • Hold a write lock across MachineManager::create to eliminate TOCTOU races on machine name uniqueness.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
virt/arcbox-dhcp/src/server.rs Adds declined-IP quarantine tracking, expiry cleanup logic, and a unit test for quarantine expiry behavior.
app/arcbox-core/src/machine.rs Changes locking strategy in MachineManager::create to prevent concurrent duplicate creation of the same machine name.

Comment on lines 333 to +350
if let Some(ip) = packet.requested_ip {
// Remove the lease so the client can start fresh with DISCOVER.
if let Some(lease) = self.leases.remove(&mac) {
// If the declined IP differs from the lease IP (unusual, but
// possible), release the lease IP since the client won't use it.
if lease.ip != ip && !self.reservations.contains_key(&mac) {
self.allocator.release(lease.ip);
}
}

// Ensure the declined IP is marked as allocated so it cannot be
// handed out during the quarantine window. If it was already
// allocated (the common case), this is a no-op.
self.allocator.allocate_specific(ip);

tracing::warn!("DHCPDECLINE: {} reported as in use", ip);
// Record the quarantine start time.
self.declined_ips.insert(ip, Instant::now());

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_decline() records any requested_ip in declined_ips even if it is outside the configured pool range. Since allocate_specific() returns false for out-of-range addresses, these entries aren't actually quarantined in the allocator and can accumulate until cleanup runs. Consider validating the IP is within the pool range (or only inserting into declined_ips when the IP is in-range) before quarantining.

Copilot uses AI. Check for mistakes.
Comment on lines +488 to +491
server.declined_ips.insert(
ip,
Instant::now() - DECLINE_QUARANTINE - Duration::from_secs(1),
);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test backdates the timestamp via Instant::now() - DECLINE_QUARANTINE - .... Subtracting a duration from Instant can panic if the system uptime is less than that duration (e.g., freshly started CI runner). Use checked_sub(...) (with a fallback) to avoid panics and test flakiness.

Suggested change
server.declined_ips.insert(
ip,
Instant::now() - DECLINE_QUARANTINE - Duration::from_secs(1),
);
let quarantine_expired_at = Instant::now()
.checked_sub(DECLINE_QUARANTINE + Duration::from_secs(1))
.unwrap_or_else(Instant::now);
server.declined_ips.insert(ip, quarantine_expired_at);

Copilot uses AI. Check for mistakes.
Comment on lines +317 to 324
// Hold the write lock for the entire create operation to prevent TOCTOU
// races: without this, two concurrent creates with the same name could
// both pass the existence check before either inserts.
let mut machines = self.machines.write().map_err(|_| CoreError::LockPoisoned)?;

if machines.contains_key(&config.name) {
return Err(CoreError::already_exists(config.name));
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create() now holds a std::sync::RwLock write guard across filesystem I/O (create_dir_all), VM creation, and persistence writes. This blocks all other readers/writers of machines for the entire duration and can degrade concurrency (and potentially block a Tokio worker thread if this runs on the async runtime). Consider narrowing the lock scope (e.g., insert a temporary "creating" entry under the lock, drop the guard while doing I/O, then finalize/update under the lock with rollback on error).

Copilot uses AI. Check for mistakes.
Comment on lines +383 to +384
for ip in expired_declines {
self.declined_ips.remove(&ip);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_expired_leases() releases quarantined declined IPs unconditionally. If a client sends a DECLINE for an IP that is still in use (e.g., belongs to an active lease for a different MAC, or is reserved in reservations), this will remove it from the allocator and allow it to be reallocated, creating duplicate assignments and breaking reservations. Consider only releasing a quarantined IP if it is not reserved for any MAC and is not currently present in any active lease (and ideally only quarantine/release IPs that were actually offered to the declining MAC).

Suggested change
for ip in expired_declines {
self.declined_ips.remove(&ip);
for ip in expired_declines {
// Only release the IP back to the allocator if it is not currently
// in use by any active lease and not reserved for any MAC.
let ip_in_use = self.leases.values().any(|lease| lease.ip == ip);
let ip_reserved = self
.reservations
.values()
.any(|reserved_ip| *reserved_ip == ip);
self.declined_ips.remove(&ip);
if ip_in_use || ip_reserved {
tracing::debug!(
"Skipped releasing quarantined declined IP {} (in_use={}, reserved={})",
ip,
ip_in_use,
ip_reserved
);
continue;
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants