fix(dhcp): add expiry for declined IPs to prevent pool exhaustion#152
fix(dhcp): add expiry for declined IPs to prevent pool exhaustion#152
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21b37fea8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for ip in expired_declines { | ||
| self.declined_ips.remove(&ip); | ||
| self.allocator.release(ip); |
There was a problem hiding this comment.
Preserve reservations when decline quarantine expires
When quarantine entries expire, cleanup_expired_leases unconditionally calls self.allocator.release(ip). This breaks reserved addresses: if a reserved IP is ever declined, it is added to declined_ips and then released after 5 minutes even though self.reservations still maps that IP to a MAC. At that point the allocator can reassign the reserved IP to other clients, causing reservation violations and potential duplicate-IP conflicts when the reserved client reconnects. The release path should skip IPs that are still reserved.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Adds a time-bounded quarantine for DHCP DECLINE’d IP addresses to avoid permanent pool exhaustion, and tightens machine creation concurrency to prevent duplicate machine names under concurrent create requests.
Changes:
- Track declined IPs with a timestamp and release them back to the allocator after a 5-minute quarantine.
- Update DHCPDECLINE handling to remove the client’s pending lease and mark the declined IP as allocated during quarantine.
- Hold a write lock across
MachineManager::createto eliminate TOCTOU races on machine name uniqueness.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
virt/arcbox-dhcp/src/server.rs |
Adds declined-IP quarantine tracking, expiry cleanup logic, and a unit test for quarantine expiry behavior. |
app/arcbox-core/src/machine.rs |
Changes locking strategy in MachineManager::create to prevent concurrent duplicate creation of the same machine name. |
| if let Some(ip) = packet.requested_ip { | ||
| // Remove the lease so the client can start fresh with DISCOVER. | ||
| if let Some(lease) = self.leases.remove(&mac) { | ||
| // If the declined IP differs from the lease IP (unusual, but | ||
| // possible), release the lease IP since the client won't use it. | ||
| if lease.ip != ip && !self.reservations.contains_key(&mac) { | ||
| self.allocator.release(lease.ip); | ||
| } | ||
| } | ||
|
|
||
| // Ensure the declined IP is marked as allocated so it cannot be | ||
| // handed out during the quarantine window. If it was already | ||
| // allocated (the common case), this is a no-op. | ||
| self.allocator.allocate_specific(ip); | ||
|
|
||
| tracing::warn!("DHCPDECLINE: {} reported as in use", ip); | ||
| // Record the quarantine start time. | ||
| self.declined_ips.insert(ip, Instant::now()); | ||
|
|
There was a problem hiding this comment.
handle_decline() records any requested_ip in declined_ips even if it is outside the configured pool range. Since allocate_specific() returns false for out-of-range addresses, these entries aren't actually quarantined in the allocator and can accumulate until cleanup runs. Consider validating the IP is within the pool range (or only inserting into declined_ips when the IP is in-range) before quarantining.
| server.declined_ips.insert( | ||
| ip, | ||
| Instant::now() - DECLINE_QUARANTINE - Duration::from_secs(1), | ||
| ); |
There was a problem hiding this comment.
This test backdates the timestamp via Instant::now() - DECLINE_QUARANTINE - .... Subtracting a duration from Instant can panic if the system uptime is less than that duration (e.g., freshly started CI runner). Use checked_sub(...) (with a fallback) to avoid panics and test flakiness.
| server.declined_ips.insert( | |
| ip, | |
| Instant::now() - DECLINE_QUARANTINE - Duration::from_secs(1), | |
| ); | |
| let quarantine_expired_at = Instant::now() | |
| .checked_sub(DECLINE_QUARANTINE + Duration::from_secs(1)) | |
| .unwrap_or_else(Instant::now); | |
| server.declined_ips.insert(ip, quarantine_expired_at); |
| // Hold the write lock for the entire create operation to prevent TOCTOU | ||
| // races: without this, two concurrent creates with the same name could | ||
| // both pass the existence check before either inserts. | ||
| let mut machines = self.machines.write().map_err(|_| CoreError::LockPoisoned)?; | ||
|
|
||
| if machines.contains_key(&config.name) { | ||
| return Err(CoreError::already_exists(config.name)); | ||
| } |
There was a problem hiding this comment.
create() now holds a std::sync::RwLock write guard across filesystem I/O (create_dir_all), VM creation, and persistence writes. This blocks all other readers/writers of machines for the entire duration and can degrade concurrency (and potentially block a Tokio worker thread if this runs on the async runtime). Consider narrowing the lock scope (e.g., insert a temporary "creating" entry under the lock, drop the guard while doing I/O, then finalize/update under the lock with rollback on error).
| for ip in expired_declines { | ||
| self.declined_ips.remove(&ip); |
There was a problem hiding this comment.
cleanup_expired_leases() releases quarantined declined IPs unconditionally. If a client sends a DECLINE for an IP that is still in use (e.g., belongs to an active lease for a different MAC, or is reserved in reservations), this will remove it from the allocator and allow it to be reallocated, creating duplicate assignments and breaking reservations. Consider only releasing a quarantined IP if it is not reserved for any MAC and is not currently present in any active lease (and ideally only quarantine/release IPs that were actually offered to the declining MAC).
| for ip in expired_declines { | |
| self.declined_ips.remove(&ip); | |
| for ip in expired_declines { | |
| // Only release the IP back to the allocator if it is not currently | |
| // in use by any active lease and not reserved for any MAC. | |
| let ip_in_use = self.leases.values().any(|lease| lease.ip == ip); | |
| let ip_reserved = self | |
| .reservations | |
| .values() | |
| .any(|reserved_ip| *reserved_ip == ip); | |
| self.declined_ips.remove(&ip); | |
| if ip_in_use || ip_reserved { | |
| tracing::debug!( | |
| "Skipped releasing quarantined declined IP {} (in_use={}, reserved={})", | |
| ip, | |
| ip_in_use, | |
| ip_reserved | |
| ); | |
| continue; | |
| } |
Summary
declined_ips: HashMap<Ipv4Addr, Instant>cleanup_expired_leases()Test plan
test_declined_ip_quarantine_and_expiryCloses ABX-235