feat: periodic fstrim + arcbox disk usage/compact CLI#274
feat: periodic fstrim + arcbox disk usage/compact CLI#274
Conversation
Add background fstrim_loop (hourly) so the guest reclaims sparse file space on the host. First tick fires immediately to catch historical waste. run_fstrim_now() helper is invoked by the DiskTrim RPC handler added in the following commit. (discard=async on the Btrfs mount was already landed on master.)
- Add DiskTrimRequest/Response to agent.proto and wire protocol. - Wire up DiskTrim RPC in guest agent (calls run_fstrim_now). - Add disk_trim() method to host-side AgentClient. - Add `arcbox disk usage` command: reports logical/physical/available sizes of the sparse docker.img file. - Add `arcbox disk compact` command: prints guidance and current stats (full RPC-based compact via gRPC relay deferred to follow-up).
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Greptile SummaryThis PR adds hourly background
Confidence Score: 4/5Safe to merge after fixing the blocking std::process::Command calls inside async contexts. One P1 finding: both fstrim_loop and run_fstrim_now invoke std::process::Command synchronously on tokio worker threads. On a large filesystem this can block for minutes, degrading or stalling the guest agent's RPC handling. All other additions are clean and follow existing patterns. guest/arcbox-agent/src/agent/mod.rs — fstrim_loop (line 459) and run_fstrim_now (line 482) Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as arcbox CLI
participant Core as arcbox-core (AgentClient)
participant Vsock as vsock wire (MessageType)
participant Agent as arcbox-agent (guest)
participant OS as fstrim (OS)
note over Agent: startup
Agent->>Agent: tokio::spawn(fstrim_loop())
loop Every 3600s
Agent->>OS: std::process::Command fstrim /var/lib/docker
Agent->>OS: std::process::Command fstrim /var/lib/containerd
end
note over CLI: arcbox disk usage
CLI->>CLI: stat(docker.img) → logical / physical bytes
CLI->>CLI: print Logical / Physical / Available (sparse delta)
note over CLI: arcbox disk compact (stub)
CLI->>CLI: print guidance + current stats
note over CLI: future — arcbox disk compact (full)
CLI->>Core: disk_trim()
Core->>Vsock: DiskTrimRequest (0x000C)
Vsock->>Agent: parse_request → RpcRequest::DiskTrim
Agent->>OS: run_fstrim_now() — fstrim -v (blocking)
OS-->>Agent: stdout trim summary
Agent-->>Vsock: DiskTrimResponse (0x100C) {result}
Vsock-->>Core: decode DiskTrimResponse
Core-->>CLI: DiskTrimResponse
Reviews (1): Last reviewed commit: "feat(cli): add arcbox disk usage/compact..." | Re-trigger Greptile |
| match std::process::Command::new("fstrim").arg(mount).status() { | ||
| Ok(s) if s.success() => { | ||
| tracing::info!("fstrim {} completed", mount); | ||
| } | ||
| Ok(s) => { | ||
| tracing::debug!( | ||
| "fstrim {} exited with code {}", | ||
| mount, | ||
| s.code().unwrap_or(-1) | ||
| ); | ||
| } | ||
| Err(e) => { | ||
| tracing::debug!("fstrim {} failed: {}", mount, e); | ||
| } | ||
| } |
There was a problem hiding this comment.
Blocking syscall on tokio thread in
fstrim_loop
std::process::Command::status() is synchronous — it blocks the OS thread until fstrim exits. Because fstrim_loop is an async fn spawned with tokio::spawn, this call holds a tokio worker thread for the full duration of the trim. On a large Btrfs volume, fstrim routinely takes tens of seconds to minutes, during which any other async task scheduled on that thread cannot run.
The same issue applies to run_fstrim_now() (line 482), which is called synchronously from the handle_disk_trim() RPC handler, blocking the connection handler task for the entire duration of the trim.
Use tokio::process::Command or wrap both calls in tokio::task::spawn_blocking.
| let available_gib = | ||
| logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0); | ||
| let usage_pct = if logical_bytes > 0 { | ||
| (physical_bytes as f64 / logical_bytes as f64) * 100.0 | ||
| } else { | ||
| 0.0 | ||
| }; | ||
|
|
||
| println!("Docker data disk:"); | ||
| println!(" Path: {}", img_path.display()); | ||
| println!(" Logical: {logical_gib:.1} GiB"); | ||
| println!(" Physical: {physical_gib:.1} GiB ({usage_pct:.1}%)"); | ||
| println!(" Available: {available_gib:.1} GiB"); |
There was a problem hiding this comment.
"Available" label is misleading
available_gib = logical - physical is the sparse-allocation delta — the blocks the host hasn't yet written to (or has reclaimed via fstrim). "Available" normally means free capacity remaining on a filesystem. Consider a label like Reclaimable or Sparse savings instead.
| let available_gib = | |
| logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0); | |
| let usage_pct = if logical_bytes > 0 { | |
| (physical_bytes as f64 / logical_bytes as f64) * 100.0 | |
| } else { | |
| 0.0 | |
| }; | |
| println!("Docker data disk:"); | |
| println!(" Path: {}", img_path.display()); | |
| println!(" Logical: {logical_gib:.1} GiB"); | |
| println!(" Physical: {physical_gib:.1} GiB ({usage_pct:.1}%)"); | |
| println!(" Available: {available_gib:.1} GiB"); | |
| println!(" Reclaimable: {available_gib:.1} GiB"); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Pull request overview
Adds guest-side disk space reclamation and exposes trim/usage surfaces so users can understand and reclaim sparse docker.img growth over time.
Changes:
- Guest agent: periodic hourly
fstrimloop plus an on-demandDiskTrimrequest handler. - Protocol/wire: adds
DiskTrimRequest/DiskTrimResponseprotobuf messages and assigns new wireMessageTypeIDs. - Host/CLI: adds
AgentClient::disk_trim()and introducesabctl disk {usage,compact}commands for disk stats/guidance.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| rpc/arcbox-protocol/src/lib.rs | Re-exports new DiskTrim types for backward compatibility. |
| rpc/arcbox-protocol/src/generated/arcbox.v1.rs | Generated Rust types for DiskTrim request/response messages. |
| rpc/arcbox-protocol/proto/agent.proto | Adds DiskTrim RPC + message definitions to the agent proto. |
| guest/arcbox-agent/src/rpc.rs | Extends guest RPC envelope parsing/encoding with DiskTrim request/response. |
| guest/arcbox-agent/src/agent/mod.rs | Implements periodic fstrim task and DiskTrim handler. |
| common/arcbox-constants/src/wire.rs | Reserves wire IDs and updates mapping/tests for DiskTrim message types. |
| app/arcbox-core/src/agent_client.rs | Adds host-side disk_trim() helper on the vsock RPC client. |
| app/arcbox-cli/src/main.rs | Routes new disk top-level command to command handler. |
| app/arcbox-cli/src/commands/mod.rs | Declares disk command module + adds Commands::Disk. |
| app/arcbox-cli/src/commands/disk.rs | New CLI implementation for disk usage and disk compact. |
| let logical_gib = logical_bytes as f64 / (1024.0 * 1024.0 * 1024.0); | ||
| let physical_gib = physical_bytes as f64 / (1024.0 * 1024.0 * 1024.0); | ||
| let available_gib = | ||
| logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0); | ||
| let usage_pct = if logical_bytes > 0 { | ||
| (physical_bytes as f64 / logical_bytes as f64) * 100.0 | ||
| } else { |
There was a problem hiding this comment.
The new disk usage calculation (logical/physical/available/percent) is currently embedded in the command handler and isn’t unit-tested. Since other CLI command modules include unit tests, consider extracting the size math into a small pure helper and adding tests for edge cases (e.g., logical=0, physical>logical due to rounding, large values).
| loop { | ||
| interval.tick().await; | ||
| for mount in [DOCKER_DATA_MOUNT_POINT, CONTAINERD_DATA_MOUNT_POINT] { | ||
| match std::process::Command::new("fstrim").arg(mount).status() { |
There was a problem hiding this comment.
fstrim_loop runs std::process::Command from inside an async task. fstrim can take seconds/minutes on large filesystems, and this will block a Tokio worker thread. Prefer tokio::process::Command (with .status().await) or wrap the call in tokio::task::spawn_blocking to avoid stalling other RPC work on the runtime.
| match std::process::Command::new("fstrim").arg(mount).status() { | |
| match tokio::process::Command::new("fstrim") | |
| .arg(mount) | |
| .status() | |
| .await | |
| { |
| /// Periodically runs `fstrim` on data mount points to reclaim sparse file | ||
| /// space on the host. First tick fires immediately to trim historical waste. | ||
| async fn fstrim_loop() { | ||
| let mut interval = tokio::time::interval(Duration::from_secs(3600)); |
There was a problem hiding this comment.
tokio::time::interval defaults to MissedTickBehavior::Burst, so if the VM is paused/suspended for a while the loop may immediately run fstrim many times back-to-back on resume. Consider setting interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip) (or Delay) so only one trim runs after long delays.
| let mut interval = tokio::time::interval(Duration::from_secs(3600)); | |
| let mut interval = tokio::time::interval(Duration::from_secs(3600)); | |
| interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip); |
| fn handle_disk_trim() -> RpcResponse { | ||
| let result = run_fstrim_now(); |
There was a problem hiding this comment.
handle_disk_trim() calls run_fstrim_now() synchronously, which runs an external process and captures its output. This blocks the per-connection async task and can monopolize a Tokio worker thread during the trim. Consider making this handler async and using tokio::process::Command/spawn_blocking, returning the response once the trim completes (or returning an error response if the command cannot be spawned).
| fn handle_disk_trim() -> RpcResponse { | |
| let result = run_fstrim_now(); | |
| async fn handle_disk_trim() -> RpcResponse { | |
| let result = match tokio::task::spawn_blocking(run_fstrim_now).await { | |
| Ok(result) => result, | |
| Err(err) => format!("failed to run fstrim task: {err}"), | |
| }; |
| let available_gib = | ||
| logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0); | ||
| let usage_pct = if logical_bytes > 0 { | ||
| (physical_bytes as f64 / logical_bytes as f64) * 100.0 |
There was a problem hiding this comment.
usage_pct can exceed 100% because st_blocks is block-rounded (e.g., very small files or filesystems with large allocation units). Consider clamping the percentage (e.g., min(physical_bytes, logical_bytes) for the numerator) or adjusting the displayed label so the output can’t imply >100% physical usage.
| (physical_bytes as f64 / logical_bytes as f64) * 100.0 | |
| let usage_bytes = physical_bytes.min(logical_bytes); | |
| (usage_bytes as f64 / logical_bytes as f64) * 100.0 |
Summary
Two-commit series that closes the "container data keeps eating disk, no way to reclaim it" gap. Revives work from the stale
feat/disk-and-hostfsbranch (2026-03-10, 465 commits behind) and ports it to current master. The/hostVirtioFS changes and Docker bind-mount rewrite from that same branch are not in this PR — they are architectural and deserve their own discussion.Commit 1 —
feat(agent): enable periodic fstrim for space reclamationGuest-side background task (
fstrim_loop) that runsfstrimhourly on/var/lib/dockerand/var/lib/containerd, with the first tick firing immediately so historical waste from existing machines is picked up on next boot.run_fstrim_now()helper is what theDiskTrimRPC in commit 2 calls.discard=asyncon the Btrfs mount was already on master — that half of the original commit is a no-op and the conflict was resolved by keeping master's (more recent) mount options (zstd:1 + noatime + space_cache=v2).Commit 2 —
feat(cli): add arcbox disk usage/compact and DiskTrim RPCWire + surface for on-demand trim:
DiskTrimRequest/DiskTrimResponseinagent.proto+ generated code,MessageType::DiskTrim{Request,Response}=0x000C/0x100Cinwire.rs.handle_disk_trim()dispatches torun_fstrim_now()and returns per-mount trim summary.AgentClient::disk_trim()method.arcbox disk usage(logical/physical/available sizes of sparsedocker.img),arcbox disk compact(prints stats + guidance; full RPC-driven compact via gRPC relay is intentionally left as a follow-up).Rebase notes
Cherry-picked the two commits onto master. Path migrations handled:
guest/arcbox-agent/src/agent.rs→agent/mod.rs(file split)comm/arcbox-protocol/→rpc/arcbox-protocol/(directory rename)Merge decisions (for reviewer sanity):
zstd:3form.tokio::spawn(fstrim_loop())alongside the existing Kubernetes API proxy spawn, not replacing it.DiskTrimarm appended toRpcRequest/RpcResponse/handle_request/message_type/encode_payload/ dispatcher match — never replacing Kubernetes/Shutdown/MmapReadFile arms.handle_disk_trim()next to the other per-request handlers.// =====section divider the original branch introduced (current convention: section dividers trigger a file split).DiskTrimMessageType IDs (0x000C/0x100C) were chosen to followMmapReadFile(0x000B/0x100B) — the branch didn't touchwire.rs, so this PR picks the IDs.Test plan
cargo check -p arcbox-core -p arcbox-clicleancargo check -p arcbox-agent --target aarch64-unknown-linux-muslclean (guest cross-compile)cargo clippy --all-targets -- -D warningsclean for CLI + corecargo fmt --checkcleanarcbox disk usageon a warm machine and verify logical vs physical divergearcbox disk compactend-to-end once the RPC surface is wired on the daemon side (follow-up task)fstrim_looprun for an hour, confirmdocker.imgphysical size shrinks after deleting a large containerNot in this PR (intentional)
From the same ancestor branch, these were dropped because they're architectural and need separate review:
/hostVirtioFS mount