Skip to content

feat: periodic fstrim + arcbox disk usage/compact CLI#274

Open
AprilNEA wants to merge 2 commits intomasterfrom
feat/disk-management
Open

feat: periodic fstrim + arcbox disk usage/compact CLI#274
AprilNEA wants to merge 2 commits intomasterfrom
feat/disk-management

Conversation

@AprilNEA
Copy link
Copy Markdown
Member

Summary

Two-commit series that closes the "container data keeps eating disk, no way to reclaim it" gap. Revives work from the stale feat/disk-and-hostfs branch (2026-03-10, 465 commits behind) and ports it to current master. The /host VirtioFS changes and Docker bind-mount rewrite from that same branch are not in this PR — they are architectural and deserve their own discussion.

Commit 1 — feat(agent): enable periodic fstrim for space reclamation

Guest-side background task (fstrim_loop) that runs fstrim hourly on /var/lib/docker and /var/lib/containerd, with the first tick firing immediately so historical waste from existing machines is picked up on next boot. run_fstrim_now() helper is what the DiskTrim RPC in commit 2 calls.

discard=async on the Btrfs mount was already on master — that half of the original commit is a no-op and the conflict was resolved by keeping master's (more recent) mount options (zstd:1 + noatime + space_cache=v2).

Commit 2 — feat(cli): add arcbox disk usage/compact and DiskTrim RPC

Wire + surface for on-demand trim:

  • DiskTrimRequest/DiskTrimResponse in agent.proto + generated code, MessageType::DiskTrim{Request,Response} = 0x000C / 0x100C in wire.rs.
  • Guest: handle_disk_trim() dispatches to run_fstrim_now() and returns per-mount trim summary.
  • Host: AgentClient::disk_trim() method.
  • CLI: arcbox disk usage (logical/physical/available sizes of sparse docker.img), arcbox disk compact (prints stats + guidance; full RPC-driven compact via gRPC relay is intentionally left as a follow-up).

Rebase notes

Cherry-picked the two commits onto master. Path migrations handled:

  • guest/arcbox-agent/src/agent.rsagent/mod.rs (file split)
  • comm/arcbox-protocol/rpc/arcbox-protocol/ (directory rename)

Merge decisions (for reviewer sanity):

  • Kept master's btrfs mount options, not the branch's older zstd:3 form.
  • Added tokio::spawn(fstrim_loop()) alongside the existing Kubernetes API proxy spawn, not replacing it.
  • DiskTrim arm appended to RpcRequest / RpcResponse / handle_request / message_type / encode_payload / dispatcher match — never replacing Kubernetes/Shutdown/MmapReadFile arms.
  • Added handle_disk_trim() next to the other per-request handlers.
  • Dropped the // ===== section divider the original branch introduced (current convention: section dividers trigger a file split).
  • DiskTrim MessageType IDs (0x000C / 0x100C) were chosen to follow MmapReadFile (0x000B / 0x100B) — the branch didn't touch wire.rs, so this PR picks the IDs.

Test plan

  • cargo check -p arcbox-core -p arcbox-cli clean
  • cargo check -p arcbox-agent --target aarch64-unknown-linux-musl clean (guest cross-compile)
  • cargo clippy --all-targets -- -D warnings clean for CLI + core
  • cargo fmt --check clean
  • Manually invoke arcbox disk usage on a warm machine and verify logical vs physical diverge
  • Invoke arcbox disk compact end-to-end once the RPC surface is wired on the daemon side (follow-up task)
  • Let fstrim_loop run for an hour, confirm docker.img physical size shrinks after deleting a large container

Not in this PR (intentional)

From the same ancestor branch, these were dropped because they're architectural and need separate review:

  • Switching the host share model to a single /host VirtioFS mount
  • Rewriting Docker bind-mount paths through the hostfs share

Add background fstrim_loop (hourly) so the guest reclaims sparse file
space on the host. First tick fires immediately to catch historical
waste. run_fstrim_now() helper is invoked by the DiskTrim RPC handler
added in the following commit.

(discard=async on the Btrfs mount was already landed on master.)
- Add DiskTrimRequest/Response to agent.proto and wire protocol.
- Wire up DiskTrim RPC in guest agent (calls run_fstrim_now).
- Add disk_trim() method to host-side AgentClient.
- Add `arcbox disk usage` command: reports logical/physical/available
  sizes of the sparse docker.img file.
- Add `arcbox disk compact` command: prints guidance and current stats
  (full RPC-based compact via gRPC relay deferred to follow-up).
Copilot AI review requested due to automatic review settings April 24, 2026 15:13
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR adds hourly background fstrim in the guest agent plus arcbox disk usage / arcbox disk compact CLI commands and the DiskTrim RPC to request an on-demand trim from the host. The wire protocol, proto definitions, generated types, and RPC plumbing are all consistent with existing patterns.

  • P1 — blocking process calls on tokio worker threads: Both fstrim_loop (hourly background task) and run_fstrim_now (called synchronously from the handle_disk_trim RPC handler) use std::process::Command, which is blocking. fstrim on a large volume can take tens of seconds to minutes, tying up tokio worker threads and potentially stalling the agent's connection handling. These should use tokio::process::Command or be wrapped in tokio::task::spawn_blocking.

Confidence Score: 4/5

Safe to merge after fixing the blocking std::process::Command calls inside async contexts.

One P1 finding: both fstrim_loop and run_fstrim_now invoke std::process::Command synchronously on tokio worker threads. On a large filesystem this can block for minutes, degrading or stalling the guest agent's RPC handling. All other additions are clean and follow existing patterns.

guest/arcbox-agent/src/agent/mod.rs — fstrim_loop (line 459) and run_fstrim_now (line 482)

Important Files Changed

Filename Overview
guest/arcbox-agent/src/agent/mod.rs Adds fstrim_loop background task and run_fstrim_now helper, but both use blocking std::process::Command inside async contexts, which can stall tokio worker threads for the duration of fstrim.
app/arcbox-cli/src/commands/disk.rs New arcbox disk usage/compact CLI; execute_compact is intentionally a stub (documented follow-up). "Available" label in usage output is semantically misleading (sparse delta vs free space).
common/arcbox-constants/src/wire.rs Adds DiskTrimRequest = 0x000C and DiskTrimResponse = 0x100C following the established sequential ID pattern; round-trip test cases included.
rpc/arcbox-protocol/proto/agent.proto Adds DiskTrim RPC + empty request / string-result response messages; clean proto additions with no conflicts.
app/arcbox-core/src/agent_client.rs New disk_trim() method follows the identical pattern used by all other RPC calls; response-type guard and decode error-path are correct.
guest/arcbox-agent/src/rpc.rs Adds DiskTrim variants to RpcRequest/RpcResponse enums and the parse_request dispatcher; straightforward extension of the existing pattern.
rpc/arcbox-protocol/src/generated/arcbox.v1.rs Generated protobuf types for DiskTrimRequest and DiskTrimResponse; matches the proto definition correctly.

Sequence Diagram

sequenceDiagram
    participant CLI as arcbox CLI
    participant Core as arcbox-core (AgentClient)
    participant Vsock as vsock wire (MessageType)
    participant Agent as arcbox-agent (guest)
    participant OS as fstrim (OS)

    note over Agent: startup
    Agent->>Agent: tokio::spawn(fstrim_loop())
    loop Every 3600s
        Agent->>OS: std::process::Command fstrim /var/lib/docker
        Agent->>OS: std::process::Command fstrim /var/lib/containerd
    end

    note over CLI: arcbox disk usage
    CLI->>CLI: stat(docker.img) → logical / physical bytes
    CLI->>CLI: print Logical / Physical / Available (sparse delta)

    note over CLI: arcbox disk compact (stub)
    CLI->>CLI: print guidance + current stats

    note over CLI: future — arcbox disk compact (full)
    CLI->>Core: disk_trim()
    Core->>Vsock: DiskTrimRequest (0x000C)
    Vsock->>Agent: parse_request → RpcRequest::DiskTrim
    Agent->>OS: run_fstrim_now() — fstrim -v (blocking)
    OS-->>Agent: stdout trim summary
    Agent-->>Vsock: DiskTrimResponse (0x100C) {result}
    Vsock-->>Core: decode DiskTrimResponse
    Core-->>CLI: DiskTrimResponse
Loading

Reviews (1): Last reviewed commit: "feat(cli): add arcbox disk usage/compact..." | Re-trigger Greptile

Comment on lines +459 to +473
match std::process::Command::new("fstrim").arg(mount).status() {
Ok(s) if s.success() => {
tracing::info!("fstrim {} completed", mount);
}
Ok(s) => {
tracing::debug!(
"fstrim {} exited with code {}",
mount,
s.code().unwrap_or(-1)
);
}
Err(e) => {
tracing::debug!("fstrim {} failed: {}", mount, e);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Blocking syscall on tokio thread in fstrim_loop

std::process::Command::status() is synchronous — it blocks the OS thread until fstrim exits. Because fstrim_loop is an async fn spawned with tokio::spawn, this call holds a tokio worker thread for the full duration of the trim. On a large Btrfs volume, fstrim routinely takes tens of seconds to minutes, during which any other async task scheduled on that thread cannot run.

The same issue applies to run_fstrim_now() (line 482), which is called synchronously from the handle_disk_trim() RPC handler, blocking the connection handler task for the entire duration of the trim.

Use tokio::process::Command or wrap both calls in tokio::task::spawn_blocking.

Comment on lines +49 to +61
let available_gib =
logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0);
let usage_pct = if logical_bytes > 0 {
(physical_bytes as f64 / logical_bytes as f64) * 100.0
} else {
0.0
};

println!("Docker data disk:");
println!(" Path: {}", img_path.display());
println!(" Logical: {logical_gib:.1} GiB");
println!(" Physical: {physical_gib:.1} GiB ({usage_pct:.1}%)");
println!(" Available: {available_gib:.1} GiB");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "Available" label is misleading

available_gib = logical - physical is the sparse-allocation delta — the blocks the host hasn't yet written to (or has reclaimed via fstrim). "Available" normally means free capacity remaining on a filesystem. Consider a label like Reclaimable or Sparse savings instead.

Suggested change
let available_gib =
logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0);
let usage_pct = if logical_bytes > 0 {
(physical_bytes as f64 / logical_bytes as f64) * 100.0
} else {
0.0
};
println!("Docker data disk:");
println!(" Path: {}", img_path.display());
println!(" Logical: {logical_gib:.1} GiB");
println!(" Physical: {physical_gib:.1} GiB ({usage_pct:.1}%)");
println!(" Available: {available_gib:.1} GiB");
println!(" Reclaimable: {available_gib:.1} GiB");

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds guest-side disk space reclamation and exposes trim/usage surfaces so users can understand and reclaim sparse docker.img growth over time.

Changes:

  • Guest agent: periodic hourly fstrim loop plus an on-demand DiskTrim request handler.
  • Protocol/wire: adds DiskTrimRequest/DiskTrimResponse protobuf messages and assigns new wire MessageType IDs.
  • Host/CLI: adds AgentClient::disk_trim() and introduces abctl disk {usage,compact} commands for disk stats/guidance.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rpc/arcbox-protocol/src/lib.rs Re-exports new DiskTrim types for backward compatibility.
rpc/arcbox-protocol/src/generated/arcbox.v1.rs Generated Rust types for DiskTrim request/response messages.
rpc/arcbox-protocol/proto/agent.proto Adds DiskTrim RPC + message definitions to the agent proto.
guest/arcbox-agent/src/rpc.rs Extends guest RPC envelope parsing/encoding with DiskTrim request/response.
guest/arcbox-agent/src/agent/mod.rs Implements periodic fstrim task and DiskTrim handler.
common/arcbox-constants/src/wire.rs Reserves wire IDs and updates mapping/tests for DiskTrim message types.
app/arcbox-core/src/agent_client.rs Adds host-side disk_trim() helper on the vsock RPC client.
app/arcbox-cli/src/main.rs Routes new disk top-level command to command handler.
app/arcbox-cli/src/commands/mod.rs Declares disk command module + adds Commands::Disk.
app/arcbox-cli/src/commands/disk.rs New CLI implementation for disk usage and disk compact.

Comment on lines +47 to +53
let logical_gib = logical_bytes as f64 / (1024.0 * 1024.0 * 1024.0);
let physical_gib = physical_bytes as f64 / (1024.0 * 1024.0 * 1024.0);
let available_gib =
logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0);
let usage_pct = if logical_bytes > 0 {
(physical_bytes as f64 / logical_bytes as f64) * 100.0
} else {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new disk usage calculation (logical/physical/available/percent) is currently embedded in the command handler and isn’t unit-tested. Since other CLI command modules include unit tests, consider extracting the size math into a small pure helper and adding tests for edge cases (e.g., logical=0, physical>logical due to rounding, large values).

Copilot uses AI. Check for mistakes.
loop {
interval.tick().await;
for mount in [DOCKER_DATA_MOUNT_POINT, CONTAINERD_DATA_MOUNT_POINT] {
match std::process::Command::new("fstrim").arg(mount).status() {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fstrim_loop runs std::process::Command from inside an async task. fstrim can take seconds/minutes on large filesystems, and this will block a Tokio worker thread. Prefer tokio::process::Command (with .status().await) or wrap the call in tokio::task::spawn_blocking to avoid stalling other RPC work on the runtime.

Suggested change
match std::process::Command::new("fstrim").arg(mount).status() {
match tokio::process::Command::new("fstrim")
.arg(mount)
.status()
.await
{

Copilot uses AI. Check for mistakes.
/// Periodically runs `fstrim` on data mount points to reclaim sparse file
/// space on the host. First tick fires immediately to trim historical waste.
async fn fstrim_loop() {
let mut interval = tokio::time::interval(Duration::from_secs(3600));
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio::time::interval defaults to MissedTickBehavior::Burst, so if the VM is paused/suspended for a while the loop may immediately run fstrim many times back-to-back on resume. Consider setting interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip) (or Delay) so only one trim runs after long delays.

Suggested change
let mut interval = tokio::time::interval(Duration::from_secs(3600));
let mut interval = tokio::time::interval(Duration::from_secs(3600));
interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);

Copilot uses AI. Check for mistakes.
Comment on lines +1308 to +1309
fn handle_disk_trim() -> RpcResponse {
let result = run_fstrim_now();
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_disk_trim() calls run_fstrim_now() synchronously, which runs an external process and captures its output. This blocks the per-connection async task and can monopolize a Tokio worker thread during the trim. Consider making this handler async and using tokio::process::Command/spawn_blocking, returning the response once the trim completes (or returning an error response if the command cannot be spawned).

Suggested change
fn handle_disk_trim() -> RpcResponse {
let result = run_fstrim_now();
async fn handle_disk_trim() -> RpcResponse {
let result = match tokio::task::spawn_blocking(run_fstrim_now).await {
Ok(result) => result,
Err(err) => format!("failed to run fstrim task: {err}"),
};

Copilot uses AI. Check for mistakes.
let available_gib =
logical_bytes.saturating_sub(physical_bytes) as f64 / (1024.0 * 1024.0 * 1024.0);
let usage_pct = if logical_bytes > 0 {
(physical_bytes as f64 / logical_bytes as f64) * 100.0
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usage_pct can exceed 100% because st_blocks is block-rounded (e.g., very small files or filesystems with large allocation units). Consider clamping the percentage (e.g., min(physical_bytes, logical_bytes) for the numerator) or adjusting the displayed label so the output can’t imply >100% physical usage.

Suggested change
(physical_bytes as f64 / logical_bytes as f64) * 100.0
let usage_bytes = physical_bytes.min(logical_bytes);
(usage_bytes as f64 / logical_bytes as f64) * 100.0

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants