Skip to content

feat(orion): add Dicfuse warmup, remove isolation-dir, fix event log tailing.#2017

Open
Ivanbeethoven wants to merge 3 commits intoweb3infra-foundation:mainfrom
Ivanbeethoven:main
Open

feat(orion): add Dicfuse warmup, remove isolation-dir, fix event log tailing.#2017
Ivanbeethoven wants to merge 3 commits intoweb3infra-foundation:mainfrom
Ivanbeethoven:main

Conversation

@Ivanbeethoven
Copy link
Collaborator

  • Add Dicfuse warmup on Orion startup to eliminate cold-start latency
    for the first build request (antares.rs)
  • Remove --isolation-dir from all buck2 invocations; each FUSE mount
    already produces a unique project root so Buck2 naturally isolates
    daemons without it. Drop ring/hex dependencies accordingly
  • Switch tracing init to EnvFilter so RUST_LOG can suppress noisy
    dependency warnings (e.g. rfuse3 alignment messages)
  • Move .env loading before logger init so RUST_LOG from .env takes effect
  • Fix --event-log placement: it is a build subcommand flag, not a
    global option
  • Fix event log tail hanging: wait up to 30s for the file to appear,
    signal tail loop to exit once decompression finishes, and abort
    tracker tasks on build completion to prevent leaks
  • Bump scorpiofs 0.2.0 → 0.2.1
  • Run as root in systemd unit to avoid capability issues with chmod
  • Keep failed-build mounts alive for debugging instead of unmounting
  • Suppress dead_code warnings for temporarily-disabled unmount infra

…tailing

Signed-off-by: Luxian <lux1an@qq.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 647857ecec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Stop the build-status tracker cleanly.
target_build_track.cancellation.cancel();
target_build_track.tail_handle.abort();
target_build_track.process_handle.abort();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow status loop to flush before aborting

build cancels the tracker and then immediately calls process_handle.abort(), but run_processing_loop only emits its final buffered TargetBuildStatusBatch in the cleanup path after it observes cancellation. Aborting the task short-circuits that cleanup, so any updates buffered near process exit are dropped, which causes incomplete per-target status reporting for clients at the end of successful/failed builds.

Useful? React with 👍 / 👎.

// Wait for the event log file to appear (buck2 creates it after startup).
// Use a bounded wait; if the file never shows up (e.g. zero targets) we
// exit gracefully so the caller is not blocked.
let wait_timeout = Duration::from_secs(30);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Increase event-log wait timeout to startup budget

tail_compressed_buck2_events hard-codes a 30s file-appearance timeout, then returns early and never tails events if the log is still missing. In the same build flow, Buck is allowed to spend up to 1200s on daemon startup (BUCKD_STARTUP_INIT_TIMEOUT), so cold-start environments can legitimately exceed 30s and lose all target-status events for that build because tracking exits permanently.

Useful? React with 👍 / 👎.

Copy link
Member

@genedna genedna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #2017

Overall this is a well-structured PR that addresses several real operational pain points (cold-start latency, event-log tail hanging, noisy dependency logs). The changes are well-motivated and the commit message does an excellent job explaining why each change is needed. A few issues worth addressing below.

Highlights

  • Dicfuse warmup — Good idea to eliminate cold-start latency by eagerly triggering import at startup. Spawning the wait in the background so Orion can continue booting is the right design.
  • Event log tail fix — The decompress_done atomic flag plus bounded empty-poll exit is a clean solution to the tail-hanging problem. Much better than the previous infinite-poll approach.
  • EnvFilter migration — Moving to EnvFilter with .env loaded first is the correct ordering for RUST_LOG to take effect.
  • --isolation-dir removal — Good cleanup. The rationale that each FUSE mount already produces a unique project root is sound, and dropping ring/hex dependencies is a nice side-effect.

Concerns

See inline comments for specific issues.

if attempt == MAX_TARGETS_ATTEMPTS {
break;
}
// Keep mounts alive on failure for debugging.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code quality: Commented-out unmount calls

Rather than commenting out guard.unmount().await / guard_old_repo.unmount().await with a note to readers, consider using an explicit flag (e.g., const KEEP_MOUNTS_ON_FAILURE: bool = true;) or environment variable (e.g., ORION_KEEP_FAILED_MOUNTS=1). Commented-out code tends to rot and makes it unclear whether the behavior is intentional vs. a debugging leftover.

This pattern also appears in the MountGuard::Drop impl on main. A single toggle would be cleaner than scattering // TODO: re-enable comments across multiple sites.

"Waiting for Antares Dicfuse warmup to finish (timeout: {}s)",
DICFUSE_WARMUP_TIMEOUT_SECS
);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 20-minute warmup timeout

DICFUSE_WARMUP_TIMEOUT_SECS = 1200 (20 minutes) is quite generous. If the warmup is genuinely stuck, Orion will appear healthy (accepting WebSocket connections) but the first build will still hit cold-start latency since the warmup is just warn-logged and non-blocking.

Consider:

  1. Making this configurable via an environment variable (e.g., ORION_DICFUSE_WARMUP_TIMEOUT_SECS) so operators can tune it per-deployment.
  2. Exposing warmup status via a health check endpoint so load balancers can delay routing traffic until warmup completes.

tracing::warn!(
"Event log file did not appear within {}s, skipping tail: {}",
wait_timeout.as_secs(),
compressed_path.display()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Robustness: Race between exists() check and File::open

After this loop exits, the code proceeds to File::open(compressed_path).await? a few lines below. There's a small TOCTOU window where the file could disappear between the exists() check and the open (unlikely but possible if the buck2 process is killed). More importantly, if the file exists but is empty or not yet written to, the ZstdDecoder may fail immediately with a decode error.

Not critical since buck2 owns the file lifecycle, but worth a brief comment noting the assumption.

let mut buffer = vec![0u8; 256 * 1024];
loop {
let bytes_read = decoder.read(&mut buffer).await?;
if bytes_read == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pattern: Using Arc<AtomicBool> to signal decompression completion is clean and avoids the need for a heavier synchronization primitive. The Release/Acquire ordering is correct here.

One minor consideration: since MAX_EMPTY_POLLS_AFTER_DONE is 5 and the poll interval is passed in from outside (typically 200ms), the tail task will wait up to ~1 second after decompression finishes before exiting. This is fine for the current use case but worth documenting the relationship between MAX_EMPTY_POLLS_AFTER_DONE and poll_interval for future maintainers.

Some(s) => s,
None => child.wait().await?,
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix: Aborting tracker tasks on completion

target_build_track.tail_handle.abort();
target_build_track.process_handle.abort();

This is the right approach — the previous code could leave the tail and process tasks leaked indefinitely if the build completed quickly. Calling abort() before await ensures clean shutdown.

However, note that abort() + await on a JoinHandle will return Err(JoinError::Cancelled). The let _ = silently ignores this, which is fine, but if these tasks have side effects (e.g., flushing final state), the abort may prevent the final flush in run_processing_loop. The cancellation token signal (already called on the line above) should handle this gracefully, but verify that the processing loop's final flush_buffer call completes before the abort fires.

.env("BUCKD_STARTUP_INIT_TIMEOUT", "1200");
let cmd = cmd
.arg("build")
.args(["--event-log", EVENT_LOG_FILE])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: BUCKD_STARTUP_INIT_TIMEOUT changed from 120 to 1200

This is a 10x increase. While the other invocations already use 1200, it's worth a brief comment explaining why the build command now also needs 1200s (e.g., "FUSE-backed repos may trigger lazy loading during daemon init, which can be slow on cold caches"). Otherwise a future reader might wonder if this was a typo.


# Grant CAP_DAC_READ_SEARCH file capability so orion can bypass
# DAC read/search checks without running as root
sudo setcap cap_dac_read_search+ep /home/orion/orion-runner/orion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security note: setcap + NoNewPrivileges=false

Granting CAP_DAC_READ_SEARCH via file capabilities is a reasonable approach to avoid running as root. However, the systemd unit sets NoNewPrivileges=false, which means the process can gain additional capabilities/privileges post-exec. This is already noted in the service file as "relaxed for FUSE operations", but be aware that CAP_DAC_READ_SEARCH allows the process to bypass DAC read permissions on any file on the system — not just FUSE mounts.

Consider documenting the threat model: what files does orion need to read that it otherwise couldn't, and whether CAP_DAC_OVERRIDE (already in CapabilityBoundingSet) is a superset that makes this redundant.


config_file = "./config.toml"
# Git 提交信息
git_author = "MEGA"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The added config_file = "./config.toml" uses a relative path. This will resolve relative to the working directory at runtime, which may vary depending on how orion is launched (systemd, manual, CI). Consider using an absolute path or documenting the expected CWD.

preheat(repo_path)?;

for attempt in 1..=MAX_ATTEMPTS {
tracing::debug!("Get targets for repo {repo_path:?} (attempt {attempt}/{MAX_ATTEMPTS})");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: preheat() call removed from get_repo_targets

The previous version called preheat(repo_path)? before running buck2 targets. This is now removed, presumably because the new warmup_dicfuse() at startup handles the warming. However, warmup_dicfuse() warms the Dicfuse import (metadata tree), while preheat() warmed the mounted FUSE directory via read_dir walks.

These serve different purposes — Dicfuse import populates the backing store, while preheat warms the kernel's VFS/FUSE cache for already-mounted paths. If the mount is fresh (new UUID per attempt), the FUSE cache will be cold regardless of Dicfuse warmup status. Was this removal intentional? If so, consider adding a brief comment explaining why preheat is no longer needed here.

Ivanbeethoven and others added 2 commits March 14, 2026 14:32
Co-authored-by: Quanyi Ma <eli@patch.sh>
Signed-off-by: Xiaoyang Han <lux1an@qq.com>
Signed-off-by: Luxian <lux1an@qq.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9afd74b57e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// the select! loop.
target_build_track.cancellation.cancel();
// Abort the tail task — no more events need to be ingested.
target_build_track.tail_handle.abort();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow event-log tailer to drain before aborting tracker

After the build exits, this unconditionally aborts the tail task before it can finish decompressing and forwarding the last event-log lines. In large/slow builds, the tailer is often still catching up when child.wait() returns, so aborting here drops final status events and leaves clients with incomplete per-target results even though the build is done. Prefer waiting for tail completion (with a bounded timeout) instead of aborting immediately.

Useful? React with 👍 / 👎.

Comment on lines +59 to +63
let wait_timeout_secs: u64 = std::env::var("ORION_EVENT_LOG_WAIT_TIMEOUT_SECS")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(300);
let wait_timeout = Duration::from_secs(wait_timeout_secs);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match event-log wait default to buck startup timeout

The new default wait for event-log file creation is 300s, but the build path allows Buck daemon startup for up to 1200s (BUCKD_STARTUP_INIT_TIMEOUT in orion/src/buck_controller.rs). On cold environments where startup takes longer than 5 minutes, this returns early and disables target-status streaming for the entire build despite Buck still starting within its allowed budget. Fresh evidence in this revision is the hard-coded 300s default added here; it should align with the 1200s startup window.

Useful? React with 👍 / 👎.

@Ivanbeethoven Ivanbeethoven requested a review from jerry609 March 14, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants