feat(orion): add Dicfuse warmup, remove isolation-dir, fix event log tailing. by Ivanbeethoven · Pull Request #2017 · web3infra-foundation/mega

Ivanbeethoven · 2026-03-13T16:18:47Z

Add Dicfuse warmup on Orion startup to eliminate cold-start latency
for the first build request (antares.rs)
Remove --isolation-dir from all buck2 invocations; each FUSE mount
already produces a unique project root so Buck2 naturally isolates
daemons without it. Drop ring/hex dependencies accordingly
Switch tracing init to EnvFilter so RUST_LOG can suppress noisy
dependency warnings (e.g. rfuse3 alignment messages)
Move .env loading before logger init so RUST_LOG from .env takes effect
Fix --event-log placement: it is a build subcommand flag, not a
global option
Fix event log tail hanging: wait up to 30s for the file to appear,
signal tail loop to exit once decompression finishes, and abort
tracker tasks on build completion to prevent leaks
Bump scorpiofs 0.2.0 → 0.2.1
Run as root in systemd unit to avoid capability issues with chmod
Keep failed-build mounts alive for debugging instead of unmounting
Suppress dead_code warnings for temporarily-disabled unmount infra

…tailing Signed-off-by: Luxian <lux1an@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 647857ecec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T16:24:04Z

orion/src/buck_controller.rs

+        // Stop the build-status tracker cleanly.
        target_build_track.cancellation.cancel();
+        target_build_track.tail_handle.abort();
+        target_build_track.process_handle.abort();


Allow status loop to flush before aborting

build cancels the tracker and then immediately calls process_handle.abort(), but run_processing_loop only emits its final buffered TargetBuildStatusBatch in the cleanup path after it observes cancellation. Aborting the task short-circuits that cleanup, so any updates buffered near process exit are dropped, which causes incomplete per-target status reporting for clients at the end of successful/failed builds.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-13T16:24:04Z

orion/td_util/src/file_tail.rs

+    // Wait for the event log file to appear (buck2 creates it after startup).
+    // Use a bounded wait; if the file never shows up (e.g. zero targets) we
+    // exit gracefully so the caller is not blocked.
+    let wait_timeout = Duration::from_secs(30);


Increase event-log wait timeout to startup budget

tail_compressed_buck2_events hard-codes a 30s file-appearance timeout, then returns early and never tails events if the log is still missing. In the same build flow, Buck is allowed to spend up to 1200s on daemon startup (BUCKD_STARTUP_INIT_TIMEOUT), so cold-start environments can legitimately exceed 30s and lose all target-status events for that build because tracking exits permanently.

Useful? React with 👍 / 👎.

genedna

Code Review — PR #2017

Overall this is a well-structured PR that addresses several real operational pain points (cold-start latency, event-log tail hanging, noisy dependency logs). The changes are well-motivated and the commit message does an excellent job explaining why each change is needed. A few issues worth addressing below.

Highlights

Dicfuse warmup — Good idea to eliminate cold-start latency by eagerly triggering import at startup. Spawning the wait in the background so Orion can continue booting is the right design.
Event log tail fix — The decompress_done atomic flag plus bounded empty-poll exit is a clean solution to the tail-hanging problem. Much better than the previous infinite-poll approach.
EnvFilter migration — Moving to EnvFilter with .env loaded first is the correct ordering for RUST_LOG to take effect.
--isolation-dir removal — Good cleanup. The rationale that each FUSE mount already produces a unique project root is sound, and dropping ring/hex dependencies is a nice side-effect.

Concerns

See inline comments for specific issues.

orion/src/buck_controller.rs

genedna · 2026-03-13T16:38:42Z

orion/src/buck_controller.rs

-                if attempt == MAX_TARGETS_ATTEMPTS {
-                    break;
-                }
+                // Keep mounts alive on failure for debugging.


Code quality: Commented-out unmount calls

Rather than commenting out guard.unmount().await / guard_old_repo.unmount().await with a note to readers, consider using an explicit flag (e.g., const KEEP_MOUNTS_ON_FAILURE: bool = true;) or environment variable (e.g., ORION_KEEP_FAILED_MOUNTS=1). Commented-out code tends to rot and makes it unclear whether the behavior is intentional vs. a debugging leftover.

This pattern also appears in the MountGuard::Drop impl on main. A single toggle would be cleaner than scattering // TODO: re-enable comments across multiple sites.

genedna · 2026-03-13T16:38:42Z

orion/src/antares.rs

+            "Waiting for Antares Dicfuse warmup to finish (timeout: {}s)",
+            DICFUSE_WARMUP_TIMEOUT_SECS
+        );
+


Nit: 20-minute warmup timeout

DICFUSE_WARMUP_TIMEOUT_SECS = 1200 (20 minutes) is quite generous. If the warmup is genuinely stuck, Orion will appear healthy (accepting WebSocket connections) but the first build will still hit cold-start latency since the warmup is just warn-logged and non-blocking.

Consider:

Making this configurable via an environment variable (e.g., ORION_DICFUSE_WARMUP_TIMEOUT_SECS) so operators can tune it per-deployment.

Exposing warmup status via a health check endpoint so load balancers can delay routing traffic until warmup completes.

genedna · 2026-03-13T16:38:42Z

orion/td_util/src/file_tail.rs

+            tracing::warn!(
+                "Event log file did not appear within {}s, skipping tail: {}",
+                wait_timeout.as_secs(),
+                compressed_path.display()


Robustness: Race between exists() check and File::open

After this loop exits, the code proceeds to File::open(compressed_path).await? a few lines below. There's a small TOCTOU window where the file could disappear between the exists() check and the open (unlikely but possible if the buck2 process is killed). More importantly, if the file exists but is empty or not yet written to, the ZstdDecoder may fail immediately with a decode error.

Not critical since buck2 owns the file lifecycle, but worth a brief comment noting the assumption.

genedna · 2026-03-13T16:38:42Z

orion/td_util/src/file_tail.rs

+            let mut buffer = vec![0u8; 256 * 1024];
+            loop {
+                let bytes_read = decoder.read(&mut buffer).await?;
+                if bytes_read == 0 {


Good pattern: Using Arc<AtomicBool> to signal decompression completion is clean and avoids the need for a heavier synchronization primitive. The Release/Acquire ordering is correct here.

One minor consideration: since MAX_EMPTY_POLLS_AFTER_DONE is 5 and the poll interval is passed in from outside (typically 200ms), the tail task will wait up to ~1 second after decompression finishes before exiting. This is fine for the current use case but worth documenting the relationship between MAX_EMPTY_POLLS_AFTER_DONE and poll_interval for future maintainers.

genedna · 2026-03-13T16:38:42Z

orion/src/buck_controller.rs

+            Some(s) => s,
+            None => child.wait().await?,
+        };



Good fix: Aborting tracker tasks on completion

target_build_track.tail_handle.abort(); target_build_track.process_handle.abort();

This is the right approach — the previous code could leave the tail and process tasks leaked indefinitely if the build completed quickly. Calling abort() before await ensures clean shutdown.

However, note that abort() + await on a JoinHandle will return Err(JoinError::Cancelled). The let _ = silently ignores this, which is fine, but if these tasks have side effects (e.g., flushing final state), the abort may prevent the final flush in run_processing_loop. The cancellation token signal (already called on the line above) should handle this gracefully, but verify that the processing loop's final flush_buffer call completes before the abort fires.

genedna · 2026-03-13T16:38:42Z

orion/src/buck_controller.rs

+            .env("BUCKD_STARTUP_INIT_TIMEOUT", "1200");
        let cmd = cmd
            .arg("build")
+            .args(["--event-log", EVENT_LOG_FILE])


Improvement: BUCKD_STARTUP_INIT_TIMEOUT changed from 120 to 1200

This is a 10x increase. While the other invocations already use 1200, it's worth a brief comment explaining why the build command now also needs 1200s (e.g., "FUSE-backed repos may trigger lazy loading during daemon init, which can be slow on cold caches"). Otherwise a future reader might wonder if this was a typo.

genedna · 2026-03-13T16:38:42Z

.github/workflows/orion-client-deploy.yml


+            # Grant CAP_DAC_READ_SEARCH file capability so orion can bypass
+            # DAC read/search checks without running as root
+            sudo setcap cap_dac_read_search+ep /home/orion/orion-runner/orion


Security note: setcap + NoNewPrivileges=false

Granting CAP_DAC_READ_SEARCH via file capabilities is a reasonable approach to avoid running as root. However, the systemd unit sets NoNewPrivileges=false, which means the process can gain additional capabilities/privileges post-exec. This is already noted in the service file as "relaxed for FUSE operations", but be aware that CAP_DAC_READ_SEARCH allows the process to bypass DAC read permissions on any file on the system — not just FUSE mounts.

Consider documenting the threat model: what files does orion need to read that it otherwise couldn't, and whether CAP_DAC_OVERRIDE (already in CapabilityBoundingSet) is a superset that makes this redundant.

genedna · 2026-03-13T16:38:42Z

orion/scorpio.toml

-
+config_file = "./config.toml"
 # Git 提交信息
 git_author = "MEGA"


Nit: The added config_file = "./config.toml" uses a relative path. This will resolve relative to the working directory at runtime, which may vary depending on how orion is launched (systemd, manual, CI). Consider using an absolute path or documenting the expected CWD.

genedna · 2026-03-13T16:38:42Z

orion/src/buck_controller.rs

-    preheat(repo_path)?;

    for attempt in 1..=MAX_ATTEMPTS {
        tracing::debug!("Get targets for repo {repo_path:?} (attempt {attempt}/{MAX_ATTEMPTS})");


Note: preheat() call removed from get_repo_targets

The previous version called preheat(repo_path)? before running buck2 targets. This is now removed, presumably because the new warmup_dicfuse() at startup handles the warming. However, warmup_dicfuse() warms the Dicfuse import (metadata tree), while preheat() warmed the mounted FUSE directory via read_dir walks.

These serve different purposes — Dicfuse import populates the backing store, while preheat warms the kernel's VFS/FUSE cache for already-mounted paths. If the mount is fresh (new UUID per attempt), the FUSE cache will be cold regardless of Dicfuse warmup status. Was this removal intentional? If so, consider adding a brief comment explaining why preheat is no longer needed here.

Co-authored-by: Quanyi Ma <eli@patch.sh> Signed-off-by: Xiaoyang Han <lux1an@qq.com>

Signed-off-by: Luxian <lux1an@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9afd74b57e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-14T06:56:34Z

orion/src/buck_controller.rs

+        // the select! loop.
        target_build_track.cancellation.cancel();
+        // Abort the tail task — no more events need to be ingested.
+        target_build_track.tail_handle.abort();


Allow event-log tailer to drain before aborting tracker

After the build exits, this unconditionally aborts the tail task before it can finish decompressing and forwarding the last event-log lines. In large/slow builds, the tailer is often still catching up when child.wait() returns, so aborting here drops final status events and leaves clients with incomplete per-target results even though the build is done. Prefer waiting for tail completion (with a bounded timeout) instead of aborting immediately.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-14T06:56:34Z

orion/td_util/src/file_tail.rs

+    let wait_timeout_secs: u64 = std::env::var("ORION_EVENT_LOG_WAIT_TIMEOUT_SECS")
+        .ok()
+        .and_then(|v| v.parse().ok())
+        .unwrap_or(300);
+    let wait_timeout = Duration::from_secs(wait_timeout_secs);


Match event-log wait default to buck startup timeout

The new default wait for event-log file creation is 300s, but the build path allows Buck daemon startup for up to 1200s (BUCKD_STARTUP_INIT_TIMEOUT in orion/src/buck_controller.rs). On cold environments where startup takes longer than 5 minutes, this returns early and disables target-status streaming for the entire build despite Buck still starting within its allowed budget. Fresh evidence in this revision is the hard-coded 300s default added here; it should align with the 1200s startup window.

Useful? React with 👍 / 👎.

feat(orion): add Dicfuse warmup, remove isolation-dir, fix event log …

647857e

…tailing Signed-off-by: Luxian <lux1an@qq.com>

chatgpt-codex-connector bot reviewed Mar 13, 2026

View reviewed changes

genedna reviewed Mar 13, 2026

View reviewed changes

Ivanbeethoven and others added 2 commits March 14, 2026 14:32

Update orion/src/buck_controller.rs

ee2090f

Co-authored-by: Quanyi Ma <eli@patch.sh> Signed-off-by: Xiaoyang Han <lux1an@qq.com>

fix(orion): address PR review comments

9afd74b

Signed-off-by: Luxian <lux1an@qq.com>

chatgpt-codex-connector bot reviewed Mar 14, 2026

View reviewed changes

Ivanbeethoven requested a review from jerry609 March 14, 2026 08:34

Conversation

Ivanbeethoven commented Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

genedna left a comment

Choose a reason for hiding this comment

Code Review — PR #2017

Highlights

Concerns

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants