queue-runner: resolve CA derivations after dependency outputs land by amaanq · Pull Request #1629 · NixOS/hydra

amaanq · 2026-03-30T22:09:16Z

Instead of resolving at dispatch time and carrying two drv identities through the gRPC layer, resolve in succeed_step once outputs are in the DB. The resolved drv becomes a fresh Step that dispatch picks up normally; the builder only ever sees one drv. A new resolvedFromBuild/resolvedFromStep foreign key BuildSteps ties the resolved step back to the dependency that triggered it.

Ericson2314 · 2026-03-30T22:14:22Z

subprojects/hydra/sql/hydra.sql

+    -- If this step was created by resolving a CA derivation, points to the
+    -- original unresolved step that triggered resolution.
+    resolvedFromBuild integer,
+    resolvedFromStep  integer,


Has to be "to" direction not "from" direction, because many derivations will resolve to the same derivation

Ericson2314

At the point where we try_resolve before, we do this:

If result is None or "same derivation":

just continue with this step, same as we always did
If result is Some(drv) and that drv is not the same as the current drv:

Step finished, status "resolved", fill in the DB pointer to the new step, and the new step now builds, this step is done.

the current "try resolve reverse dependencies" is a possible optimisation --- incrementally resolve reverse dependencies, reminiscent of union find --- but that's a more complex thing we shouldn't worry about at this time.

Ericson2314 · 2026-03-31T12:24:12Z

subprojects/crates/db/src/models.rs

    NarSizeLimitExceeded = 11,
    NotDeterministic = 12,
-    Busy = 100, // not stored
+    Resolved = 13, // step was resolved to a CA derivation, see resolvedTo FK


Is this enum used in the database? I'm realizing this might be the one from Nix BuildResult that's already in Harmonia, and in those places already has a resolved entry.

subprojects/hydra/sql/hydra.sql

Ericson2314 · 2026-03-31T12:32:44Z

subprojects/hydra-queue-runner/src/state/mod.rs

+        // Try to resolve CA derivation inputs. If resolution yields a
+        // different drv, mark this step as Resolved in the DB and create a new
+        // step for the resolved drv that the builder will actually build.
+        let resolved = if let Some(guard) = step_info.step.get_drv() {


Scrolling up, I think there is a way to get the drv path here which is not partial? E.g. from the job.

job only has drv as a StorePath. Resolution requires a parsed nix_utils::Derivation, which is already stored in the Step object. We could parse again, but this reduces unneeded work.

Ericson2314 · 2026-03-31T12:34:56Z

subprojects/hydra-queue-runner/src/state/mod.rs

+                nr
+            };
+            job.step_nr = resolved_step_nr;
+            resolved_path


There should be some sort of early return here. We shouldn't directly go build the resolved step, but let the DB wake us up that it's ready to build I think.

That's also how dynamic derivations will work, when the new stuff is not 1-1 with the old stuff and thus this way of manually pulling it out of the queue will not work.

It's therefore good preparation for that to figure out how to do this now.

Just to clarify, the issue is that we are replacing the current step with the new resolved one. Instead, we should tell the scheduler that this new step is pending, then exit the now finished as resolved step.

artemist · 2026-04-06T19:20:51Z

I think I have this working, but there are a few things I'm unclear on and would like to figure out:

Do the BuildSteps entries we create in the database properly map to the Step objects in the queue-runner, and do they have sufficient information to recover state after a restart and be shown in the UI?
How should we handle cached failures, and what is actually needed?
Are we doing any double-building, and how can we test that?

I also believe my change needs substantial refactoring, as I'm using only parts of several functions (e.g. create_step)

artemist · 2026-04-07T18:49:58Z

Rebased onto #1642

When Hydra is configured with a local store whose physical location differs from its logical store dir (e.g. `local?root=X&store=Y` where `Y != X/nix/store`), `read_text` on the input's store path fails: the path is a logical store path but the file lives at the real (physical) location. Shell out to `nix store cat` instead, which lets the store resolve the physical path internally. We already import `Hydra::Helper::Exec`, so reuse its `captureStdoutStderr` helper rather than pulling in `IPC::Run3`. Not only will this work for relocated local stores, `nix store cat` will work for arbitrary store implementations. This sort of decoupling is generally a good thing to pursue. In the future, we might extend the Nix Perl bindings so we don't need to shell out for this. (This script already uses the Nix Perl bindings.)

`registerRoot` short-circuits when the target file already exists, so calling it on a build whose GC root was already registered is a no-op. The check is `-e $link`, which follows symlinks. In the traditional physical = logical case this is fine: if another process (e.g. `nix-eval-jobs --gc-roots-dir`) has already created a symlink at `$link` pointing at the store path, the symlink's target is a real file on disk, so `-e` returns true and we skip correctly. Once we allow for redirected local stores, where the logical and physical store directory paths differ, the assumption breaks down. The symlink target is the *logical* store path, but the store object lives at the *physical* store path, so `-e` (following the symlink) returns false. We then fall through to `open(">", $link)`, which also follows the symlink and tries to create its (non-existent) target — failing with `No such file or directory`. The solution is to switch to `readlink` + `isValidPath`. This routes the existence decision through the store rather than relying on filesystem checks that assume physical = logical.

Add a `read_derivation` C++ FFI wrapper that calls `store->readDerivation()` and serialises the result to JSON via `nlohmann::json`. Expose it in Rust as `LocalStore::read_derivation_json()`. Rewrite `query_drv` to: 1. Check the path via `is_valid_path` (which consults the store's own database) instead of `fs::try_exists` on the logical store path (the logical store path's presence on disk is not a reliable signal — only the physical path is). 2. Fetch the derivation contents via the new `read_derivation_json` FFI (so the store resolves the physical location internally) instead of reading the `.drv` file directly from the filesystem. 3. Deserialise using the harmonia `Derivation` type's `serde::Deserialize` impl, matching the JSON format produced on the C++ side. This fixes `query_drv` for stores where the physical store directory differs from the logical one (e.g. `local?root=X&store=Y` where `Y != X/nix/store`): previously `print_store_path` returned the logical path and the direct filesystem access failed. In fact, this fixes the code not only for redirected local stores, but for *any* store implementation. As an added bonus, with ATerm parsing gone, we drop the `harmonia-store-aterm` dependency entirely.

A handful of test infrastructure fixes needed once the queue runner and builder use distinct physical Nix stores sharing the same logical store dir. - `HydraTestContext`: create the builder and central roots up front with `make_path` (the `local?root=...` store won't initialise otherwise), and give the builder its own `nix_state_dir` / `nix_log_dir` fields alongside the existing `nix_store_dir`. - `QueueRunnerBuildOne`: - Set `NIX_STORE_DIR` when starting `hydra-builder` with a `TODO` note: the builder currently reads this to advertise its logical store dir to the queue runner; it should take the value from the store URI instead. - Pump the queue-runner `IPC::Run` harness alongside the builder and flush accumulated stderr on every poll, so when a test hangs we actually see the queue runner's logs instead of a blank "Builder stderr: ..." wall. - `queue-runner/notifications.t`: the "prebuild into a binary cache" preamble uses `nix-build` / `nix copy` / `nix log` / `nix-store` directly, which inherits the central store URI. That store has physical != logical, so building derivations whose `$PATH` points at the host `/nix/store/...-coreutils/bin` fails with `mkdir: not found`. Point all of these at the builder store (which has physical = logical) via `--store`, and clean up the builder's log dir at the end of the subtest instead of the central one. This also makes sure that the test doesn't rely on `nix-build` etc. doing things to the central / queue runner's store that we didn't realize, so we don't accidentally rely on those things (e.g. the test would fail without them.)

Instead of resolving at StepInfo construction and carrying two drv identities through the gRPC layer, resolve in realise_drv_on_valid_machine once all deps are built. If resolution yields a different drv, the original step is marked Resolved and a new DB step is created for the resolved drv with a resolvedTo FK linking them. The builder only ever sees one drv. We create a new Step for that resolution and bunt it back to the scheduler. This grants us more flexibility in execution and the method can be used in the future for dynamic derivations, which won't map 1:1 with the original derivations. In order to make tests more consistent, CA derivations will fail if they cannot be fully resolved. Otherwise, there could be inconsistent successes depending on which builder a step was performed on.

Ericson2314 · 2026-04-07T20:31:45Z

subprojects/hydra-queue-runner/src/state/mod.rs

                .set_max_silent_time(biggest_max_silent_time.unwrap_or(build.max_silent_time));
            build_options.set_build_timeout(biggest_build_timeout.unwrap_or(build.timeout));
-            build.id
+            build.clone()


Could this be a move?

Ericson2314 · 2026-04-07T20:35:26Z

subprojects/hydra-queue-runner/src/state/mod.rs

+                // If we try to build an unresolved CA derivation it might work, but it might not.
+                // Make sure it always fails for consistency.
+                let is_ca = drv_ref
+                    .outputs
+                    .iter()
+                    .all(|output| matches!(output.1, nix_utils::DerivationOutput::CAFloating(_)));
+
+                let needs_resolution = drv_ref.inputs.iter().any(|input| {
+                    matches!(
+                        input,
+                        nix_utils::SingleDerivedPath::Built {
+                            drv_path: _,
+                            output: _
+                        }
+                    )
+                });
+
+                if is_ca && needs_resolution && resolved.is_none() {
+                    return Err(anyhow::anyhow!(
+                        "Failed to resolve CAFloating derivation {drv}"
+                    ));
+                }
+


We should probably do this stuff first, before we bother to compute the resolution?

Ericson2314 · 2026-04-07T20:37:12Z

subprojects/hydra-queue-runner/src/state/mod.rs

+            } else {
+                None
+            }
+        };
+
+        if let Some(resolved_path) = resolved {


I think we can combine this with the previous branch? Or at least we could compute a "do we want to resolve" bool, and then have the actual resolving (which much succeed) and this stuff in the same branch.

Ericson2314 · 2026-04-07T20:39:31Z

subprojects/hydra-queue-runner/src/state/mod.rs

+        if let Some(resolved_path) = resolved {
+            tracing::info!("resolved CA derivation {drv} -> {resolved_path}");
+
+            // Finish original step as "resolved".


If possible, I would like to create the follow-up build step before we marked this resolved. This will bode well for the future when we start doing more of the state machine in the database.

Ideally, we will have a check constraint on the resolved status, saying it can only be used when the foreign key to the next step is set!

Ericson2314 · 2026-04-07T20:41:12Z

subprojects/hydra-queue-runner/src/state/mod.rs

+            resolved_result.set_start_time_now();
+            resolved_result.set_stop_time_now();
+            resolved_result.log_file.clone_from(&job.result.log_file);
+            finish_build_step(


Let's try to do everything in one transaction here. finish_build_step current creates its own transaction, so let's make take the transaction as an argument instead.

Ericson2314 · 2026-04-07T20:45:06Z

subprojects/hydra-queue-runner/src/state/mod.rs

+
+            tx.set_resolved_to(build_id, step_nr, nr).await?;
+            tx.commit().await?;
+


TODO think about restart resiliency. I believe everything else after this point in this branch doesn't use the database. So that's good, we aren't logically splitting a transaction in two (assuming the previous stuff is done).

When is left to think about is ---- if the queue runner was restarted at this point, would the general "rebuild state machine from build database state" logic get us into the same step as what this code right here does?

We could make a variation of our CA tests where where the queue runner dies here and then restarts to test this

Ericson2314 · 2026-04-07T20:47:25Z

subprojects/hydra-queue-runner/src/state/mod.rs

+                    Arc::new(parking_lot::RwLock::new(HashSet::new())),
+                    Arc::new(parking_lot::RwLock::new(HashSet::new())),
+                    Arc::new(parking_lot::RwLock::new(HashSet::new())),


We should have a big fat comment why it is correct to not give this function the mutable state it expects

Ericson2314 · 2026-04-07T20:52:40Z

subprojects/hydra-queue-runner/src/state/mod.rs

+            // If we're the root of a build then we need to become the new root,
+            // lest the Step get garbage collected.


Comment should be clear this is about Arc keeping things alive, not store keeping ,drv files alive.

We should also be clear that in the non-root case, the idea is that we registered some reverse des above that will keep us alive. Maybe we should assert that we did in fact go through that loop.

We will need to figure out how to keep the .drv file alive for all resolved derivations, and the GC roots that nix-eval-jobs made will not do the job since these are new drvs.

Ericson2314 · 2026-04-07T21:03:26Z

subprojects/hydra/sql/hydra.sql

    foreign key   (build) references Builds(id) on delete cascade,
-    foreign key   (propagatedFrom) references Builds(id) on delete cascade
+    foreign key   (propagatedFrom) references Builds(id) on delete cascade,
+    foreign key   (build, resolvedToStep) references BuildSteps(build, stepnr) on delete cascade


So long term, when the database distinguishes derivations from build attempts, this should actually point to a derivation, not another build attempt.

A build attempt samples the state of the database at that point to resolve, and creates a new (resolved) derivation. That derivation can be attempted to be built multiple times (We don't really care to privilage the first attempt with a foreign key). However, subsequent attempts to build the original derivation --- if we rebuild other upstream derivations and they are content-addressed and non-deterministic --- will result in a different resolved derivations, and that too can be attempted to build multiple times.

Memoizing realization in this way does give us something close a derived build trace, but the fact that our build trace is thus keyed off attempts not not derivations gives us some ability to represent (and not choke on) incoherence.

Ericson2314 reviewed Mar 30, 2026

View reviewed changes

Ericson2314 requested changes Mar 30, 2026

View reviewed changes

amaanq force-pushed the ca-resolve-two-phase branch 2 times, most recently from ae692ac to 05eef39 Compare March 31, 2026 03:10

Ericson2314 reviewed Mar 31, 2026

View reviewed changes

subprojects/hydra/sql/hydra.sql Outdated Show resolved Hide resolved

Ericson2314 reviewed Mar 31, 2026

View reviewed changes

Ericson2314 force-pushed the ca-resolve-two-phase branch from 05eef39 to de9aed8 Compare March 31, 2026 18:45

artemist force-pushed the ca-resolve-two-phase branch 6 times, most recently from 0768499 to 80430bf Compare April 6, 2026 18:37

artemist force-pushed the ca-resolve-two-phase branch 2 times, most recently from 3c7b586 to 7eb2319 Compare April 7, 2026 18:49

Ericson2314 and others added 5 commits April 7, 2026 15:57

artemist force-pushed the ca-resolve-two-phase branch from 7eb2319 to 62abe6f Compare April 7, 2026 20:03

Ericson2314 reviewed Apr 7, 2026

View reviewed changes

Ericson2314 mentioned this pull request Apr 7, 2026

Don't pass resolved_drv to the builder #1620

Closed


		tx.set_resolved_to(build_id, step_nr, nr).await?;
		tx.commit().await?;

		// If we're the root of a build then we need to become the new root,
		// lest the Step get garbage collected.

Uh oh!

Conversation

amaanq commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ericson2314 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artemist commented Apr 6, 2026

Uh oh!

artemist commented Apr 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ericson2314 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ericson2314 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ericson2314 left a comment •

edited

Loading

Ericson2314 Apr 7, 2026 •

edited

Loading

Ericson2314 Apr 7, 2026 •

edited

Loading