test(sentinel/checkpoint): isolate via CONTINUUM_CHECKPOINT_DIR env override by joelteply · Pull Request #971 · CambrianTech/continuum

joelteply · 2026-04-25T18:18:30Z

Summary

Checkpoint tests no longer write to the user's real ~/.continuum/sentinel/checkpoints/ — they redirect to a process-shared tempfile::TempDir via the new CONTINUUM_CHECKPOINT_DIR env override.
Production code path is unchanged: when the env var is unset (always, in prod), checkpoints_dir() resolves to the same default ~/.continuum/sentinel/checkpoints path as before.
Fixes a real prepush flake: on dev machines whose ~/.continuum/sentinel/ directory is root-owned (left over from any docker compose run that bind-mounted $HOME and chmod'd dirs under root), cargo test --lib panics with "Permission denied (os error 13)" in 3 checkpoint tests and blocks the prepush gate.

Background — empirical hit

Hit on PR #950 prepush from bigmama-wsl 2026-04-25:

modules::sentinel::checkpoint::tests::test_save_load_checkpoint
  Failed to write checkpoint tmp file: Permission denied (os error 13)
modules::sentinel::checkpoint::tests::test_list_checkpoints
  Failed to write checkpoint tmp file: Permission denied (os error 13)
modules::sentinel::checkpoint::tests::test_recover_interrupted
  Failed to write checkpoint tmp file: Permission denied (os error 13)
test result: FAILED. 1826 passed; 3 failed; 28 ignored; 0 measured; 0 filtered out

ls -la ~/.continuum/sentinel/checkpoints/ showed drwxr-xr-x 2 root root — created by a prior root-uid docker process that wrote into the bind-mounted home. Tests had no isolation: they wrote to the production path.

Fix structure

Two-part change in src/workers/continuum-core/src/modules/sentinel/checkpoint.rs:

checkpoints_dir() env override — checks CONTINUUM_CHECKPOINT_DIR first, falls through to the existing home-derived default. Production callers see no behavior change unless they explicitly set the var.
Test-mode tempdir isolation — new ensure_test_checkpoint_dir() helper, called at the top of each test. It lazy-initializes a process-global tempfile::TempDir via OnceLock and sets the env var to its path. Cargo runs tests in parallel within one process, and set_var is process-global, so a per-test TempDir would race; sharing one is fine because tests already use UUID-derived handles that structurally don't collide.

Validation

$ source src/scripts/shared/cargo-features.sh
$ cd src/workers/continuum-core
$ cargo test --lib $CARGO_GPU_FEATURES modules::sentinel::checkpoint
test modules::sentinel::checkpoint::tests::test_save_load_checkpoint ... ok
test modules::sentinel::checkpoint::tests::test_list_checkpoints ... ok
test modules::sentinel::checkpoint::tests::test_recover_interrupted ... ok
test result: ok. 3 passed; 0 failed; 0 ignored

$ cargo test --lib $CARGO_GPU_FEATURES
test result: ok. 1829 passed; 0 failed; 28 ignored; 0 measured; 0 filtered out

Why this matters beyond one machine

The checkpoint module's hard-coded ~/.continuum dependency is a latent foot-gun for any contributor whose dev box has had a root-mounted docker run over $HOME — a common state given how often we run docker compose with bind-mounts. Env override + tempdir isolation makes the test suite portable across dev environments without per-machine sudo chown gymnastics.

Test plan

cargo test --lib --features cuda,load-dynamic-ort — 1829 pass, 0 fail (was 1826/3 fail on bigmama-wsl).
Repeated runs share TempDir cleanly (parallel test execution, no panics).
Prod path (env var unset) unchanged — checkpoints_dir() returns ~/.continuum/sentinel/checkpoints exactly as before.
Re-run on a machine with root-owned ~/.continuum/sentinel/ to confirm the bug is gone there too. (Will validate post-merge on bigmama-wsl with the offending state intact.)

…verride Empirical hit on PR #950 prepush: cargo test --lib panicked in 3 checkpoint tests with "Failed to write checkpoint tmp file: Permission denied (os error 13)". Root cause: ~/.continuum/sentinel/checkpoints/ was owned by root:root with mode 755 on this dev machine — left over from a prior docker container that mounted $HOME and chmod'd the dir under root. Tests wrote to that production path, no isolation. Two-part fix in one commit, both in checkpoint.rs: 1. checkpoints_dir() now honors CONTINUUM_CHECKPOINT_DIR env override. Production code still falls through to ~/.continuum/sentinel/ checkpoints when the var is unset (no behavior change in prod). 2. Tests share a process-global tempfile::TempDir via OnceLock, exposed through ensure_test_checkpoint_dir() called at the top of each test. Cargo runs tests in parallel within one process and set_var is process-global — a per-test TempDir would race. Shared dir is fine because tests use UUID-derived handles; collisions structurally don't happen. Validated locally: cargo test --lib --features cuda,load-dynamic-ort → 1829 passed; 0 failed; 28 ignored (was: 1826 passed; 3 failed in checkpoint tests on this machine.) Why this matters beyond the local fix: the checkpoint module's hard- coded ~/.continuum dependency is a latent foot-gun for ANY contributor whose dev box has had a root-mounted docker run over $HOME — a common state given how often we run docker compose with bind-mounts. Env override + tempdir isolation makes the test suite portable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adjusts sentinel checkpoint tests to avoid writing into the user’s real ~/.continuum/sentinel/checkpoints by introducing an env-var override and having tests redirect storage to a shared temp directory.

Changes:

Added CONTINUUM_CHECKPOINT_DIR override support to checkpoints_dir().
Introduced a process-global TempDir for checkpoint tests and set the override in each test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T18:21:04Z

+    if let Ok(override_dir) = std::env::var("CONTINUUM_CHECKPOINT_DIR") {
+        return PathBuf::from(override_dir);


checkpoints_dir() uses std::env::var, which requires UTF-8 and will also accept an empty string (mapping to PathBuf::from(""), i.e., the current working directory). Since this env var is intended to be a filesystem path override, consider using std::env::var_os and ignoring empty values so an accidentally-set empty/invalid override doesn’t redirect checkpoint writes into the process CWD.

Suggested change

if let Ok(override_dir) = std::env::var("CONTINUUM_CHECKPOINT_DIR") {

return PathBuf::from(override_dir);

if let Some(override_dir) = std::env::var_os("CONTINUUM_CHECKPOINT_DIR") {

if !override_dir.is_empty() {

return PathBuf::from(override_dir);

}

Copilot · 2026-04-25T18:21:04Z

+        let dir = TMPDIR.get_or_init(|| {
+            tempfile::tempdir().expect("Failed to create test checkpoint tempdir")
+        });
+        // SAFETY: set_var is unsafe in newer Rust (process-global, racy with
+        // other threads reading env). Tests in this module only ever write
+        // the SAME path, so concurrent setters write the same value — race-
+        // free in practice.
+        std::env::set_var(
+            "CONTINUUM_CHECKPOINT_DIR",
+            dir.path(),
+        );


ensure_test_checkpoint_dir() calls std::env::set_var on every test invocation. With cargo’s parallel test execution, multiple tests can call set_var concurrently, which is exactly the racy/unsound pattern the comment warns about (even if all writers set the same value). To avoid this, set the env var only once (e.g., inside the OnceLock init path or via an additional Once guard) so only a single thread ever performs the set_var call.

Suggested change

let dir = TMPDIR.get_or_init(|| {

tempfile::tempdir().expect("Failed to create test checkpoint tempdir")

});

// SAFETY: set_var is unsafe in newer Rust (process-global, racy with

// other threads reading env). Tests in this module only ever write

// the SAME path, so concurrent setters write the same value — race-

// free in practice.

std::env::set_var(

"CONTINUUM_CHECKPOINT_DIR",

dir.path(),

);

TMPDIR.get_or_init(|| {

let dir =

tempfile::tempdir().expect("Failed to create test checkpoint tempdir");

// SAFETY: set_var is process-global and must not race with other

// writers. Performing it inside the OnceLock init closure ensures

// only a single thread sets the test override for the lifetime of

// this test process.

std::env::set_var(

"CONTINUUM_CHECKPOINT_DIR",

dir.path(),

);

dir

});

joelteply · 2026-04-25T19:45:57Z

CI gate status — needs your call

After my workflow_dispatch rerun, the docker pipeline ran end-to-end:

Job	Result
verify-architectures	✅
rebuild-stale-amd64	✅ (rebuilt 6 of 7 images at HEAD; cuda skipped — GHA runner has no GPU to build cuda)
rebuild-stale-arm64	✅
verify-after-rebuild	❌

The verify-after-rebuild fail is the tag-overwrite race + cuda-cant-build-on-CI combo I've hit before:

rebuild-stale-amd64 pushed amd64 to :a8c71d3bd and :pr-971. Concurrently rebuild-stale-arm64 pushed arm64 to the same :a8c71d3bd and :pr-971. The second push (arm64, ran after amd64) clobbered amd64. Now continuum-core:pr-971 and continuum-livekit-bridge:pr-971 are arm64-ONLY in the registry. The amd64 pulls happened, the bits exist as orphan manifest digests, but the tag points to arm64.
continuum-core-cuda:pr-971 is at the OLD revision (752d35019 — my pre-merge run) because GHA runners have no GPU and skip the cuda build. It needs a CUDA-capable host (BigMama).

Why the smart-staleness skip doesn't help here

verify-image-revisions.sh's image_relevant_paths for cuda includes src/workers. The diff between 752d35019 and a8c71d3bd is exactly src/workers/continuum-core/src/modules/sentinel/checkpoint.rs — a test-only cfg(test) change that has zero effect on the production binary. But the path-glob doesn't know about cfg(test), so it flags as relevant → no skip.

Two paths forward (your call)

A. Admin-merge. This PR's diff is exactly 3 lines of production code (the checkpoints_dir() env override) plus 39 lines of #[cfg(test)] test isolation. The production binary content of the docker image is unchanged from main's. The verify-after-rebuild fail is structural CI plumbing, not a real divergence between code and shipped image. If you've reviewed the diff, admin-merge is safe and correct.

B. I rebuild amd64 of core + livekit-bridge + cuda on BigMama (~30-45 min) and imagetools-merge to fix the gate. This costs the rebuild time but exercises the full path. Worth it only if you want the discipline of "every PR earns its green gate the proper way."

Either way, the underlying issues are real and worth tracking:

The tag-overwrite race during parallel rebuild jobs (separate, architectural — I'd file as a follow-up if it doesn't already have one).
image_relevant_paths is too coarse for test-only Rust changes (could add cfg(test)-aware diff filter, but that's its own non-trivial PR).

Awaiting direction. The diff itself is ready.

Copilot AI review requested due to automatic review settings April 25, 2026 18:18

Copilot started reviewing on behalf of joelteply April 25, 2026 18:18 View session

github-actions Bot added the size: S label Apr 25, 2026

Copilot AI reviewed Apr 25, 2026

View reviewed changes

joelteply mentioned this pull request Apr 25, 2026

ci(workflow): Verify Docker Images PR-trigger paths too narrow — non-Rust/non-docker PRs perpetually BLOCKED #974

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(sentinel/checkpoint): isolate via CONTINUUM_CHECKPOINT_DIR env override#971

test(sentinel/checkpoint): isolate via CONTINUUM_CHECKPOINT_DIR env override#971
joelteply wants to merge 1 commit intomainfrom
fix/checkpoint-test-isolation

joelteply commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

joelteply commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if let Ok(override_dir) = std::env::var("CONTINUUM_CHECKPOINT_DIR") {
		return PathBuf::from(override_dir);

Conversation

joelteply commented Apr 25, 2026

Summary

Background — empirical hit

Fix structure

Validation

Why this matters beyond one machine

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

joelteply commented Apr 25, 2026

CI gate status — needs your call

Why the smart-staleness skip doesn't help here

Two paths forward (your call)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants