test(sentinel/checkpoint): isolate via CONTINUUM_CHECKPOINT_DIR env override#971
test(sentinel/checkpoint): isolate via CONTINUUM_CHECKPOINT_DIR env override#971
Conversation
…verride Empirical hit on PR #950 prepush: cargo test --lib panicked in 3 checkpoint tests with "Failed to write checkpoint tmp file: Permission denied (os error 13)". Root cause: ~/.continuum/sentinel/checkpoints/ was owned by root:root with mode 755 on this dev machine — left over from a prior docker container that mounted $HOME and chmod'd the dir under root. Tests wrote to that production path, no isolation. Two-part fix in one commit, both in checkpoint.rs: 1. checkpoints_dir() now honors CONTINUUM_CHECKPOINT_DIR env override. Production code still falls through to ~/.continuum/sentinel/ checkpoints when the var is unset (no behavior change in prod). 2. Tests share a process-global tempfile::TempDir via OnceLock, exposed through ensure_test_checkpoint_dir() called at the top of each test. Cargo runs tests in parallel within one process and set_var is process-global — a per-test TempDir would race. Shared dir is fine because tests use UUID-derived handles; collisions structurally don't happen. Validated locally: cargo test --lib --features cuda,load-dynamic-ort → 1829 passed; 0 failed; 28 ignored (was: 1826 passed; 3 failed in checkpoint tests on this machine.) Why this matters beyond the local fix: the checkpoint module's hard- coded ~/.continuum dependency is a latent foot-gun for ANY contributor whose dev box has had a root-mounted docker run over $HOME — a common state given how often we run docker compose with bind-mounts. Env override + tempdir isolation makes the test suite portable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adjusts sentinel checkpoint tests to avoid writing into the user’s real ~/.continuum/sentinel/checkpoints by introducing an env-var override and having tests redirect storage to a shared temp directory.
Changes:
- Added
CONTINUUM_CHECKPOINT_DIRoverride support tocheckpoints_dir(). - Introduced a process-global
TempDirfor checkpoint tests and set the override in each test.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let Ok(override_dir) = std::env::var("CONTINUUM_CHECKPOINT_DIR") { | ||
| return PathBuf::from(override_dir); |
There was a problem hiding this comment.
checkpoints_dir() uses std::env::var, which requires UTF-8 and will also accept an empty string (mapping to PathBuf::from(""), i.e., the current working directory). Since this env var is intended to be a filesystem path override, consider using std::env::var_os and ignoring empty values so an accidentally-set empty/invalid override doesn’t redirect checkpoint writes into the process CWD.
| if let Ok(override_dir) = std::env::var("CONTINUUM_CHECKPOINT_DIR") { | |
| return PathBuf::from(override_dir); | |
| if let Some(override_dir) = std::env::var_os("CONTINUUM_CHECKPOINT_DIR") { | |
| if !override_dir.is_empty() { | |
| return PathBuf::from(override_dir); | |
| } |
| let dir = TMPDIR.get_or_init(|| { | ||
| tempfile::tempdir().expect("Failed to create test checkpoint tempdir") | ||
| }); | ||
| // SAFETY: set_var is unsafe in newer Rust (process-global, racy with | ||
| // other threads reading env). Tests in this module only ever write | ||
| // the SAME path, so concurrent setters write the same value — race- | ||
| // free in practice. | ||
| std::env::set_var( | ||
| "CONTINUUM_CHECKPOINT_DIR", | ||
| dir.path(), | ||
| ); |
There was a problem hiding this comment.
ensure_test_checkpoint_dir() calls std::env::set_var on every test invocation. With cargo’s parallel test execution, multiple tests can call set_var concurrently, which is exactly the racy/unsound pattern the comment warns about (even if all writers set the same value). To avoid this, set the env var only once (e.g., inside the OnceLock init path or via an additional Once guard) so only a single thread ever performs the set_var call.
| let dir = TMPDIR.get_or_init(|| { | |
| tempfile::tempdir().expect("Failed to create test checkpoint tempdir") | |
| }); | |
| // SAFETY: set_var is unsafe in newer Rust (process-global, racy with | |
| // other threads reading env). Tests in this module only ever write | |
| // the SAME path, so concurrent setters write the same value — race- | |
| // free in practice. | |
| std::env::set_var( | |
| "CONTINUUM_CHECKPOINT_DIR", | |
| dir.path(), | |
| ); | |
| TMPDIR.get_or_init(|| { | |
| let dir = | |
| tempfile::tempdir().expect("Failed to create test checkpoint tempdir"); | |
| // SAFETY: set_var is process-global and must not race with other | |
| // writers. Performing it inside the OnceLock init closure ensures | |
| // only a single thread sets the test override for the lifetime of | |
| // this test process. | |
| std::env::set_var( | |
| "CONTINUUM_CHECKPOINT_DIR", | |
| dir.path(), | |
| ); | |
| dir | |
| }); |
CI gate status — needs your callAfter my workflow_dispatch rerun, the docker pipeline ran end-to-end:
The verify-after-rebuild fail is the tag-overwrite race + cuda-cant-build-on-CI combo I've hit before:
Why the smart-staleness skip doesn't help here
Two paths forward (your call)A. Admin-merge. This PR's diff is exactly 3 lines of production code (the B. I rebuild amd64 of core + livekit-bridge + cuda on BigMama (~30-45 min) and imagetools-merge to fix the gate. This costs the rebuild time but exercises the full path. Worth it only if you want the discipline of "every PR earns its green gate the proper way." Either way, the underlying issues are real and worth tracking:
Awaiting direction. The diff itself is ready. |
Summary
~/.continuum/sentinel/checkpoints/— they redirect to a process-sharedtempfile::TempDirvia the newCONTINUUM_CHECKPOINT_DIRenv override.checkpoints_dir()resolves to the same default~/.continuum/sentinel/checkpointspath as before.~/.continuum/sentinel/directory is root-owned (left over from any docker compose run that bind-mounted$HOMEand chmod'd dirs under root),cargo test --libpanics with "Permission denied (os error 13)" in 3 checkpoint tests and blocks the prepush gate.Background — empirical hit
Hit on PR #950 prepush from
bigmama-wsl2026-04-25:ls -la ~/.continuum/sentinel/checkpoints/showeddrwxr-xr-x 2 root root— created by a prior root-uid docker process that wrote into the bind-mounted home. Tests had no isolation: they wrote to the production path.Fix structure
Two-part change in
src/workers/continuum-core/src/modules/sentinel/checkpoint.rs:checkpoints_dir()env override — checksCONTINUUM_CHECKPOINT_DIRfirst, falls through to the existing home-derived default. Production callers see no behavior change unless they explicitly set the var.Test-mode tempdir isolation — new
ensure_test_checkpoint_dir()helper, called at the top of each test. It lazy-initializes a process-globaltempfile::TempDirviaOnceLockand sets the env var to its path. Cargo runs tests in parallel within one process, andset_varis process-global, so a per-test TempDir would race; sharing one is fine because tests already use UUID-derived handles that structurally don't collide.Validation
Why this matters beyond one machine
The checkpoint module's hard-coded
~/.continuumdependency is a latent foot-gun for any contributor whose dev box has had a root-mounted docker run over$HOME— a common state given how often we rundocker composewith bind-mounts. Env override + tempdir isolation makes the test suite portable across dev environments without per-machine sudo chown gymnastics.Test plan
cargo test --lib --features cuda,load-dynamic-ort— 1829 pass, 0 fail (was 1826/3 fail on bigmama-wsl).checkpoints_dir()returns~/.continuum/sentinel/checkpointsexactly as before.~/.continuum/sentinel/to confirm the bug is gone there too. (Will validate post-merge on bigmama-wsl with the offending state intact.)