fix(compose): audit batch 1 — GC data-loss, baked-policy modes, overlay truncation, atomic config refresh#162
Merged
Merged
Conversation
…emovable entries gc_orphan_session_dirs treated a failed 'docker ps -a' (empty output, nonzero rc) the same as 'no containers': every live session dir looked orphaned and the GC rm -rf'd the bind-mounted ~/.claude of RUNNING sessions. The rc is now checked explicitly — on daemon failure the pass is skipped with a warning and retried on the next invocation. Empty SUCCESS output still reaps orphans as before. The reap loop also violated its 'Returns 0 always' contract under set -e: one unremovable entry (e.g. a root-owned file) aborted GC and therefore every 'drydock run' for every project. rm -rf is now guarded with a warning and the sweep continues to the remaining dirs.
COPY preserves build-context file modes: a builder with umask 077 bakes 600 root-owned files. In the agent image that makes every JSON under /etc/claude-code/managed-settings.d/ unreadable by the non-root user — the entire managed-settings guardrail layer (INV-3) silently fails to load. In the sidecar image the tinyproxy.conf readability depended solely on the later chown -R step. Both COPY lines now carry --chmod=644. Directory traversal for the managed-settings path is pinned with an explicit RUN chmod 755 (BuildKit applies --chmod to copied directories too, so 644 must never be the directory mode). Static bats assertions guard both lines.
…truncation compose_files was consumed at six sites as 'while read ... < <(compose_files ...)' — process substitution discards the producer's exit status. A fallible generator dying mid-stream (submount/links overlay write on a full disk or unwritable TMPDIR) left the caller with a TRUNCATED -f list: the session could launch with no hardening overlay and no mode overlay — a "contained" session with open egress and no cap_drop. Two layers: - compose_files now emits the mandatory, infallible args FIRST (base, mode overlay, hardening) before the fallible generators, and propagates generator failure explicitly. Truncation can only degrade features, never the security posture. Compose merge is per-key last-wins and the feature overlays share no keys with mode/hardening, so the effective config is unchanged. - compose_files_into replaces the consumer idiom at all six call sites: capture to a temp file, check the producer's rc, err out on failure instead of launching with a partial list.
… jq rc check cmd_setup and cmd_sync filtered the host ~/.claude.json with 'jq del(...) >"$CONTAINER_CLAUDE_JSON"' — the redirect truncates the target BEFORE jq parses the source, so a host config caught mid-rewrite (invalid JSON) left a 0-byte container config. Worse, the auto-sync path calls 'cmd_sync || warn': bash suppresses set -e inside the function in that context, so execution fell through to the freshness-marker touch and a lying "Sync done". Both call sites now use _refresh_container_claude_json: jq writes a same-directory temp file, the rc is checked explicitly (if !), failure removes the temp + warns that the container config was NOT refreshed + returns 1 — no marker touch, no success message — and success moves the temp into place atomically.
docker-compose.yml mounts ${HOME}/.gitconfig (:ro) and ${HOME}/.local
(:rw) with no host-side existence guard. On a fresh machine the Docker
daemon auto-creates a missing source as a root-owned DIRECTORY —
~/.gitconfig-as-a-dir breaks host git entirely until a manual sudo rm.
Same class as #71 (hooks subdir) and #109 (~/.config/gh), which
ensure_runtime_dirs already pre-creates.
ensure_runtime_dirs now touches ~/.gitconfig only when absent (never
rewriting an existing host-owned file — INV-1 non-contamination class),
creates ~/.local, and creates the prototype's projects/ shared-store
sub-mount source for prototypes that predate the shared store.
Mount-source sweep across all compose files found no other unguarded
unconditional sources: every overlay mount is either generated per
session, dir-existence-gated at activation, :?-guarded, or validated by
ensure_prereqs. CHANGELOG records all six audit batch-1 fixes.
# Conflicts: # lib/compose.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Batch 1 (data-loss / security) of the full-codebase audit of the v0.3.3 feature set: six fixes, each empirically grounded (repros with isolated fake-HOME fixtures and stubbed docker), all present on
devas well asmain. ~146 lines of product code; the bulk of the diff is the TDD test coverage (365 lines).size:exceptionagreed for keeping the audit batch as one reviewable unit — the six fixes ship as separate work-unit commits.F1+F2 — session-dir GC: daemon failure no longer reaps live sessions; GC honors its never-fatal contract (
09b774f)gc_orphan_session_dirscaptureddocker ps -awith|| true, making a daemon outage indistinguishable from "zero containers": every session dir — including dirs bind-mounted as~/.claudeinside RUNNING containers — looked orphaned and wasrm -rfed (settings,.credentials.json, attach/resume mapping, across all projects). Now adocker psfailure warns and skips GC for that invocation. Additionally, the unguardedrm -rf/rm -fviolated the function's documented "returns 0 always" contract: one unremovable orphan entry aborted every subsequentdrydock runfor every project; both are now guarded (|| warn/|| true), consistent with the already-guarded sidecar reap.F3 — pin baked config modes with
COPY --chmod=644(740d9fb)COPYpreserves build-context modes: a builder withumask 077baked600 root:rootmanaged-settings drop-ins that the non-root container user cannot read — silently disabling the ENTIRE tier-1 guardrail layer (deny rules, SessionStart, PreToolUse), exactly the failure mode INV-3 exists to prevent. Both Dockerfiles now pin file modes viaCOPY --chmod=644; directory traversal perms are pinned by a separateRUN chmod 755(BuildKit applies--chmodto created directories too, where 644 would break traversal). Static tests assert both.F4 — compose_files: mandatory overlays first, fail-loud consumption (
c3a3cd8)compose_fileswas consumed via process substitution, which discards the producer's exit status: a fallible generator step dying mid-stream (submount overlay redirect on a full disk — threat model A's "runaway loop filling disk") left the caller with a truncated-flist, launching a session WITHOUT the hardening overlay and, worse, without the mode overlay — a "contained" session with open egress while the banner claims otherwise. Two defense layers: (a) emission reordered so base → mode overlay → hardening are emitted before any fallible generator — truncation can now only degrade features, never security posture (the reorder is merge-semantics-neutral: environment merges per variable name and volumes per target path, disjoint in every overlay combination); (b) a newcompose_files_into <array> <dir>helper (temp-file capture + explicit rc check +err) replaces all six process-substitution consumer sites, so assembly failure aborts loudly.F5 — refresh container
claude.jsonatomically with an explicit jq rc check (99350a1)jq 'del(...)' "$HOST_CLAUDE_JSON" >"$CONTAINER_CLAUDE_JSON"truncated the target before jq ran: a host~/.claude.jsoncaught mid-rewrite (a file Claude Code rewrites constantly) left a 0-byte container config — and in the auto-sync path (cmd_sync || warn, where bash suppresses errexit inside the function) execution fell through took "refreshed", stamped.drydock-last-sync, and printed "Sync done" over the corrupted artifact. New_refresh_container_claude_json: jq into a same-dir mktemp with an explicitif !rc check (immune to the errexit suppression), failure → temp removed, warning,return 1, marker untouched, no success message; success →chmod 644+ atomicmv. Applied to bothcmd_setupandcmd_sync.F6 — pre-create unconditional
${HOME}bind-mount sources (caa3f99)docker-compose.ymlmounts${HOME}/.gitconfig:roand${HOME}/.localunconditionally; on a fresh host dockerd auto-creates a missing bind source as a root-owned DIRECTORY — a root-owned~/.gitconfigdir breaks hostgit config --globaluntilsudo rm(the host-contamination class INV-1/#89 exists to prevent).ensure_runtime_dirsalready guards~/.config/ghand the prototypehooks/for exactly this reason; it now also guards~/.gitconfig(touch if absent),~/.local, and the prototypeprojects/dir (upgrade-path gap for prototypes predating the shared store). A sweep of all remaining${HOME}-sourced mounts across base + overlays found every other source either feature-gated on existence or session-generated.Invariant impact
Verification
scripts/test.sh: 1181/1181 pass (each fix implemented test-first; review nits added root-skip guards to the three chmod-based tests).bash -n,shellcheck,shfmt -d,scripts/lint-commits.sh: clean.fa64694.Out of scope (noted for follow-up)
The same daemon-failure-equals-empty conflation survives in two sibling sites (
seed_session_config_dir, the disc-collision retry loop) — backstopped by dir-existence checks and_ensure_docker_responsive, tracked for a follow-up issue rather than scope-creeping this batch.