Skip to content

fix(compose): audit batch 1 — GC data-loss, baked-policy modes, overlay truncation, atomic config refresh#162

Merged
jraicr merged 7 commits into
devfrom
fix/audit-batch1-dataloss
Jun 11, 2026
Merged

fix(compose): audit batch 1 — GC data-loss, baked-policy modes, overlay truncation, atomic config refresh#162
jraicr merged 7 commits into
devfrom
fix/audit-batch1-dataloss

Conversation

@jraicr

@jraicr jraicr commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Batch 1 (data-loss / security) of the full-codebase audit of the v0.3.3 feature set: six fixes, each empirically grounded (repros with isolated fake-HOME fixtures and stubbed docker), all present on dev as well as main. ~146 lines of product code; the bulk of the diff is the TDD test coverage (365 lines). size:exception agreed for keeping the audit batch as one reviewable unit — the six fixes ship as separate work-unit commits.

F1+F2 — session-dir GC: daemon failure no longer reaps live sessions; GC honors its never-fatal contract (09b774f)

gc_orphan_session_dirs captured docker ps -a with || true, making a daemon outage indistinguishable from "zero containers": every session dir — including dirs bind-mounted as ~/.claude inside RUNNING containers — looked orphaned and was rm -rfed (settings, .credentials.json, attach/resume mapping, across all projects). Now a docker ps failure warns and skips GC for that invocation. Additionally, the unguarded rm -rf/rm -f violated the function's documented "returns 0 always" contract: one unremovable orphan entry aborted every subsequent drydock run for every project; both are now guarded (|| warn / || true), consistent with the already-guarded sidecar reap.

F3 — pin baked config modes with COPY --chmod=644 (740d9fb)

COPY preserves build-context modes: a builder with umask 077 baked 600 root:root managed-settings drop-ins that the non-root container user cannot read — silently disabling the ENTIRE tier-1 guardrail layer (deny rules, SessionStart, PreToolUse), exactly the failure mode INV-3 exists to prevent. Both Dockerfiles now pin file modes via COPY --chmod=644; directory traversal perms are pinned by a separate RUN chmod 755 (BuildKit applies --chmod to created directories too, where 644 would break traversal). Static tests assert both.

F4 — compose_files: mandatory overlays first, fail-loud consumption (c3a3cd8)

compose_files was consumed via process substitution, which discards the producer's exit status: a fallible generator step dying mid-stream (submount overlay redirect on a full disk — threat model A's "runaway loop filling disk") left the caller with a truncated -f list, launching a session WITHOUT the hardening overlay and, worse, without the mode overlay — a "contained" session with open egress while the banner claims otherwise. Two defense layers: (a) emission reordered so base → mode overlay → hardening are emitted before any fallible generator — truncation can now only degrade features, never security posture (the reorder is merge-semantics-neutral: environment merges per variable name and volumes per target path, disjoint in every overlay combination); (b) a new compose_files_into <array> <dir> helper (temp-file capture + explicit rc check + err) replaces all six process-substitution consumer sites, so assembly failure aborts loudly.

F5 — refresh container claude.json atomically with an explicit jq rc check (99350a1)

jq 'del(...)' "$HOST_CLAUDE_JSON" >"$CONTAINER_CLAUDE_JSON" truncated the target before jq ran: a host ~/.claude.json caught mid-rewrite (a file Claude Code rewrites constantly) left a 0-byte container config — and in the auto-sync path (cmd_sync || warn, where bash suppresses errexit inside the function) execution fell through to ok "refreshed", stamped .drydock-last-sync, and printed "Sync done" over the corrupted artifact. New _refresh_container_claude_json: jq into a same-dir mktemp with an explicit if ! rc check (immune to the errexit suppression), failure → temp removed, warning, return 1, marker untouched, no success message; success → chmod 644 + atomic mv. Applied to both cmd_setup and cmd_sync.

F6 — pre-create unconditional ${HOME} bind-mount sources (caa3f99)

docker-compose.yml mounts ${HOME}/.gitconfig:ro and ${HOME}/.local unconditionally; on a fresh host dockerd auto-creates a missing bind source as a root-owned DIRECTORY — a root-owned ~/.gitconfig dir breaks host git config --global until sudo rm (the host-contamination class INV-1/#89 exists to prevent). ensure_runtime_dirs already guards ~/.config/gh and the prototype hooks/ for exactly this reason; it now also guards ~/.gitconfig (touch if absent), ~/.local, and the prototype projects/ dir (upgrade-path gap for prototypes predating the shared store). A sweep of all remaining ${HOME}-sourced mounts across base + overlays found every other source either feature-gated on existence or session-generated.

Invariant impact

  • INV-3: strengthened — F3 removes a silent-disable path for the baked policy layer.
  • INV-8 / INV-9: strengthened — F4 guarantees the hardening and mode overlays survive generator failure; no overlay content changed.
  • INV-1: strengthened — F6 closes a host-HOME contamination path; drydock still never mutates host-owned file content.
  • INV-2: untouched — F5 writes only the container-side config; the host file remains read-only input.

Verification

  • Full suite via scripts/test.sh: 1181/1181 pass (each fix implemented test-first; review nits added root-skip guards to the three chmod-based tests).
  • bash -n, shellcheck, shfmt -d, scripts/lint-commits.sh: clean.
  • Fresh-context adversarial review: READY-WITH-NITS — key traps verified (sed -i preserves COPY modes; compose merge semantics across overlay reorder; errexit propagation through the new helper; mapfile no-word-splitting). All nits addressed in fa64694.

Out of scope (noted for follow-up)

The same daemon-failure-equals-empty conflation survives in two sibling sites (seed_session_config_dir, the disc-collision retry loop) — backstopped by dir-existence checks and _ensure_docker_responsive, tracked for a follow-up issue rather than scope-creeping this batch.

jraicr added 6 commits June 11, 2026 07:10
…emovable entries

gc_orphan_session_dirs treated a failed 'docker ps -a' (empty output,
nonzero rc) the same as 'no containers': every live session dir looked
orphaned and the GC rm -rf'd the bind-mounted ~/.claude of RUNNING
sessions. The rc is now checked explicitly — on daemon failure the pass
is skipped with a warning and retried on the next invocation. Empty
SUCCESS output still reaps orphans as before.

The reap loop also violated its 'Returns 0 always' contract under
set -e: one unremovable entry (e.g. a root-owned file) aborted GC and
therefore every 'drydock run' for every project. rm -rf is now guarded
with a warning and the sweep continues to the remaining dirs.
COPY preserves build-context file modes: a builder with umask 077 bakes
600 root-owned files. In the agent image that makes every JSON under
/etc/claude-code/managed-settings.d/ unreadable by the non-root user —
the entire managed-settings guardrail layer (INV-3) silently fails to
load. In the sidecar image the tinyproxy.conf readability depended
solely on the later chown -R step.

Both COPY lines now carry --chmod=644. Directory traversal for the
managed-settings path is pinned with an explicit RUN chmod 755 (BuildKit
applies --chmod to copied directories too, so 644 must never be the
directory mode). Static bats assertions guard both lines.
…truncation

compose_files was consumed at six sites as
'while read ... < <(compose_files ...)' — process substitution discards
the producer's exit status. A fallible generator dying mid-stream
(submount/links overlay write on a full disk or unwritable TMPDIR) left
the caller with a TRUNCATED -f list: the session could launch with no
hardening overlay and no mode overlay — a "contained" session with open
egress and no cap_drop.

Two layers:
- compose_files now emits the mandatory, infallible args FIRST (base,
  mode overlay, hardening) before the fallible generators, and
  propagates generator failure explicitly. Truncation can only degrade
  features, never the security posture. Compose merge is per-key
  last-wins and the feature overlays share no keys with mode/hardening,
  so the effective config is unchanged.
- compose_files_into replaces the consumer idiom at all six call sites:
  capture to a temp file, check the producer's rc, err out on failure
  instead of launching with a partial list.
… jq rc check

cmd_setup and cmd_sync filtered the host ~/.claude.json with
'jq del(...) >"$CONTAINER_CLAUDE_JSON"' — the redirect truncates the
target BEFORE jq parses the source, so a host config caught mid-rewrite
(invalid JSON) left a 0-byte container config. Worse, the auto-sync path
calls 'cmd_sync || warn': bash suppresses set -e inside the function in
that context, so execution fell through to the freshness-marker touch
and a lying "Sync done".

Both call sites now use _refresh_container_claude_json: jq writes a
same-directory temp file, the rc is checked explicitly (if !), failure
removes the temp + warns that the container config was NOT refreshed +
returns 1 — no marker touch, no success message — and success moves the
temp into place atomically.
docker-compose.yml mounts ${HOME}/.gitconfig (:ro) and ${HOME}/.local
(:rw) with no host-side existence guard. On a fresh machine the Docker
daemon auto-creates a missing source as a root-owned DIRECTORY —
~/.gitconfig-as-a-dir breaks host git entirely until a manual sudo rm.
Same class as #71 (hooks subdir) and #109 (~/.config/gh), which
ensure_runtime_dirs already pre-creates.

ensure_runtime_dirs now touches ~/.gitconfig only when absent (never
rewriting an existing host-owned file — INV-1 non-contamination class),
creates ~/.local, and creates the prototype's projects/ shared-store
sub-mount source for prototypes that predate the shared store.

Mount-source sweep across all compose files found no other unguarded
unconditional sources: every overlay mount is either generated per
session, dir-existence-gated at activation, :?-guarded, or validated by
ensure_prereqs. CHANGELOG records all six audit batch-1 fixes.
@jraicr jraicr added type:fix Bug fix size:exception Over 400 lines, pre-approved by maintainer labels Jun 11, 2026
@jraicr jraicr merged commit f5200cf into dev Jun 11, 2026
4 checks passed
@jraicr jraicr deleted the fix/audit-batch1-dataloss branch June 11, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:exception Over 400 lines, pre-approved by maintainer type:fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant