feat(compose): fail fast when the Docker daemon is unresponsive#159
Merged
Conversation
ensure_prereqs only checked that the docker socket FILE exists, not that the daemon actually answers. A degraded daemon (e.g. after host memory pressure) keeps the socket present but stops replying, so a later `drydock build`/`run` hung indefinitely with no message — a silent failure against drydock's ethos. Add _ensure_docker_responsive: probe `docker version` under a bounded `timeout` (default 12s, override via DRYDOCK_DOCKER_PROBE_TIMEOUT) and, on no reply, abort with backend-agnostic guidance naming BOTH Docker Desktop and native Docker Engine — the daemon is unreachable at that point, so the backend cannot be detected, hence no presumption. The probe is skipped when `timeout` is absent (e.g. stock macOS) so the check itself can never hang. test/lib_compose.bats: 4 unit tests via the DOCKER seam (replies / errors / hangs-bounded-by-timeout / backend-agnostic message). Full suite 1165/1165 green; shellcheck + shfmt clean.
This was referenced Jun 6, 2026
jraicr
added a commit
that referenced
this pull request
Jun 11, 2026
…soning, unanchored user ERE (#161) * fix(egress): make generated allowlist filter world-readable (644) The effective filter was written via mktemp (always 0600) and mv (mode preserved), then RO bind-mounted to /etc/tinyproxy/filter. tinyproxy in the sidecar runs as a non-root uid with cap_drop ALL (no DAC_OVERRIDE), so it could not read the filter: it logged 'filter file: Permission denied' and exited, the healthcheck never passed, and the agent's depends_on: service_healthy aborted every contained-mode run. chmod 644 before the move in both generators — _generate_egress_filter (lib/compose.sh) and _write_filter (scripts/egress-smoke.sh). The allowlist is non-secret data; world-readable is correct. Also move egress-smoke.sh's cleanup trap from top level into main(): a top-level 'trap _cleanup EXIT' fired on every sourcing shell's exit and ran 'docker rm -f' against the live daemon, which made the helpers untestable via the source-guard (the new test/egress_smoke.bats asserts sourcing is side-effect-free). * fix(egress): smoke run removes only the networks it created _check_preconditions pre-created the production fixed-name networks (drydock_internal / drydock_egress) via 'docker network create' when absent, and _cleanup deliberately never removed networks. A plain 'docker network create' attaches no compose labels, so a subsequent contained 'docker compose up' referencing the same fixed name: fails fatally ('network ... has incorrect label com.docker.compose.network set to ""'). Since setup-gates.sh runs the G2 smoke before the G1 capture sessions, the gate host was poisoned for every later contained run until a manual 'docker network rm'. Fix: track exactly the networks THIS run creates (SMOKE_CREATED_NETWORKS) and remove them in _cleanup — on success and on the failure paths via the EXIT trap. This restores the host to its prior state by construction. Pre-existing networks (e.g. owned by a live contained session) are never touched. Creating the networks with hand-built compose labels was rejected: compose v2 validates the label set against its own project metadata, and an approximated label set that diverges from a future compose version silently reintroduces the poisoning — removal is the only guarantee that does not depend on matching compose internals. * fix(egress): anchor bare-hostname user allowlist entries as exact ERE tinyproxy's FilterType ere matches UNANCHORED substrings, so a user allowlist line in the documented bare-hostname form ('echo example.com >> ~/.config/drydock/egress-allowlist') also matched example.com.evil.io and example1com.net — silently widening the deny-by-default filter. The shipped baseline was already anchored (^api\.anthropic\.com$); only user additions were exposed. _generate_egress_filter now normalizes every line after comment-stripping and before dedup, uniformly across all three sources (baseline, global user file, per-project user file): - a bare hostname (alphanumerics, dots, hyphens) is escaped and anchored exactly: example.com -> ^example\.com$ - any other line passes through verbatim as raw ERE (expert escape hatch; the baseline's anchored lines are untouched), so 'example.com' and '^example\.com$' dedup to one line Same transform as _to_ere_baseline_pattern in scripts/egress-capture.sh, replicated because lib/ must not source scripts/. Documented in docs/troubleshooting.md and the baseline header. Also closes a coverage gap: the per-project allowlist file (egress-allowlist-<project>) now has a test proving it feeds the generated filter. * docs(changelog): daemon fail-fast (#159) and egress audit fixes (#149) * docs(egress): state the exact bare-hostname anchoring rule
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
ensure_prereqsonly checked that the docker socket file exists — not thatthe daemon answers. A degraded daemon (e.g. after host memory pressure) keeps
the socket present but stops replying, so a later
drydock build/runhangsindefinitely with no message. This adds a bounded liveness probe that fails fast
with a clear, backend-agnostic message.
Context
This surfaced live: after host memory pressure left Docker Desktop's daemon
unresponsive,
drydock buildprinted "Building…" and then hung silently — thesocket existed, the daemon didn't answer. drydock should say so, not hang.
Change
_ensure_docker_responsive(lib/compose.sh), called at the end ofensure_prereqs:docker versionundertimeout(default 12s, override withDRYDOCK_DOCKER_PROBE_TIMEOUT);errs with guidance that names both Docker Desktop andnative Docker Engine — when the daemon is unreachable the backend can't be
detected, so it must not presume one (per review feedback);
timeoutis unavailable (e.g. stock macOS withoutcoreutils) so the probe itself can never hang.
Tests
test/lib_compose.bats— 4 unit tests via the existingDOCKERseam: daemonreplies (returns 0), daemon errors (errs, names "not responding"), a hanging
daemon is bounded by the probe timeout, and the message is backend-agnostic
(asserts both Docker Desktop and Docker Engine appear).
Full suite 1165/1165 green; shellcheck + shfmt clean.
Cost
Adds one
docker version(~sub-second when healthy) perensure_prereqs. Cheaprelative to the build/run it guards, and it converts an infinite hang into an
immediate, actionable error.