feat(compose): fail fast when the Docker daemon is unresponsive by jraicr · Pull Request #159 · sideralith/drydock

jraicr · 2026-06-06T00:17:28Z

What

ensure_prereqs only checked that the docker socket file exists — not that
the daemon answers. A degraded daemon (e.g. after host memory pressure) keeps
the socket present but stops replying, so a later drydock build/run hangs
indefinitely with no message. This adds a bounded liveness probe that fails fast
with a clear, backend-agnostic message.

Context

This surfaced live: after host memory pressure left Docker Desktop's daemon
unresponsive, drydock build printed "Building…" and then hung silently — the
socket existed, the daemon didn't answer. drydock should say so, not hang.

Change

_ensure_docker_responsive (lib/compose.sh), called at the end of
ensure_prereqs:

probes docker version under timeout (default 12s, override with
DRYDOCK_DOCKER_PROBE_TIMEOUT);
on no reply, errs with guidance that names both Docker Desktop and
native Docker Engine — when the daemon is unreachable the backend can't be
detected, so it must not presume one (per review feedback);
is skipped when timeout is unavailable (e.g. stock macOS without
coreutils) so the probe itself can never hang.

Tests

test/lib_compose.bats — 4 unit tests via the existing DOCKER seam: daemon
replies (returns 0), daemon errors (errs, names "not responding"), a hanging
daemon is bounded by the probe timeout, and the message is backend-agnostic
(asserts both Docker Desktop and Docker Engine appear).

Full suite 1165/1165 green; shellcheck + shfmt clean.

Cost

Adds one docker version (~sub-second when healthy) per ensure_prereqs. Cheap
relative to the build/run it guards, and it converts an infinite hang into an
immediate, actionable error.

ensure_prereqs only checked that the docker socket FILE exists, not that the daemon actually answers. A degraded daemon (e.g. after host memory pressure) keeps the socket present but stops replying, so a later `drydock build`/`run` hung indefinitely with no message — a silent failure against drydock's ethos. Add _ensure_docker_responsive: probe `docker version` under a bounded `timeout` (default 12s, override via DRYDOCK_DOCKER_PROBE_TIMEOUT) and, on no reply, abort with backend-agnostic guidance naming BOTH Docker Desktop and native Docker Engine — the daemon is unreachable at that point, so the backend cannot be detected, hence no presumption. The probe is skipped when `timeout` is absent (e.g. stock macOS) so the check itself can never hang. test/lib_compose.bats: 4 unit tests via the DOCKER seam (replies / errors / hangs-bounded-by-timeout / backend-agnostic message). Full suite 1165/1165 green; shellcheck + shfmt clean.

…soning, unanchored user ERE (#161) * fix(egress): make generated allowlist filter world-readable (644) The effective filter was written via mktemp (always 0600) and mv (mode preserved), then RO bind-mounted to /etc/tinyproxy/filter. tinyproxy in the sidecar runs as a non-root uid with cap_drop ALL (no DAC_OVERRIDE), so it could not read the filter: it logged 'filter file: Permission denied' and exited, the healthcheck never passed, and the agent's depends_on: service_healthy aborted every contained-mode run. chmod 644 before the move in both generators — _generate_egress_filter (lib/compose.sh) and _write_filter (scripts/egress-smoke.sh). The allowlist is non-secret data; world-readable is correct. Also move egress-smoke.sh's cleanup trap from top level into main(): a top-level 'trap _cleanup EXIT' fired on every sourcing shell's exit and ran 'docker rm -f' against the live daemon, which made the helpers untestable via the source-guard (the new test/egress_smoke.bats asserts sourcing is side-effect-free). * fix(egress): smoke run removes only the networks it created _check_preconditions pre-created the production fixed-name networks (drydock_internal / drydock_egress) via 'docker network create' when absent, and _cleanup deliberately never removed networks. A plain 'docker network create' attaches no compose labels, so a subsequent contained 'docker compose up' referencing the same fixed name: fails fatally ('network ... has incorrect label com.docker.compose.network set to ""'). Since setup-gates.sh runs the G2 smoke before the G1 capture sessions, the gate host was poisoned for every later contained run until a manual 'docker network rm'. Fix: track exactly the networks THIS run creates (SMOKE_CREATED_NETWORKS) and remove them in _cleanup — on success and on the failure paths via the EXIT trap. This restores the host to its prior state by construction. Pre-existing networks (e.g. owned by a live contained session) are never touched. Creating the networks with hand-built compose labels was rejected: compose v2 validates the label set against its own project metadata, and an approximated label set that diverges from a future compose version silently reintroduces the poisoning — removal is the only guarantee that does not depend on matching compose internals. * fix(egress): anchor bare-hostname user allowlist entries as exact ERE tinyproxy's FilterType ere matches UNANCHORED substrings, so a user allowlist line in the documented bare-hostname form ('echo example.com >> ~/.config/drydock/egress-allowlist') also matched example.com.evil.io and example1com.net — silently widening the deny-by-default filter. The shipped baseline was already anchored (^api\.anthropic\.com$); only user additions were exposed. _generate_egress_filter now normalizes every line after comment-stripping and before dedup, uniformly across all three sources (baseline, global user file, per-project user file): - a bare hostname (alphanumerics, dots, hyphens) is escaped and anchored exactly: example.com -> ^example\.com$ - any other line passes through verbatim as raw ERE (expert escape hatch; the baseline's anchored lines are untouched), so 'example.com' and '^example\.com$' dedup to one line Same transform as _to_ere_baseline_pattern in scripts/egress-capture.sh, replicated because lib/ must not source scripts/. Documented in docs/troubleshooting.md and the baseline header. Also closes a coverage gap: the per-project allowlist file (egress-allowlist-<project>) now has a test proving it feeds the generated filter. * docs(changelog): daemon fail-fast (#159) and egress audit fixes (#149) * docs(egress): state the exact bare-hostname anchoring rule

jraicr added type:feat Feature work size:s Small: under 100 lines labels Jun 6, 2026

jraicr merged commit 01993c6 into dev Jun 6, 2026
4 checks passed

jraicr deleted the feat/docker-daemon-fail-fast branch June 6, 2026 00:42

This was referenced Jun 6, 2026

docs: egress jail as a security layer + daemon fail-fast troubleshooting #160

Merged

fix(egress): pre-gate criticals — filter mode 0600, smoke network poisoning, unanchored user ERE #161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compose): fail fast when the Docker daemon is unresponsive#159

feat(compose): fail fast when the Docker daemon is unresponsive#159
jraicr merged 1 commit into
devfrom
feat/docker-daemon-fail-fast

jraicr commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jraicr commented Jun 6, 2026

What

Context

Change

Tests

Cost

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant