Skip to content

feat(compose): fail fast when the Docker daemon is unresponsive#159

Merged
jraicr merged 1 commit into
devfrom
feat/docker-daemon-fail-fast
Jun 6, 2026
Merged

feat(compose): fail fast when the Docker daemon is unresponsive#159
jraicr merged 1 commit into
devfrom
feat/docker-daemon-fail-fast

Conversation

@jraicr

@jraicr jraicr commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

What

ensure_prereqs only checked that the docker socket file exists — not that
the daemon answers. A degraded daemon (e.g. after host memory pressure) keeps
the socket present but stops replying, so a later drydock build/run hangs
indefinitely with no message. This adds a bounded liveness probe that fails fast
with a clear, backend-agnostic message.

Context

This surfaced live: after host memory pressure left Docker Desktop's daemon
unresponsive, drydock build printed "Building…" and then hung silently — the
socket existed, the daemon didn't answer. drydock should say so, not hang.

Change

_ensure_docker_responsive (lib/compose.sh), called at the end of
ensure_prereqs:

  • probes docker version under timeout (default 12s, override with
    DRYDOCK_DOCKER_PROBE_TIMEOUT);
  • on no reply, errs with guidance that names both Docker Desktop and
    native Docker Engine — when the daemon is unreachable the backend can't be
    detected, so it must not presume one (per review feedback);
  • is skipped when timeout is unavailable (e.g. stock macOS without
    coreutils) so the probe itself can never hang.

Tests

test/lib_compose.bats — 4 unit tests via the existing DOCKER seam: daemon
replies (returns 0), daemon errors (errs, names "not responding"), a hanging
daemon is bounded by the probe timeout, and the message is backend-agnostic
(asserts both Docker Desktop and Docker Engine appear).

Full suite 1165/1165 green; shellcheck + shfmt clean.

Cost

Adds one docker version (~sub-second when healthy) per ensure_prereqs. Cheap
relative to the build/run it guards, and it converts an infinite hang into an
immediate, actionable error.

ensure_prereqs only checked that the docker socket FILE exists, not that the
daemon actually answers. A degraded daemon (e.g. after host memory pressure)
keeps the socket present but stops replying, so a later `drydock build`/`run`
hung indefinitely with no message — a silent failure against drydock's ethos.

Add _ensure_docker_responsive: probe `docker version` under a bounded `timeout`
(default 12s, override via DRYDOCK_DOCKER_PROBE_TIMEOUT) and, on no reply, abort
with backend-agnostic guidance naming BOTH Docker Desktop and native Docker
Engine — the daemon is unreachable at that point, so the backend cannot be
detected, hence no presumption. The probe is skipped when `timeout` is absent
(e.g. stock macOS) so the check itself can never hang.

test/lib_compose.bats: 4 unit tests via the DOCKER seam (replies / errors /
hangs-bounded-by-timeout / backend-agnostic message). Full suite 1165/1165
green; shellcheck + shfmt clean.
@jraicr jraicr added type:feat Feature work size:s Small: under 100 lines labels Jun 6, 2026
@jraicr jraicr merged commit 01993c6 into dev Jun 6, 2026
4 checks passed
@jraicr jraicr deleted the feat/docker-daemon-fail-fast branch June 6, 2026 00:42
jraicr added a commit that referenced this pull request Jun 11, 2026
…soning, unanchored user ERE (#161)

* fix(egress): make generated allowlist filter world-readable (644)

The effective filter was written via mktemp (always 0600) and mv (mode
preserved), then RO bind-mounted to /etc/tinyproxy/filter. tinyproxy in
the sidecar runs as a non-root uid with cap_drop ALL (no DAC_OVERRIDE),
so it could not read the filter: it logged 'filter file: Permission
denied' and exited, the healthcheck never passed, and the agent's
depends_on: service_healthy aborted every contained-mode run.

chmod 644 before the move in both generators — _generate_egress_filter
(lib/compose.sh) and _write_filter (scripts/egress-smoke.sh). The
allowlist is non-secret data; world-readable is correct.

Also move egress-smoke.sh's cleanup trap from top level into main():
a top-level 'trap _cleanup EXIT' fired on every sourcing shell's exit
and ran 'docker rm -f' against the live daemon, which made the helpers
untestable via the source-guard (the new test/egress_smoke.bats asserts
sourcing is side-effect-free).

* fix(egress): smoke run removes only the networks it created

_check_preconditions pre-created the production fixed-name networks
(drydock_internal / drydock_egress) via 'docker network create' when
absent, and _cleanup deliberately never removed networks. A plain
'docker network create' attaches no compose labels, so a subsequent
contained 'docker compose up' referencing the same fixed name: fails
fatally ('network ... has incorrect label com.docker.compose.network
set to ""'). Since setup-gates.sh runs the G2 smoke before the G1
capture sessions, the gate host was poisoned for every later contained
run until a manual 'docker network rm'.

Fix: track exactly the networks THIS run creates
(SMOKE_CREATED_NETWORKS) and remove them in _cleanup — on success and
on the failure paths via the EXIT trap. This restores the host to its
prior state by construction. Pre-existing networks (e.g. owned by a
live contained session) are never touched. Creating the networks with
hand-built compose labels was rejected: compose v2 validates the label
set against its own project metadata, and an approximated label set
that diverges from a future compose version silently reintroduces the
poisoning — removal is the only guarantee that does not depend on
matching compose internals.

* fix(egress): anchor bare-hostname user allowlist entries as exact ERE

tinyproxy's FilterType ere matches UNANCHORED substrings, so a user
allowlist line in the documented bare-hostname form ('echo example.com
>> ~/.config/drydock/egress-allowlist') also matched
example.com.evil.io and example1com.net — silently widening the
deny-by-default filter. The shipped baseline was already anchored
(^api\.anthropic\.com$); only user additions were exposed.

_generate_egress_filter now normalizes every line after
comment-stripping and before dedup, uniformly across all three sources
(baseline, global user file, per-project user file):

  - a bare hostname (alphanumerics, dots, hyphens) is escaped and
    anchored exactly: example.com -> ^example\.com$
  - any other line passes through verbatim as raw ERE (expert escape
    hatch; the baseline's anchored lines are untouched), so
    'example.com' and '^example\.com$' dedup to one line

Same transform as _to_ere_baseline_pattern in scripts/egress-capture.sh,
replicated because lib/ must not source scripts/. Documented in
docs/troubleshooting.md and the baseline header. Also closes a coverage
gap: the per-project allowlist file (egress-allowlist-<project>) now has
a test proving it feeds the generated filter.

* docs(changelog): daemon fail-fast (#159) and egress audit fixes (#149)

* docs(egress): state the exact bare-hostname anchoring rule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:s Small: under 100 lines type:feat Feature work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant