fix(dind): clean per-job state + add per-repo image cache#70
Merged
Conversation
Two related problems landed together because they're tightly coupled: the cleanup work would force a full re-pull of kindest/node (~1 GB) per job without the cache, and the cache trusts cleanup never touches its long-lived namespaces. Problem 1: per-job state leak dind.Server.Stop() destroyed in-memory tracked containers but left the underlying containerd per-job namespace (ephemerd-dind-<runner-name>) populated with Image records, leases, snapshots and content blobs. Over ~2 days we accumulated 73 leaked namespaces holding ~98 GB on a 100 GB VHDX, blocking new jobs with "no space left on device". Fix: pkg/dind/cleanup.go's CleanupJobNamespace enumerates and removes containers (with WithSnapshotCleanup), Image records (drops gc.ref labels), leases, then snapshots in leaf-first multi-pass order (containerd refuses to delete a snapshot with children), then content blobs, then the namespace metadata bucket itself. A 3-attempt async-GC-catchup retry handles transient FailedPrecondition results from containerd's eventually-consistent state. On boot, worker mode runs CleanupStaleDindNamespaces to sweep anything left behind by ungraceful exits (DeadlineExceeded, SIGKILL, host reboot). Verified live: 4 consecutive ephpm E2E jobs across two parallel runners exit-and-clean with zero leftover namespaces, VHDX growth bounded at ~59 GB after extensive testing vs the pre-fix unbounded growth. Problem 2: per-repo image cache to avoid the re-download tax Without it, every job would re-pull kindest/node and any other dind image because the Image records get deleted in step 2 of cleanup above, dropping the gc.refs on the underlying content blobs. Fix: pkg/dind/cache.go introduces a per-(provider, repo) long-lived namespace at ephemerd-dind-cache-<provider>-<sanitized-repo>. On image pull or container create that picked up a previously-pulled image, dind mirrors the Image record into the cache namespace (or refreshes its ephemerd.io/last-accessed label if already there). The cache's gc.refs keep the content blobs alive after the per-job namespace is cleaned up, so subsequent jobs in the same repo get a content-store hit instead of a network pull. Privacy boundary: containerd namespace isolation means a content blob referenced only from `dind-cache-foo-private`'s Image records is invisible to any other namespace's resolver. Two forges with same-named repos (e.g. github/ephpm vs gitea/ephpm) get distinct cache namespaces; two repos within a forge get distinct caches keyed by full owner/repo. Provider + Repo plumb through CreateJobRequest → runtime.CreateConfig → dind.Config so the cache namespace is derived from the dispatching forge, not parsed from the runner name. Cache pruner (cmd/ephemerd/main.go): a goroutine started in worker mode walks every dind-cache-* namespace every [dind].cache_prune_interval (default 24h) and evicts Image records whose last-accessed label is older than [dind].cache_max_age (default 168h / 7 days). Empty cache namespaces are removed entirely. Records pre-dating the label fall back to UpdatedAt so a deploy of this change doesn't nuke existing caches on first prune. Tests (all green locally with shared in-process containerd): - TestCleanup_DindNamespaces — image + lease + namespace removal - TestCleanup_DindNamespaces/StaleSweep — prefix filter, doesn't touch cache namespaces or non-dind namespaces - TestCacheNamespace_FormatAndIsolation — cross-provider + nested-repo sanitization, empty-input handling - TestSanitizeForNamespace_CollapsesAndTrims — no leading/trailing separator, no consecutive separators - TestCache_MirrorAndPrune — full mirror → refresh → backdate → prune → empty-namespace cleanup lifecycle - TestCachePrune_KeepsFreshAndPrefixedOnly — fresh records survive, non-cache namespaces untouched - TestPushHandlerEndToEnd still passes against the shared containerd (rewired to sharedTestContainerd because containerd's prometheus metrics use a process-global registry — two containerdpkg.New() in one test binary panics).
Extends docs/architecture/fake-docker-daemon.md with two new sections: - "Per-Job Namespace and Cleanup" covers the ephemerd-dind-<runner-name> namespace, the 6-step CleanupJobNamespace sequence (containers, images, leases, leaf-first snapshots, content, namespace), the FailedPrecondition retry, and the worker-mode CleanupStaleDindNamespaces sweep for crash recovery. - "Per-Repo Image Cache" covers the ephemerd-dind-cache-<provider>-<sanitized-repo> namespace, the Provider/Repo plumbing path, the two cache-write events (pull and container-create), the containerd namespace-isolation privacy guarantee (and the explicit "don't ever set namespace.shareable" caveat), and the prune semantics including the UpdatedAt fallback for records pre-dating the last-accessed label. Updates docs/getting-started/configuration.md to surface cache_prune_interval and cache_max_age on the example config block and adds a [dind] section reference paragraph explaining the cache behavior, privacy boundary, and disable knobs. Also refreshes the Key Files table to include cleanup.go, cache.go, and their respective test files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related problems land together because they're tightly coupled: fixing the cleanup leak would force a full re-pull of
kindest/node(~1 GB) every job without the cache, and the cache only works because cleanup never touches its long-lived namespaces.Problem 1: per-job state leak
dind.Server.Stop()destroyed in-memory tracked containers but left the underlying containerd per-job namespace (ephemerd-dind-<runner-name>) populated with Image records, leases, snapshots, and content blobs. Over ~2 days we accumulated 73 leaked namespaces holding ~98 GB on a 100 GB VHDX, eventually blocking new jobs withNo space left on device.Fix
pkg/dind/cleanup.goCleanupJobNamespace:WithSnapshotCleanup).gc.ref.content.*labels).NamespaceService().Delete()the metadata bucket.Plus a 3-attempt async-GC-catchup retry for transient
FailedPreconditionfrom containerd's eventually-consistent state, and a diagnostic log of the stuck snapshot'sStat()(name/parent/kind/labels) if the loop bails so future failures tell us what's pinning.On worker-mode boot,
CleanupStaleDindNamespacessweeps anything left behind by ungraceful exits (DeadlineExceeded, SIGKILL, host reboot).Verified live
4 consecutive ephpm E2E jobs across two parallel runners exit-and-clean with zero leftover namespaces. VHDX growth bounded at ~59 GB after extensive testing vs the pre-fix unbounded growth. Console log shows
dind cleanup: namespace removedfor each completed runner.Problem 2: per-repo image cache to avoid the re-download tax
Without it, every job would re-pull
kindest/nodeand any other dind image because the Image records get deleted in step 2 above, dropping the gc.refs on the underlying content blobs.Fix
pkg/dind/cache.gointroduces a long-lived per-(provider, repo)namespace:On image pull or container create that picked up a previously-pulled image, dind mirrors the Image record into the cache namespace (or refreshes its
ephemerd.io/last-accessedlabel if already there). The cache'sgc.reflabels keep the content blobs alive after the per-job namespace is cleaned up, so subsequent jobs in the same repo get a content-store hit instead of a network pull.Privacy boundary
Containerd namespace isolation means a content blob referenced only from
dind-cache-foo-private's Image records is invisible to any other namespace's resolver. Two forges with same-named repos (e.g.github/ephpmvsgitea/ephpm) get distinct cache namespaces; two repos within a forge get distinct caches keyed by fullowner/repo. Provider + Repo plumb throughCreateJobRequest→runtime.CreateConfig→dind.Configso the cache namespace is derived from the dispatching forge, not parsed from the runner name (which would lose provider info).Cache pruner
cmd/ephemerd/main.gostarts a goroutine in worker mode that walks everydind-cache-*namespace every[dind].cache_prune_interval(default24h) and evicts Image records whoselast-accessedlabel is older than[dind].cache_max_age(default168h/ 7 days). Empty cache namespaces are removed entirely. Records pre-dating the label fall back toUpdatedAtso a deploy of this change doesn't nuke existing caches on first prune.Tests
TestCleanup_DindNamespaces/RemovesImageLeaseAndNamespace— full delete cycle.TestCleanup_DindNamespaces/StaleSweepFiltersByPrefix— sweep touches onlyephemerd-dind-*(non-cache) namespaces.TestCacheNamespace_FormatAndIsolation— cross-provider + nested-repo sanitization, empty-input → empty-result (caching disabled).TestSanitizeForNamespace_CollapsesAndTrims— no leading/trailing separators, no consecutive separators, alphanumerics/._-preserved.TestCache_MirrorAndPrune— full mirror → refresh → backdate → prune → empty-namespace cleanup lifecycle.TestCachePrune_KeepsFreshAndPrefixedOnly— fresh records survive, non-cache namespaces untouched.TestPushHandlerEndToEndstill passes (rewired tosharedTestContainerdbecause containerd's prometheus metrics use a process-global registry — twocontainerdpkg.New()in one test binary panics).All 26 packages green locally:
CGO_ENABLED=0 go test -tags containers_image_openpgp -count=1 ./.... Localmage lintblocked by the known Windows-onlymiekg/pkcs11cgo typecheck issue (documented inAGENTS.md); CI lint on Linux runs clean.Test plan
dind-cache-gitea-*anddind-cache-github-*don't share blobs even with same-named repos.Why one PR
The two changes are coupled by design: the leak fix would defeat the cache, the cache work assumes the cleanup never touches its namespaces. Splitting would require landing the leak fix in a state where every job has the re-download tax, then later landing the cache work — strictly worse in production for as long as PR1 is alone on main.