Skip to content

feat: add Swarm mode with atomic Caddyfile rollout via configs#778

Open
oneingan wants to merge 7 commits intolucaslorentz:masterfrom
oneingan:feat/swarm-mode-configs-upstream
Open

feat: add Swarm mode with atomic Caddyfile rollout via configs#778
oneingan wants to merge 7 commits intolucaslorentz:masterfrom
oneingan:feat/swarm-mode-configs-upstream

Conversation

@oneingan
Copy link
Copy Markdown
Contributor

@oneingan oneingan commented Mar 3, 2026

Implements the Swarm mode discussed in #773 (and the earlier atomic-write motivation from #766).

  • Adds --mode=swarm: controller-only; renders a full Caddyfile from labels and rolls it out via immutable Swarm configs.
  • Updates an existing Swarm service (--swarm-service) by swapping the mounted config at /etc/caddy/Caddyfile (or --swarm-caddyfile-target).
  • Keeps Caddy workers socketless and the Admin API closed; rollout happens via docker service update, so disruption is governed by the service's update/rollback settings.
  • Includes docs + examples/swarm.yaml and a swarm-mode test script.

Notes:

  • Each config change creates a new Swarm config object named <prefix>-<sha256>; old configs are not garbage-collected yet (documented in README). A GC pass can be added as follow-up.
  • This has been running on a production Swarm cluster for a couple of weeks without issues.

AI assistance disclosure: this PR was prepared with help from OpenCode (AI coding assistant) using OpenAI model gpt-5.2-xhigh.

if err != nil {
// If another instance created it concurrently, inspect again.
if errdefs.IsConflict(err) {
return ensureSwarmConfig(ctx, dockerClient, name, data, fullHash)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursive call has no depth limit. If ConfigCreate keeps returning a conflict (e.g. a genuine name collision with different data), this will stack overflow.

Consider a bounded loop like the one in ensureServiceCaddyfileConfig just below (line 190).

if err != nil {
log.Error("Failed to convert caddyfile into json config", zap.Error(err))
if dockerLoader.options.SwarmMode {
if err := dockerLoader.updateSwarmService(); err != nil {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateSwarmService() runs on every update() cycle, not just when caddyfileChanged is true. This means every polling interval / Docker event triggers ServiceInspectWithRaw + ConfigList even when nothing changed.

The existing flow only does expensive work (Adapt + server push) inside the if caddyfileChanged block. Could this be guarded the same way?

defaultSwarmConfigPrefix = "caddyfile"
defaultSwarmConfigHashLen = 32
maxSwarmConfigSizeBytes = 1000 * 1024
)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults are also defined in the flag declarations in cmd.go (lines 75-86). In the existing code, defaults live only in the flag definitions. Having two sources means they can silently drift apart.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

fs.Int("swarm-config-hash-len", 32,
				"Length of sha256 hex used in generated Swarm config name (swarm mode only)")

Having a shared const would be nice

docker config ls --format "{{.Name}}" | grep "^${CONFIG_PREFIX}-" | xargs -r docker config rm >/dev/null || true
}

cleanup_configs
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_configs runs here at the start but not on exit. docker stack rm (which the main test runner calls) won't remove standalone Swarm config objects created by this mode — unlike other tests, this one creates Docker objects outside the stack.

Consider adding a trap:

trap cleanup_configs EXIT

@lucaslorentz
Copy link
Copy Markdown
Owner

Reviewed with Claude Opus

@lucaslorentz
Copy link
Copy Markdown
Owner

A concern I have is the lack of config garbage collection. Each Caddyfile change creates a new Swarm config object (<prefix>-<sha256>) that is never removed. In a cluster with frequent deploys, this will silently accumulate and fill up Swarm's Raft store, which can degrade the entire cluster's performance and eventually prevent new configs/secrets from being created.

Would it be possible to tag the created configs with a label identifying them as managed by this controller, and include a cleanup routine that scans for old configs with that label and removes any that are no longer mounted on the target service?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants