Skip to content

Implement socket-activated zero-downtime deploy switchover#67

Merged
retlehs merged 1 commit intomainfrom
fix/zero-downtime-deploy
Apr 4, 2026
Merged

Implement socket-activated zero-downtime deploy switchover#67
retlehs merged 1 commit intomainfrom
fix/zero-downtime-deploy

Conversation

@retlehs
Copy link
Copy Markdown
Member

@retlehs retlehs commented Mar 25, 2026

Summary

  • Implement systemd socket activation for zero-downtime deploy switchover
  • Add Go-side LISTEN_FDS detection with fallback to normal listen for local dev
  • Bump Caddy retry window (lb_try_duration 5s→10s, lb_try_interval 250ms→100ms) as safety net

Why

We were seeing brief 502/503 responses during deploy because restarting the service drops the listening socket. Socket activation (wppackages.socket) keeps the socket open across service restarts — incoming connections queue at the kernel instead of failing.

Builds on #95 which separated Litestream into its own service, unblocking socket activation (the old litestream -exec wrapper wouldn't pass through the socket fd).

Changes

  • internal/http/server.gosystemdListener() consumes the fd passed by systemd via LISTEN_FDS/LISTEN_PID; falls back to ListenAndServe when not socket-activated (local dev)
  • templates/wppackages.socket.j2 — new systemd socket unit listening on {{ go_listen_addr }}
  • templates/wppackages.service.j2 — adds Requires=wppackages.socket
  • tasks/main.yml — deploys and enables the socket unit before the service
  • Caddyfile.j2 — retry tuning as additional safety net

Test plan

  • Run provision and verify wppackages.socket is active (systemctl status wppackages.socket)
  • Confirm wppackages.service starts via socket activation (journalctl -u wppackages shows "using systemd socket activation")
  • Run deploy and monitor for 502/503 elimination during switchover
  • Verify local dev still works without socket activation (normal make dev)

🤖 Generated with Claude Code

@retlehs retlehs self-assigned this Mar 25, 2026
@retlehs retlehs changed the title Increase Caddy retry window to reduce deploy 502s Implement socket-activated zero-downtime deploy switchover Apr 2, 2026
@swalkinshaw
Copy link
Copy Markdown
Member

🤔 not sure this is entirely correct or solves the problem. I think partly its because we're wrapping the Go command with litestream -exec when really we should separate out litestream into its own service. In fact litestream might not even pass through the socket fd to our binary...

Assuming we separate them, not even sure the readiness checks in Ansible are needed or the Caddy retries 🤔

@swalkinshaw
Copy link
Copy Markdown
Member

Though the readiness checks aren't bad anyway just to be safe

Systemd socket activation keeps the listening socket open across
service restarts so connections queue at the kernel instead of
getting 503s from Caddy. The Go server detects LISTEN_FDS and
uses the inherited fd, falling back to normal listen for local dev.
Caddy retry window bumped as a safety net.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@retlehs retlehs force-pushed the fix/zero-downtime-deploy branch from e8fc77d to a62be24 Compare April 4, 2026 16:07
@retlehs retlehs merged commit e276195 into main Apr 4, 2026
5 checks passed
@retlehs retlehs deleted the fix/zero-downtime-deploy branch April 4, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants