Skip to content

Commit 034d188

Browse files
Sbussisoclaude
andcommitted
ci(deploy): switch to build + fly machine update for reliable in-place deploys
`fly deploy` for our single-machine + single-volume topology is non-deterministic — sometimes it updates the existing machine in place, sometimes it tries to provision a NEW machine alongside and fails because the volume only has one attachment slot. Hit this on consecutive runs today with both strategy=rolling AND strategy=immediate; the manual `fly machine update` recovery worked reliably. Workflow now does: 1. flyctl deploy --build-only --push (build + push image, no machine touch) 2. capture the image tag from the build output 3. flyctl machine update <id> --image <tag> (explicit in-place API) `fly machine update` targets a specific machine ID and cannot try to create a parallel machine. ~30-60s of downtime per deploy, same as strategy=immediate on its best day, but reliably works. Loops will be needed if we ever scale to multiple machines (or migrate off SQLite-on-volume, which is the long-term unblocker for zero-downtime deploys). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c2f9a85 commit 034d188

1 file changed

Lines changed: 59 additions & 10 deletions

File tree

.github/workflows/deploy.yml

Lines changed: 59 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,15 +38,64 @@ jobs:
3838

3939
- uses: superfly/flyctl-actions/setup-flyctl@master
4040

41-
# `--depot=false` falls back to Fly's built-in remote builder
42-
# (a Fly Machine running `docker build`). Slower than depot
43-
# (~3 min vs ~30s for a cached build) but doesn't depend on
44-
# depot.dev's availability — depot has timed out twice in a
45-
# row on consecutive deploys (2026-04-28), wedging CI for ~10
46-
# min each before falling through to a final auth-handshake
47-
# error. Switch back to `--depot=true` (or remove the flag)
48-
# once depot stabilises, or accept the slower-but-reliable
49-
# path indefinitely if uptime matters more than build speed.
50-
- run: flyctl deploy --remote-only --depot=false
41+
# Why two-step (build → machine update) instead of plain `fly deploy`:
42+
#
43+
# We run a single Fly Machine with a single persistent volume
44+
# (`opensentry_data` at /data, holding the SQLite DB). `fly deploy`
45+
# for this topology is non-deterministic: sometimes it sees the
46+
# existing machine and updates it in place, sometimes it decides
47+
# the image config has "drifted enough" and tries to provision a
48+
# NEW machine alongside the old. The new-machine path errors
49+
# immediately because the volume only has one attachment slot:
50+
# "creating a new machine in group 'app' requires an
51+
# unattached 'opensentry_data' volume."
52+
# We hit this on consecutive runs 2026-04-28 with strategy=rolling
53+
# AND strategy=immediate, and `max_unavailable` is rolling-only so
54+
# it didn't help either.
55+
#
56+
# `fly machine update --image …` is the explicit in-place API.
57+
# It targets a specific machine ID, restarts it on the new image,
58+
# and the volume stays attached throughout. It cannot try to
59+
# create a new machine. ~30-60s of downtime per deploy (same as
60+
# `strategy = "immediate"` on a good day) but reliably works.
61+
#
62+
# Note: `--depot=false` is intentional — depot.dev timed out
63+
# twice in a row on 2026-04-28 (5 min each), wedging CI for
64+
# ~10 min before failing. Standard remote builder is slower
65+
# (~3 min cached vs ~30s depot) but reliable.
66+
- name: Build + push image to Fly registry
67+
id: build
68+
run: |
69+
set -e
70+
# --build-only --push: build the image and push to Fly's
71+
# registry, but do NOT touch any machines. The image tag is
72+
# printed on a line like:
73+
# image: registry.fly.io/opensentry-command:deployment-XXXX
74+
# Capture it so the next step can target it explicitly.
75+
OUT=$(flyctl deploy --remote-only --depot=false --build-only --push 2>&1 | tee /dev/stderr)
76+
IMAGE=$(echo "$OUT" | grep -oP 'image:\s+\K[^\s]+' | tail -1)
77+
if [ -z "$IMAGE" ]; then
78+
echo "::error::Could not find image tag in flyctl output"
79+
exit 1
80+
fi
81+
echo "Captured image: $IMAGE"
82+
echo "image=$IMAGE" >> "$GITHUB_OUTPUT"
5183
env:
5284
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
85+
86+
- name: Update machine in place
87+
run: |
88+
set -e
89+
# We currently run exactly one machine. If we ever scale to
90+
# multiple machines (or migrate to LiteFS / Postgres so we
91+
# don't need a single volume), this script needs to loop.
92+
MACHINE=$(flyctl machines list -a opensentry-command --json | jq -r '.[0].id')
93+
if [ -z "$MACHINE" ] || [ "$MACHINE" = "null" ]; then
94+
echo "::error::No machines found for opensentry-command"
95+
exit 1
96+
fi
97+
echo "Updating machine $MACHINE to $IMAGE"
98+
flyctl machine update "$MACHINE" --image "$IMAGE" --yes -a opensentry-command
99+
env:
100+
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
101+
IMAGE: ${{ steps.build.outputs.image }}

0 commit comments

Comments
 (0)