Skip to content

ci: split release pipeline into per-OS native builds#69

Merged
luthermonson merged 2 commits into
mainfrom
release/per-os-builds
May 15, 2026
Merged

ci: split release pipeline into per-OS native builds#69
luthermonson merged 2 commits into
mainfrom
release/per-os-builds

Conversation

@luthermonson
Copy link
Copy Markdown
Contributor

Summary

Replaces the single-Linux-container goreleaser pipeline with four parallel native builds, so v0.0.1 and future tags actually ship working binaries on every advertised platform.

Why

The old release.yml ran goreleaser inside a single [self-hosted, linux, x64] container and cross-compiled all four platforms. That left three of the four binaries broken:

Platform go:embed needs Present after old release.yml?
linux/amd64 linux x64 runner, cni, shim
linux/arm64 linux arm64 runner, cni, shim ✗ (would embed x64 runner — wrong arch)
windows/amd64 vmlinuz, initrd, rootfs, ephemerd-linux, win runner ✗ missing vmlinuz/initrd → Hyper-V VM won't boot
darwin/arm64 aarch64 kernel/initrd/rootfs/ephemerd-linux/runner ✗ none of the aarch64 assets are downloaded

Root cause: mage download:all on Linux fetches only Linux-host assets (Linux ephemerd has no VM dependency), and EnsurePlaceholders() then creates zero-byte files for the other platforms' embeds so the build compiles regardless. Compiles fine, runs broken.

How

Four parallel build jobs, each on its native self-hosted runner running the appropriate full mage target:

  • build-linux-amd64: mage build:build on [self-hosted, linux, x64]
  • build-linux-arm64: mage build:build on [self-hosted, linux, arm64]
  • build-windows-amd64: mage build:windows on [self-hosted, windows, x64] — full two-stage build with Linuxembed + Rootfs + Kernelx86 + Initrdx86
  • build-darwin-arm64: mage build:macos on [self-hosted, macos, arm64] — aarch64 Linux VM assets + codesign

Each job packages its binary (tar.gz on unix, zip on Windows) and uploads it as an artifact. A final release job downloads all four, computes sha256sum > checksums.txt, and creates a draft GitHub release via gh release create --draft --prerelease --generate-notes. Publish step stays manual.

.goreleaser.yml is deleted — gh release create handles the release, mage handles per-OS cross-compile.

Test plan

  • After merge, push tag v0.0.1-rc1 and watch all four build jobs run on their native runners.
  • Verify each artifact contains a real binary (non-zero size, non-zero embedded VM assets where applicable).
  • Verify draft release is created with 4 archives + checksums.txt.
  • Smoke-test one binary per platform: ephemerd --version returns the tag, ephemerd start boots without errors.
  • Once happy, publish the draft.

Caveats

  • All four builds must succeed for the release job to run. Partial-platform releases aren't supported by this design — by intent, since "ephemerd v0.0.1 ships Linux but not Windows" would be confusing. If a platform's runner is offline at tag time, push a new tag once it's back.
  • The release job depends on the Windows self-hosted runner and macOS self-hosted runner being online. If either is offline, the release won't complete until it's available — same situation as PR feat: end-to-end KIND-on-dind support — kube-proxy networking, dind hardening, debug tooling #68's Build (Linux arm64).

Arch doc?

This is the first cross-platform release pipeline for the project. Happy to add docs/arch/release-pipeline.md if you'd like — it'd document the per-OS native build invariant + why goreleaser cross-compile from Linux is wrong here, so the next person doesn't try to consolidate it back.

The previous release.yml ran goreleaser inside a single Linux container
and cross-compiled for linux/amd64, linux/arm64, windows/amd64, and
darwin/arm64. That approach left the cross-compiled binaries with empty
go:embed sections — the Hyper-V kernel/initrd/rootfs, the macOS Vz
aarch64 VM assets, and the arm64 GHA runner tarball all live in
platform-specific paths fed by mage targets gated to the matching host.
Cross-compiling from Linux saw only the x64 Linux assets and filled the
rest with EnsurePlaceholders() zero-byte files — the resulting Windows
and Darwin binaries would compile fine but be unable to boot a Linux VM.

Replace with four parallel build jobs, one per platform, each running
on its native self-hosted runner via the appropriate mage target:

- linux/amd64: mage build:build on [self-hosted, linux, x64]
- linux/arm64: mage build:build on [self-hosted, linux, arm64]
- windows/amd64: mage build:windows on [self-hosted, windows, x64] —
  the full two-stage build (Linuxembed, Rootfs, Kernelx86, Initrdx86)
- darwin/arm64: mage build:macos on [self-hosted, macos, arm64] — the
  Darwin build with aarch64 Linux VM assets + codesign

Each job packages its binary (tar.gz on unix, zip on Windows) and
uploads it as a workflow artifact. A final release job downloads all
four artifacts, computes sha256 checksums.txt, and creates a draft
GitHub release via `gh release create --draft --prerelease`. The draft
gate is intentional — release notes auto-generated by --generate-notes,
publishing is manual.

Drop .goreleaser.yml; the workflow uses `gh release create` directly
and mage handles cross-compile via its existing per-OS build:* targets.
Two related things, both surfaced by the PR #69 CI run failing on the
same TestPushHandlerEndToEnd flake from PR #68 that I "fixed" with a
post-stage Info() diagnostic.

1. AGENTS.md — short hard-rules file for any agent (Claude, Cursor, etc.)
   working in this repo. Centerpiece is "run mage lint AND mage test
   before every push, no exceptions". Local cgo failures on Windows are
   not a free pass — that's exactly the path that produced two recent
   red CI runs (errcheck on debugexec_linux.go in PR #68, flake-mask
   regression in PR #69). Also documents the no-flake-masking rule:
   never paper over a flaky test with a diagnostic call, sleep, or
   label that "might help".

2. The real TestPushHandlerEndToEnd fix — hold a 5-minute lease across
   the entire staging→push lifecycle via leases.Create + WithLease.
   Without an active lease, content.WriteBlob's addContentLease is a
   no-op (leases.FromContext returns false), and the staged blobs are
   namespace-bucket-registered but un-leased and un-labeled. That
   combination flakes in CI in ways that look like the layer digest
   "doesn't exist" mid-push.

   Replaces the post-stage Info() diagnostic from PR #68, which was
   flake-masking: it made one CI run pass but the underlying race was
   never fixed.

   Verified: 5 sequential `go test -run=TestPushHandlerEndToEnd` runs
   pass on Windows (CGO_ENABLED=0) in 0.65–1.0s each, vs. the previous
   half-second-flake behavior.

Note on this commit's lint coverage: golangci-lint on this Windows box
fails on the miekg/pkcs11 cgo cross-import (a known local-env issue
documented in AGENTS.md). golangci-lint reports "0 issues" before
exiting on that typecheck. `go build ./pkg/dind/...` and `go test
./pkg/dind/...` both pass.
@luthermonson luthermonson merged commit c6a9ef9 into main May 15, 2026
4 checks passed
@luthermonson luthermonson deleted the release/per-os-builds branch May 16, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant