Skip to content

feat(buildrunner): add Buildkite BuildRunner implementation#190

Open
JamyDev wants to merge 8 commits into
mainfrom
feat/buildkite-buildrunner
Open

feat(buildrunner): add Buildkite BuildRunner implementation#190
JamyDev wants to merge 8 commits into
mainfrom
feat/buildkite-buildrunner

Conversation

@JamyDev
Copy link
Copy Markdown

@JamyDev JamyDev commented Jun 4, 2026

What

Adds submitqueue/extension/buildrunner/buildkite — a BuildRunner implementation backed by the Buildkite CI platform. The package is four files:

  • buildkite.gorunner struct implementing BuildRunner.Trigger, .Status, and .Cancel
  • client.go — thin HTTP wrapper over the three Buildkite REST endpoints (create/get/cancel build)
  • buildkite_test.go — 16 unit tests backed by an in-process httptest server
  • BUILD.bazel — Bazel build file

Trigger calls the Buildkite API to create a build, encoding the base and head change URIs as JSON-encoded environment variables (SQ_BASE_URIS, SQ_HEAD_URIS, SQ_QUEUE). The Buildkite
pipeline script uses these to fetch diffs via the GitHub API, apply them in order, and run CI. Status and Cancel derive the Buildkite build number from the BuildID string — no local
state is needed.

Auth and base URL are injected via HTTP transport layers on the caller-provided *http.Client (using httpclient.BaseURLTransport + an auth transport), following the existing httpclient
pattern and keeping credentials out of the struct.

Why

The BuildRunner extension interface was designed to support multiple CI platforms as pluggable backends. This adds the first real (non-noop) implementation. Buildkite is the target CI
platform for speculative merges: its pipeline model maps cleanly onto the base/head separation in BuildRunner.Trigger — the pipeline applies the base layer first (potentially cacheable
across batches), then the head layer, then runs the full test suite.

Using environment variables to pass change URIs keeps the SQ→Buildkite contract simple and auditable; the pipeline script is responsible for materialising the diffs, isolating SQ from
Buildkite's pipeline YAML.

Tests

All three interface methods are tested against an in-process httptest server — no external dependencies or mocks:

  • Trigger: validates HTTP method, request payload (env var keys/values), URI flattening across multiple Change entries, and that a nil base produces [] (not null) in JSON
  • Status: state mapping — exhaustive table over all documented Buildkite states (creating, scheduled, running, blocked, passed, failed, not_run, skipped, canceling,
    canceled) plus unknown future states (mapped to non-terminal Unknown so the poll loop keeps waiting rather than failing a batch on an unrecognised state); also validates build URL is
    returned in metadata
  • Cancel: verifies the cancel endpoint is called; verifies 422 (already-terminal build) is treated as a no-op
  • Helpers: encode/parse round-trip, invalid build ID rejection

Implements buildrunner.BuildRunner and buildrunner.Factory backed by the
Buildkite CI platform, under submitqueue/extension/buildrunner/buildkite.

Design:
- Trigger/Cancel return promptly (per the async/sync contract in the
  build-runner RFC): both enqueue work on a buffered channel and a background
  worker contacts Buildkite, keeping the orchestrator's queue loops decoupled
  from Buildkite availability.
- The in-memory build-ID -> Buildkite-ref map is a pure latency cache, not the
  source of truth. The SQ build ID is stamped into the Buildkite build's
  metadata at create time; Status and Cancel re-derive the ref via a
  metadata-filtered build lookup on cache miss, so they remain correct after a
  process restart that empties the cache (no orphaned Accepted-forever builds).
- The background worker retries transient create/cancel failures with backoff.
  If a submission exhausts its retries, the build is recorded as a submission
  failure and Status reports terminal Failed (with the reason in
  BuildMetadata["error"]) rather than polling Accepted forever.
- mapState collapses Buildkite states into BuildStatus; unrecognised states map
  to the non-terminal Unknown (not Failed), so an unknown state does not
  terminally fail a batch.
- Each job is dispatched to its own goroutine so a retrying submit does not
  head-of-line block other builds.

Errors are returned plain (classification left to the controller, per
core/errs). Includes config knobs (queue sizes, submit timeout, retry
attempts/backoff, test overrides) and unit tests covering trigger/status/cancel,
cache-miss recovery, retry-then-succeed, retry-exhaustion-fails, and state
mapping.

Known limitation: a process restart while a trigger job is still buffered
(never submitted to Buildkite) leaves the build Accepted with nothing to
recover; catching that belongs to an orchestrator-level Accepted deadline,
which is out of scope for this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/factory.go Outdated
JamyDev added 3 commits June 4, 2026 11:42
… via HTTP transport

- Replace New(Config) with NewBuildRunner(Params) following the MergeChecker
  Params pattern: Config holds queue-specific settings, HTTPClient is injected
  by the caller with auth and base URL already configured via transports
- Remove APIToken from Config; auth is now the caller's concern via a
  transport layer (e.g. Authorization-header RoundTripper)
- Remove BaseURL from Config and client; client now uses relative paths so
  BaseURLTransport on the injected client resolves them to absolute URLs
- Delete factory.go; the Factory belongs in the example server (per-queue
  wiring is an application-level concern, not an extension concern)
Replace the channel-based async machinery (triggerCh, cancelCh, background
worker goroutine, retry loop, submitFailures map) with direct synchronous
Buildkite API calls in Trigger and Cancel. Errors propagate to the caller so
the queue consumer can nack and retry through the normal path.

Config is simplified to QueueName only; the channel buffer sizes, submit
timeout, max attempts, and backoff fields are removed. Tests drop the
drainTrigger/drainCancel helpers and async-specific cases in favour of
straightforward call-and-assert patterns.
Comment thread submitqueue/extension/buildrunner/buildkite/config.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/buildkite.go Outdated
Comment thread submitqueue/extension/buildrunner/buildkite/buildkite.go Outdated
Comment on lines +89 to +91
if httpClient == nil {
httpClient = http.DefaultClient
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail given we expect transports to be wired up and be passed and if nil, there is no point is continueing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants