|
| 1 | +# Build Runner |
| 2 | + |
| 3 | +Design notes for the `BuildRunner` extension that lets the orchestrator drive external Build Runners from the build stage of the workflow pipeline. |
| 4 | + |
| 5 | +This document captures **design decisions and rationale only**. |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +The build stage needs a vendor-agnostic abstraction for talking to a Build Runner — one that fits the existing extension family (storage, queue, conflict) and that does not have to change when the first real backend lands. |
| 10 | + |
| 11 | +## Flow |
| 12 | + |
| 13 | +`build` triggers the runner and hands the `buildID` to the `buildsignal` poll loop. The loop calls `Status` on its own partition per build until the build is terminal: terminal results wake the batch state machine via `speculate`; non-terminal results re-enqueue the same `buildID` after a delay (`PublishAfter`). A webhook-capable backend can publish a status message into the same queue — the consumer cannot tell a push from a poll. |
| 14 | + |
| 15 | +``` |
| 16 | + ┌────────────────────────────────────────────────────┐ |
| 17 | + │ build │ |
| 18 | + │ Trigger(base, head) → buildID │ |
| 19 | + │ persist Build{Accepted}, enqueue buildID │ |
| 20 | + └──────────────────────────┬─────────────────────────┘ |
| 21 | + │ buildID |
| 22 | + ▼ |
| 23 | + ┌────────────────────────────────────────────────────┐ |
| 24 | + │ buildsignal (poll loop, one partition per build) │ |
| 25 | + │ load Build, call Status(buildID) │ |
| 26 | + └──────────────────────────┬─────────────────────────┘ |
| 27 | + │ |
| 28 | + ┌──────────────────┴─────────────────┐ |
| 29 | + ▼ ▼ |
| 30 | + ┌───────────────┐ ┌──────────────────────────────────┐ |
| 31 | + │ terminal │ │ non-terminal │ |
| 32 | + │ → speculate │ │ → PublishAfter(buildID, delay) │ |
| 33 | + │ re-evaluate │ │ re-enqueues to buildsignal │ |
| 34 | + └───────────────┘ └──────────────────────────────────┘ |
| 35 | +``` |
| 36 | + |
| 37 | +## Interface |
| 38 | + |
| 39 | +`BuildRunner` exposes three verbs, all keyed by a build identifier (`entity.BuildID`): |
| 40 | + |
| 41 | +- **`Trigger`** — submit a build for a queue, given the ordered `base` and `head` change sets plus a free-form metadata map; returns the new build's ID. Runner-side work is asynchronous. |
| 42 | +- **`Status`** — fetch the current `BuildStatus` and runner-defined metadata for a build; MAY round-trip to the runner. |
| 43 | +- **`Cancel`** — request cancellation; returns once the request reaches the runner, not once the build stops. |
| 44 | + |
| 45 | +See `extension/buildrunner/build_runner.go` for the exact Go signatures. The sections below record why the contract is shaped this way. |
| 46 | + |
| 47 | +### Trigger: base + head |
| 48 | + |
| 49 | +`Trigger` takes two ordered lists of changes and a free-form metadata map: |
| 50 | + |
| 51 | +- **`base`** — changes from the dependency batches (assumed-good prefix). Ordered. |
| 52 | +- **`head`** — changes from the batch being verified. Ordered. |
| 53 | +- **`metadata`** — caller-supplied attributes (requester, ticket ID, trace ID, etc.) the provider MAY persist or echo back via `Status`. Schema is caller/provider-defined; the interface treats it as opaque. `nil` is equivalent to an empty map. |
| 54 | + |
| 55 | +The provider applies `base` then `head` in order on top of the queue's target branch and validates the resulting tree. Validation is **implicit and holistic**: it is not a per-change action, it is what the provider does after applying everything. |
| 56 | + |
| 57 | +Why split base and head: |
| 58 | + |
| 59 | +- The orchestrator's internal model already distinguishes them — a speculation path has a head batch and a list of base batches assumed to pass. |
| 60 | +- Lets a provider cache or short-circuit the base when it has validated the same prefix before — a hot path for stacked-batch speculation. |
| 61 | +- Lets the provider attribute terminal failure to base vs head in `BuildMetadata` without the orchestrator having to round-trip the split itself. |
| 62 | + |
| 63 | +Rejected: a single flat `changes []entity.Change`. Provider would have to deduce base via prefix matching and could not distinguish "base broke" from "head broke" without out-of-band hints. Loses the orchestrator's clearest piece of structural information at the boundary for no gain. |
| 64 | + |
| 65 | +Rejected: list-of-lists, one slice per batch. Pushes batch structure across the boundary, which the provider does not care about. The provider thinks in terms of "stuff to apply before" and "stuff to validate" — base / head matches that mental model. Batches are an orchestrator concept. |
| 66 | + |
| 67 | +### Async vs sync contract |
| 68 | + |
| 69 | +| Verb | Must return promptly? | Notes | |
| 70 | +|---|---|---| |
| 71 | +| `Trigger` | yes | provider-side work is asynchronous | |
| 72 | +| `Cancel` | yes | returns once the request reaches the provider, **not** once the build stops | |
| 73 | +| `Status` | no | a provider round trip is typical | |
| 74 | + |
| 75 | +This keeps the orchestrator's queue loops snappy while admitting that reading state is remote. |
| 76 | + |
| 77 | +### Status delivery: polling, as queue traffic |
| 78 | + |
| 79 | +The interface is pull-only — the only way to learn a build's current state is to call `Status`. But polling is not a tight loop. It runs as queue traffic, with each `buildID` as its own partition. |
| 80 | + |
| 81 | +A status-loop consumer receives a `buildID` message, calls `Status`, and either: |
| 82 | + |
| 83 | +- **terminal** → forwards downstream, or |
| 84 | +- **non-terminal** → re-publishes to its own queue with a delay. |
| 85 | + |
| 86 | +This makes polling behave like everything else in the orchestrator: |
| 87 | + |
| 88 | +- **Independent partitions** — slow polls on one build don't block others. |
| 89 | +- **Restart-safe** — pending polls live in the queue, not in memory. |
| 90 | +- **Retry-native** — a `Status` call that errors out is `Nack`'d and redelivered with the queue's normal backoff, separate from polling cadence. |
| 91 | +- **Tunable cadence** — re-publish delay can vary by status (longer for `Accepted`, shorter for `Running`). |
| 92 | + |
| 93 | +### Polling primitive: `PublishAfter`, not `Nack` |
| 94 | + |
| 95 | +Postponing the next poll needs a "publish-with-delay" verb. Two candidates exist in or near the queue extension: |
| 96 | + |
| 97 | +- **`Publisher.PublishAfter(topic, msg, delayMs)`** — a new primitive. A fresh message, made visible only after `delayMs`. The SQL-backed queue already has the column needed (`invisible_until`); `PublishAfter` is `Publish` with a non-zero delay. |
| 98 | +- **`Delivery.Nack(requeueAfterMs)`** — the existing primitive. Re-uses the same message, sets it invisible until `now + delay`, increments `retry_count`. |
| 99 | + |
| 100 | +Both deliver the same surface behaviour: one message per build at a time, redelivered after the chosen delay. The difference is what `retry_count` means. |
| 101 | + |
| 102 | +`Nack` is "this delivery failed, try again," and `retry_count` feeds `MaxAttempts` and DLQ. Using it for "build not yet done" overloads that counter — every poll bumps a number that is supposed to flag problems. |
| 103 | + |
| 104 | +`PublishAfter` is "postpone this work." Each poll cycle is a fresh message with `retry_count = 0`. `Nack` stays available for true `Status` failures with its normal bounded-retry-then-DLQ behaviour. The two signals stay separate. |
| 105 | + |
| 106 | +**Why not `Nack` with `MaxAttempts = ∞`** (one message per build, just keep cycling)? The mechanism works. Three things break: |
| 107 | + |
| 108 | +- **No DLQ escape valve.** A malformed `buildID`, or a build the provider has lost, fails `Status` every call. With unbounded retries the message spins forever; the operator gets no signal that something is permanently wrong. DLQ exists for exactly this case; opting out for the buildsignal subscription means opting out of every poison-message signal it offers. |
| 109 | +- **Conflated metric.** `retry_count` is the obvious dashboard signal for "this consumer is having trouble." With infinite-retry polling, a `retry_count` of 500 might mean "build has been running 30 minutes" *or* "Status has errored 500 times" — operationally indistinguishable. |
| 110 | +- **Visibility-timeout coupling.** If the consumer crashes mid-poll before its `Nack`, the queue's visibility timeout redelivers the message and bumps `retry_count`. One number ends up counting legitimate polls, real errors, *and* consumer crashes — three signals fused. |
| 111 | + |
| 112 | +`PublishAfter` costs one new queue primitive. It buys back the queue's diagnostic semantics. |
| 113 | + |
| 114 | +Trade-off acknowledged: `PublishAfter` writes more — Ack deletes the old message, PublishAfter inserts a new one — vs `Nack` updating one row in place. At minute cadence the difference is noise; at second cadence it is real but small. |
| 115 | + |
| 116 | +### Push, when a backend supports it |
| 117 | + |
| 118 | +A webhook-capable backend publishes a status message into the same queue, keyed by the same `buildID`. The consumer cannot tell whether a message came from a poll or a push — both shapes are identical, both partition the same way. |
| 119 | + |
| 120 | +This keeps the `BuildRunner` contract pull-only: push is implemented at the queue boundary, not on the interface. A backend without webhooks needs zero extra code; a backend with webhooks needs only a webhook receiver that publishes into the existing queue. |
| 121 | + |
| 122 | +Rejected: a push method on `BuildRunner` (e.g. `Subscribe`). Forces every implementation to either expose a real push primitive or fake one, and gives the orchestrator two parallel update paths. Pushing into the queue collapses both paths. |
| 123 | + |
| 124 | +Rejected: long-polling on `Status`. Not every backend supports efficient server-side blocking; making it part of the contract forces backends to fake it. Same latency, more interface complexity. |
| 125 | + |
| 126 | +### Lifecycle |
| 127 | + |
| 128 | +Implementations are long-lived singletons bound to provider config at construction. Every method is concurrent-safe; connection pools and caches live inside the manager; anything that must survive a restart belongs in persistent storage, not the manager. |
| 129 | + |
| 130 | +### Transient failures |
| 131 | + |
| 132 | +The manager's problem, not the caller's. Reconnect and retry-with-backoff internally; during the recovery window return plain errors rather than block indefinitely. |
| 133 | + |
| 134 | +### Error classification |
| 135 | + |
| 136 | +Methods return plain errors. The controller classifies errors as user vs infra and retryable vs not — the manager does not. It MAY mark errors as retryable when it knows a failure is transient. Domain sentinels (e.g. "build not found") land with the first implementation that needs them, not before. |
| 137 | + |
| 138 | +## `BuildStatus` |
| 139 | + |
| 140 | +A deliberately narrow set — `Unknown`, `Accepted`, `Running`, `Succeeded`, `Failed`, `Cancelled`. Every backend must collapse its native lifecycle into one of these. |
| 141 | + |
| 142 | +`Accepted` covers any pre-execution state — submitted, queued for a worker, waiting on capacity. The orchestrator cannot act differently on "submitted" vs "queued", so collapsing them removes a distinction nobody consumes. `Running` is a separate first-class non-terminal state because "in a queue" and "burning compute right now" are operationally different: cancellation cost differs, and future speculation decisions can use the signal. |
| 143 | + |
| 144 | +`Succeeded` is the common terminal-success name across Build Runners. `Failed` covers terminal failure of any cause — including runner-initiated terminations like timeouts, OOM, or worker crashes, which are real failures the orchestrator must see and react to. `Cancelled` is the terminal state for a build that was cancelled. |
| 145 | + |
| 146 | +The set excludes `Blocked` — a wait on an upstream gate is the orchestrator's concept, owned by speculation. Folding it into build status would conflate two systems and invite a controller branch that should not exist. |
| 147 | + |
| 148 | +`IsTerminal` is a predicate on the type: `Succeeded`, `Failed`, `Cancelled` are terminal. Living on the enum prevents callers from reimplementing the list and drifting. |
| 149 | + |
| 150 | +## `BuildMetadata` |
| 151 | + |
| 152 | +`BuildMetadata` is a free-form `map[string]string` from `Status`. Every backend exposes different metadata (URL, duration, SHA, runner ID, base-vs-head failure attribution); the orchestrator's job is to round-trip it to users, not interpret it. |
0 commit comments