|
| 1 | +# Build Runner |
| 2 | + |
| 3 | +Design notes for the `BuildRunner` extension that lets the orchestrator |
| 4 | +drive external Build Runners from the build stage of the workflow pipeline. |
| 5 | + |
| 6 | +This document captures **design decisions and rationale only**. |
| 7 | + |
| 8 | +## Problem |
| 9 | + |
| 10 | +The build stage needs a vendor-agnostic abstraction for talking to a |
| 11 | +Build Runner — one that fits the existing extension family (storage, |
| 12 | +queue, conflict) and that does not have to change when the first real |
| 13 | +backend lands. |
| 14 | + |
| 15 | +## Interface |
| 16 | + |
| 17 | +```go |
| 18 | +type BuildRunner interface { |
| 19 | + Trigger( |
| 20 | + ctx context.Context, |
| 21 | + queueName string, |
| 22 | + base []entity.Change, |
| 23 | + head []entity.Change, |
| 24 | + metadata entity.BuildMetadata, |
| 25 | + ) (buildID string, err error) |
| 26 | + |
| 27 | + Status( |
| 28 | + ctx context.Context, |
| 29 | + buildID string, |
| 30 | + ) (entity.BuildStatus, entity.BuildMetadata, error) |
| 31 | + |
| 32 | + Cancel(ctx context.Context, buildID string) error |
| 33 | +} |
| 34 | +``` |
| 35 | + |
| 36 | +### Trigger: base + head |
| 37 | + |
| 38 | +`Trigger` takes two ordered lists of changes and a free-form metadata map: |
| 39 | + |
| 40 | +- **`base`** — changes from the dependency batches (assumed-good prefix). |
| 41 | + Ordered. |
| 42 | +- **`head`** — changes from the batch being verified. Ordered. |
| 43 | +- **`metadata`** — caller-supplied attributes (requester, ticket ID, trace |
| 44 | + ID, etc.) the provider MAY persist or echo back via `Status`. Schema is |
| 45 | + caller/provider-defined; the interface treats it as opaque. `nil` is |
| 46 | + equivalent to an empty map. |
| 47 | + |
| 48 | +The provider applies `base` then `head` in order on top of the queue's |
| 49 | +target branch and validates the resulting tree. Validation is **implicit |
| 50 | +and holistic**: it is not a per-change action, it is what the provider does |
| 51 | +after applying everything. |
| 52 | + |
| 53 | +Why split base and head: |
| 54 | + |
| 55 | +- The orchestrator's internal model already distinguishes them — a |
| 56 | + speculation path has a head batch and a list of base batches assumed to |
| 57 | + pass. |
| 58 | +- Lets a provider cache or short-circuit the base when it has validated |
| 59 | + the same prefix before — a hot path for stacked-batch speculation. |
| 60 | +- Lets the provider attribute terminal failure to base vs head in |
| 61 | + `BuildMetadata` without the orchestrator having to round-trip the split |
| 62 | + itself. |
| 63 | + |
| 64 | +Rejected: a single flat `changes []entity.Change`. Provider would have to |
| 65 | +deduce base via prefix matching and could not distinguish "base broke" |
| 66 | +from "head broke" without out-of-band hints. Loses the orchestrator's |
| 67 | +clearest piece of structural information at the boundary for no gain. |
| 68 | + |
| 69 | +Rejected: list-of-lists, one slice per batch. Pushes batch structure |
| 70 | +across the boundary, which the provider does not care about. The provider |
| 71 | +thinks in terms of "stuff to apply before" and "stuff to validate" — base |
| 72 | +/ head matches that mental model. Batches are an orchestrator concept. |
| 73 | + |
| 74 | +### Async vs sync contract |
| 75 | + |
| 76 | +| Verb | Must return promptly? | Notes | |
| 77 | +|---|---|---| |
| 78 | +| `Trigger` | yes | provider-side work is asynchronous | |
| 79 | +| `Cancel` | yes | returns once the request reaches the provider, **not** once the build stops | |
| 80 | +| `Status` | no | a provider round trip is typical | |
| 81 | + |
| 82 | +This keeps the orchestrator's queue loops snappy while admitting that |
| 83 | +reading state is remote. |
| 84 | + |
| 85 | +### Status delivery: polling, as queue traffic |
| 86 | + |
| 87 | +The interface is pull-only — the only way to learn a build's current |
| 88 | +state is to call `Status`. But polling is not a tight loop. It runs as |
| 89 | +queue traffic, with each `buildID` as its own partition. |
| 90 | + |
| 91 | +A status-loop consumer receives a `buildID` message, calls `Status`, |
| 92 | +and either: |
| 93 | + |
| 94 | +- **terminal** → forwards downstream, or |
| 95 | +- **non-terminal** → re-publishes to its own queue with a delay. |
| 96 | + |
| 97 | +This makes polling behave like everything else in the orchestrator: |
| 98 | + |
| 99 | +- **Independent partitions** — slow polls on one build don't block |
| 100 | + others. |
| 101 | +- **Restart-safe** — pending polls live in the queue, not in memory. |
| 102 | +- **Retry-native** — a `Status` call that errors out is `Nack`'d and |
| 103 | + redelivered with the queue's normal backoff, separate from polling |
| 104 | + cadence. |
| 105 | +- **Tunable cadence** — re-publish delay can vary by status (longer for |
| 106 | + `Accepted`, shorter for `Running`). |
| 107 | + |
| 108 | +### Polling primitive: `PublishAfter`, not `Nack` |
| 109 | + |
| 110 | +Postponing the next poll needs a "publish-with-delay" verb. Two |
| 111 | +candidates exist in or near the queue extension: |
| 112 | + |
| 113 | +- **`Publisher.PublishAfter(topic, msg, delayMs)`** — a new primitive. |
| 114 | + A fresh message, made visible only after `delayMs`. The SQL-backed |
| 115 | + queue already has the column needed (`invisible_until`); |
| 116 | + `PublishAfter` is `Publish` with a non-zero delay. |
| 117 | +- **`Delivery.Nack(requeueAfterMs)`** — the existing primitive. Re-uses |
| 118 | + the same message, sets it invisible until `now + delay`, increments |
| 119 | + `retry_count`. |
| 120 | + |
| 121 | +Both deliver the same surface behaviour: one message per build at a |
| 122 | +time, redelivered after the chosen delay. The difference is what |
| 123 | +`retry_count` means. |
| 124 | + |
| 125 | +`Nack` is "this delivery failed, try again," and `retry_count` feeds |
| 126 | +`MaxAttempts` and DLQ. Using it for "build not yet done" overloads that |
| 127 | +counter — every poll bumps a number that is supposed to flag problems. |
| 128 | + |
| 129 | +`PublishAfter` is "postpone this work." Each poll cycle is a fresh |
| 130 | +message with `retry_count = 0`. `Nack` stays available for true |
| 131 | +`Status` failures with its normal bounded-retry-then-DLQ behaviour. |
| 132 | +The two signals stay separate. |
| 133 | + |
| 134 | +**Why not `Nack` with `MaxAttempts = ∞`** (one message per build, just |
| 135 | +keep cycling)? The mechanism works. Three things break: |
| 136 | + |
| 137 | +- **No DLQ escape valve.** A malformed `buildID`, or a build the |
| 138 | + provider has lost, fails `Status` every call. With unbounded retries |
| 139 | + the message spins forever; the operator gets no signal that something |
| 140 | + is permanently wrong. DLQ exists for exactly this case; opting out |
| 141 | + for the buildsignal subscription means opting out of every |
| 142 | + poison-message signal it offers. |
| 143 | +- **Conflated metric.** `retry_count` is the obvious dashboard signal |
| 144 | + for "this consumer is having trouble." With infinite-retry polling, a |
| 145 | + `retry_count` of 500 might mean "build has been running 30 minutes" |
| 146 | + *or* "Status has errored 500 times" — operationally indistinguishable. |
| 147 | +- **Visibility-timeout coupling.** If the consumer crashes mid-poll |
| 148 | + before its `Nack`, the queue's visibility timeout redelivers the |
| 149 | + message and bumps `retry_count`. One number ends up counting |
| 150 | + legitimate polls, real errors, *and* consumer crashes — three signals |
| 151 | + fused. |
| 152 | + |
| 153 | +`PublishAfter` costs one new queue primitive. It buys back the queue's |
| 154 | +diagnostic semantics. |
| 155 | + |
| 156 | +Trade-off acknowledged: `PublishAfter` writes more — Ack deletes the |
| 157 | +old message, PublishAfter inserts a new one — vs `Nack` updating one |
| 158 | +row in place. At minute cadence the difference is noise; at second |
| 159 | +cadence it is real but small. |
| 160 | + |
| 161 | +### Push, when a backend supports it |
| 162 | + |
| 163 | +A webhook-capable backend publishes a status message into the same |
| 164 | +queue, keyed by the same `buildID`. The consumer cannot tell whether |
| 165 | +a message came from a poll or a push — both shapes are identical, both |
| 166 | +partition the same way. |
| 167 | + |
| 168 | +This keeps the `BuildRunner` contract pull-only: push is implemented |
| 169 | +at the queue boundary, not on the interface. A backend without webhooks |
| 170 | +needs zero extra code; a backend with webhooks needs only a webhook |
| 171 | +receiver that publishes into the existing queue. |
| 172 | + |
| 173 | +Rejected: a push method on `BuildRunner` (e.g. `Subscribe`). Forces |
| 174 | +every implementation to either expose a real push primitive or fake one, |
| 175 | +and gives the orchestrator two parallel update paths. Pushing into the |
| 176 | +queue collapses both paths. |
| 177 | + |
| 178 | +Rejected: long-polling on `Status`. Not every backend supports efficient |
| 179 | +server-side blocking; making it part of the contract forces backends to |
| 180 | +fake it. Same latency, more interface complexity. |
| 181 | + |
| 182 | +### Lifecycle |
| 183 | + |
| 184 | +Implementations are long-lived singletons bound to provider config at |
| 185 | +construction. Every method is concurrent-safe; connection pools and caches |
| 186 | +live inside the manager; anything that must survive a restart belongs in |
| 187 | +persistent storage, not the manager. |
| 188 | + |
| 189 | +### Transient failures |
| 190 | + |
| 191 | +The manager's problem, not the caller's. Reconnect and retry-with-backoff |
| 192 | +internally; during the recovery window return plain errors rather than |
| 193 | +block indefinitely. |
| 194 | + |
| 195 | +### Error classification |
| 196 | + |
| 197 | +Methods return plain errors. The controller classifies errors as user vs |
| 198 | +infra and retryable vs not — the manager does not. It MAY mark errors as |
| 199 | +retryable when it knows a failure is transient. Domain sentinels (e.g. |
| 200 | +"build not found") land with the first implementation that needs them, |
| 201 | +not before. |
| 202 | + |
| 203 | +## `BuildStatus` |
| 204 | + |
| 205 | +``` |
| 206 | +{Unknown, Accepted, Running, Succeeded, Failed, Cancelled} |
| 207 | +``` |
| 208 | + |
| 209 | +A deliberately narrow set. Every backend must collapse its native |
| 210 | +lifecycle into one of these. |
| 211 | + |
| 212 | +`Accepted` covers any pre-execution state — submitted, queued for a |
| 213 | +worker, waiting on capacity. The orchestrator cannot act differently on |
| 214 | +"submitted" vs "queued", so collapsing them removes a distinction nobody |
| 215 | +consumes. `Running` is a separate first-class non-terminal state because |
| 216 | +"in a queue" and "burning compute right now" are operationally different: |
| 217 | +cancellation cost differs, and future speculation decisions can use the |
| 218 | +signal. |
| 219 | + |
| 220 | +`Succeeded` is the common terminal-success name across Build Runners. |
| 221 | +`Failed` covers terminal failure of any cause — including |
| 222 | +runner-initiated terminations like timeouts, OOM, or worker crashes, |
| 223 | +which are real failures the orchestrator must see and react to. |
| 224 | +`Cancelled` is the terminal state for a build that was cancelled. |
| 225 | + |
| 226 | +The set excludes `Blocked` — a wait on an upstream gate is the |
| 227 | +orchestrator's concept, owned by speculation. Folding it into build |
| 228 | +status would conflate two systems and invite a controller branch that |
| 229 | +should not exist. |
| 230 | + |
| 231 | +`IsTerminal` is a predicate on the type: `Succeeded`, `Failed`, |
| 232 | +`Cancelled` are terminal. Living on the enum prevents callers from |
| 233 | +reimplementing the list and drifting. |
| 234 | + |
| 235 | +## `BuildMetadata` |
| 236 | + |
| 237 | +`BuildMetadata` is a free-form `map[string]string` from `Status`. Every |
| 238 | +backend exposes different metadata (URL, duration, SHA, runner ID, |
| 239 | +base-vs-head failure attribution); the orchestrator's job is to |
| 240 | +round-trip it to users, not interpret it. |
0 commit comments