Skip to content

Commit b96ced6

Browse files
committed
feat(buildrunner): add vendor-agnostic BuildRunner interface + RFC
## Summary ### Why? The build stage needs a vendor-agnostic abstraction for talking to a Build Runner — one that fits the existing extension family (storage, queue, conflict) and that does not have to change when the first real backend lands. Design rationale lives in `doc/rfc/build-runner.md`. ### What? Introduces the `BuildRunner` contract: ```go Trigger(ctx, queueName, base, head []entity.Change, metadata entity.BuildMetadata) (buildID, error) Status(ctx, buildID) (entity.BuildStatus, entity.BuildMetadata, error) Cancel(ctx, buildID) error ``` Highlights: - `Trigger` takes ordered `base` (assumed-good prefix from dependency batches) and `head` (changes being verified) — a runner can cache or short-circuit a prefix it has validated before, and attribute terminal failures to base vs head via `BuildMetadata`. - `metadata` is a free-form `map[string]string` for caller-supplied attributes (requester, ticket ID, trace ID) the runner MAY persist or echo back via `Status`. - `Trigger` and `Cancel` return promptly; `Status` MAY be lengthy. - The `BuildStatus` enum is narrowed to `Accepted` / `Running` / `Succeeded` / `Failed` / `Cancelled` so every backend can map its native lifecycle without leaking runner-specific stages. - `BuildMetadata` is added as a free-form map for round-tripping runner-defined attributes (build URL, duration, SHA, etc.). Naming: the interface is `BuildRunner` (not `BuildManager` — the "Manager" suffix doesn't describe what it does, and "Build" alone would collide cognitively with `entity.Build`). The package is `extension/buildrunner` to follow the repo convention used by `mergechecker`, `changeprovider`, `pusher`, `counter`, and `storage`. Implementation (noop backend, queue `PublishAfter` primitive, controller wiring, and the poll-driven `buildsignal` loop) lands in the follow-up PRs stacked on top of this one. ## Test Plan - ✅ `make test` — entity/build tests pass; mock-only consumers of the interface compile and pass. - ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean. - ✅ `make build` — all targets compile.
1 parent b4ad71e commit b96ced6

12 files changed

Lines changed: 486 additions & 35 deletions

File tree

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,7 @@ local-stovepipe-gateway-start: build-stovepipe-gateway-linux ## Start Stovepipe
285285

286286
mocks: ## Generate mock files using mockgen
287287
@echo "Generating mocks..."
288-
@$(BAZEL) run @rules_go//go -- generate ./extension/storage/... ./extension/changestore/... ./extension/counter/... ./extension/queue/... ./extension/queueconfig/... ./extension/mergechecker/... ./extension/pusher/... ./extension/scorer/... ./extension/conflict/... ./core/consumer/...
288+
@$(BAZEL) run @rules_go//go -- generate ./extension/storage/... ./extension/buildrunner/... ./extension/changestore/... ./extension/counter/... ./extension/queue/... ./extension/queueconfig/... ./extension/mergechecker/... ./extension/pusher/... ./extension/scorer/... ./extension/conflict/... ./core/consumer/...
289289
@echo "Mocks generated successfully!"
290290

291291
proto: ## Generate protobuf files from .proto definitions

doc/rfc/build-runner.md

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Build Runner
2+
3+
Design notes for the `BuildRunner` extension that lets the orchestrator
4+
drive external Build Runners from the build stage of the workflow pipeline.
5+
6+
This document captures **design decisions and rationale only**.
7+
8+
## Problem
9+
10+
The build stage needs a vendor-agnostic abstraction for talking to a
11+
Build Runner — one that fits the existing extension family (storage,
12+
queue, conflict) and that does not have to change when the first real
13+
backend lands.
14+
15+
## Interface
16+
17+
```go
18+
type BuildRunner interface {
19+
Trigger(
20+
ctx context.Context,
21+
queueName string,
22+
base []entity.Change,
23+
head []entity.Change,
24+
metadata entity.BuildMetadata,
25+
) (buildID string, err error)
26+
27+
Status(
28+
ctx context.Context,
29+
buildID string,
30+
) (entity.BuildStatus, entity.BuildMetadata, error)
31+
32+
Cancel(ctx context.Context, buildID string) error
33+
}
34+
```
35+
36+
### Trigger: base + head
37+
38+
`Trigger` takes two ordered lists of changes and a free-form metadata map:
39+
40+
- **`base`** — changes from the dependency batches (assumed-good prefix).
41+
Ordered.
42+
- **`head`** — changes from the batch being verified. Ordered.
43+
- **`metadata`** — caller-supplied attributes (requester, ticket ID, trace
44+
ID, etc.) the provider MAY persist or echo back via `Status`. Schema is
45+
caller/provider-defined; the interface treats it as opaque. `nil` is
46+
equivalent to an empty map.
47+
48+
The provider applies `base` then `head` in order on top of the queue's
49+
target branch and validates the resulting tree. Validation is **implicit
50+
and holistic**: it is not a per-change action, it is what the provider does
51+
after applying everything.
52+
53+
Why split base and head:
54+
55+
- The orchestrator's internal model already distinguishes them — a
56+
speculation path has a head batch and a list of base batches assumed to
57+
pass.
58+
- Lets a provider cache or short-circuit the base when it has validated
59+
the same prefix before — a hot path for stacked-batch speculation.
60+
- Lets the provider attribute terminal failure to base vs head in
61+
`BuildMetadata` without the orchestrator having to round-trip the split
62+
itself.
63+
64+
Rejected: a single flat `changes []entity.Change`. Provider would have to
65+
deduce base via prefix matching and could not distinguish "base broke"
66+
from "head broke" without out-of-band hints. Loses the orchestrator's
67+
clearest piece of structural information at the boundary for no gain.
68+
69+
Rejected: list-of-lists, one slice per batch. Pushes batch structure
70+
across the boundary, which the provider does not care about. The provider
71+
thinks in terms of "stuff to apply before" and "stuff to validate" — base
72+
/ head matches that mental model. Batches are an orchestrator concept.
73+
74+
### Async vs sync contract
75+
76+
| Verb | Must return promptly? | Notes |
77+
|---|---|---|
78+
| `Trigger` | yes | provider-side work is asynchronous |
79+
| `Cancel` | yes | returns once the request reaches the provider, **not** once the build stops |
80+
| `Status` | no | a provider round trip is typical |
81+
82+
This keeps the orchestrator's queue loops snappy while admitting that
83+
reading state is remote.
84+
85+
### Status delivery: polling, as queue traffic
86+
87+
The interface is pull-only — the only way to learn a build's current
88+
state is to call `Status`. But polling is not a tight loop. It runs as
89+
queue traffic, with each `buildID` as its own partition.
90+
91+
A status-loop consumer receives a `buildID` message, calls `Status`,
92+
and either:
93+
94+
- **terminal** → forwards downstream, or
95+
- **non-terminal** → re-publishes to its own queue with a delay.
96+
97+
This makes polling behave like everything else in the orchestrator:
98+
99+
- **Independent partitions** — slow polls on one build don't block
100+
others.
101+
- **Restart-safe** — pending polls live in the queue, not in memory.
102+
- **Retry-native** — a `Status` call that errors out is `Nack`'d and
103+
redelivered with the queue's normal backoff, separate from polling
104+
cadence.
105+
- **Tunable cadence** — re-publish delay can vary by status (longer for
106+
`Accepted`, shorter for `Running`).
107+
108+
### Polling primitive: `PublishAfter`, not `Nack`
109+
110+
Postponing the next poll needs a "publish-with-delay" verb. Two
111+
candidates exist in or near the queue extension:
112+
113+
- **`Publisher.PublishAfter(topic, msg, delayMs)`** — a new primitive.
114+
A fresh message, made visible only after `delayMs`. The SQL-backed
115+
queue already has the column needed (`invisible_until`);
116+
`PublishAfter` is `Publish` with a non-zero delay.
117+
- **`Delivery.Nack(requeueAfterMs)`** — the existing primitive. Re-uses
118+
the same message, sets it invisible until `now + delay`, increments
119+
`retry_count`.
120+
121+
Both deliver the same surface behaviour: one message per build at a
122+
time, redelivered after the chosen delay. The difference is what
123+
`retry_count` means.
124+
125+
`Nack` is "this delivery failed, try again," and `retry_count` feeds
126+
`MaxAttempts` and DLQ. Using it for "build not yet done" overloads that
127+
counter — every poll bumps a number that is supposed to flag problems.
128+
129+
`PublishAfter` is "postpone this work." Each poll cycle is a fresh
130+
message with `retry_count = 0`. `Nack` stays available for true
131+
`Status` failures with its normal bounded-retry-then-DLQ behaviour.
132+
The two signals stay separate.
133+
134+
**Why not `Nack` with `MaxAttempts = ∞`** (one message per build, just
135+
keep cycling)? The mechanism works. Three things break:
136+
137+
- **No DLQ escape valve.** A malformed `buildID`, or a build the
138+
provider has lost, fails `Status` every call. With unbounded retries
139+
the message spins forever; the operator gets no signal that something
140+
is permanently wrong. DLQ exists for exactly this case; opting out
141+
for the buildsignal subscription means opting out of every
142+
poison-message signal it offers.
143+
- **Conflated metric.** `retry_count` is the obvious dashboard signal
144+
for "this consumer is having trouble." With infinite-retry polling, a
145+
`retry_count` of 500 might mean "build has been running 30 minutes"
146+
*or* "Status has errored 500 times" — operationally indistinguishable.
147+
- **Visibility-timeout coupling.** If the consumer crashes mid-poll
148+
before its `Nack`, the queue's visibility timeout redelivers the
149+
message and bumps `retry_count`. One number ends up counting
150+
legitimate polls, real errors, *and* consumer crashes — three signals
151+
fused.
152+
153+
`PublishAfter` costs one new queue primitive. It buys back the queue's
154+
diagnostic semantics.
155+
156+
Trade-off acknowledged: `PublishAfter` writes more — Ack deletes the
157+
old message, PublishAfter inserts a new one — vs `Nack` updating one
158+
row in place. At minute cadence the difference is noise; at second
159+
cadence it is real but small.
160+
161+
### Push, when a backend supports it
162+
163+
A webhook-capable backend publishes a status message into the same
164+
queue, keyed by the same `buildID`. The consumer cannot tell whether
165+
a message came from a poll or a push — both shapes are identical, both
166+
partition the same way.
167+
168+
This keeps the `BuildRunner` contract pull-only: push is implemented
169+
at the queue boundary, not on the interface. A backend without webhooks
170+
needs zero extra code; a backend with webhooks needs only a webhook
171+
receiver that publishes into the existing queue.
172+
173+
Rejected: a push method on `BuildRunner` (e.g. `Subscribe`). Forces
174+
every implementation to either expose a real push primitive or fake one,
175+
and gives the orchestrator two parallel update paths. Pushing into the
176+
queue collapses both paths.
177+
178+
Rejected: long-polling on `Status`. Not every backend supports efficient
179+
server-side blocking; making it part of the contract forces backends to
180+
fake it. Same latency, more interface complexity.
181+
182+
### Lifecycle
183+
184+
Implementations are long-lived singletons bound to provider config at
185+
construction. Every method is concurrent-safe; connection pools and caches
186+
live inside the manager; anything that must survive a restart belongs in
187+
persistent storage, not the manager.
188+
189+
### Transient failures
190+
191+
The manager's problem, not the caller's. Reconnect and retry-with-backoff
192+
internally; during the recovery window return plain errors rather than
193+
block indefinitely.
194+
195+
### Error classification
196+
197+
Methods return plain errors. The controller classifies errors as user vs
198+
infra and retryable vs not — the manager does not. It MAY mark errors as
199+
retryable when it knows a failure is transient. Domain sentinels (e.g.
200+
"build not found") land with the first implementation that needs them,
201+
not before.
202+
203+
## `BuildStatus`
204+
205+
```
206+
{Unknown, Accepted, Running, Succeeded, Failed, Cancelled}
207+
```
208+
209+
A deliberately narrow set. Every backend must collapse its native
210+
lifecycle into one of these.
211+
212+
`Accepted` covers any pre-execution state — submitted, queued for a
213+
worker, waiting on capacity. The orchestrator cannot act differently on
214+
"submitted" vs "queued", so collapsing them removes a distinction nobody
215+
consumes. `Running` is a separate first-class non-terminal state because
216+
"in a queue" and "burning compute right now" are operationally different:
217+
cancellation cost differs, and future speculation decisions can use the
218+
signal.
219+
220+
`Succeeded` is the common terminal-success name across Build Runners.
221+
`Failed` covers terminal failure of any cause — including
222+
runner-initiated terminations like timeouts, OOM, or worker crashes,
223+
which are real failures the orchestrator must see and react to.
224+
`Cancelled` is the terminal state for a build that was cancelled.
225+
226+
The set excludes `Blocked` — a wait on an upstream gate is the
227+
orchestrator's concept, owned by speculation. Folding it into build
228+
status would conflate two systems and invite a controller branch that
229+
should not exist.
230+
231+
`IsTerminal` is a predicate on the type: `Succeeded`, `Failed`,
232+
`Cancelled` are terminal. Living on the enum prevents callers from
233+
reimplementing the list and drifting.
234+
235+
## `BuildMetadata`
236+
237+
`BuildMetadata` is a free-form `map[string]string` from `Status`. Every
238+
backend exposes different metadata (URL, duration, SHA, runner ID,
239+
base-vs-head failure attribution); the orchestrator's job is to
240+
round-trip it to users, not interpret it.

doc/rfc/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ Design documents and technical proposals for SubmitQueue.
66

77
- [SQL-Based Distributed Queue](sql-queue-rfc.md) - MySQL-based distributed message queue with partition leasing and at-least-once delivery
88
- [Orchestrator Workflow](workflow.md) - Queue-driven controller pipeline from gateway entry through batching, scoring, build, merge, and conclude
9+
- [Build Runner](build-runner.md) - Vendor-agnostic BuildRunner interface, provider-neutral BuildStatus lifecycle, and how the orchestrator wires it into the build stage

entity/build.go

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,41 +16,41 @@ package entity
1616

1717
import "encoding/json"
1818

19-
// BuildStatus defines the possible states of a build.
19+
// BuildStatus defines the possible states of a build. The set is
20+
// intentionally narrow: every supported build provider must be able to map
21+
// its native lifecycle into one of these values without leaking
22+
// provider-specific stages.
2023
type BuildStatus string
2124

2225
const (
23-
// BuildStatusUnknown is the unreachable state. It is set by default when the structure is initialized. It should never be seen in the system.
26+
// BuildStatusUnknown is the unreachable zero value, set by default when
27+
// the structure is initialized. It should never be seen in the system.
2428
BuildStatusUnknown BuildStatus = ""
2529

26-
// BuildStatusQueued indicates the build has been scheduled but not yet started.
27-
BuildStatusQueued BuildStatus = "queued"
30+
// BuildStatusAccepted indicates the build has been accepted for
31+
// execution.
32+
BuildStatusAccepted BuildStatus = "accepted"
2833

2934
// BuildStatusRunning indicates the build is currently executing.
3035
BuildStatusRunning BuildStatus = "running"
3136

32-
// BuildStatusPassed indicates the build completed successfully.
37+
// BuildStatusSucceeded indicates the build completed successfully.
3338
// This is a terminal state.
34-
BuildStatusPassed BuildStatus = "passed"
39+
BuildStatusSucceeded BuildStatus = "succeeded"
3540

36-
// BuildStatusFailed indicates the build completed with failures.
41+
// BuildStatusFailed indicates the build did not complete successfully.
3742
// This is a terminal state.
3843
BuildStatusFailed BuildStatus = "failed"
3944

40-
// BuildStatusCancelled indicates the build was cancelled before completion.
45+
// BuildStatusCancelled indicates the build was cancelled.
4146
// This is a terminal state.
4247
BuildStatusCancelled BuildStatus = "cancelled"
43-
44-
// BuildStatusBlocked indicates the build is waiting for manual approval or unblocking.
45-
// Some CI systems (like BuildKite) support manual approval steps.
46-
BuildStatusBlocked BuildStatus = "blocked"
4748
)
4849

49-
// IsTerminal returns true if the build state represents a final state (passed, failed, or cancelled).
50-
// Terminal states indicate the build has finished and will not change state again.
51-
// Note: BuildStatusBlocked is NOT terminal as blocked builds can be unblocked and continue execution.
50+
// IsTerminal returns true if the status represents a final state
51+
// (Succeeded, Failed, or Cancelled).
5252
func (s BuildStatus) IsTerminal() bool {
53-
return s == BuildStatusPassed || s == BuildStatusFailed || s == BuildStatusCancelled
53+
return s == BuildStatusSucceeded || s == BuildStatusFailed || s == BuildStatusCancelled
5454
}
5555

5656
// SpeculationPathInfo represents the base and head commits of a speculation path used in a build.
@@ -88,3 +88,8 @@ func BuildFromBytes(data []byte) (Build, error) {
8888
err := json.Unmarshal(data, &build)
8989
return build, err
9090
}
91+
92+
// BuildMetadata carries provider-defined free-form metadata about a build
93+
// (e.g. build URL, duration, commit SHA). Keys and values are
94+
// implementation-defined; callers should not assume any particular schema.
95+
type BuildMetadata map[string]string

0 commit comments

Comments
 (0)