Skip to content

Commit ab8796d

Browse files
authored
feat(buildrunner): add vendor-agnostic BuildRunner interface + RFC (#175)
## Summary ### Why? The build stage needs a vendor-agnostic abstraction for talking to a Build Runner — one that fits the existing extension family (storage, queue, conflict) and that does not have to change when the first real backend lands. Design rationale lives in `doc/rfc/build-runner.md`. ### What? Introduces the `BuildRunner` contract — three verbs, all keyed by an `entity.BuildID`: `Trigger` submits a build (ordered `base` + `head` change sets plus a free-form metadata map) and returns its ID; `Status` fetches the current `BuildStatus` and runner-defined metadata; `Cancel` requests cancellation. See `extension/buildrunner/build_runner.go` for the signatures. Highlights: - `Trigger` takes ordered `base` (assumed-good prefix from dependency batches) and `head` (changes being verified) — a runner can cache or short-circuit a prefix it has validated before, and attribute terminal failures to base vs head via `BuildMetadata`. - Build identifiers are the typed `entity.BuildID` (a lightweight `{ID string}` value, also the queue payload envelope) rather than a bare `string`, so the runner contract and the on-queue messages share one identifier type and the compiler distinguishes a build ID from other string IDs. - `metadata` is a free-form `map[string]string` for caller-supplied attributes (requester, ticket ID, trace ID) the runner MAY persist or echo back via `Status`. - `Trigger` and `Cancel` return promptly; `Status` MAY be lengthy. - The `BuildStatus` enum is narrowed to `Accepted` / `Running` / `Succeeded` / `Failed` / `Cancelled` so every backend can map its native lifecycle without leaking runner-specific stages. - `BuildMetadata` is added as a free-form map for round-tripping runner-defined attributes (build URL, duration, SHA, etc.). Naming: the interface is `BuildRunner` (not `BuildManager` — the "Manager" suffix doesn't describe what it does, and "Build" alone would collide cognitively with `entity.Build`). The package is `extension/buildrunner` to follow the repo convention used by `mergechecker`, `changeprovider`, `pusher`, `counter`, and `storage`. Implementation (noop backend, queue `PublishAfter` primitive, controller wiring, and the poll-driven `buildsignal` loop) lands in the follow-up PRs stacked on top of this one. ## Test Plan - ✅ `make test` — entity/build tests pass; mock-only consumers of the interface compile and pass. - ✅ `make fmt lint check-tidy check-gazelle check-mocks` — clean. - ✅ `make build` — all targets compile. ## Stack 1. @ #175 1. #179 1. #176 1. #177
1 parent 4bc4aef commit ab8796d

13 files changed

Lines changed: 409 additions & 35 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,7 @@ Generated proto files are committed. When modifying `.proto` files:
154154
- **Files**: `{method}.go`, `{entity}.go`, `{file}_test.go`, `BUILD.bazel`
155155
- **Proto files**: `{service}.proto`
156156
- **README files**: Do not duplicate interface or type definitions as code blocks in READMEs. Describe behavior in prose and let readers navigate to the source. Only include code samples when explicitly instructed.
157+
- **Markdown prose width**: Do not hard-wrap prose in Markdown docs (RFCs under `doc/`, READMEs). Write one line per paragraph and one line per list item, and let the editor soft-wrap — hard wrapping at a fixed column renders as a narrow fixed-width column regardless of window size. Code blocks, tables, and ASCII diagrams keep their own line breaks.
157158
158159
### Makefile
159160

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,7 @@ local-stovepipe-gateway-start: build-stovepipe-gateway-linux ## Start Stovepipe
334334

335335
mocks: ## Generate mock files using mockgen
336336
@echo "Generating mocks..."
337-
@$(BAZEL) run @rules_go//go -- generate ./extension/storage/... ./extension/changestore/... ./extension/counter/... ./extension/queue/... ./extension/queueconfig/... ./extension/mergechecker/... ./extension/pusher/... ./extension/scorer/... ./extension/conflict/... ./core/consumer/...
337+
@$(BAZEL) run @rules_go//go -- generate ./extension/storage/... ./extension/buildrunner/... ./extension/changestore/... ./extension/counter/... ./extension/queue/... ./extension/queueconfig/... ./extension/mergechecker/... ./extension/pusher/... ./extension/scorer/... ./extension/conflict/... ./core/consumer/...
338338
@echo "Mocks generated successfully!"
339339

340340
proto: ## Generate protobuf files from .proto definitions

doc/rfc/build-runner.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Build Runner
2+
3+
Design notes for the `BuildRunner` extension that lets the orchestrator drive external Build Runners from the build stage of the workflow pipeline.
4+
5+
This document captures **design decisions and rationale only**.
6+
7+
## Problem
8+
9+
The build stage needs a vendor-agnostic abstraction for talking to a Build Runner — one that fits the existing extension family (storage, queue, conflict) and that does not have to change when the first real backend lands.
10+
11+
## Flow
12+
13+
`build` triggers the runner and hands the `buildID` to the `buildsignal` poll loop. The loop calls `Status` on its own partition per build until the build is terminal: terminal results wake the batch state machine via `speculate`; non-terminal results re-enqueue the same `buildID` after a delay (`PublishAfter`). A webhook-capable backend can publish a status message into the same queue — the consumer cannot tell a push from a poll.
14+
15+
```
16+
┌────────────────────────────────────────────────────┐
17+
│ build │
18+
│ Trigger(base, head) → buildID │
19+
│ persist Build{Accepted}, enqueue buildID │
20+
└──────────────────────────┬─────────────────────────┘
21+
│ buildID
22+
23+
┌────────────────────────────────────────────────────┐
24+
│ buildsignal (poll loop, one partition per build) │
25+
│ load Build, call Status(buildID) │
26+
└──────────────────────────┬─────────────────────────┘
27+
28+
┌──────────────────┴─────────────────┐
29+
▼ ▼
30+
┌───────────────┐ ┌──────────────────────────────────┐
31+
│ terminal │ │ non-terminal │
32+
│ → speculate │ │ → PublishAfter(buildID, delay) │
33+
│ re-evaluate │ │ re-enqueues to buildsignal │
34+
└───────────────┘ └──────────────────────────────────┘
35+
```
36+
37+
## Interface
38+
39+
`BuildRunner` exposes three verbs, all keyed by a build identifier (`entity.BuildID`):
40+
41+
- **`Trigger`** — submit a build for a queue, given the ordered `base` and `head` change sets plus a free-form metadata map; returns the new build's ID. Runner-side work is asynchronous.
42+
- **`Status`** — fetch the current `BuildStatus` and runner-defined metadata for a build; MAY round-trip to the runner.
43+
- **`Cancel`** — request cancellation; returns once the request reaches the runner, not once the build stops.
44+
45+
See `extension/buildrunner/build_runner.go` for the exact Go signatures. The sections below record why the contract is shaped this way.
46+
47+
### Trigger: base + head
48+
49+
`Trigger` takes two ordered lists of changes and a free-form metadata map:
50+
51+
- **`base`** — changes from the dependency batches (assumed-good prefix). Ordered.
52+
- **`head`** — changes from the batch being verified. Ordered.
53+
- **`metadata`** — caller-supplied attributes (requester, ticket ID, trace ID, etc.) the provider MAY persist or echo back via `Status`. Schema is caller/provider-defined; the interface treats it as opaque. `nil` is equivalent to an empty map.
54+
55+
The provider applies `base` then `head` in order on top of the queue's target branch and validates the resulting tree. Validation is **implicit and holistic**: it is not a per-change action, it is what the provider does after applying everything.
56+
57+
Why split base and head:
58+
59+
- The orchestrator's internal model already distinguishes them — a speculation path has a head batch and a list of base batches assumed to pass.
60+
- Lets a provider cache or short-circuit the base when it has validated the same prefix before — a hot path for stacked-batch speculation.
61+
- Lets the provider attribute terminal failure to base vs head in `BuildMetadata` without the orchestrator having to round-trip the split itself.
62+
63+
Rejected: a single flat `changes []entity.Change`. Provider would have to deduce base via prefix matching and could not distinguish "base broke" from "head broke" without out-of-band hints. Loses the orchestrator's clearest piece of structural information at the boundary for no gain.
64+
65+
Rejected: list-of-lists, one slice per batch. Pushes batch structure across the boundary, which the provider does not care about. The provider thinks in terms of "stuff to apply before" and "stuff to validate" — base / head matches that mental model. Batches are an orchestrator concept.
66+
67+
### Async vs sync contract
68+
69+
| Verb | Must return promptly? | Notes |
70+
|---|---|---|
71+
| `Trigger` | yes | provider-side work is asynchronous |
72+
| `Cancel` | yes | returns once the request reaches the provider, **not** once the build stops |
73+
| `Status` | no | a provider round trip is typical |
74+
75+
This keeps the orchestrator's queue loops snappy while admitting that reading state is remote.
76+
77+
### Status delivery: polling, as queue traffic
78+
79+
The interface is pull-only — the only way to learn a build's current state is to call `Status`. But polling is not a tight loop. It runs as queue traffic, with each `buildID` as its own partition.
80+
81+
A status-loop consumer receives a `buildID` message, calls `Status`, and either:
82+
83+
- **terminal** → forwards downstream, or
84+
- **non-terminal** → re-publishes to its own queue with a delay.
85+
86+
This makes polling behave like everything else in the orchestrator:
87+
88+
- **Independent partitions** — slow polls on one build don't block others.
89+
- **Restart-safe** — pending polls live in the queue, not in memory.
90+
- **Retry-native** — a `Status` call that errors out is `Nack`'d and redelivered with the queue's normal backoff, separate from polling cadence.
91+
- **Tunable cadence** — re-publish delay can vary by status (longer for `Accepted`, shorter for `Running`).
92+
93+
### Polling primitive: `PublishAfter`, not `Nack`
94+
95+
Postponing the next poll needs a "publish-with-delay" verb. Two candidates exist in or near the queue extension:
96+
97+
- **`Publisher.PublishAfter(topic, msg, delayMs)`** — a new primitive. A fresh message, made visible only after `delayMs`. The SQL-backed queue already has the column needed (`invisible_until`); `PublishAfter` is `Publish` with a non-zero delay.
98+
- **`Delivery.Nack(requeueAfterMs)`** — the existing primitive. Re-uses the same message, sets it invisible until `now + delay`, increments `retry_count`.
99+
100+
Both deliver the same surface behaviour: one message per build at a time, redelivered after the chosen delay. The difference is what `retry_count` means.
101+
102+
`Nack` is "this delivery failed, try again," and `retry_count` feeds `MaxAttempts` and DLQ. Using it for "build not yet done" overloads that counter — every poll bumps a number that is supposed to flag problems.
103+
104+
`PublishAfter` is "postpone this work." Each poll cycle is a fresh message with `retry_count = 0`. `Nack` stays available for true `Status` failures with its normal bounded-retry-then-DLQ behaviour. The two signals stay separate.
105+
106+
**Why not `Nack` with `MaxAttempts = ∞`** (one message per build, just keep cycling)? The mechanism works. Three things break:
107+
108+
- **No DLQ escape valve.** A malformed `buildID`, or a build the provider has lost, fails `Status` every call. With unbounded retries the message spins forever; the operator gets no signal that something is permanently wrong. DLQ exists for exactly this case; opting out for the buildsignal subscription means opting out of every poison-message signal it offers.
109+
- **Conflated metric.** `retry_count` is the obvious dashboard signal for "this consumer is having trouble." With infinite-retry polling, a `retry_count` of 500 might mean "build has been running 30 minutes" *or* "Status has errored 500 times" — operationally indistinguishable.
110+
- **Visibility-timeout coupling.** If the consumer crashes mid-poll before its `Nack`, the queue's visibility timeout redelivers the message and bumps `retry_count`. One number ends up counting legitimate polls, real errors, *and* consumer crashes — three signals fused.
111+
112+
`PublishAfter` costs one new queue primitive. It buys back the queue's diagnostic semantics.
113+
114+
Trade-off acknowledged: `PublishAfter` writes more — Ack deletes the old message, PublishAfter inserts a new one — vs `Nack` updating one row in place. At minute cadence the difference is noise; at second cadence it is real but small.
115+
116+
### Push, when a backend supports it
117+
118+
A webhook-capable backend publishes a status message into the same queue, keyed by the same `buildID`. The consumer cannot tell whether a message came from a poll or a push — both shapes are identical, both partition the same way.
119+
120+
This keeps the `BuildRunner` contract pull-only: push is implemented at the queue boundary, not on the interface. A backend without webhooks needs zero extra code; a backend with webhooks needs only a webhook receiver that publishes into the existing queue.
121+
122+
Rejected: a push method on `BuildRunner` (e.g. `Subscribe`). Forces every implementation to either expose a real push primitive or fake one, and gives the orchestrator two parallel update paths. Pushing into the queue collapses both paths.
123+
124+
Rejected: long-polling on `Status`. Not every backend supports efficient server-side blocking; making it part of the contract forces backends to fake it. Same latency, more interface complexity.
125+
126+
### Lifecycle
127+
128+
Implementations are long-lived singletons bound to provider config at construction. Every method is concurrent-safe; connection pools and caches live inside the manager; anything that must survive a restart belongs in persistent storage, not the manager.
129+
130+
### Transient failures
131+
132+
The manager's problem, not the caller's. Reconnect and retry-with-backoff internally; during the recovery window return plain errors rather than block indefinitely.
133+
134+
### Error classification
135+
136+
Methods return plain errors. The controller classifies errors as user vs infra and retryable vs not — the manager does not. It MAY mark errors as retryable when it knows a failure is transient. Domain sentinels (e.g. "build not found") land with the first implementation that needs them, not before.
137+
138+
## `BuildStatus`
139+
140+
A deliberately narrow set — `Unknown`, `Accepted`, `Running`, `Succeeded`, `Failed`, `Cancelled`. Every backend must collapse its native lifecycle into one of these.
141+
142+
`Accepted` covers any pre-execution state — submitted, queued for a worker, waiting on capacity. The orchestrator cannot act differently on "submitted" vs "queued", so collapsing them removes a distinction nobody consumes. `Running` is a separate first-class non-terminal state because "in a queue" and "burning compute right now" are operationally different: cancellation cost differs, and future speculation decisions can use the signal.
143+
144+
`Succeeded` is the common terminal-success name across Build Runners. `Failed` covers terminal failure of any cause — including runner-initiated terminations like timeouts, OOM, or worker crashes, which are real failures the orchestrator must see and react to. `Cancelled` is the terminal state for a build that was cancelled.
145+
146+
The set excludes `Blocked` — a wait on an upstream gate is the orchestrator's concept, owned by speculation. Folding it into build status would conflate two systems and invite a controller branch that should not exist.
147+
148+
`IsTerminal` is a predicate on the type: `Succeeded`, `Failed`, `Cancelled` are terminal. Living on the enum prevents callers from reimplementing the list and drifting.
149+
150+
## `BuildMetadata`
151+
152+
`BuildMetadata` is a free-form `map[string]string` from `Status`. Every backend exposes different metadata (URL, duration, SHA, runner ID, base-vs-head failure attribution); the orchestrator's job is to round-trip it to users, not interpret it.

doc/rfc/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ Design documents and technical proposals for SubmitQueue.
66

77
- [SQL-Based Distributed Queue](sql-queue-rfc.md) - MySQL-based distributed message queue with partition leasing and at-least-once delivery
88
- [Orchestrator Workflow](workflow.md) - Queue-driven controller pipeline from gateway entry through batching, scoring, build, merge, and conclude
9+
- [Build Runner](build-runner.md) - Vendor-agnostic BuildRunner interface, provider-neutral BuildStatus lifecycle, and how the orchestrator wires it into the build stage

entity/build.go

Lines changed: 39 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,41 +16,41 @@ package entity
1616

1717
import "encoding/json"
1818

19-
// BuildStatus defines the possible states of a build.
19+
// BuildStatus defines the possible states of a build. The set is
20+
// intentionally narrow: every supported build provider must be able to map
21+
// its native lifecycle into one of these values without leaking
22+
// provider-specific stages.
2023
type BuildStatus string
2124

2225
const (
23-
// BuildStatusUnknown is the unreachable state. It is set by default when the structure is initialized. It should never be seen in the system.
26+
// BuildStatusUnknown is the unreachable zero value, set by default when
27+
// the structure is initialized. It should never be seen in the system.
2428
BuildStatusUnknown BuildStatus = ""
2529

26-
// BuildStatusQueued indicates the build has been scheduled but not yet started.
27-
BuildStatusQueued BuildStatus = "queued"
30+
// BuildStatusAccepted indicates the build has been accepted for
31+
// execution.
32+
BuildStatusAccepted BuildStatus = "accepted"
2833

2934
// BuildStatusRunning indicates the build is currently executing.
3035
BuildStatusRunning BuildStatus = "running"
3136

32-
// BuildStatusPassed indicates the build completed successfully.
37+
// BuildStatusSucceeded indicates the build completed successfully.
3338
// This is a terminal state.
34-
BuildStatusPassed BuildStatus = "passed"
39+
BuildStatusSucceeded BuildStatus = "succeeded"
3540

36-
// BuildStatusFailed indicates the build completed with failures.
41+
// BuildStatusFailed indicates the build did not complete successfully.
3742
// This is a terminal state.
3843
BuildStatusFailed BuildStatus = "failed"
3944

40-
// BuildStatusCancelled indicates the build was cancelled before completion.
45+
// BuildStatusCancelled indicates the build was cancelled.
4146
// This is a terminal state.
4247
BuildStatusCancelled BuildStatus = "cancelled"
43-
44-
// BuildStatusBlocked indicates the build is waiting for manual approval or unblocking.
45-
// Some CI systems (like BuildKite) support manual approval steps.
46-
BuildStatusBlocked BuildStatus = "blocked"
4748
)
4849

49-
// IsTerminal returns true if the build state represents a final state (passed, failed, or cancelled).
50-
// Terminal states indicate the build has finished and will not change state again.
51-
// Note: BuildStatusBlocked is NOT terminal as blocked builds can be unblocked and continue execution.
50+
// IsTerminal returns true if the status represents a final state
51+
// (Succeeded, Failed, or Cancelled).
5252
func (s BuildStatus) IsTerminal() bool {
53-
return s == BuildStatusPassed || s == BuildStatusFailed || s == BuildStatusCancelled
53+
return s == BuildStatusSucceeded || s == BuildStatusFailed || s == BuildStatusCancelled
5454
}
5555

5656
// SpeculationPathInfo represents the base and head commits of a speculation path used in a build.
@@ -88,3 +88,26 @@ func BuildFromBytes(data []byte) (Build, error) {
8888
err := json.Unmarshal(data, &build)
8989
return build, err
9090
}
91+
92+
// BuildID is a lightweight entity for publishing and consuming just the build identifier via the queue.
93+
type BuildID struct {
94+
// ID is the globally unique identifier for the build.
95+
ID string `json:"id"`
96+
}
97+
98+
// ToBytes serializes the BuildID to JSON bytes for queue message payload.
99+
func (b BuildID) ToBytes() ([]byte, error) {
100+
return json.Marshal(b)
101+
}
102+
103+
// BuildIDFromBytes deserializes a BuildID from JSON bytes.
104+
func BuildIDFromBytes(data []byte) (BuildID, error) {
105+
var bid BuildID
106+
err := json.Unmarshal(data, &bid)
107+
return bid, err
108+
}
109+
110+
// BuildMetadata carries provider-defined free-form metadata about a build
111+
// (e.g. build URL, duration, commit SHA). Keys and values are
112+
// implementation-defined; callers should not assume any particular schema.
113+
type BuildMetadata map[string]string

0 commit comments

Comments
 (0)