You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
- Adds a `submitqueue/orchestrator/controller/dlq/` package that drains
every primary pipeline topic's `{topic}_dlq` companion and transitions
the affected request or batch to a terminal failed state. Without this,
a request that exhausts its retry budget in the primary pipeline stays
stuck in a non-terminal state and the gateway keeps reporting it as "in
progress".
- Splits error-classification policy from consumer transport via a new
`errs.ErrorProcessor` interface with two implementations:
`NewClassifierProcessor` for primary pipeline consumers (framework-wrap
short-circuit + per-node classifier walk) and `AlwaysRetryableProcessor`
for DLQ consumers (every failure redelivered). DLQ subscriptions disable
their own DLQ and run with `Retry.MaxAttempts=1000` to avoid a
`_dlq_dlq` cascade while preserving convergence.
- Documentation: new `dlq/README.md` (DLQ-is-final-destination design
constraints), updated `core/errs/README.md` (processor choices), updated
`submitqueue/core/consumer/README.md` (`ErrorProcessor` argument +
error-handling rewrite), new DLQ reconciliation section in
`doc/rfc/submitqueue/workflow.md`, and a sentence in `CLAUDE.md` on the
two default classifiers.
## Design notes
- **DLQ is the final destination.** A genuinely unprocessable DLQ
message (e.g. malformed payload) must be removed by an operator. The
controllers do not attempt to recover; they only reconcile request/batch
state to terminal failed states with the same optimistic-locking CAS the
primary pipeline uses.
- **Two controller shapes.** `RequestController` for request-scoped
topics (start, validate, batch, cancel, log); `BatchController` for
batch-scoped topics (score, speculate, build, merge, conclude) with
fan-out to member requests; small dedicated controller for `buildsignal`
payloads.
- **Error handling in `dlq.go`.** `storage.ErrNotFound` → warn+skip;
`storage.ErrVersionMismatch` → wrapped retryable; everything else →
plain (then forced retryable by `AlwaysRetryableProcessor`).
## Test plan
- [x] `bazel test //core/errs/...
//submitqueue/core/consumer:consumer_test
//submitqueue/orchestrator/controller/dlq:dlq_test` — all PASSED
locally.
- [x] `bazel build //example/submitqueue/... //example/stovepipe/...` —
clean locally.
- [ ] Watch CI on the draft PR.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: sergeyb <sergeyb@uber.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CLAUDE.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -304,3 +304,4 @@ Errors are classified by origin (user vs infra) and retryability. The framework
304
304
3.**Extensions return plain errors** — extension interfaces (`MergeChecker`, `Storage`, `Publisher`) return standard `error` values with their own domain sentinels (e.g. `storage.ErrNotFound`). They do NOT classify errors as user or infra.
305
305
4.**Controllers classify errors** — the service controller that calls an extension decides whether the failure is user-caused or infrastructure-caused. The same extension error may be classified differently depending on context.
306
306
5.**Error chain works end-to-end** — extensions wrap custom errors, controllers wrap with `errs.New*Error`, and `errors.Is`/`errors.As` walks the full chain.
307
+
6.**Default classifiers** — primary pipeline consumers compose one or more classifiers (each owning a focused concern such as transport-level signals or a specific driver/library's errors) into `errs.NewClassifierProcessor(...)`. Pick classifiers that match the failure surfaces the consumer actually touches; add a new classifier package when a backend introduces error shapes that no existing one understands, rather than teaching an unrelated classifier about them. DLQ reconciliation consumers use `errs.AlwaysRetryableProcessor` instead so any failure is redelivered rather than dropped.
Copy file name to clipboardExpand all lines: core/errs/README.md
+27-15Lines changed: 27 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,11 +23,11 @@ Errors are classified along two axes:
23
23
A returned `error` reaches `IsUserError` / `IsRetryable` / `IsDependencyError` carrying one of the framework types (`*userError` / `*infraError`). It gets there one of two ways:
24
24
25
25
1.**Explicit wrap by the controller** — the controller knows the meaning of the failure and wraps the cause with `NewUserError`, `NewRetryableError`, `NewDependencyError`, or `NewRetryableDependencyError` before returning.
26
-
2.**Automatic wrap by `Classify`** — the controller returns a raw driver/library/sentinel error, and a per-backend `Classifier` recognises it later in the pipeline (typically inside the consumer) and adds the appropriate framework wrap.
26
+
2.**Automatic wrap by the classifier-based `ErrorProcessor`** — the controller returns a raw driver/library/sentinel error, and a per-backend `Classifier` recognises it later in the pipeline (typically inside the consumer, after `ErrorProcessor.Process` runs) and adds the appropriate framework wrap.
27
27
28
28
Both routes feed the same downstream helpers; the chain that reaches `IsRetryable` looks identical regardless of who wrapped it.
29
29
30
-
## `Classify`and the `Classifier` Interface
30
+
## `ErrorProcessor`, `Classifier`, and the Processing Pass
31
31
32
32
`Classifier` inspects a **single error node** and returns a `Verdict`:
`Classify(err, classifiers...)` is the single, explicit pass that turns a raw chain into a wrapped one. It is called **exactly once per chain** — typically by the consumer immediately after the controller returns. After that point, callers use only the `IsXxx` helpers, which are pure type checks.
42
+
An `ErrorProcessor` runs the per-chain pass that turns a raw chain into a wrapped one. It is called **exactly once per chain** — typically by the consumer immediately after the controller returns. After that point, callers use only the `IsXxx` helpers, which are pure type checks.
43
43
44
-
`Classify` walks the chain twice:
44
+
Two implementations ship in this package:
45
45
46
-
1.**Pass 1 — framework-wrap check.** A cheap type switch looks for an existing `*userError` / `*infraError` anywhere in the chain. If found, the chain is already interpretable and `Classify` returns `err` unchanged. **No classifier is invoked.**
47
-
2.**Pass 2 — classifier walk.** From outermost to innermost node, each registered classifier is asked for a verdict. The first non-`Unknown` verdict wins and `err` is wrapped with the matching framework constructor.
46
+
-**`NewClassifierProcessor(classifiers...)`** — the standard pass for primary pipeline consumers. Walks the chain twice:
47
+
1.**Pass 1 — framework-wrap check.** A cheap type switch looks for an existing `*userError` / `*infraError` anywhere in the chain. If found, the chain is already interpretable and the processor returns `err` unchanged. **No classifier is invoked.**
48
+
2.**Pass 2 — classifier walk.** From outermost to innermost node, each registered classifier is asked for a verdict. The first non-`Unknown` verdict wins and `err` is wrapped with the matching framework constructor.
48
49
49
-
If no classifier recognises anything, `err` is returned unchanged — and behaves as non-retryable infra at the helper layer.
50
+
If no classifier recognises anything, `err` is returned unchanged — and behaves as non-retryable infra at the helper layer.
51
+
52
+
-**`AlwaysRetryableProcessor`** — unconditionally wraps every non-nil error with `NewRetryableError`, overriding any inner framework wrap. Use it for narrowly-scoped consumers — typically DLQ reconciliation — that must redeliver on any failure because there is no further dead-letter destination. Side-effect: an inner `*infraError(dependency=true)` is masked by the outer `retryable=true` wrap, since `errors.As` matches the outermost `*infraError` first. This is acceptable for the intended DLQ use case where only `IsRetryable` drives transport behaviour; do not pair this processor with a primary pipeline consumer or genuine user errors will retry forever instead of reaching their DLQ.
53
+
54
+
### Choosing a processor
55
+
56
+
-**Primary pipeline consumer** → `NewClassifierProcessor(...)`. Controllers' explicit `NewUserError` / `NewDependencyError` wraps must survive so user errors don't get retried, and unclassified backend errors must be inspected by the registered classifiers.
57
+
-**DLQ reconciliation consumer** → `AlwaysRetryableProcessor`. The DLQ is the last stop; any unprocessable message must come back for another attempt rather than silently drop. The DLQ subscription itself runs with a very high `Retry.MaxAttempts` and with its own DLQ disabled, so "always retryable + bounded-but-effectively-infinite attempts" is the convergence guarantee.
Servers wire each classifier into the consumer as a vararg. Order matters only when two classifiers might both match a node — earlier classifiers win:
88
+
Servers wire each classifier into the consumer's `ErrorProcessor`. Order matters only when two classifiers might both match a node — earlier classifiers win:
Tests follow the same shape: assert per-node behaviour against `Classifier.Classify(node)` directly, and assert end-to-end behaviour by running `errs.Classify(err, Classifier)` and checking the helpers (`IsRetryable`, `IsUserError`, …) on the result. See `core/errs/mysql/mysql_test.go` and `core/errs/generic/generic_test.go`.
105
+
Tests follow the same shape: assert per-node behaviour against `Classifier.Classify(node)` directly, and assert end-to-end behaviour by running `errs.NewClassifierProcessor(Classifier).Process(err)` and checking the helpers (`IsRetryable`, `IsUserError`, …) on the result. See `core/errs/mysql/mysql_test.go` and `core/errs/generic/generic_test.go`.
95
106
96
107
## Overriding Classification from a Controller
97
108
@@ -106,8 +117,9 @@ if errors.Is(err, storage.ErrNotFound) {
// Hand the raw error to Classify — the mysql classifier will recognise
110
-
// deadlocks, lock-wait timeouts, etc. and wrap them as retryable infra.
120
+
// Hand the raw error to the consumer's ErrorProcessor — the mysql
121
+
// classifier will recognise deadlocks, lock-wait timeouts, etc. and wrap
122
+
// them as retryable infra.
111
123
return fmt.Errorf("get %s: %w", id, err)
112
124
}
113
125
```
@@ -119,7 +131,7 @@ Two practical rules fall out of the short-circuit semantics:
119
131
120
132
## Extensions Return Plain Go Errors
121
133
122
-
Extension interfaces (`MergeChecker`, `Storage`, `Publisher`) return standard `error` values. They may define their own domain-specific sentinel errors (e.g. `storage.ErrNotFound`, `storage.ErrVersionMismatch`) but they do **not** classify errors as user or infra — that is the controller's (and `Classify`'s) job.
134
+
Extension interfaces (`MergeChecker`, `Storage`, `Publisher`) return standard `error` values. They may define their own domain-specific sentinel errors (e.g. `storage.ErrNotFound`, `storage.ErrVersionMismatch`) but they do **not** classify errors as user or infra — that is the controller's (and the consumer's `ErrorProcessor`'s) job.
123
135
124
136
This separation keeps extensions reusable across contexts. The same `storage.ErrNotFound` might be a user error in one controller (user requested a non-existent resource) and an infra error in another (expected record is missing).
125
137
@@ -151,4 +163,4 @@ errors.Is(err, ErrNotFound) // true — cause is in the chain
151
163
|`IsRetryable(err)`|`err` is or wraps an infra error with the retryable flag set |
152
164
|`IsDependencyError(err)`|`err` is or wraps an infra error marked as dependency |
153
165
154
-
All three are type-only checks. They do not invoke classifiers — pair them with a preceding `Classify` call when the controller's error may not carry an explicit wrap.
166
+
All three are type-only checks. They do not invoke classifiers — pair them with a preceding `ErrorProcessor.Process` call when the controller's error may not carry an explicit wrap.
0 commit comments