feat(routing): pre-content failover, error cooldowns, tool-schema normalization, capability floors by Gajesh2007 · Pull Request #322 · Layr-Labs/d-inference

Gajesh2007 · 2026-06-11T22:56:38Z

Incident

An OpenRouter partner reported Gemma vision requests returning a role delta then {"error":{"message":"provider disconnected"}} with zero completion tokens. Root cause analysis (prod logs + live repro) found two compounding failure classes:

Deterministic template crash: requests carrying OpenAI tool schemas with nullable types ("type":["string","null"]) crashed Gemma's chat_template.jinja (Runtime error: upper filter requires string) on providers running pre-0.6.3 binaries. The coordinator retried 3×, but every retry landed on an identically-buggy provider — 6/6 partner requests failed after burning all attempts (42 errors in 42s in prod logs).
Premature commit: the dispatch loop committed to a provider on its first chunk — but that chunk is boilerplate (role-only delta) emitted before any failure-prone work. A provider dying during prefill surfaced an unretried in-band error even though zero bytes had been written to the client.

Fixes

Deferred commit + pre-content failover (api/consumer.go): dispatch commits only on the first content-bearing chunk. Boilerplate (role-only delta, response.created/in_progress) buffers in a held-chunks queue; provider error/disconnect before content → invisible retry on another provider (status:retry_precontent). Side effects: the speculative TTFT race now measures real first tokens (boilerplate no longer falsely wins and cancels the backup), and a role-then-stall zombie is bounded by a 90s preamble→content budget (preambleContentTimeout) instead of the full 600s. Genuine AcceptedCh model-load waits keep the full window. In-band SSE errors remain possible only after real content has flowed — now alertable via inference.in_band_error.
Inference-error circuit breaker (registry/error_cooldown.go): 2× sickness-shaped errors (500 = provider bug, 502 = disconnect flush, 504 = accepted-then-silent; 503 capacity/lifecycle signals exempt) per (provider, model) within 60s → 5-min routing cooldown; success clears. Accepted-then-silent timeout arms feed the breaker too.
Capability version floors + render gate (registry/request_traits.go): tool-bearing requests route only to providers ≥ 0.6.3 (provider-side schema normalization, fix(provider): gemma-4 tool calls 500 on nullable (array-typed) schema fields #310) whose model doesn't advertise template_render_ok=false. Tools fail-fast mirrors the vision fail-fast: clean immediate 503 instead of a 120s queue stall when no capable provider exists. Retries soft-prefer a different provider binary version (Traits.AvoidVersion, never fail-closed).
Coordinator-side tool-schema normalization (api/toolschema.go): Go port of the Swift ToolSchemaNormalization, extended to all three wire shapes — chat function.parameters, Responses-API flat parameters, Anthropic input_schema (broader than the provider-side normalizer, which covers only chat shape as of 0.6.4). Wired pre-parse in both handleChatCompletions and handleGenericInference, so lagging providers never see template-crashing shapes the moment the coordinator deploys.
Provider template render self-check (Swift, ProviderCoreFoundation/TemplateRenderCheck.swift): at scan time, renders each model's chat template(s) against canonical fixtures (multimodal parts for vision models, post-normalization tools, tool-call/response flows) and advertises tri-state template_render_ok on the wire (false MUST encode — it is the routing signal). Protocol field mirrored in Go protocol/messages.go. Fences off future bad template publishes before they serve traffic.

Protocol changes

ModelInfo.template_render_ok (*bool, omitempty) added symmetrically to coordinator/protocol/messages.go and provider-swift/.../Protocol/Types.swift. Absent = pre-0.6.5 provider (no opinion); merge paths verified to preserve it.

Verification

go test ./... — 19 packages green (incl. e2e testbed module); -race clean on api + registry; linux/amd64 cross-compile OK
Swift: full non-live suite green (695 swift-testing + all XCTest, 18 new tests); swift-jinja 2.3.5 pinned as direct dep, Package.resolved unchanged
New integration tests (fake WS providers): pre-content failover on error AND disconnect (the partner's exact case), post-content errors still surfaced (correctness boundary), boilerplate-then-clean-close, cooldown exclusion, tools version floor, template_render_ok gate, tools fail-fast latency, decrypted-body assertion proving normalized schemas reach providers
Reviewed by two independent passes (Codex + Claude); all findings fixed: stranded provider-extra refunds on pre-commit errors, trait-blind capacity preflight, /v1/messages+/v1/completions bypass, 503-as-sickness breaker miscount, 504 stall arms not feeding the breaker

Follow-ups (not in this PR)

Swift inbound normalizer parity for Responses-flat + Anthropic input_schema shapes (needs provider release)
TTFT deadline estimation ignores image tokens (vision requests lean on the liveness extension)
Mid-stream continuation failover for post-content provider deaths
DD monitors/dashboards for inference.in_band_error, routing.cooldown_entered
Provider version constants intentionally NOT bumped — no release cut here

🤖 Generated with Claude Code

^{Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.}

vercel · 2026-06-11T22:56:46Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
d-inference	Ready	Preview	Jun 12, 2026 1:16am
d-inference-console-ui-dev	Ready	Preview	Jun 12, 2026 1:16am
d-inference-landing	Ready	Preview	Jun 12, 2026 1:16am

github-actions · 2026-06-11T22:57:18Z

This PR introduces shape-keyed inference error circuit-breaking and a serial-constrained vision-provider check; neither change weakens any existing mitigation, and the key-collision fix in inferenceErrorKey is a minor positive for T-034's integrity assumptions.

Trust Boundaries Touched

TB-002 — coordinator-to-provider WebSocket (registry routing logic)
TB-007 — provider inference engine (error circuit-breaker gating provider selection)

Threat Analysis

Threat	Assessment	Notes
T-034 — Provider runs modified code while advertising a trusted identity	ℹ️ Neutral	The `CodeAttested` gate in `providerSupportsPrivateTextLocked` is not touched by this diff. The routing chokepoint is unmodified. The comment in `inferenceErrorStrikes` that closes the "colon-collision" key note is defensive housekeeping — it does not alter the attestation path.
T-006 — Unauthenticated WebSocket connection floods provider registry	ℹ️ Neutral	New `inferenceErrorStrikes` and `inferenceErrorCooldowns` maps are guarded by `r.mu` (same as `dispatchLoadCooldowns`). No new unauthenticated entry points are opened.
T-007 / T-027 — Provider serves manipulated outputs / weight substitution	ℹ️ Neutral	Error circuit-breaker operates on 5xx routing failures, not on weight-hash mismatches. Fail-open weight-hash enforcement (SEC-007) is unchanged.

New Attack Surface

HasVisionProviderForModel variadic allowedSerials parameter (registry.go, new signature at line ~1706):

The function now builds an allowedSet map from caller-supplied serial strings and filters provider candidates against it. This is a net improvement (prevents false-positive capacity reports for constrained requests), but introduces one thing worth verifying in follow-on code review:

Callers that pass zero serials still get the old unrestricted behavior (the len(allowedSet) > 0 guard at line ~1726). Confirm every call site that should be serial-constrained actually passes the serials — an omission silently falls back to the unconstrained check, which the commit message identifies as a "latent gap." This is not a threat-model violation on its own, but a mis-call would re-introduce the false-serviceable reporting the fix targets.
RequestTraits{} zero value passed to QuickCapacityCheck at the RejectUnservableQueuedRequests call site (line ~2669): the comment explicitly documents this as the intended base-shape check. Acceptable, but worth confirming that downstream callers of RejectUnservableQueuedRequests for tool-heavy workloads don't rely on this to gate tool-capable routing (they should not — this path is only for fast-fail of clearly unservable queued requests).

Neither item is an exploitable vulnerability under the current threat model; both are correctness concerns that could become security-relevant if the routing logic evolves.

SEC-* Findings Resolved

None claimed by this PR. No open findings from the threat model are resolved by this diff.

🔐 Threat model: docs/threat-model.yaml · Updates on each push to this PR

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea3c34b01d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T23:05:37Z

+	// a broken chat-template render — the request can never route. Without
+	// this gate it passes the trait-blind QuickCapacityCheck preflight, queues
+	// for up to 120s, and dies with a misleading capacity 429.
+	if hasTools && !s.registry.HasToolCapableProviderForModel(model) {


Honor routing constraints in tool capability checks

When a tool request is constrained to a provider allowlist or exclusive self-route, this global capability check can be satisfied by an unrelated public provider while every allowed/owned provider is still rejected later by Traits.HasTools. The subsequent QuickCapacityCheck/selfRouteUnavailable paths are trait-blind, so those constrained requests can queue until timeout instead of failing fast with the intended tool-capability error. Please apply the same allowlist/self-route constraints when checking for a tool-capable provider, or make the preflight capacity check trait-aware.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T23:05:37Z

+	if strings.Contains(chunk, `"response.created"`) || strings.Contains(chunk, `"response.in_progress"`) {
+		return true


Parse Responses boilerplate before matching by substring

If a chat-completion content delta contains the literal string response.created or response.in_progress, this returns true before parsing the chunk and the dispatcher treats real output as pre-content boilerplate. When that provider then errors or stalls before another content chunk, the held chunk is discarded/retried or timed out even though user-visible content had already been generated. Please identify Responses lifecycle events from their parsed event/type fields rather than a raw substring match.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T23:05:38Z

+	if !r.providerMeetsTraitFloorsLocked(p, t) {
+		return false
+	}
+	if t.HasTools && providerTemplateRenderBrokenLocked(p, model) {


Apply template-render failures beyond tools

When a 0.6.5+ provider advertises template_render_ok=false because the scan failed a plain-chat or multimodal fixture, this check still allows non-tool text/vision requests to route to that provider. Since renderOK folds all canonical fixture failures into the same boolean, a bad model publish that cannot render ordinary or image/video prompts is not fenced off unless the consumer also sends tools; either split the advertised result by capability or gate every request shape represented by the failed check.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T23:05:38Z

+	if r.inferenceErrorCooldownActiveLocked(p.ID, model, now) {
+		return routingSnapshot{}, false


Keep cooled pairs out of capacity preflight

When all otherwise-eligible providers for a model are in the new inference-error cooldown, ReserveProviderEx skips them here but QuickCapacityCheck and the queue preflight still count them as admissible because they do not consult inferenceErrorCooldowns. A fresh public request can therefore pass capacity preflight, fail to reserve any provider, and sit in the 120s queue until timeout instead of failing/retrying immediately. Mirror this cooldown in the capacity/queue admission path as well as in scheduler snapshots.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T23:05:38Z

+							// Both missed deadline.
+							s.cancelDispatch(provider, pr)
+							s.cancelDispatch(backupProvider, backupPR)


Feed preamble race timeouts into the breaker

In the speculative race path, if either provider emits only held boilerplate and then both racers miss the extended preamble-to-content deadline, this timeout arm cancels both attempts without calling noteInferenceError with the 504 that the single-provider acceptedWait path records. Repeated role-then-stall failures during speculative dispatch therefore never trip the new provider/model cooldown, so the same zombie pair can keep soaking retries. Record a 504 for any racer with held preamble before canceling it here.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T23:05:38Z

+func (r *Registry) RecordInferenceSuccess(providerID, modelID string) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	key := providerID + ":" + modelID
+	delete(r.inferenceErrorStrikes, key)
+	delete(r.inferenceErrorCooldowns, key)


Do not clear shape-specific cooldowns on any success

A clean non-tool request to the same provider/model reaches RecordInferenceSuccess and deletes the entire cooldown, even when the strikes came from a deterministic tool/template failure that the non-tool path never exercised. In mixed traffic this lets ordinary completions immediately re-enable routing of tool requests to the same broken provider before the TTL expires, causing the original failure loop to recur. Key the breaker by the failing request traits or only clear it on a success for the same shape that generated the strikes.

Useful? React with 👍 / 👎.

Follow-up to the routing-failover squash addressing 6 P2 findings from the Codex review plus the threat-model key note: - Shape-key the inference-error breaker by {provider, model, shape}: a non-tool success no longer resets the strike counter a deterministic tool/template failure is accumulating (mixed-traffic quarantine bug). The struct key also closes the colon-collision note; success clears only its own shape bucket. - QuickCapacityCheck + queue preflight are now cooldown- and trait-aware, and a fast 503 is returned when no provider is structurally eligible (all offline / trait-gated / shape-cooled) instead of queueing 120s to a misleading 429. - template_render_ok=false now fences ALL request shapes, not just tools (a crashing chat template breaks every shape); version floor stays tools-only. One shared providerPassesRoutingGatesLocked helper backs selection, admit, and preflight so they can't drift. - Tool + vision fail-fasts honor allowlist/self-route constraints (allowedSerials; guarded for self-route/prefer). - Responses lifecycle events classified by parsed type, not substring, so a chat content delta containing the literal isn't dropped as boilerplate. - Speculative-race preamble timeout feeds the breaker for held-preamble racers (previously only the single-provider path did). go test ./... green (16 pkgs), -race clean on api+registry, linux/amd64 OK. New tests: constrained-tool fast-fail, shape-keyed breaker mixed traffic, cooled-pair preflight exclusion.

…malization, capability floors Root-caused from the OpenRouter partner incident (vision+tools requests 500ing fleet-wide, 'provider disconnected' after a bare role delta): - Deferred commit: dispatch no longer commits on boilerplate chunks (role-only delta / response.created). Pre-content provider errors and disconnects retry invisibly on another provider; the speculative TTFT race now measures real first tokens instead of being defeated by instant boilerplate. In-band SSE errors only after content has flowed. - Inference-error circuit breaker: 2x 5xx within 60s puts the (provider, model) pair on a 5-min routing cooldown; success clears. - Version-diverse retry: after a 5xx, the next attempt soft-prefers a provider running a different binary version (never fail-closed). - Capability floors: tool-bearing requests route only to providers >= 0.6.3 (provider-side schema normalization, #310) whose models do not report template_render_ok=false. - Coordinator-side tool-schema normalization (Go port of the Swift ToolSchemaNormalization): nullable array types collapsed, missing types inferred, applied pre-encryption so lagging providers never see template-crashing schemas. - Provider scan-time chat-template render self-check (Swift): renders each model's template against canonical fixtures (multimodal parts, normalized tools, tool responses) and advertises template_render_ok tri-state on the wire. - DD: inference.in_band_error, routing.cooldown_entered, inference.dispatches{status:retry_precontent}. Coordinator: go test ./... green (16 pkgs), -race clean on api+registry, linux/amd64 cross-compile OK. Provider: full non-live Swift suite green (695 swift-testing + XCTest incl. 18 new).

Follow-up to the routing-failover squash addressing 6 P2 findings from the Codex review plus the threat-model key note: - Shape-key the inference-error breaker by {provider, model, shape}: a non-tool success no longer resets the strike counter a deterministic tool/template failure is accumulating (mixed-traffic quarantine bug). The struct key also closes the colon-collision note; success clears only its own shape bucket. - QuickCapacityCheck + queue preflight are now cooldown- and trait-aware, and a fast 503 is returned when no provider is structurally eligible (all offline / trait-gated / shape-cooled) instead of queueing 120s to a misleading 429. - template_render_ok=false now fences ALL request shapes, not just tools (a crashing chat template breaks every shape); version floor stays tools-only. One shared providerPassesRoutingGatesLocked helper backs selection, admit, and preflight so they can't drift. - Tool + vision fail-fasts honor allowlist/self-route constraints (allowedSerials; guarded for self-route/prefer). - Responses lifecycle events classified by parsed type, not substring, so a chat content delta containing the literal isn't dropped as boilerplate. - Speculative-race preamble timeout feeds the breaker for held-preamble racers (previously only the single-provider path did). go test ./... green (16 pkgs), -race clean on api+registry, linux/amd64 OK. New tests: constrained-tool fast-fail, shape-keyed breaker mixed traffic, cooled-pair preflight exclusion.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ⚠️ SKIPPED

Error: Expecting property name enclosed in double quotes: line 2 column 2 (char 3)

Performance — 6 finding(s) (5 blocking)

🔵 [INFO] coordinator/api/consumer.go:1664 — Potential unbounded growth of heldChunks slice in dispatch loop
- Suggestion: The heldChunks slice could grow without bounds if a provider sends many boilerplate chunks. Consider adding a size limit or timeout to prevent memory exhaustion.
🟡 [MEDIUM] coordinator/api/consumer.go:1943 — Repeated calls to isBoilerplateChunk() for each chunk
- Suggestion: The isBoilerplateChunk() function performs JSON parsing on every chunk. Consider caching results or optimizing the parsing logic to avoid redundant work.
🔴 [CRITICAL] coordinator/api/toolschema.go:93-114 — JSON parsing and normalization on request hot path
- Suggestion: NormalizeToolSchemas performs full JSON decode/encode on every request with tools. This blocks the request handler. Consider async processing or caching normalized schemas.
🟡 [MEDIUM] coordinator/api/toolschema.go:156-200 — Recursive traversal of tool schemas without depth limits
- Suggestion: The injectDefaultTypes function recursively traverses nested schemas without depth limits, potentially causing stack overflow or excessive CPU usage on deeply nested schemas.
🟡 [MEDIUM] coordinator/registry/error_cooldown.go:95-105 — O(n) map cleanup operations in hot path
- Suggestion: The opportunistic cleanup of expired entries iterates through all map entries when size exceeds 1024. Consider using a more efficient data structure like a time-ordered queue for cleanup.
🟡 [MEDIUM] coordinator/api/consumer.go:3855-3968 — JSON parsing in isBoilerplateChunk for every streaming chunk
- Suggestion: The function parses JSON for every chunk to determine if it's boilerplate. Consider using string matching for common patterns before falling back to JSON parsing.

Type_diligence — ✅ No issues found

Additive_complexity — 8 finding(s) (5 blocking)

🟡 [MEDIUM] coordinator/api/consumer.go:1307-1316 — Tool schema normalization duplicated across endpoints
- Suggestion: Extract NormalizeToolSchemas call to a shared middleware or helper function since it's identical in handleChatCompletions and handleGenericInference
🟡 [MEDIUM] coordinator/api/consumer.go:5224-5233 — Identical tool schema normalization in handleGenericInference
- Suggestion: Consolidate with the identical block in handleChatCompletions - this is the same 10-line pattern repeated
🔵 [INFO] coordinator/api/consumer.go:162-175 — Inference error recording mixed with refund logic
- Suggestion: Split noteDispatchProviderError into separate error recording and refund operations - the function does too many unrelated things
🔴 [CRITICAL] coordinator/api/consumer.go:1928-2818 — 890-line nested state machine in dispatch loop
- Suggestion: Extract the speculative dispatch racing logic into separate functions - this single block handles too many scenarios and is hard to follow
🟡 [MEDIUM] coordinator/api/toolschema.go:1-317 — Complex tool schema normalization for simple fix
- Suggestion: Consider if a simpler approach could handle the core issue - 317 lines to fix template crashes seems heavy for the problem scope
🔵 [INFO] coordinator/registry/error_cooldown.go:78-130 — Multiple cooldown parameters and complex sliding window
- Suggestion: Consider if simpler exponential backoff or fixed cooldown would suffice - the sliding window adds complexity
🔵 [INFO] coordinator/registry/request_traits.go:47-59 — The capabilityVersionFloors map has only one entry ('tools': '0.6.3') - consider a simpler constant until more floors are needed
🟡 [MEDIUM] provider-swift/Sources/ProviderCoreFoundation/TemplateRenderCheck.swift:1-387 — 387-line template validation for startup check
- Suggestion: Consider if a smaller subset of fixtures could catch the same issues - this adds significant startup overhead

14 finding(s) total, 10 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

ethenotethan · 2026-06-12T00:13:21Z

@@ -1500,14 +1664,18 @@ func (s *Server) handleChatCompletions(w http.ResponseWriter, r *http.Request) {
 	deadline := ttftDeadline(estimatedPromptTokens)


🔵 [INFO] ⚡ Potential unbounded growth of heldChunks slice in dispatch loop

💡 Suggestion: The heldChunks slice could grow without bounds if a provider sends many boilerplate chunks. Consider adding a size limit or timeout to prevent memory exhaustion.

📊 Score: 2×3 = 6 · Category: unbounded_allocation

ethenotethan · 2026-06-12T00:13:21Z

 		// - If speculative timer fires → dispatch backup and race.
-		// - If full deadline expires → fail.
+		// - If full deadline expires with no liveness → fail.



🟡 [MEDIUM] ⚡ Repeated calls to isBoilerplateChunk() for each chunk

💡 Suggestion: The isBoilerplateChunk() function performs JSON parsing on every chunk. Consider caching results or optimizing the parsing logic to avoid redundant work.

📊 Score: 3×4 = 12 · Category: repeated_work

ethenotethan · 2026-06-12T00:13:21Z

+		return body
+	}
+	root, ok := decoded.(map[string]any)
+	if !ok {
+		return body
+	}
+	tools, ok := root["tools"].([]any)
+	if !ok {
+		return body
+	}
+	for i, tool := range tools {
+		tools[i] = normalizeToolEntry(tool)
+	}
+
+	var buf bytes.Buffer
+	buf.Grow(len(body))
+	enc := json.NewEncoder(&buf)
+	enc.SetEscapeHTML(false)
+	if err := enc.Encode(root); err != nil {
+		return body
+	}
+	// Encoder appends a newline after the document; the input had none.


🔴 [CRITICAL] ⚡ JSON parsing and normalization on request hot path

💡 Suggestion: NormalizeToolSchemas performs full JSON decode/encode on every request with tools. This blocks the request handler. Consider async processing or caching normalized schemas.

📊 Score: 4×5 = 20 · Category: blocking_io

ethenotethan · 2026-06-12T00:13:21Z

+// injectDefaultTypes recursively default-fills `type` on JSON-Schema nodes. A
+// node gets a type only when it looks like a schema node (has properties /
+// items / additionalProperties / enum / description / anyOf / oneOf / allOf)
+// — we never invent types on arbitrary maps. The inferred default favours
+// structure: object when it has properties, array when it has items,
+// otherwise string.
+func injectDefaultTypes(node any) any {
+	switch n := node.(type) {
+	case []any:
+		for i, v := range n {
+			n[i] = injectDefaultTypes(v)
+		}
+		return n
+	case map[string]any:
+		return injectDefaultTypesIntoSchema(n)
+	default:
+		return node
+	}
+}
+
+// injectDefaultTypesIntoSchema is the map-shaped arm of injectDefaultTypes.
+// Children are normalized BEFORE this node's own type is repaired — ordering
+// is load-bearing: a union member declaring `"type": ["string","null"]` must
+// collapse first so the parent's union inference sees a concrete string.
+func injectDefaultTypesIntoSchema(dict map[string]any) map[string]any {
+	if props, ok := dict["properties"].(map[string]any); ok {
+		for k, v := range props {
+			props[k] = injectDefaultTypes(v)
+		}
+	}
+	if items, ok := dict["items"]; ok {
+		dict["items"] = injectDefaultTypes(items)
+	}
+	// additionalProperties may itself be a schema (map-shaped params, e.g.
+	// {"additionalProperties":{"type":"string"}}) — recurse so its inner schema
+	// gets a default type too. A bare `true`/`false` is left untouched.
+	if addl, ok := dict["additionalProperties"].(map[string]any); ok {
+		dict["additionalProperties"] = injectDefaultTypesIntoSchema(addl)
+	}
+	for _, key := range schemaUnionKeys {
+		if variants, ok := dict[key].([]any); ok {
+			for i, v := range variants {
+				variants[i] = injectDefaultTypes(v)
+			}
+		}


🟡 [MEDIUM] ⚡ Recursive traversal of tool schemas without depth limits

💡 Suggestion: The injectDefaultTypes function recursively traverses nested schemas without depth limits, potentially causing stack overflow or excessive CPU usage on deeply nested schemas.

📊 Score: 3×4 = 12 · Category: inefficient_data_structures

ethenotethan · 2026-06-12T00:13:21Z

+			if !now.Before(expiry) {
+				delete(r.inferenceErrorCooldowns, key)
+			}
+		}
+	}
+	if len(r.inferenceErrorStrikes) > 1024 {
+		for key, strikes := range r.inferenceErrorStrikes {
+			if len(strikes) == 0 || !strikes[len(strikes)-1].Add(inferenceErrorWindow).After(now) {
+				delete(r.inferenceErrorStrikes, key)
+			}
+		}


🟡 [MEDIUM] ⚡ O(n) map cleanup operations in hot path

💡 Suggestion: The opportunistic cleanup of expired entries iterates through all map entries when size exceeds 1024. Consider using a more efficient data structure like a time-ordered queue for cleanup.

📊 Score: 3×3 = 9 · Category: inefficient_data_structures

ethenotethan · 2026-06-12T00:13:21Z


 		// ---- Speculative TTFT-aware first-chunk wait ----
 		//
-		// Phase 1: Wait for first chunk with speculative timer.
-		// - If primary sends first chunk → commit.
+		// Phase 1: Wait for first CONTENT chunk with speculative timer.
+		// - Boilerplate preamble chunks (role delta / Responses lifecycle) are
+		//   held, not committed: the provider hasn't survived its failure-prone
+		//   work (media decode, template render, prefill) yet. They count as a
+		//   liveness signal — the TTFT deadline stops failing the attempt — but
+		//   the speculative backup may still race for first content.
+		// - If primary sends a content chunk → commit (held chunks are emitted
+		//   ahead of it).
 		// - If primary sends accepted → extend to inferenceTimeout (model reload).
-		// - If primary errors → retry immediately (sequential fallback).
+		// - If primary errors → discard held chunks and retry invisibly.
 		// - If speculative timer fires → dispatch backup and race.
-		// - If full deadline expires → fail.
+		// - If full deadline expires with no liveness → fail.

 		speculativeTimer := time.NewTimer(speculativeAt)
 		deadlineTimer := time.NewTimer(deadline)
 		accepted := false
-
-		select {
-		case chunk, ok := <-pr.ChunkCh:
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			if ok {
-				firstChunk = chunk
-				pr.Timing.FirstChunkAt = time.Now()
-				committed = true
-			} else {
-				select {
-				case errMsg := <-pr.ErrorCh:
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					lastErr = errMsg.Error
-					lastErrCode = errMsg.StatusCode
-					provider = nil
-					pr = nil
-					continue
-				default:
+		// preambleLiveness distinguishes WHY the extended first-content wait was
+		// entered: a genuine AcceptedCh (cold model load — keeps the full
+		// inferenceTimeout) vs a held-boilerplate liveness extension past an
+		// expired TTFT deadline (zero bytes written to the client — bounded by
+		// preambleContentTimeout so a role-then-stall zombie fails over).
+		preambleLiveness := false
+
+	firstChunkWait:
+		for {
+			select {
+			case chunk, ok := <-pr.ChunkCh:
+				if ok && len(heldChunks) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+					heldChunks = append(heldChunks, chunk)
+					if pr.Timing.FirstChunkAt.IsZero() {
+						pr.Timing.FirstChunkAt = time.Now()
+					}
+					continue firstChunkWait
+				}
+				speculativeTimer.Stop()
+				deadlineTimer.Stop()
+				if ok {
+					firstChunk = chunk
+					if pr.Timing.FirstChunkAt.IsZero() {
+						pr.Timing.FirstChunkAt = time.Now()
+					}
 					committed = true
+				} else {
+					select {
+					case errMsg := <-pr.ErrorCh:
+						excludeProviders[provider.ID] = struct{}{}
+						s.cancelDispatch(provider, pr)
+						lastErr = errMsg.Error
+						lastErrCode = errMsg.StatusCode
+						lastFailedVersion = failedProviderVersion(provider)
+						if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+							s.ddIncr("inference.dispatches", []string{"status:retry"})
+						}
+						provider = nil
+						pr = nil
+						continue dispatch
+					default:
+						// Closed without error — commit (held chunks only is
+						// fine: a preamble-then-complete stream is empty output).
+						committed = true
+					}
 				}
-			}
-
-		case <-pr.AcceptedCh:
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			accepted = true
-
-		case errMsg := <-pr.ErrorCh:
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			excludeProviders[provider.ID] = struct{}{}
-			s.cancelDispatch(provider, pr)
-			lastErr = errMsg.Error
-			lastErrCode = errMsg.StatusCode
-			s.logger.Warn("provider failed, retrying",
-				"request_id", requestID,
-				"provider_id", provider.ID,
-				"attempt", attempt+1,
-				"error", errMsg.Error,
-			)
-			s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-				"provider failed, retrying",
-				map[string]any{
-					"provider_id": provider.ID,
-					"attempt":     attempt + 1,
-					"reason":      "provider_error",
-					"status_code": errMsg.StatusCode,
-				})
-			if s.metrics != nil {
-				s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-			}
-			s.ddIncr("inference.dispatches", []string{"status:retry"})
-			provider = nil
-			pr = nil
-			continue
+				break firstChunkWait

-		case <-speculativeTimer.C:
-			deadlineTimer.Stop()
-			// Primary is slow. Attempt speculative backup dispatch.
-			s.ddIncr("inference.speculative_dispatch", []string{"model:" + model})
-
-			var backupProvider *registry.Provider
-			var backupPR *registry.PendingRequest
-
-			// Do NOT speculatively race a paid PUBLIC backup against a prefer
-			// request that is being served by the caller's OWN machine: the user
-			// opted into "prefer my machine (free)", so a slow owned machine must
-			// be waited on, not raced (and billed) by the public fleet. (Exclusive
-			// self-route is already safe — its backup selection is owned-only and
-			// returns nil when there's no other owned machine.) When the prefer
-			// primary is itself a public provider (the owner owns nothing / fell
-			// back), normal speculative behaviour applies.
-			skipBackup := false
-			if policy.prefer {
-				provider.Mu().Lock()
-				skipBackup = policy.ownerAccountID != "" && provider.AccountID == policy.ownerAccountID
-				provider.Mu().Unlock()
-			}
+			case <-pr.AcceptedCh:
+				speculativeTimer.Stop()
+				deadlineTimer.Stop()
+				accepted = true
+				break firstChunkWait

-			if !skipBackup {
-				backupExclude := make(map[string]struct{}, len(excludeProviders)+1)
-				for id := range excludeProviders {
-					backupExclude[id] = struct{}{}
+			case errMsg := <-pr.ErrorCh:
+				speculativeTimer.Stop()
+				deadlineTimer.Stop()
+				excludeProviders[provider.ID] = struct{}{}
+				s.cancelDispatch(provider, pr)
+				lastErr = errMsg.Error
+				lastErrCode = errMsg.StatusCode
+				lastFailedVersion = failedProviderVersion(provider)
+				s.logger.Warn("provider failed, retrying",
+					"request_id", requestID,
+					"provider_id", provider.ID,
+					"attempt", attempt+1,
+					"error", errMsg.Error,
+				)
+				s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
+					"provider failed, retrying",
+					map[string]any{
+						"provider_id": provider.ID,
+						"attempt":     attempt + 1,
+						"reason":      "provider_error",
+						"status_code": errMsg.StatusCode,
+					})
+				if s.metrics != nil {
+					s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
+				}
+				if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+					s.ddIncr("inference.dispatches", []string{"status:retry"})
+				}
+				provider = nil
+				pr = nil
+				continue dispatch
+
+			case <-speculativeTimer.C:
+				deadlineTimer.Stop()
+				// Primary is slow. Attempt speculative backup dispatch.
+				s.ddIncr("inference.speculative_dispatch", []string{"model:" + model})
+
+				var backupProvider *registry.Provider
+				var backupPR *registry.PendingRequest
+
+				// Do NOT speculatively race a paid PUBLIC backup against a prefer
+				// request that is being served by the caller's OWN machine: the user
+				// opted into "prefer my machine (free)", so a slow owned machine must
+				// be waited on, not raced (and billed) by the public fleet. (Exclusive
+				// self-route is already safe — its backup selection is owned-only and
+				// returns nil when there's no other owned machine.) When the prefer
+				// primary is itself a public provider (the owner owns nothing / fell
+				// back), normal speculative behaviour applies.
+				skipBackup := false
+				if policy.prefer {
+					provider.Mu().Lock()
+					skipBackup = policy.ownerAccountID != "" && provider.AccountID == policy.ownerAccountID
+					provider.Mu().Unlock()
 				}
-				backupExclude[provider.ID] = struct{}{}

-				backupProvider, backupPR, _, _, _ = s.dispatchOneProvider(
-					r, model, publicModel, rawBody, consumerKey, consumerLocation, reservedMicroUSD,
-					estimatedPromptTokens, requestedMaxTokens, requiresVision, allowedProviderSerials,
-					isResponsesAPI, policy, &registry.RequestTiming{ReceivedAt: timing.ReceivedAt},
-					backupExclude,
-				)
-			}
+				if !skipBackup {
+					backupExclude := make(map[string]struct{}, len(excludeProviders)+1)
+					for id := range excludeProviders {
+						backupExclude[id] = struct{}{}
+					}
+					backupExclude[provider.ID] = struct{}{}
+
+					backupProvider, backupPR, _, _, _ = s.dispatchOneProvider(
+						r, model, publicModel, rawBody, consumerKey, consumerLocation, reservedMicroUSD,
+						estimatedPromptTokens, requestedMaxTokens, requiresVision,
+						registry.RequestTraits{HasTools: hasTools, AvoidVersion: lastFailedVersion},
+						allowedProviderSerials, isResponsesAPI, policy,
+						&registry.RequestTiming{ReceivedAt: timing.ReceivedAt},
+						backupExclude,
+					)
+				}

-			if backupProvider == nil {
-				// No backup available. Keep waiting for primary with remaining deadline.
-				s.logger.Info("speculative_dispatch_no_backup",
-					"request_id", requestID,
-					"primary_provider", provider.ID,
-				)
-				remainingDeadline := time.NewTimer(deadline - speculativeAt)
-				select {
-				case chunk, ok := <-pr.ChunkCh:
-					remainingDeadline.Stop()
-					if ok {
-						firstChunk = chunk
-						pr.Timing.FirstChunkAt = time.Now()
-						committed = true
-					} else {
+				if backupProvider == nil {
+					// No backup available. Keep waiting for primary with remaining deadline.
+					s.logger.Info("speculative_dispatch_no_backup",
+						"request_id", requestID,
+						"primary_provider", provider.ID,
+					)
+					remainingDeadline := time.NewTimer(deadline - speculativeAt)
+				noBackupWait:
+					for {
 						select {
+						case chunk, ok := <-pr.ChunkCh:
+							if ok && len(heldChunks) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+								heldChunks = append(heldChunks, chunk)
+								if pr.Timing.FirstChunkAt.IsZero() {
+									pr.Timing.FirstChunkAt = time.Now()
+								}
+								continue noBackupWait
+							}
+							remainingDeadline.Stop()
+							if ok {
+								firstChunk = chunk
+								if pr.Timing.FirstChunkAt.IsZero() {
+									pr.Timing.FirstChunkAt = time.Now()
+								}
+								committed = true
+							} else {
+								select {
+								case errMsg := <-pr.ErrorCh:
+									excludeProviders[provider.ID] = struct{}{}
+									s.cancelDispatch(provider, pr)
+									lastErr = errMsg.Error
+									lastErrCode = errMsg.StatusCode
+									lastFailedVersion = failedProviderVersion(provider)
+									if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+										s.ddIncr("inference.dispatches", []string{"status:retry"})
+									}
+									provider = nil
+									pr = nil
+									continue dispatch
+								default:
+									committed = true
+								}
+							}
+							break noBackupWait
+						case <-pr.AcceptedCh:
+							remainingDeadline.Stop()
+							accepted = true
+							break noBackupWait
 						case errMsg := <-pr.ErrorCh:
+							remainingDeadline.Stop()
 							excludeProviders[provider.ID] = struct{}{}
 							s.cancelDispatch(provider, pr)
 							lastErr = errMsg.Error
 							lastErrCode = errMsg.StatusCode
+							lastFailedVersion = failedProviderVersion(provider)
+							if s.metrics != nil {
+								s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
+							}
+							if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+								s.ddIncr("inference.dispatches", []string{"status:retry"})
+							}
 							provider = nil
 							pr = nil
-							continue
-						default:
-							committed = true
+							continue dispatch
+						case <-remainingDeadline.C:
+							if len(heldChunks) > 0 {
+								// Liveness: the provider already produced its preamble —
+								// vision prefill / template render may legitimately
+								// exceed the TTFT deadline. Fall through to the
+								// extended (preambleContentTimeout) wait for first
+								// content, with ErrorCh still armed for retry.
+								accepted = true
+								preambleLiveness = true
+								break noBackupWait
+							}
+							excludeProviders[provider.ID] = struct{}{}
+							s.cancelDispatch(provider, pr)
+							lastErr = "timeout waiting for first response"
+							lastErrCode = http.StatusGatewayTimeout
+							s.logger.Warn("provider timeout (no backup), retrying",
+								"request_id", requestID,
+								"provider_id", provider.ID,
+								"attempt", attempt+1,
+							)
+							s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
+								"provider first-chunk timeout",
+								map[string]any{
+									"provider_id": provider.ID,
+									"attempt":     attempt + 1,
+									"reason":      "first_chunk_timeout",
+								})
+							if s.metrics != nil {
+								s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+							}
+							s.ddIncr("inference.dispatches", []string{"status:timeout"})
+							provider = nil
+							pr = nil
+							continue dispatch
+						case <-r.Context().Done():
+							remainingDeadline.Stop()
+							s.cancelDispatch(provider, pr)
+							refundReservation()
+							return
 						}
 					}
-				case <-pr.AcceptedCh:
-					remainingDeadline.Stop()
-					accepted = true
-				case errMsg := <-pr.ErrorCh:
-					remainingDeadline.Stop()
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					lastErr = errMsg.Error
-					lastErrCode = errMsg.StatusCode
-					if s.metrics != nil {
-						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-					}
-					s.ddIncr("inference.dispatches", []string{"status:retry"})
-					provider = nil
-					pr = nil
-					continue
-				case <-remainingDeadline.C:
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					lastErr = "timeout waiting for first response"
-					lastErrCode = http.StatusGatewayTimeout
-					s.logger.Warn("provider timeout (no backup), retrying",
+				} else {
+					// Backup dispatched — race primary vs backup.
+					s.logger.Info("speculative_dispatch",
 						"request_id", requestID,
-						"provider_id", provider.ID,
-						"attempt", attempt+1,
+						"primary_provider", provider.ID,
+						"backup_provider", backupProvider.ID,
+						"ttft_deadline_ms", deadline.Milliseconds(),
+						"speculative_at_ms", speculativeAt.Milliseconds(),
 					)
-					s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-						"provider first-chunk timeout",
-						map[string]any{
-							"provider_id": provider.ID,
-							"attempt":     attempt + 1,
-							"reason":      "first_chunk_timeout",
-						})
-					if s.metrics != nil {
-						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+
+					raceDeadline := time.NewTimer(deadline - speculativeAt)
+					// One-shot extension: when the race deadline expires but a racer
+					// has shown liveness (preamble received), the race continues for
+					// the full inference window instead of failing the request.
+					raceExtended := false
+					// Preamble chunks from the backup are buffered separately —
+					// held chunks must never mix providers.
+					var backupHeld []string
+
+				race:
+					for {
+						select {
+						case chunk, ok := <-pr.ChunkCh:
+							if ok && len(heldChunks) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+								// Preamble only — the primary hasn't proven it can
+								// generate; keep the backup racing for first content.
+								heldChunks = append(heldChunks, chunk)
+								if pr.Timing.FirstChunkAt.IsZero() {
+									pr.Timing.FirstChunkAt = time.Now()
+								}
+								continue race
+							}
+							// Primary wins!
+							raceDeadline.Stop()
+							s.cancelDispatch(backupProvider, backupPR)
+							if ok {
+								firstChunk = chunk
+								if pr.Timing.FirstChunkAt.IsZero() {
+									pr.Timing.FirstChunkAt = time.Now()
+								}
+								committed = true
+							} else {
+								select {
+								case errMsg := <-pr.ErrorCh:
+									// Primary failed but we already cancelled backup.
+									excludeProviders[provider.ID] = struct{}{}
+									s.cancelDispatch(provider, pr)
+									lastErr = errMsg.Error
+									lastErrCode = errMsg.StatusCode
+									lastFailedVersion = failedProviderVersion(provider)
+									if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+										s.ddIncr("inference.dispatches", []string{"status:retry"})
+									}
+									provider = nil
+									pr = nil
+									continue dispatch
+								default:
+									committed = true
+								}
+							}
+							break race
+
+						case chunk, ok := <-backupPR.ChunkCh:
+							if ok && len(backupHeld) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+								// Backup preamble doesn't win the race — first CONTENT does.
+								backupHeld = append(backupHeld, chunk)
+								if backupPR.Timing.FirstChunkAt.IsZero() {
+									backupPR.Timing.FirstChunkAt = time.Now()
+								}
+								continue race
+							}
+							// Backup wins!
+							raceDeadline.Stop()
+							s.cancelDispatch(provider, pr)
+							s.ddIncr("inference.speculative_win", []string{"model:" + model})
+							if ok {
+								provider = backupProvider
+								pr = backupPR
+								requestID = pr.RequestID
+								heldChunks = backupHeld
+								firstChunk = chunk
+								if pr.Timing.FirstChunkAt.IsZero() {
+									pr.Timing.FirstChunkAt = time.Now()
+								}
+								committed = true
+							} else {
+								select {
+								case errMsg := <-backupPR.ErrorCh:
+									// Backup failed too. Keep primary context for retry.
+									excludeProviders[backupProvider.ID] = struct{}{}
+									lastFailedVersion = failedProviderVersion(backupProvider)
+									s.noteDispatchProviderError(backupProvider, backupPR, errMsg.StatusCode, &backupHeld)
+									// Wait remaining deadline for primary.
+									remainingPrimary := time.NewTimer(deadline - speculativeAt)
+								backupFailedPrimaryWait:
+									for {
+										select {
+										case chunk, ok := <-pr.ChunkCh:
+											if ok && len(heldChunks) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+												heldChunks = append(heldChunks, chunk)
+												if pr.Timing.FirstChunkAt.IsZero() {
+													pr.Timing.FirstChunkAt = time.Now()
+												}
+												continue backupFailedPrimaryWait
+											}
+											remainingPrimary.Stop()
+											if ok {
+												firstChunk = chunk
+												if pr.Timing.FirstChunkAt.IsZero() {
+													pr.Timing.FirstChunkAt = time.Now()
+												}
+												committed = true
+											} else {
+												select {
+												case errMsg2 := <-pr.ErrorCh:
+													excludeProviders[provider.ID] = struct{}{}
+													s.cancelDispatch(provider, pr)
+													lastErr = errMsg2.Error
+													lastErrCode = errMsg2.StatusCode
+													lastFailedVersion = failedProviderVersion(provider)
+													if !s.noteDispatchProviderError(provider, pr, errMsg2.StatusCode, &heldChunks) {
+														s.ddIncr("inference.dispatches", []string{"status:retry"})
+													}
+													provider = nil
+													pr = nil
+													continue dispatch
+												default:
+													committed = true
+												}
+											}
+											break backupFailedPrimaryWait
+										case <-pr.AcceptedCh:
+											remainingPrimary.Stop()
+											accepted = true
+											break backupFailedPrimaryWait
+										case errMsg2 := <-pr.ErrorCh:
+											// Defensive: both ErrorCh senders currently send before
+											// closing ChunkCh (the closed-ChunkCh check above catches
+											// them), but a direct arm keeps this loop correct if that
+											// ordering ever changes — mirroring its sibling wait loops.
+											remainingPrimary.Stop()
+											excludeProviders[provider.ID] = struct{}{}
+											s.cancelDispatch(provider, pr)
+											lastErr = errMsg2.Error
+											lastErrCode = errMsg2.StatusCode
+											lastFailedVersion = failedProviderVersion(provider)
+											if !s.noteDispatchProviderError(provider, pr, errMsg2.StatusCode, &heldChunks) {
+												s.ddIncr("inference.dispatches", []string{"status:retry"})
+											}
+											provider = nil
+											pr = nil
+											continue dispatch
+										case <-remainingPrimary.C:
+											if len(heldChunks) > 0 {
+												// Primary preamble liveness — extend to the
+												// preamble-to-content budget instead of failing.
+												accepted = true
+												preambleLiveness = true
+												break backupFailedPrimaryWait
+											}
+											// The PRIMARY timed out here (the backup's earlier error
+											// is already recorded); report the timeout, not the
+											// backup's stale error text.
+											excludeProviders[provider.ID] = struct{}{}
+											s.cancelDispatch(provider, pr)
+											lastErr = "timeout waiting for first response"
+											lastErrCode = http.StatusGatewayTimeout
+											if s.metrics != nil {
+												s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+											}
+											s.ddIncr("inference.dispatches", []string{"status:timeout"})
+											provider = nil
+											pr = nil
+											continue dispatch
+										case <-r.Context().Done():
+											remainingPrimary.Stop()
+											s.cancelDispatch(provider, pr)
+											refundReservation()
+											return
+										}
+									}
+								default:
+									// Backup channel closed with no error — treat as committed.
+									s.cancelDispatch(provider, pr)
+									provider = backupProvider
+									pr = backupPR
+									requestID = pr.RequestID
+									heldChunks = backupHeld
+									committed = true
+								}
+							}
+							break race
+
+						case <-pr.AcceptedCh:
+							// Primary accepted (model reload). Cancel backup, extend deadline.
+							raceDeadline.Stop()
+							s.cancelDispatch(backupProvider, backupPR)
+							accepted = true
+							break race
+
+						case <-backupPR.AcceptedCh:
+							// Backup accepted (model reload). Cancel primary, extend deadline.
+							raceDeadline.Stop()
+							s.cancelDispatch(provider, pr)
+							provider = backupProvider
+							pr = backupPR
+							requestID = pr.RequestID
+							heldChunks = backupHeld
+							accepted = true
+							break race
+
+						case errMsg := <-pr.ErrorCh:
+							// Primary failed. Keep waiting for backup.
+							raceDeadline.Stop()
+							excludeProviders[provider.ID] = struct{}{}
+							s.cancelDispatch(provider, pr)
+							lastFailedVersion = failedProviderVersion(provider)
+							s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks)
+							// Wait for backup with remaining deadline.
+							backupDeadline := time.NewTimer(deadline - speculativeAt)
+						primaryFailedBackupWait:
+							for {
+								select {
+								case chunk, ok := <-backupPR.ChunkCh:
+									if ok && len(backupHeld) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+										backupHeld = append(backupHeld, chunk)
+										if backupPR.Timing.FirstChunkAt.IsZero() {
+											backupPR.Timing.FirstChunkAt = time.Now()
+										}
+										continue primaryFailedBackupWait
+									}
+									backupDeadline.Stop()
+									if ok {
+										provider = backupProvider
+										pr = backupPR
+										requestID = pr.RequestID
+										heldChunks = backupHeld
+										firstChunk = chunk
+										if pr.Timing.FirstChunkAt.IsZero() {
+											pr.Timing.FirstChunkAt = time.Now()
+										}
+										committed = true
+									} else {
+										select {
+										case errMsg2 := <-backupPR.ErrorCh:
+											excludeProviders[backupProvider.ID] = struct{}{}
+											s.cancelDispatch(backupProvider, backupPR)
+											lastErr = errMsg2.Error
+											lastErrCode = errMsg2.StatusCode
+											lastFailedVersion = failedProviderVersion(backupProvider)
+											if !s.noteDispatchProviderError(backupProvider, backupPR, errMsg2.StatusCode, &backupHeld) {
+												s.ddIncr("inference.dispatches", []string{"status:retry"})
+											}
+											provider = nil
+											pr = nil
+											continue dispatch
+										default:
+											provider = backupProvider
+											pr = backupPR
+											requestID = pr.RequestID
+											heldChunks = backupHeld
+											committed = true
+										}
+									}
+									break primaryFailedBackupWait
+								case <-backupPR.AcceptedCh:
+									backupDeadline.Stop()
+									provider = backupProvider
+									pr = backupPR
+									requestID = pr.RequestID
+									heldChunks = backupHeld
+									accepted = true
+									break primaryFailedBackupWait
+								case errMsg2 := <-backupPR.ErrorCh:
+									backupDeadline.Stop()
+									excludeProviders[backupProvider.ID] = struct{}{}
+									s.cancelDispatch(backupProvider, backupPR)
+									lastErr = errMsg2.Error
+									lastErrCode = errMsg2.StatusCode
+									lastFailedVersion = failedProviderVersion(backupProvider)
+									s.noteDispatchProviderError(backupProvider, backupPR, errMsg2.StatusCode, &backupHeld)
+									provider = nil
+									pr = nil
+									continue dispatch
+								case <-backupDeadline.C:
+									if len(backupHeld) > 0 {
+										// Backup preamble liveness — promote it and extend
+										// by the preamble-to-content budget for first content.
+										provider = backupProvider
+										pr = backupPR
+										requestID = pr.RequestID
+										heldChunks = backupHeld
+										accepted = true
+										preambleLiveness = true
+										break primaryFailedBackupWait
+									}
+									excludeProviders[backupProvider.ID] = struct{}{}
+									s.cancelDispatch(backupProvider, backupPR)
+									lastErr = "timeout waiting for first response (backup)"
+									lastErrCode = http.StatusGatewayTimeout
+									if s.metrics != nil {
+										s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+									}
+									s.ddIncr("inference.dispatches", []string{"status:timeout"})
+									provider = nil
+									pr = nil
+									continue dispatch
+								case <-r.Context().Done():
+									backupDeadline.Stop()
+									s.cancelDispatch(backupProvider, backupPR)
+									refundReservation()
+									return
+								}
+							}
+							break race
+
+						case errMsg := <-backupPR.ErrorCh:
+							// Backup failed. Keep waiting for primary.
+							raceDeadline.Stop()
+							excludeProviders[backupProvider.ID] = struct{}{}
+							s.cancelDispatch(backupProvider, backupPR)
+							lastFailedVersion = failedProviderVersion(backupProvider)
+							s.noteDispatchProviderError(backupProvider, backupPR, errMsg.StatusCode, &backupHeld)
+							primaryDeadline := time.NewTimer(deadline - speculativeAt)
+						backupFailedWaitPrimary:
+							for {
+								select {
+								case chunk, ok := <-pr.ChunkCh:
+									if ok && len(heldChunks) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+										heldChunks = append(heldChunks, chunk)
+										if pr.Timing.FirstChunkAt.IsZero() {
+											pr.Timing.FirstChunkAt = time.Now()
+										}
+										continue backupFailedWaitPrimary
+									}
+									primaryDeadline.Stop()
+									if ok {
+										firstChunk = chunk
+										if pr.Timing.FirstChunkAt.IsZero() {
+											pr.Timing.FirstChunkAt = time.Now()
+										}
+										committed = true
+									} else {
+										select {
+										case errMsg2 := <-pr.ErrorCh:
+											excludeProviders[provider.ID] = struct{}{}
+											s.cancelDispatch(provider, pr)
+											lastErr = errMsg2.Error
+											lastErrCode = errMsg2.StatusCode
+											lastFailedVersion = failedProviderVersion(provider)
+											if !s.noteDispatchProviderError(provider, pr, errMsg2.StatusCode, &heldChunks) {
+												s.ddIncr("inference.dispatches", []string{"status:retry"})
+											}
+											provider = nil
+											pr = nil
+											continue dispatch
+										default:
+											committed = true
+										}
+									}
+									break backupFailedWaitPrimary
+								case <-pr.AcceptedCh:
+									primaryDeadline.Stop()
+									accepted = true
+									break backupFailedWaitPrimary
+								case errMsg2 := <-pr.ErrorCh:
+									primaryDeadline.Stop()
+									excludeProviders[provider.ID] = struct{}{}
+									s.cancelDispatch(provider, pr)
+									lastErr = errMsg2.Error
+									lastErrCode = errMsg2.StatusCode
+									lastFailedVersion = failedProviderVersion(provider)
+									s.noteDispatchProviderError(provider, pr, errMsg2.StatusCode, &heldChunks)
+									provider = nil
+									pr = nil
+									continue dispatch
+								case <-primaryDeadline.C:
+									if len(heldChunks) > 0 {
+										// Primary preamble liveness — extend by the
+										// preamble-to-content budget instead of failing.
+										accepted = true
+										preambleLiveness = true
+										break backupFailedWaitPrimary
+									}
+									excludeProviders[provider.ID] = struct{}{}
+									s.cancelDispatch(provider, pr)
+									lastErr = "timeout waiting for first response"
+									lastErrCode = http.StatusGatewayTimeout
+									if s.metrics != nil {
+										s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+									}
+									s.ddIncr("inference.dispatches", []string{"status:timeout"})
+									provider = nil
+									pr = nil
+									continue dispatch
+								case <-r.Context().Done():
+									primaryDeadline.Stop()
+									s.cancelDispatch(provider, pr)
+									refundReservation()
+									return
+								}
+							}
+							break race
+
+						case <-raceDeadline.C:
+							if !raceExtended && (len(heldChunks) > 0 || len(backupHeld) > 0) {
+								// Liveness from at least one racer: don't fail at the
+								// TTFT deadline — extend once by the preamble-to-content
+								// budget (zero bytes have reached the client; a genuine
+								// cold load would have signalled AcceptedCh) and keep both
+								// racing for first content, with both error channels still
+								// armed for retry.
+								raceExtended = true
+								raceDeadline = time.NewTimer(preambleContentTimeout)
+								continue race
+							}
+							// Both missed deadline. A racer that held preamble (role
+							// then stall) is a 504-shaped sickness — feed the breaker
+							// before cancelling, mirroring the single-provider
+							// acceptedWait timeout path so a stalling provider/model
+							// (shape-keyed) trips its cooldown.
+							if len(heldChunks) > 0 {
+								s.noteInferenceError(provider.ID, pr, http.StatusGatewayTimeout)
+							}
+							if len(backupHeld) > 0 {
+								s.noteInferenceError(backupProvider.ID, backupPR, http.StatusGatewayTimeout)
+							}
+							s.cancelDispatch(provider, pr)
+							s.cancelDispatch(backupProvider, backupPR)
+							excludeProviders[provider.ID] = struct{}{}
+							excludeProviders[backupProvider.ID] = struct{}{}
+							lastErr = "timeout waiting for first response (both providers)"
+							lastErrCode = http.StatusGatewayTimeout
+							if s.metrics != nil {
+								s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+							}
+							s.ddIncr("inference.dispatches", []string{"status:timeout"})
+							provider = nil
+							pr = nil
+							continue dispatch
+
+						case <-r.Context().Done():
+							raceDeadline.Stop()
+							s.cancelDispatch(provider, pr)
+							s.cancelDispatch(backupProvider, backupPR)
+							refundReservation()
+							return
+						}
 					}
-					s.ddIncr("inference.dispatches", []string{"status:timeout"})
-					provider = nil
-					pr = nil
-					continue
-				case <-r.Context().Done():
-					remainingDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					refundReservation()
-					return
 				}
-			} else {
-				// Backup dispatched — race primary vs backup.
-				s.logger.Info("speculative_dispatch",
+				break firstChunkWait
+
+			case <-deadlineTimer.C:
+				speculativeTimer.Stop()
+				if len(heldChunks) > 0 {
+					// Preamble liveness — the provider is alive but still in its
+					// pre-content phase. Fall through to the extended
+					// (preambleContentTimeout) wait instead of failing the attempt.
+					accepted = true
+					preambleLiveness = true
+					break firstChunkWait
+				}
+				excludeProviders[provider.ID] = struct{}{}
+				s.cancelDispatch(provider, pr)
+				lastErr = "timeout waiting for first response"
+				lastErrCode = http.StatusGatewayTimeout
+				s.logger.Warn("provider timeout (full deadline), retrying",
 					"request_id", requestID,
-					"primary_provider", provider.ID,
-					"backup_provider", backupProvider.ID,
-					"ttft_deadline_ms", deadline.Milliseconds(),
-					"speculative_at_ms", speculativeAt.Milliseconds(),
+					"provider_id", provider.ID,
+					"attempt", attempt+1,
 				)
+				s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
+					"provider first-chunk timeout",
+					map[string]any{
+						"provider_id": provider.ID,
+						"attempt":     attempt + 1,
+						"reason":      "first_chunk_timeout",
+					})
+				if s.metrics != nil {
+					s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
+				}
+				s.ddIncr("inference.dispatches", []string{"status:timeout"})
+				provider = nil
+				pr = nil
+				continue dispatch

-				raceDeadline := time.NewTimer(deadline - speculativeAt)
+			case <-r.Context().Done():
+				speculativeTimer.Stop()
+				deadlineTimer.Stop()
+				s.cancelDispatch(provider, pr)
+				refundReservation()
+				return
+			}
+		}

+		// Provider accepted or sent first content chunk — commit to this
+		// provider. If not committed yet, wait for the first content chunk.
+		// The budget depends on WHY we're here: a genuine AcceptedCh (the
+		// backend is reloading the model — legitimately minutes) keeps the
+		// full inferenceTimeout; a boilerplate-liveness extension past an
+		// expired TTFT deadline gets only preambleContentTimeout — zero bytes
+		// have been written to the client, so a preamble-then-stall provider
+		// must fail over instead of pinning the request for 10 minutes.
+		if accepted && !committed {
+			firstContentBudget := inferenceTimeout
+			if preambleLiveness {
+				firstContentBudget = preambleContentTimeout
+			}
+			chunkTimer := time.NewTimer(firstContentBudget)
+		acceptedWait:
+			for {
 				select {
 				case chunk, ok := <-pr.ChunkCh:
-					// Primary wins!
-					raceDeadline.Stop()
-					s.cancelDispatch(backupProvider, backupPR)
+					if ok && len(heldChunks) < maxHeldBoilerplate && isBoilerplateChunk(chunk) {
+						heldChunks = append(heldChunks, chunk)
+						if pr.Timing.FirstChunkAt.IsZero() {
+							pr.Timing.FirstChunkAt = time.Now()
+						}
+						continue acceptedWait
+					}
+					chunkTimer.Stop()
 					if ok {
 						firstChunk = chunk
-						pr.Timing.FirstChunkAt = time.Now()
+						if pr.Timing.FirstChunkAt.IsZero() {
+							pr.Timing.FirstChunkAt = time.Now()
+						}
 						committed = true
 					} else {
+						// Closed — check for error. Use a short grace
+						// period instead of a non-blocking default to
+						// close the race where Go's select picks the
+						// ChunkCh close before the ErrorCh value (sent
+						// by the provider handler before closing ChunkCh).
 						select {
 						case errMsg := <-pr.ErrorCh:
-							// Primary failed but we already cancelled backup.
 							excludeProviders[provider.ID] = struct{}{}
 							s.cancelDispatch(provider, pr)
 							lastErr = errMsg.Error
 							lastErrCode = errMsg.StatusCode
+							lastFailedVersion = failedProviderVersion(provider)
+							s.logger.Warn("provider failed after accepting request, retrying",
+								"request_id", requestID,
+								"provider_id", provider.ID,
+								"attempt", attempt+1,
+								"error", errMsg.Error,
+							)
+							s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
+								"provider failed after accepting request, retrying",
+								map[string]any{
+									"provider_id": provider.ID,
+									"attempt":     attempt + 1,
+									"reason":      "provider_error",
+									"status_code": errMsg.StatusCode,
+								})
+							if s.metrics != nil {
+								s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
+							}
+							if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+								s.ddIncr("inference.dispatches", []string{"status:retry"})
+							}
 							provider = nil
 							pr = nil
-							continue
-						default:
-							committed = true
-						}
-					}
-
-				case chunk, ok := <-backupPR.ChunkCh:
-					// Backup wins!
-					raceDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					s.ddIncr("inference.speculative_win", []string{"model:" + model})
-					if ok {
-						provider = backupProvider
-						pr = backupPR
-						requestID = pr.RequestID
-						firstChunk = chunk
-						pr.Timing.FirstChunkAt = time.Now()
-						committed = true
-					} else {
-						select {
-						case errMsg := <-backupPR.ErrorCh:
-							// Backup failed too. Keep primary context for retry.
-							excludeProviders[backupProvider.ID] = struct{}{}
-							// Wait remaining deadline for primary.
-							remainingPrimary := time.NewTimer(deadline - speculativeAt)
-							select {
-							case chunk, ok := <-pr.ChunkCh:
-								remainingPrimary.Stop()
-								if ok {
-									firstChunk = chunk
-									pr.Timing.FirstChunkAt = time.Now()
-									committed = true
-								} else {
-									select {
-									case errMsg2 := <-pr.ErrorCh:
-										excludeProviders[provider.ID] = struct{}{}
-										s.cancelDispatch(provider, pr)
-										lastErr = errMsg2.Error
-										lastErrCode = errMsg2.StatusCode
-										provider = nil
-										pr = nil
-										continue
-									default:
-										committed = true
-									}
-								}
-							case <-pr.AcceptedCh:
-								remainingPrimary.Stop()
-								accepted = true
-							case <-remainingPrimary.C:
-								excludeProviders[provider.ID] = struct{}{}
-								s.cancelDispatch(provider, pr)
-								lastErr = errMsg.Error
-								lastErrCode = errMsg.StatusCode
-								provider = nil
-								pr = nil
-								continue
-							case <-r.Context().Done():
-								remainingPrimary.Stop()
-								s.cancelDispatch(provider, pr)
-								refundReservation()
-								return
-							}
-						default:
-							// Backup channel closed with no error — treat as committed.
-							s.cancelDispatch(provider, pr)
-							provider = backupProvider
-							pr = backupPR
-							requestID = pr.RequestID
+							continue dispatch
+						case <-time.After(50 * time.Millisecond):
 							committed = true
 						}
 					}
-
-				case <-pr.AcceptedCh:
-					// Primary accepted (model reload). Cancel backup, extend deadline.
-					raceDeadline.Stop()
-					s.cancelDispatch(backupProvider, backupPR)
-					accepted = true
-
-				case <-backupPR.AcceptedCh:
-					// Backup accepted (model reload). Cancel primary, extend deadline.
-					raceDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					provider = backupProvider
-					pr = backupPR
-					requestID = pr.RequestID
-					accepted = true
-
+					break acceptedWait
 				case errMsg := <-pr.ErrorCh:
-					// Primary failed. Keep waiting for backup.
-					raceDeadline.Stop()
+					chunkTimer.Stop()
 					excludeProviders[provider.ID] = struct{}{}
 					s.cancelDispatch(provider, pr)
-					// Wait for backup with remaining deadline.
-					backupDeadline := time.NewTimer(deadline - speculativeAt)
-					select {
-					case chunk, ok := <-backupPR.ChunkCh:
-						backupDeadline.Stop()
-						_ = errMsg // used implicitly via excludeProviders
-						if ok {
-							provider = backupProvider
-							pr = backupPR
-							requestID = pr.RequestID
-							firstChunk = chunk
-							pr.Timing.FirstChunkAt = time.Now()
-							committed = true
-						} else {
-							select {
-							case errMsg2 := <-backupPR.ErrorCh:
-								excludeProviders[backupProvider.ID] = struct{}{}
-								s.cancelDispatch(backupProvider, backupPR)
-								lastErr = errMsg2.Error
-								lastErrCode = errMsg2.StatusCode
-								provider = nil
-								pr = nil
-								continue
-							default:
-								provider = backupProvider
-								pr = backupPR
-								requestID = pr.RequestID
-								committed = true
-							}
-						}
-					case <-backupPR.AcceptedCh:
-						backupDeadline.Stop()
-						provider = backupProvider
-						pr = backupPR
-						requestID = pr.RequestID
-						accepted = true
-					case errMsg2 := <-backupPR.ErrorCh:
-						backupDeadline.Stop()
-						excludeProviders[backupProvider.ID] = struct{}{}
-						s.cancelDispatch(backupProvider, backupPR)
-						lastErr = errMsg2.Error
-						lastErrCode = errMsg2.StatusCode
-						provider = nil
-						pr = nil
-						continue
-					case <-backupDeadline.C:
-						excludeProviders[backupProvider.ID] = struct{}{}
-						s.cancelDispatch(backupProvider, backupPR)
-						lastErr = "timeout waiting for first response (backup)"
-						lastErrCode = http.StatusGatewayTimeout
-						if s.metrics != nil {
-							s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-						}
-						s.ddIncr("inference.dispatches", []string{"status:timeout"})
-						provider = nil
-						pr = nil
-						continue
-					case <-r.Context().Done():
-						backupDeadline.Stop()
-						s.cancelDispatch(backupProvider, backupPR)
-						refundReservation()
-						return
+					lastErr = errMsg.Error
+					lastErrCode = errMsg.StatusCode
+					lastFailedVersion = failedProviderVersion(provider)
+					s.logger.Warn("provider failed after accepting request, retrying",
+						"request_id", requestID,
+						"provider_id", provider.ID,
+						"attempt", attempt+1,
+						"error", errMsg.Error,
+					)
+					s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
+						"provider failed after accepting request, retrying",
+						map[string]any{
+							"provider_id": provider.ID,
+							"attempt":     attempt + 1,
+							"reason":      "provider_error",
+							"status_code": errMsg.StatusCode,
+						})
+					if s.metrics != nil {
+						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
 					}
-
-				case errMsg := <-backupPR.ErrorCh:
-					// Backup failed. Keep waiting for primary.
-					raceDeadline.Stop()
-					excludeProviders[backupProvider.ID] = struct{}{}
-					s.cancelDispatch(backupProvider, backupPR)
-					_ = errMsg
-					primaryDeadline := time.NewTimer(deadline - speculativeAt)
-					select {
-					case chunk, ok := <-pr.ChunkCh:
-						primaryDeadline.Stop()
-						if ok {
-							firstChunk = chunk
-							pr.Timing.FirstChunkAt = time.Now()
-							committed = true
-						} else {
-							select {
-							case errMsg2 := <-pr.ErrorCh:
-								excludeProviders[provider.ID] = struct{}{}
-								s.cancelDispatch(provider, pr)
-								lastErr = errMsg2.Error
-								lastErrCode = errMsg2.StatusCode
-								provider = nil
-								pr = nil
-								continue
-							default:
-								committed = true
-							}
-						}
-					case <-pr.AcceptedCh:
-						primaryDeadline.Stop()
-						accepted = true
-					case errMsg2 := <-pr.ErrorCh:
-						primaryDeadline.Stop()
-						excludeProviders[provider.ID] = struct{}{}
-						s.cancelDispatch(provider, pr)
-						lastErr = errMsg2.Error
-						lastErrCode = errMsg2.StatusCode
-						provider = nil
-						pr = nil
-						continue
-					case <-primaryDeadline.C:
-						excludeProviders[provider.ID] = struct{}{}
-						s.cancelDispatch(provider, pr)
-						lastErr = "timeout waiting for first response"
-						lastErrCode = http.StatusGatewayTimeout
-						if s.metrics != nil {
-							s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-						}
-						s.ddIncr("inference.dispatches", []string{"status:timeout"})
-						provider = nil
-						pr = nil
-						continue
-					case <-r.Context().Done():
-						primaryDeadline.Stop()
-						s.cancelDispatch(provider, pr)
-						refundReservation()
-						return
+					if !s.noteDispatchProviderError(provider, pr, errMsg.StatusCode, &heldChunks) {
+						s.ddIncr("inference.dispatches", []string{"status:retry"})
 					}
-
-				case <-raceDeadline.C:
-					// Both missed deadline.
-					s.cancelDispatch(provider, pr)
-					s.cancelDispatch(backupProvider, backupPR)
+					provider = nil
+					pr = nil
+					continue dispatch
+				case <-chunkTimer.C:
 					excludeProviders[provider.ID] = struct{}{}
-					excludeProviders[backupProvider.ID] = struct{}{}
-					lastErr = "timeout waiting for first response (both providers)"
+					s.cancelDispatch(provider, pr)
+					// Accepted-then-silent (or preamble-then-stall) is a
+					// provider-at-fault 504 — feed the breaker so a provider
+					// that repeatedly acks and stalls enters cooldown instead
+					// of soaking retries forever. (504 is one of the breaker's
+					// counted codes; this arm is where those 504s originate.)
+					s.noteInferenceError(provider.ID, pr, http.StatusGatewayTimeout)
+					lastErr = "provider accepted but timed out before first chunk"
+					if preambleLiveness {
+						lastErr = "provider sent preamble but stalled before first content"
+					}
 					lastErrCode = http.StatusGatewayTimeout
+					s.logger.Warn("provider timed out after accepting request, retrying",
+						"request_id", requestID,
+						"provider_id", provider.ID,
+						"attempt", attempt+1,
+						"preamble_liveness", preambleLiveness,
+					)
+					s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
+						"provider accepted timeout",
+						map[string]any{
+							"provider_id": provider.ID,
+							"attempt":     attempt + 1,
+							"reason":      "accepted_timeout",
+						})
 					if s.metrics != nil {
 						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
 					}
 					s.ddIncr("inference.dispatches", []string{"status:timeout"})
 					provider = nil
 					pr = nil
-					continue
-
+					continue dispatch
 				case <-r.Context().Done():
-					raceDeadline.Stop()
 					s.cancelDispatch(provider, pr)
-					s.cancelDispatch(backupProvider, backupPR)
 					refundReservation()
 					return
 				}
 			}
-
-		case <-deadlineTimer.C:
-			speculativeTimer.Stop()
-			excludeProviders[provider.ID] = struct{}{}
-			s.cancelDispatch(provider, pr)
-			lastErr = "timeout waiting for first response"
-			lastErrCode = http.StatusGatewayTimeout
-			s.logger.Warn("provider timeout (full deadline), retrying",
-				"request_id", requestID,
-				"provider_id", provider.ID,
-				"attempt", attempt+1,
-			)
-			s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-				"provider first-chunk timeout",
-				map[string]any{
-					"provider_id": provider.ID,
-					"attempt":     attempt + 1,
-					"reason":      "first_chunk_timeout",
-				})
-			if s.metrics != nil {
-				s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-			}
-			s.ddIncr("inference.dispatches", []string{"status:timeout"})
-			provider = nil
-			pr = nil
-			continue
-
-		case <-r.Context().Done():
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			s.cancelDispatch(provider, pr)
-			refundReservation()
-			return
-		}
-
-		// Provider accepted or sent first chunk — commit to this provider.
-		// If only accepted (no chunk yet), wait for the first chunk with
-		// the full inference timeout since the backend may be reloading.
-		if accepted && !committed {
-			chunkTimer := time.NewTimer(inferenceTimeout)
-			select {
-			case chunk, ok := <-pr.ChunkCh:
-				chunkTimer.Stop()
-				if ok {
-					firstChunk = chunk
-					pr.Timing.FirstChunkAt = time.Now()
-					committed = true
-				} else {
-					// Closed — check for error. Use a short grace
-					// period instead of a non-blocking default to
-					// close the race where Go's select picks the
-					// ChunkCh close before the ErrorCh value (sent
-					// by the provider handler before closing ChunkCh).
-					select {
-					case errMsg := <-pr.ErrorCh:
-						excludeProviders[provider.ID] = struct{}{}
-						s.cancelDispatch(provider, pr)
-						lastErr = errMsg.Error
-						lastErrCode = errMsg.StatusCode
-						s.logger.Warn("provider failed after accepting request, retrying",
-							"request_id", requestID,
-							"provider_id", provider.ID,
-							"attempt", attempt+1,
-							"error", errMsg.Error,
-						)
-						s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-							"provider failed after accepting request, retrying",
-							map[string]any{
-								"provider_id": provider.ID,
-								"attempt":     attempt + 1,
-								"reason":      "provider_error",
-								"status_code": errMsg.StatusCode,
-							})
-						if s.metrics != nil {
-							s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-						}
-						s.ddIncr("inference.dispatches", []string{"status:retry"})
-						provider = nil
-						pr = nil
-						continue
-					case <-time.After(50 * time.Millisecond):
-						committed = true
-					}
-				}
-			case errMsg := <-pr.ErrorCh:
-				chunkTimer.Stop()
-				excludeProviders[provider.ID] = struct{}{}
-				s.cancelDispatch(provider, pr)
-				lastErr = errMsg.Error
-				lastErrCode = errMsg.StatusCode
-				s.logger.Warn("provider failed after accepting request, retrying",
-					"request_id", requestID,
-					"provider_id", provider.ID,
-					"attempt", attempt+1,
-					"error", errMsg.Error,
-				)
-				s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-					"provider failed after accepting request, retrying",
-					map[string]any{
-						"provider_id": provider.ID,
-						"attempt":     attempt + 1,
-						"reason":      "provider_error",
-						"status_code": errMsg.StatusCode,
-					})
-				if s.metrics != nil {
-					s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-				}
-				s.ddIncr("inference.dispatches", []string{"status:retry"})
-				provider = nil
-				pr = nil
-				continue
-			case <-chunkTimer.C:
-				excludeProviders[provider.ID] = struct{}{}
-				s.cancelDispatch(provider, pr)
-				lastErr = "provider accepted but timed out before first chunk"
-				lastErrCode = http.StatusGatewayTimeout
-				s.logger.Warn("provider timed out after accepting request, retrying",
-					"request_id", requestID,
-					"provider_id", provider.ID,
-					"attempt", attempt+1,
-				)
-				s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-					"provider accepted timeout",
-					map[string]any{
-						"provider_id": provider.ID,
-						"attempt":     attempt + 1,
-						"reason":      "accepted_timeout",
-					})
-				if s.metrics != nil {
-					s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-				}
-				s.ddIncr("inference.dispatches", []string{"status:timeout"})
-				provider = nil
-				pr = nil
-				continue
-			case <-r.Context().Done():
-				s.cancelDispatch(provider, pr)
-				refundReservation()
-				return
-			}
 		}

 		break


🔴 [CRITICAL] 🧩 890-line nested state machine in dispatch loop

💡 Suggestion: Extract the speculative dispatch racing logic into separate functions - this single block handles too many scenarios and is hard to follow

📊 Score: 4×5 = 20 · Category: over-complexity

ethenotethan · 2026-06-12T00:13:21Z

+package api
+
+import (
+	"bytes"
+	"encoding/json"
+	"errors"
+	"io"
+	"slices"
+)
+
+// Tool-schema normalization (DAR-130), a Go port of the Swift provider's
+// ToolSchemaNormalization.ensureParameterTypes
+// (provider-swift/Sources/ProviderCore/Inference/ToolSchemaNormalization.swift).
+// The per-schema repair semantics (injectDefaultTypes and helpers) must stay
+// semantically in sync with the Swift implementation.
+//
+// Gemma-style chat templates render `{{ value['type'] | upper }}` over each
+// tool parameter. A `type` that is missing — a legitimate OpenAI shape (e.g.
+// an enum-only or anyOf property) — or present but not a string (the
+// JSON-Schema nullable idiom `"type": ["string","null"]` that Pydantic emits
+// for every Optional[...] field) makes the Jinja `| upper` filter throw,
+// surfacing to the consumer as a 500. Providers normalize since 0.6.3, but
+// the fleet updates slowly; normalizing centrally protects consumers from
+// lagging providers the moment the coordinator deploys.
+//
+// Three wire shapes put a JSON-Schema on a tool entry, all of which reach the
+// same templates (the same DAR-130 incident class), so all three are repaired
+// (per-entry detection rules in normalizeToolEntry):
+//
+//  1. OpenAI chat completions: tools[].function.parameters — the original
+//     shape, and the only one the Swift provider-side normalizer covers as
+//     of 0.6.4.
+//  2. OpenAI Responses API (flat): tools[].parameters with no "function"
+//     wrapper. The coordinator converts Responses→chat AFTER this
+//     normalization runs and copies parameters verbatim, so repairing the
+//     flat shape pre-conversion fixes that path end-to-end.
+//  3. Anthropic Messages: tools[].input_schema, served via /v1/messages.
+//
+// Because the provider-side normalizer covers only shape 1, this
+// coordinator-side breadth is the fleet's only protection for shapes 2 and 3.
+
+// maxToolNormalizationBytes is the upper bound on the body we'll JSON
+// round-trip for tool-schema normalization. Tool definitions are tiny (KB),
+// so a multi-MB body — e.g. a long prompt that merely contains the word
+// "tools" — should not trigger a full parse + recursive traversal. Above this
+// we skip normalization, bounding the cost on the (already size-capped)
+// inference path.
+const maxToolNormalizationBytes = 4 * 1024 * 1024
+
+// toolsKeyNeedle is the cheap byte gate: only bodies carrying these bytes pay
+// the JSON round-trip.
+var toolsKeyNeedle = []byte(`"tools"`)
+
+// schemaUnionKeys are the JSON-Schema combinators whose members are
+// themselves schemas.
+var schemaUnionKeys = []string{"anyOf", "oneOf", "allOf"}
+
+// NormalizeToolSchemas returns body with default `type`s injected into every
+// JSON-Schema node under each tool's schema home — function.parameters (chat
+// completions), top-level parameters (Responses flat shape), or input_schema
+// (Anthropic Messages) — so chat templates always have a string to
+// upper-case.
+//
+// Fast-paths out (returns the input unchanged) when the body exceeds
+// maxToolNormalizationBytes, carries no "tools" bytes, isn't a JSON object,
+// or its "tools" value isn't an array. On ANY error path the input is
+// returned unchanged — this function must never break a request that would
+// otherwise work.
+//
+// The body is decoded with json.Decoder.UseNumber so numbers round-trip
+// verbatim (no float64 mangling of int64s or high-precision decimals).
+// Re-marshalling reorders keys and normalizes whitespace; every field other
+// than "tools" survives value-equivalent.
+func NormalizeToolSchemas(body []byte) []byte {
+	// Bound the work: skip the round-trip for oversized bodies (see the constant).
+	if len(body) > maxToolNormalizationBytes {
+		return body
+	}
+	// Cheap gate: only pay the JSON round-trip for requests that carry tools.
+	if !bytes.Contains(body, toolsKeyNeedle) {
+		return body
+	}
+	dec := json.NewDecoder(bytes.NewReader(body))
+	dec.UseNumber()
+	var decoded any
+	if err := dec.Decode(&decoded); err != nil {
+		return body
+	}
+	// Trailing content after the JSON document means the body isn't a single
+	// well-formed object (Swift's JSONSerialization rejects it too) — leave it
+	// for downstream validation to handle.
+	if _, err := dec.Token(); !errors.Is(err, io.EOF) {
+		return body
+	}
+	root, ok := decoded.(map[string]any)
+	if !ok {
+		return body
+	}
+	tools, ok := root["tools"].([]any)
+	if !ok {
+		return body
+	}
+	for i, tool := range tools {
+		tools[i] = normalizeToolEntry(tool)
+	}
+
+	var buf bytes.Buffer
+	buf.Grow(len(body))
+	enc := json.NewEncoder(&buf)
+	enc.SetEscapeHTML(false)
+	if err := enc.Encode(root); err != nil {
+		return body
+	}
+	// Encoder appends a newline after the document; the input had none.
+	return bytes.TrimSuffix(buf.Bytes(), []byte{'\n'})
+}
+
+// normalizeToolEntry rewrites one element of the "tools" array. A tool's
+// JSON-Schema lives in one of three homes, detected per entry:
+//
+//   - Chat-completions shape: "function" is an object — its "parameters"
+//     value (when the key is present) is normalized. An object "function"
+//     wrapper claims the entry: a stray top-level "parameters" beside it is
+//     not a recognized shape and stays untouched (never double-repair).
+//   - Responses-API flat shape: no object "function" wrapper (the key is
+//     absent, or holds a non-object value, which itself passes through
+//     verbatim) — the top-level "parameters" value (when the key is present)
+//     is normalized.
+//   - Anthropic Messages shape: the "input_schema" value (when the key is
+//     present) is normalized, independent of the two OpenAI homes — the
+//     shapes are mutually exclusive on real traffic, but an entry carrying
+//     several schema keys gets each repaired.
+//
+// Entries matching no shape — scalars, schema-less maps — pass through
+// untouched (mirrors the Swift per-tool guard). Schema values are handed to
+// injectDefaultTypes as-is: nulls and scalars come back verbatim, and keys
+// are never invented on the tool entry itself.
+func normalizeToolEntry(tool any) any {
+	toolDict, ok := tool.(map[string]any)
+	if !ok {
+		return tool
+	}
+	if function, ok := toolDict["function"].(map[string]any); ok {
+		if parameters, ok := function["parameters"]; ok {
+			function["parameters"] = injectDefaultTypes(parameters)
+		}
+	} else if parameters, ok := toolDict["parameters"]; ok {
+		toolDict["parameters"] = injectDefaultTypes(parameters)
+	}
+	if inputSchema, ok := toolDict["input_schema"]; ok {
+		toolDict["input_schema"] = injectDefaultTypes(inputSchema)
+	}
+	return toolDict
+}
+
+// injectDefaultTypes recursively default-fills `type` on JSON-Schema nodes. A
+// node gets a type only when it looks like a schema node (has properties /
+// items / additionalProperties / enum / description / anyOf / oneOf / allOf)
+// — we never invent types on arbitrary maps. The inferred default favours
+// structure: object when it has properties, array when it has items,
+// otherwise string.
+func injectDefaultTypes(node any) any {
+	switch n := node.(type) {
+	case []any:
+		for i, v := range n {
+			n[i] = injectDefaultTypes(v)
+		}
+		return n
+	case map[string]any:
+		return injectDefaultTypesIntoSchema(n)
+	default:
+		return node
+	}
+}
+
+// injectDefaultTypesIntoSchema is the map-shaped arm of injectDefaultTypes.
+// Children are normalized BEFORE this node's own type is repaired — ordering
+// is load-bearing: a union member declaring `"type": ["string","null"]` must
+// collapse first so the parent's union inference sees a concrete string.
+func injectDefaultTypesIntoSchema(dict map[string]any) map[string]any {
+	if props, ok := dict["properties"].(map[string]any); ok {
+		for k, v := range props {
+			props[k] = injectDefaultTypes(v)
+		}
+	}
+	if items, ok := dict["items"]; ok {
+		dict["items"] = injectDefaultTypes(items)
+	}
+	// additionalProperties may itself be a schema (map-shaped params, e.g.
+	// {"additionalProperties":{"type":"string"}}) — recurse so its inner schema
+	// gets a default type too. A bare `true`/`false` is left untouched.
+	if addl, ok := dict["additionalProperties"].(map[string]any); ok {
+		dict["additionalProperties"] = injectDefaultTypesIntoSchema(addl)
+	}
+	for _, key := range schemaUnionKeys {
+		if variants, ok := dict[key].([]any); ok {
+			for i, v := range variants {
+				variants[i] = injectDefaultTypes(v)
+			}
+		}
+	}
+
+	// A type that is PRESENT but not a string crashes `| upper` just like a
+	// missing one. The common real-world shape is the JSON-Schema array form
+	// for nullable fields — `"type": ["string","null"]` — which Pydantic
+	// emits for every Optional[...] tool parameter. Collapse it to a single
+	// representative string (never delete the key: a node whose only content
+	// is its type would not be refilled below and would crash anyway).
+	// Nullability is preserved losslessly: the gemma template natively
+	// renders the standard `nullable` key, so collapsing away a "null"
+	// member sets it (without clobbering an explicit value).
+	if t, present := dict["type"]; present {
+		if _, isString := t.(string); !isString {
+			members := typeStringMembers(t)
+			if slices.Contains(members, "null") &&
+				slices.ContainsFunc(members, func(m string) bool { return m != "null" }) {
+				if _, hasNullable := dict["nullable"]; !hasNullable {
+					dict["nullable"] = true
+				}
+			}
+			dict["type"] = collapsedType(members, dict)
+		}
+	}
+
+	if _, present := dict["type"]; !present && looksLikeSchemaNode(dict) {
+		dict["type"] = inferredType(dict)
+	}
+	return dict
+}
+
+// typeStringMembers extracts the string members of an array-form `type`
+// value. Any other shape (number, bool, object, null) yields no members,
+// pushing the collapse to structural inference.
+func typeStringMembers(t any) []string {
+	arr, ok := t.([]any)
+	if !ok {
+		return nil
+	}
+	members := make([]string, 0, len(arr))
+	for _, m := range arr {
+		if s, ok := m.(string); ok {
+			members = append(members, s)
+		}
+	}
+	return members
+}
+
+// collapsedType collapses a non-string `type` value (pre-extracted string
+// members of the array form) to one renderable string: the first concrete
+// (non-"null") member, the lone "null" when that is all the array declares,
+// else fall back to structural inference.
+func collapsedType(members []string, dict map[string]any) string {
+	for _, m := range members {
+		if m != "null" {
+			return m
+		}
+	}
+	if len(members) > 0 {
+		return members[0]
+	}
+	return inferredType(dict)
+}
+
+// looksLikeSchemaNode reports whether the map carries any JSON-Schema marker
+// key. Only such nodes receive a defaulted `type`.
+func looksLikeSchemaNode(dict map[string]any) bool {
+	for _, key := range []string{
+		"properties", "items", "additionalProperties",
+		"enum", "description", "anyOf", "oneOf", "allOf",
+	} {
+		if _, ok := dict[key]; ok {
+			return true
+		}
+	}
+	return false
+}
+
+// inferredType is the structural default for a schema node's `type`: object
+// when it has properties (or additionalProperties), array when it has items,
+// a union member's type when it is an anyOf/oneOf/allOf (skipping "null" —
+// mislabelling a union as a string would be wrong), otherwise string.
+func inferredType(dict map[string]any) string {
+	_, hasProps := dict["properties"]
+	_, hasAddl := dict["additionalProperties"]
+	if hasProps || hasAddl {
+		return "object"
+	}
+	if _, ok := dict["items"]; ok {
+		return "array"
+	}
+	if t, ok := unionMemberType(dict); ok {
+		return t
+	}
+	return "string"
+}
+
+// unionMemberType derives a representative `type` for a union node from the
+// first member that declares a concrete, non-"null" type. The second return
+// is false when none is found.
+func unionMemberType(dict map[string]any) (string, bool) {
+	for _, key := range schemaUnionKeys {
+		variants, ok := dict[key].([]any)
+		if !ok {
+			continue
+		}
+		for _, variant := range variants {
+			v, ok := variant.(map[string]any)
+			if !ok {
+				continue
+			}
+			if t, ok := v["type"].(string); ok && t != "null" {
+				return t, true
+			}
+		}
+	}
+	return "", false
+}


🟡 [MEDIUM] 🧩 Complex tool schema normalization for simple fix

💡 Suggestion: Consider if a simpler approach could handle the core issue - 317 lines to fix template crashes seems heavy for the problem scope

📊 Score: 3×3 = 9 · Category: over-abstraction

ethenotethan · 2026-06-12T00:13:21Z

+func (r *Registry) RecordInferenceError(providerID, modelID string, statusCode int, shape string) (enteredCooldown bool) {
+	switch statusCode {
+	case 500, 502, 504:
+		// Provider-sickness shapes: count the strike.
+	default:
+		return false
+	}
+
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	now := time.Now()
+
+	// Opportunistic sweep, mirroring dispatchLoadCooldowns: provider ids are
+	// per-session UUIDs, so dead entries never get re-keyed — bound both maps
+	// by dropping expired ones once they grow.
+	if len(r.inferenceErrorCooldowns) > 1024 {
+		for key, expiry := range r.inferenceErrorCooldowns {
+			if !now.Before(expiry) {
+				delete(r.inferenceErrorCooldowns, key)
+			}
+		}
+	}
+	if len(r.inferenceErrorStrikes) > 1024 {
+		for key, strikes := range r.inferenceErrorStrikes {
+			if len(strikes) == 0 || !strikes[len(strikes)-1].Add(inferenceErrorWindow).After(now) {
+				delete(r.inferenceErrorStrikes, key)
+			}
+		}
+	}
+
+	key := inferenceErrorKey{ProviderID: providerID, ModelID: modelID, Shape: shape}
+
+	// Slide the window: keep only strikes still inside it, then add this one.
+	strikes := r.inferenceErrorStrikes[key]
+	kept := strikes[:0]
+	for _, ts := range strikes {
+		if now.Sub(ts) < inferenceErrorWindow {
+			kept = append(kept, ts)
+		}
+	}
+	kept = append(kept, now)
+	r.inferenceErrorStrikes[key] = kept
+
+	if len(kept) < inferenceErrorThreshold {
+		return false
+	}
+
+	expiry, active := r.inferenceErrorCooldowns[key]
+	active = active && now.Before(expiry)
+	// Threshold met: (re-)arm the cool-down. Repeated failures extend an
+	// active cool-down, but only the transition reports true.
+	r.inferenceErrorCooldowns[key] = now.Add(inferenceErrorCooldownTTL)
+	return !active


🔵 [INFO] 🧩 Multiple cooldown parameters and complex sliding window

💡 Suggestion: Consider if simpler exponential backoff or fixed cooldown would suffice - the sliding window adds complexity

📊 Score: 2×3 = 6 · Category: over-configuration

ethenotethan · 2026-06-12T00:13:21Z

+	return "base"
+}
+
+// capabilityVersionFloors maps a request trait to the minimum provider binary
+// version able to serve it. Providers BELOW a floor — including providers that
+// report no version at all — are excluded from requests carrying that trait.
+//
+// "tools": 0.6.3 is the first version with provider-side tool-schema
+// normalization (#310); older binaries crash Gemma's chat template on OpenAI
+// tool schemas with nullable/missing types.
+var capabilityVersionFloors = map[string]string{
+	"tools": "0.6.3",
+}


🔵 [INFO] 🧩 The capabilityVersionFloors map has only one entry ('tools': '0.6.3') - consider a simpler constant until more floors are needed

📊 Score: 2×3 = 6 · Category: over-configuration

ethenotethan · 2026-06-12T00:13:21Z

+// Copyright © 2026 Eigen Labs.
+//
+// Scan-time template-render self-check.
+//
+// Background (DAR-130): Gemma's `chat_template.jinja` applies `| upper` to
+// tool-schema `type` fields; legitimate request shapes the template didn't
+// anticipate crashed the Jinja render at request time and surfaced to
+// consumers as 500s. Binary-side normalization (`ToolSchemaNormalization`)
+// fixed that specific shape, but the failure CLASS — a registry-published
+// chat template that throws on legitimate request shapes — will recur with
+// future model publishes.
+//
+// This check runs at model-scan time: it renders every chat template the
+// snapshot ships against canonical fixtures that mirror exactly what the
+// runtime hands the template — post-normalization tool schemas (every node
+// has a string `type`, nullability as `nullable: true`), runtime message
+// dictionaries (`MLXBatchedEngineServerEngine+Translation.templateMessage()`
+// shapes for tool calls/responses, `MessageGenerator` content-parts shapes
+// for multimodal) — using the same Jinja engine and compile options the
+// runtime tokenizer uses (swift-jinja via swift-transformers, with
+// `lstripBlocks`/`trimBlocks` enabled). The result is advertised per model
+// as `template_render_ok` so the coordinator can refuse to route
+// tool-bearing requests to a (provider, model) whose template is broken.
+
+import Foundation
+import Jinja
+
+public enum TemplateRenderCheck {
+
+    // MARK: - Template discovery
+
+    /// Collect every chat-template STRING the runtime could use from a model
+    /// snapshot directory, in the runtime's precedence order
+    /// (swift-transformers `Hub.swift`): `chat_template.jinja` (plain text),
+    /// then `chat_template.json` (`chat_template` key), then
+    /// `tokenizer_config.json` (`chat_template` key — string form, or the
+    /// list-of-`{name, template}` form, all entries collected).
+    ///
+    /// All sources present in the snapshot are collected (not just the
+    /// winning one): the runtime's selection can change across paths
+    /// (named-template lookup, `tool_use` template selection), so a broken
+    /// template anywhere in the snapshot is a latent request-time crash.
+    public static func templateSources(at snapshotDir: URL) -> [String] {
+        var sources: [String] = []
+
+        // 1. chat_template.jinja — plain-text Jinja file.
+        let jinjaURL = snapshotDir.appendingPathComponent("chat_template.jinja")
+        if let text = try? String(contentsOf: jinjaURL, encoding: .utf8),
+            !text.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty {
+            sources.append(text)
+        }
+
+        // 2. chat_template.json — {"chat_template": <string | [{name, template}]>}.
+        sources.append(contentsOf: chatTemplateValues(
+            fromJSONFile: snapshotDir.appendingPathComponent("chat_template.json")))
+
+        // 3. tokenizer_config.json — same `chat_template` key shapes.
+        sources.append(contentsOf: chatTemplateValues(
+            fromJSONFile: snapshotDir.appendingPathComponent("tokenizer_config.json")))
+
+        return sources
+    }
+
+    /// Extract template strings from a JSON file's `chat_template` key.
+    /// Handles the string form and the list-of-`{name, template}` form
+    /// (all list entries are collected). Returns [] when the file is
+    /// missing, unparseable, or carries no usable template.
+    private static func chatTemplateValues(fromJSONFile url: URL) -> [String] {
+        guard let data = try? Data(contentsOf: url),
+            let json = (try? JSONSerialization.jsonObject(with: data)) as? [String: Any],
+            let value = json["chat_template"]
+        else { return [] }
+
+        if let template = value as? String {
+            return template.isEmpty ? [] : [template]
+        }
+        if let entries = value as? [Any] {
+            return entries.compactMap { entry in
+                guard let dict = entry as? [String: Any],
+                    let template = dict["template"] as? String,
+                    !template.isEmpty
+                else { return nil }
+                return template
+            }
+        }
+        return []
+    }
+
+    // MARK: - Render check
+
+    /// Render every template the snapshot ships against every canonical
+    /// fixture.
+    ///
+    /// - Returns: `nil` when the snapshot has no chat template (unknown —
+    ///   the key is omitted on the wire, matching old providers); `false`
+    ///   when any template fails to compile or any fixture render throws
+    ///   (the routing signal); `true` when everything renders.
+    ///
+    /// Never throws and never traps: this runs inside the startup model
+    /// scan for every cached model, so any unexpected condition degrades to
+    /// a result, not a crash. Multimodal (content-parts) fixtures are only
+    /// rendered for vision models (`config.json` declares `vision_config`),
+    /// mirroring the runtime: `MessageGenerator`s only emit content-parts
+    /// message shapes for VLM models, so judging a text-only template
+    /// against parts it will never see would false-flag healthy models.
+    public static func renderOK(at snapshotDir: URL) -> Bool? {
+        let sources = templateSources(at: snapshotDir)
+        guard !sources.isEmpty else { return nil }
+
+        let includeMultimodal = configDeclaresVision(at: snapshotDir)
+        let fixtures = canonicalFixtures(includeMultimodal: includeMultimodal)
+        let specialTokens = specialTokenContext(at: snapshotDir)
+
+        for source in sources {
+            // Same compile options as the runtime tokenizer
+            // (swift-transformers `compiledTemplate(for:)`).
+            let template: Template
+            do {
+                template = try Template(source, with: .init(lstripBlocks: true, trimBlocks: true))
+            } catch {
+                // A template that doesn't compile can't render at request time.
+                return false
+            }
+            for fixture in fixtures {
+                do {
+                    let context = try renderContext(for: fixture, specialTokens: specialTokens)
+                    _ = try template.render(context)
+                } catch {
+                    return false
+                }
+            }
+        }
+        return true
+    }
+
+    /// Whether `config.json` declares a vision tower (`vision_config`).
+    /// Mirrors `ModelScanner.configDeclaresVision` / `ProviderLoop.modelIsVLM`,
+    /// re-derived here so the Linux-buildable foundation target stays
+    /// dependency-free.
+    static func configDeclaresVision(at snapshotDir: URL) -> Bool {
+        let configURL = snapshotDir.appendingPathComponent("config.json")
+        guard let data = try? Data(contentsOf: configURL),
+            let json = (try? JSONSerialization.jsonObject(with: data)) as? [String: Any]
+        else { return false }
+        return json["vision_config"] != nil
+    }
+
+    // MARK: - Render context
+
+    /// Special-token attributes the runtime tokenizer injects into the
+    /// render context (swift-transformers `specialTokenAttributes`, minus
+    /// `additional_special_tokens` which no chat template interpolates).
+    private static let specialTokenAttributes: [String] = [
+        "bos_token", "eos_token", "unk_token", "sep_token",
+        "pad_token", "cls_token", "mask_token",
+    ]
+
+    /// Build the special-token slice of the render context. `bos_token` and
+    /// `eos_token` default to "" (so a bare `chat_template.jinja` with no
+    /// tokenizer_config still renders — real templates open with
+    /// `{{ bos_token }}`); every attribute present in tokenizer_config.json
+    /// overrides, accepting both the plain-string and the added-token
+    /// (`{"content": "..."}`) forms, mirroring the runtime's
+    /// `addedTokenAsString` handling.
+    static func specialTokenContext(at snapshotDir: URL) -> [String: Value] {
+        var tokens: [String: Value] = [
+            "bos_token": .string(""),
+            "eos_token": .string(""),
+        ]
+        let configURL = snapshotDir.appendingPathComponent("tokenizer_config.json")
+        guard let data = try? Data(contentsOf: configURL),
+            let json = (try? JSONSerialization.jsonObject(with: data)) as? [String: Any]
+        else { return tokens }
+
+        for attribute in specialTokenAttributes {
+            guard let value = json[attribute] else { continue }
+            if let string = value as? String {
+                tokens[attribute] = .string(string)
+            } else if let dict = value as? [String: Any],
+                let content = dict["content"] as? String {
+                tokens[attribute] = .string(content)
+            }
+        }
+        return tokens
+    }
+
+    /// Assemble the full Jinja render context for one fixture: `messages`,
+    /// `add_generation_prompt` (always true — the runtime entrypoint
+    /// `applyChatTemplate(messages:tools:additionalContext:)` always
+    /// generates), `tools` only when the fixture carries them (mirrors the
+    /// runtime, which only sets the context key for tool-bearing requests),
+    /// plus the special-token attributes.
+    private static func renderContext(
+        for fixture: Fixture,
+        specialTokens: [String: Value]
+    ) throws -> [String: Value] {
+        var context: [String: Value] = specialTokens
+        context["messages"] = .array(try fixture.messages.map { try Value(any: $0) })
+        context["add_generation_prompt"] = .boolean(true)
+        if let tools = fixture.tools {
+            context["tools"] = .array(try tools.map { try Value(any: $0) })
+        }
+        return context
+    }
+
+    // MARK: - Canonical fixtures
+
+    /// A canonical request shape: messages exactly as the runtime's
+    /// template layer sees them, plus tools when the fixture exercises the
+    /// tool branches.
+    struct Fixture {
+        let name: String
+        let messages: [[String: Any]]
+        let tools: [[String: Any]]?
+
+        init(name: String, messages: [[String: Any]], tools: [[String: Any]]? = nil) {
+            self.name = name
+            self.messages = messages
+            self.tools = tools
+        }
+    }
+
+    /// The canonical fixture set. Multimodal (content-parts) fixtures are
+    /// included only for vision models — see `renderOK(at:)`.
+    static func canonicalFixtures(includeMultimodal: Bool) -> [Fixture] {
+        var fixtures: [Fixture] = [
+            plainChatFixture,
+            toolFlowFixture,
+            emptyAssistantTailFixture,
+        ]
+        if includeMultimodal {
+            fixtures.append(imagePartsFixture)
+            fixtures.append(videoPartsFixture)
+        }
+        return fixtures
+    }
+
+    /// (a) Plain system + user chat — the baseline every template must render.
+    static var plainChatFixture: Fixture {
+        Fixture(
+            name: "plain_chat",
+            messages: [
+                ["role": "system", "content": "You are a helpful assistant."],
+                ["role": "user", "content": "Write one sentence about the sea."],
+            ]
+        )
+    }
+
+    /// (b) User content as PARTS — image variant. Mirrors the
+    /// `MessageGenerator` shape (`[{"type":"image"},{"type":"text",...}]`)
+    /// that VLM models receive (e.g. `Gemma4MessageGenerator`).
+    static var imagePartsFixture: Fixture {
+        Fixture(
+            name: "image_parts",
+            messages: [
+                [
+                    "role": "user",
+                    "content": [
+                        ["type": "image"],
+                        ["type": "text", "text": "Describe this image."],
+                    ] as [[String: Any]],
+                ]
+            ]
+        )
+    }
+
+    /// (b) User content as PARTS — video variant.
+    static var videoPartsFixture: Fixture {
+        Fixture(
+            name: "video_parts",
+            messages: [
+                [
+                    "role": "user",
+                    "content": [
+                        ["type": "video"],
+                        ["type": "text", "text": "Describe this video."],
+                    ] as [[String: Any]],
+                ]
+            ]
+        )
+    }
+
+    /// (c) Full tool flow in POST-NORMALIZATION shape (the runtime
+    /// normalizes inbound schemas since 0.6.3 — `ToolSchemaNormalization` —
+    /// so every schema node HAS a string `type` and nullability is
+    /// `nullable: true`). One tool with nested object params including an
+    /// enum property, array items, and required lists; an assistant turn
+    /// with `tool_calls` (arguments as a dict — mirrors
+    /// `decodeToolCallArguments`) followed by a `role: "tool"` response
+    /// message, exercising the declaration, call, and response template
+    /// branches.
+    static var toolFlowFixture: Fixture {
+        Fixture(
+            name: "tool_flow",
+            messages: [
+                ["role": "user", "content": "What's the weather in Paris?"],
+                [
+                    "role": "assistant",
+                    "content": "",
+                    "tool_calls": [
+                        [
+                            "id": "call_0001",
+                            "type": "function",
+                            "function": [
+                                "name": "get_weather",
+                                "arguments": [
+                                    "location": "Paris, France",
+                                    "unit": "celsius",
+                                ] as [String: Any],
+                            ] as [String: Any],
+                        ] as [String: Any]
+                    ] as [[String: Any]],
+                ],
+                [
+                    "role": "tool",
+                    "content": "{\"temperature\": 21, \"condition\": \"sunny\"}",
+                    "tool_call_id": "call_0001",
+                    "name": "get_weather",
+                ],
+            ],
+            tools: [Self.canonicalTool]
+        )
+    }
+
+    /// (d) `add_generation_prompt` with an empty assistant tail — exercises
+    /// last-message indexing and empty-content trims.
+    static var emptyAssistantTailFixture: Fixture {
+        Fixture(
+            name: "empty_assistant_tail",
+            messages: [
+                ["role": "user", "content": "Continue from here."],
+                ["role": "assistant", "content": ""],
+            ]
+        )
+    }
+
+    /// The canonical post-normalization tool: nested object params with an
+    /// enum property, array items, a nullable property, and required lists
+    /// at both levels. Every schema node carries a string `type` —
+    /// `ToolSchemaNormalization.ensureParameterTypes` guarantees this for
+    /// real inbound requests, so this is exactly what templates see.
+    static var canonicalTool: [String: Any] {
+        [
+            "type": "function",
+            "function": [
+                "name": "get_weather",
+                "description": "Get the current weather and short-term forecast for a location.",
+                "parameters": [
+                    "type": "object",
+                    "properties": [
+                        "location": [
+                            "type": "string",
+                            "description": "City and country, e.g. Paris, France",
+                        ],
+                        "unit": [
+                            "type": "string",
+                            "enum": ["celsius", "fahrenheit"],
+                            "description": "Temperature unit",
+                        ],
+                        "days": [
+                            "type": "array",
+                            "items": ["type": "integer"] as [String: Any],
+                            "description": "Number of forecast days to include",
+                        ],
+                        "options": [
+                            "type": "object",
+                            "properties": [
+                                "verbose": [
+                                    "type": "boolean",
+                                    "description": "Include extended fields",
+                                ],
+                                "retries": [
+                                    "type": "integer",
+                                    "nullable": true,
+                                    "description": "Optional retry budget",
+                                ] as [String: Any],
+                            ] as [String: Any],
+                            "required": ["verbose"],
+                            "description": "Advanced lookup options",
+                        ] as [String: Any],
+                    ] as [String: Any],
+                    "required": ["location", "unit"],
+                ] as [String: Any],
+            ] as [String: Any],
+        ]
+    }
+}


🟡 [MEDIUM] 🧩 387-line template validation for startup check

💡 Suggestion: Consider if a smaller subset of fixtures could catch the same issues - this adds significant startup overhead

📊 Score: 3×3 = 9 · Category: over-complexity

blacksmith-sh · 2026-06-12T00:19:28Z

Found 1 test failure on Blacksmith runners:

Failure

Test	View Logs
`github.com/eigeninference/d-inference/e2e/TestIntegration_ConcurrentRequests`	View Logs

^{Need help on this PR? Tag /codesmith with what you need.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9cc6769a3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T00:22:10Z

+	// Opportunistic sweep, mirroring dispatchLoadCooldowns: provider ids are
+	// per-session UUIDs, so dead entries never get re-keyed — bound both maps
+	// by dropping expired ones once they grow.


Clear inference cooldowns for fresh provider sessions

This new breaker state assumes provider IDs are per-session and only expires entries opportunistically, but Register already treats the same provider ID as reusable by clearing dispatchLoadCooldowns on re-registration. If a provider trips the inference breaker, restarts or updates to a fixed binary, and reconnects with the same ID, the stale (provider, model, shape) cooldown/strikes remain and routing will keep skipping that fresh session until the 5-minute TTL or strike window expires; clear these maps on re-registration/disconnect just like the dispatch-load cooldowns.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T00:22:10Z

+var capabilityVersionFloors = map[string]string{
+	"tools": "0.6.3",
+}


Do not floor out providers after central tool normalization

For tool requests when the available fleet is still on pre-0.6.3 binaries, this floor rejects every older provider before dispatch even though NormalizeToolSchemas now rewrites the exact nullable/missing type schemas that made those binaries crash. That makes the coordinator-side normalizer ineffective for its stated lagging-provider case and turns otherwise serviceable normalized tool calls into model_unavailable; either remove/relax this floor for centrally-normalized requests or gate on a capability those older providers truly lack.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T00:22:10Z

 		case <-timer.C:
 			s.refundReservedBalance(pr, "provider_timeout:"+pr.RequestID)
+			s.ddIncr("inference.in_band_error", []string{"model:" + pr.Model, "reason:timeout"})
 			emitter.emitError("timeout", "request timed out")


Feed committed stream timeouts into the breaker

When a provider starts a Responses stream and then stalls until inferenceTimeout, this timeout path only refunds/emits an SSE error and never records the 504 with noteInferenceError. Repeated post-commit stalls from the same provider/model therefore never enter the new cooldown, so subsequent requests keep routing to the zombie pair; record a 504 here as the accepted/pre-content timeout paths do.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T00:22:10Z

 				// Channel closed — inference complete.
+				s.noteInferenceSuccess(pr)


Do not clear cooldowns on incomplete zero-cost streams

If a stream closes without CompleteCh or ErrorCh on a zero-reservation request (exclusive self-route/free path, or billing disabled), refundReservedBalance returns false and this falls through to noteInferenceSuccess, clearing the provider/model cooldown even though the provider ended incompletely. That lets broken owned/free streams erase strikes and be routed again; treat missing completion as an error independent of whether there was money to refund.

Useful? React with 👍 / 👎.

…est preprocessing; harden toolschema Addresses the automated-review complexity/perf findings on PR #322, and satisfies CLAUDE.md's behavior-preserving modular-refactor pass. Purely structural — the full failover/breaker/race suite stays green and -race clean. - Extract the ~890-line dispatch loop into coordinator/api/dispatch.go: a dispatchState struct (per-request mutable state + immutable inputs) and a dispatchOutcome enum, with each former labeled wait-loop (firstChunkWait, noBackupWait, race + sub-waits, acceptedWait) now a method; run() orchestrates. consumer.go 5889 -> 4475 lines. Every select arm, timer Stop/Reset, closed-ChunkCh+ErrorCh grace, heldChunks cap, preamble/accepted liveness split, speculative race + cancel-loser, refund-exactly-once, and breaker/DD call is carried over verbatim (old->new mapping reviewed; passing-test set unchanged). - De-duplicate the request preprocessing shared by handleChatCompletions and handleGenericInference into coordinator/api/inference_preprocess.go: parseInferencePrelude (body read -> NormalizeToolSchemas -> parse -> model required -> key allowlist) and visionToolsFailFast (vision + tools fast-fail, allowlist/self-route aware). Handler-specific bits (Responses media reject, capacity/503 ladder, body re-marshal) stay per-handler, parameterized not flattened. - toolschema: bound injectDefaultTypes recursion (maxToolSchemaDepth) against pathologically-nested schemas, and skip the JSON re-encode entirely when no normalization was needed (return input bytes unchanged). go test ./... green (16 pkgs), -race clean on api+registry, linux/amd64 OK, named contract suite 49 pass / 0 fail.

Gajesh2007 · 2026-06-12T01:15:33Z

Addressed the review (commit `757d9031`)

Thanks @ethenotethan — went through every finding. Summary of what changed and the few I'm pushing back on with evidence.

Fixed (the substantive ones)

CRITICAL "890-line nested state machine" → extracted into a focused coordinator/api/dispatch.go: a dispatchState struct + dispatchOutcome enum, with each former labeled wait-loop (firstChunkWait, noBackupWait, race + sub-waits, acceptedWait) now its own method orchestrated by run(). consumer.go shrank 5889 → 4475 lines. The extraction is purely structural — verified behavior-equivalent by two independent reviews (line-by-line old→new mapping + exact-count invariants: timers 9 NewTimer/35 Stop, refunds 11/5, cancelDispatch 36) and by the full failover/breaker/race suite + -race staying green and the passing-test set being byte-identical to before.
MEDIUM "duplicated tool-schema normalization / preprocessing across endpoints" → de-duplicated into coordinator/api/inference_preprocess.go: parseInferencePrelude (read → normalize → parse → model-required → key allowlist) and visionToolsFailFast, both called by handleChatCompletions and handleGenericInference. Handler-specific bits (Responses-media reject, n>1, alias re-marshal, stripProviderRoutingFields, the capacity/503 ladder) stay per-handler — parameterized, not flattened.
MEDIUM "recursive traversal without depth limits" (injectDefaultTypes) → added a maxToolSchemaDepth recursion ceiling (pathological/deeply-nested schemas can't blow the stack).
CRITICAL "JSON re-encode on the request hot path" → NormalizeToolSchemas now returns the input bytes unchanged (no decode/re-encode) when nothing needs normalizing.

Respectfully pushing back (with evidence)

"Potential unbounded growth of heldChunks" — there's already a hard cap: maxHeldBoilerplate = 8, enforced at all 8 append sites (len(heldChunks) < maxHeldBoilerplate). Bounded by design.
"JSON parse on every streaming chunk" (isBoilerplateChunk) — it has a cheap pre-gate (only parses chunks containing "role" / a response. event prefix) and runs pre-commit only, not per content chunk. Negligible.
"toolschema blocking_io on hot path → async/cache" — it's a CPU transform on KB-sized tool defs, gated by a "tools" byte-check and a 4 MiB cap; it's microseconds, not blocking I/O, and can't be async (it must complete before dispatch). The skip-re-encode above removes the no-op cost; depth-limit removes the pathological case.
"O(n) cooldown cleanup" — only runs when the map exceeds 1024 entries (fleet ≈ 100) and mirrors the long-standing dispatchLoadCooldowns pattern.

go test ./... green (16 pkgs), -race clean on api+registry, linux/amd64 build OK, 49 named contract tests pass. Re-requesting your review.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 5 finding(s) (3 blocking)

🟡 [MEDIUM] coordinator/api/consumer.go:1575 — excludeProviders map allocated without size hint in hot dispatch path
- Suggestion: Pre-allocate excludeProviders with estimated capacity based on maxDispatchAttempts: make(map[string]struct{}, maxDispatchAttempts)
🔵 [INFO] coordinator/api/dispatch.go:158-160 — backupExclude map grows without size hint during speculative dispatch
- Suggestion: Pre-allocate backupExclude with known size: make(map[string]struct{}, len(excludeProviders)+1)
🟡 [MEDIUM] coordinator/registry/error_cooldown.go:115-118 — Sliding window filtering creates new slice on every error record
- Suggestion: Reuse the existing slice by reslicing in-place: strikes = strikes[:kept_count] instead of creating new slice
🟡 [MEDIUM] coordinator/api/toolschema.go:139-141 — JSON re-encoding allocates new buffer without size hint
- Suggestion: Pre-allocate buffer with input size hint: buf.Grow(len(body)) to reduce allocations during normalization
🔵 [INFO] coordinator/registry/scheduler.go:407-413 — Version-diverse retry creates new slice without capacity hint
- Suggestion: Pre-allocate diverse slice with pool capacity: make([]*routingCandidate, 0, len(pool))

Type_diligence — 3 finding(s)

🔵 [INFO] coordinator/api/consumer.go:507 — Potential nil pointer dereference on pr.Timing without nil check
- Suggestion: The nil check was added but should be consistent - either always check or ensure pr.Timing is never nil during construction
🔵 [INFO] coordinator/api/toolschema.go:282 — Function parameter uses bare any type
- Suggestion: Consider using a more specific type like map[string]any or create a proper struct type for the node parameter
🔵 [INFO] coordinator/api/toolschema.go:295 — Function parameter uses bare any type
- Suggestion: Consider using a more specific type like map[string]any or create a proper struct type for the node parameter

Additive_complexity — 7 finding(s) (4 blocking)

🔴 [CRITICAL] coordinator/api/dispatch.go:1-1553 — 1553-line dispatch.go file extracts inline dispatch loop into complex state machine
- Suggestion: Consider keeping the dispatch logic inline or breaking into smaller, focused modules rather than one massive state machine file
🟡 [MEDIUM] coordinator/api/toolschema.go:1-361 — 361-line custom JSON schema normalization reimplements JSON processing
- Suggestion: Evaluate if existing JSON schema libraries could handle this normalization instead of custom implementation
🔵 [INFO] coordinator/api/toolschema_test.go:1-1118 — 1118-line test file with repetitive test helper patterns
- Suggestion: Extract common test setup patterns into shared helpers to reduce duplication
🟡 [MEDIUM] coordinator/api/failover_integration_test.go:1-786 — Large integration test file duplicates provider setup patterns
- Suggestion: Extract fake provider harness into reusable test utilities shared across test files
🔵 [INFO] coordinator/registry/request_traits.go:27-48 — RequestTraits struct with multiple capability flags could grow complex
- Suggestion: Consider a more extensible trait system if more request attributes will be added
🟡 [MEDIUM] coordinator/api/consumer.go:1575-1600 — Consumer handler delegates entire dispatch logic to external state machine
- Suggestion: Keep core dispatch logic in the consumer handler and extract only reusable components
🔵 [INFO] coordinator/registry/error_cooldown.go:47-57 — Complex struct key for circuit breaker when string key might suffice
- Suggestion: Consider if a well-formatted string key could replace the struct key while avoiding collisions

15 finding(s) total, 7 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

ethenotethan · 2026-06-12T01:16:22Z

@@ -1481,1047 +1575,49 @@ func (s *Server) handleChatCompletions(w http.ResponseWriter, r *http.Request) {
 	//


🟡 [MEDIUM] ⚡ excludeProviders map allocated without size hint in hot dispatch path

💡 Suggestion: Pre-allocate excludeProviders with estimated capacity based on maxDispatchAttempts: make(map[string]struct{}, maxDispatchAttempts)

📊 Score: 2×4 = 8 · Category: unbounded allocations

ethenotethan · 2026-06-12T01:16:22Z

+			d.lastErr = dispatchErr
+			d.lastErrCode = dispatchErrCode
+			return outcomeFailFast


🔵 [INFO] ⚡ backupExclude map grows without size hint during speculative dispatch

💡 Suggestion: Pre-allocate backupExclude with known size: make(map[string]struct{}, len(excludeProviders)+1)

📊 Score: 2×3 = 6 · Category: unbounded allocations

ethenotethan · 2026-06-12T01:16:22Z

+			kept = append(kept, ts)
+		}
+	}
+	kept = append(kept, now)


🟡 [MEDIUM] ⚡ Sliding window filtering creates new slice on every error record

💡 Suggestion: Reuse the existing slice by reslicing in-place: strikes = strikes[:kept_count] instead of creating new slice

📊 Score: 3×3 = 9 · Category: repeated work

ethenotethan · 2026-06-12T01:16:22Z

+}
+
+// normalizeToolEntry rewrites one element of the "tools" array. A tool's


🟡 [MEDIUM] ⚡ JSON re-encoding allocates new buffer without size hint

💡 Suggestion: Pre-allocate buffer with input size hint: buf.Grow(len(body)) to reduce allocations during normalization

📊 Score: 2×4 = 8 · Category: unbounded allocations

ethenotethan · 2026-06-12T01:16:22Z

+	// when every candidate runs the avoided version, keep the full pool rather
+	// than failing the request.
+	if pr.Traits.AvoidVersion != "" {
+		diverse := make([]*routingCandidate, 0, len(pool))
+		for _, c := range pool {
+			if providerVersion(c.provider) != pr.Traits.AvoidVersion {
+				diverse = append(diverse, c)


🔵 [INFO] ⚡ Version-diverse retry creates new slice without capacity hint

💡 Suggestion: Pre-allocate diverse slice with pool capacity: make([]*routingCandidate, 0, len(pool))

📊 Score: 2×3 = 6 · Category: unbounded allocations

ethenotethan · 2026-06-12T01:16:22Z

+package api
+
+import (
+	"bytes"
+	"encoding/json"
+	"reflect"
+	"strings"
+	"testing"
+)
+
+// Tests for NormalizeToolSchemas (DAR-130), ported case-for-case from the
+// Swift provider's ToolSchemaNormalizationTests
+// (provider-swift/Tests/ProviderCoreTests/ToolSchemaNormalizationTests.swift),
+// plus Go-specific coverage: number round-tripping (UseNumber), sibling-field
+// survival, idempotency, and the byte-gate/JSON-gate split.
+
+// tsnDecode decodes a body with UseNumber (matching the implementation) and
+// asserts it is a JSON object.
+func tsnDecode(t *testing.T, body []byte) map[string]any {
+	t.Helper()
+	dec := json.NewDecoder(bytes.NewReader(body))
+	dec.UseNumber()
+	var v any
+	if err := dec.Decode(&v); err != nil {
+		t.Fatalf("decoding body: %v\nbody: %s", err, body)
+	}
+	m, ok := v.(map[string]any)
+	if !ok {
+		t.Fatalf("body decoded to %T, want JSON object", v)
+	}
+	return m
+}
+
+// tsnMap asserts v is a JSON object.
+func tsnMap(t *testing.T, v any, what string) map[string]any {
+	t.Helper()
+	m, ok := v.(map[string]any)
+	if !ok {
+		t.Fatalf("%s is %T (%v), want JSON object", what, v, v)
+	}
+	return m
+}
+
+// tsnTools returns the decoded tools array from a body, asserting its length.
+func tsnTools(t *testing.T, body []byte, wantLen int) []any {
+	t.Helper()
+	root := tsnDecode(t, body)
+	tools, ok := root["tools"].([]any)
+	if !ok || len(tools) != wantLen {
+		t.Fatalf("tools = %v (%T), want array of %d", root["tools"], root["tools"], wantLen)
+	}
+	return tools
+}
+
+// tsnFirstToolFn returns tools[0].function from a body.
+func tsnFirstToolFn(t *testing.T, body []byte) map[string]any {
+	t.Helper()
+	root := tsnDecode(t, body)
+	tools, ok := root["tools"].([]any)
+	if !ok || len(tools) == 0 {
+		t.Fatalf("tools is %T (%v), want non-empty array", root["tools"], root["tools"])
+	}
+	return tsnMap(t, tsnMap(t, tools[0], "tools[0]")["function"], "tools[0].function")
+}
+
+// tsnParams returns tools[0].function.parameters.
+func tsnParams(t *testing.T, body []byte) map[string]any {
+	t.Helper()
+	return tsnMap(t, tsnFirstToolFn(t, body)["parameters"], "parameters")
+}
+
+// tsnProps returns tools[0].function.parameters.properties.
+func tsnProps(t *testing.T, body []byte) map[string]any {
+	t.Helper()
+	return tsnMap(t, tsnParams(t, body)["properties"], "parameters.properties")
+}
+
+// tsnType asserts the node's `type` is a string and returns it.
+func tsnType(t *testing.T, node map[string]any, what string) string {
+	t.Helper()
+	s, ok := node["type"].(string)
+	if !ok {
+		t.Fatalf("%s type is %T (%v), want string", what, node["type"], node["type"])
+	}
+	return s
+}
+
+// tsnPadBody builds a valid, normalizable tool body of exactly total bytes
+// (a typeless enum-only property that WOULD gain a type if parsed).
+func tsnPadBody(t *testing.T, total int) []byte {
+	t.Helper()
+	const prefix = `{"pad":"`
+	const suffix = `","tools":[{"type":"function","function":{"name":"f","parameters":{"properties":{"u":{"enum":["c","f"]}}}}}]}`
+	pad := total - len(prefix) - len(suffix)
+	if pad < 0 {
+		t.Fatalf("total %d smaller than the fixed body parts", total)
+	}
+	body := prefix + strings.Repeat("a", pad) + suffix
+	if len(body) != total {
+		t.Fatalf("built %d bytes, want %d", len(body), total)
+	}
+	return []byte(body)
+}
+
+// Swift: injectsTypeIntoTypelessParameterPropertyAndObject
+func TestNormalizeToolSchemas_InjectsTypeIntoTypelessParameterPropertyAndObject(t *testing.T) {
+	// A legitimate OpenAI schema: the `unit` property has enum+description but
+	// no explicit `type`, and the parameters object itself omits `type`.
+	body := []byte(`{"model":"gemma-4-26b","messages":[{"role":"user","content":"hi"}],
+	 "tools":[{"type":"function","function":{"name":"get_weather",
+	   "parameters":{"properties":{"unit":{"enum":["c","f"],"description":"unit"}}}}}]}`)
+
+	out := NormalizeToolSchemas(body)
+	params := tsnParams(t, out)
+	if got := tsnType(t, params, "parameters"); got != "object" {
+		t.Errorf("parameters type = %q, want object", got)
+	}
+	unit := tsnMap(t, tsnMap(t, params["properties"], "properties")["unit"], "unit")
+	// Defaulted to "string" so `{{ value['type'] | upper }}` no longer throws.
+	if got := tsnType(t, unit, "unit"); got != "string" {
+		t.Errorf("unit type = %q, want string", got)
+	}
+	// The original enum/description are preserved.
+	if enum, ok := unit["enum"].([]any); !ok || len(enum) != 2 {
+		t.Errorf("unit enum = %v, want 2 members", unit["enum"])
+	}
+	if unit["description"] != "unit" {
+		t.Errorf("unit description = %v, want %q", unit["description"], "unit")
+	}
+}
+
+// Swift: preservesExistingTypesAndNestedArrays
+func TestNormalizeToolSchemas_PreservesExistingTypesAndNestedArrays(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "tags":{"type":"array","items":{"description":"a tag"}},
+	    "q":{"type":"string"}}}}}]}`)
+
+	props := tsnProps(t, NormalizeToolSchemas(body))
+	// Existing types untouched.
+	if got := tsnType(t, tsnMap(t, props["q"], "q"), "q"); got != "string" {
+		t.Errorf("q type = %q, want string", got)
+	}
+	tags := tsnMap(t, props["tags"], "tags")
+	if got := tsnType(t, tags, "tags"); got != "array" {
+		t.Errorf("tags type = %q, want array", got)
+	}
+	// Nested array `items` schema with no type gets defaulted.
+	items := tsnMap(t, tags["items"], "tags.items")
+	if got := tsnType(t, items, "tags.items"); got != "string" {
+		t.Errorf("items type = %q, want string", got)
+	}
+	if items["description"] != "a tag" {
+		t.Errorf("items description = %v, want %q", items["description"], "a tag")
+	}
+}
+
+// Swift: nonToolBodyReturnedUnchanged
+func TestNormalizeToolSchemas_NonToolBodyReturnedUnchanged(t *testing.T) {
+	noTools := []byte(`{"model":"m","messages":[]}`)
+	if out := NormalizeToolSchemas(noTools); !bytes.Equal(out, noTools) {
+		t.Errorf("no-tools body changed: %s", out)
+	}
+}
+
+// Swift: recursesIntoAdditionalProperties
+func TestNormalizeToolSchemas_RecursesIntoAdditionalProperties(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "meta":{"additionalProperties":{"description":"a value"}}}}}}]}`)
+
+	meta := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["meta"], "meta")
+	// The map-shaped param node is typed "object"...
+	if got := tsnType(t, meta, "meta"); got != "object" {
+		t.Errorf("meta type = %q, want object", got)
+	}
+	// ...and its inner additionalProperties schema gets a default type too.
+	addl := tsnMap(t, meta["additionalProperties"], "meta.additionalProperties")
+	if got := tsnType(t, addl, "additionalProperties"); got != "string" {
+		t.Errorf("additionalProperties type = %q, want string", got)
+	}
+	if addl["description"] != "a value" {
+		t.Errorf("additionalProperties description = %v, want %q", addl["description"], "a value")
+	}
+}
+
+// Swift: derivesUnionTypeInsteadOfBlanketString
+func TestNormalizeToolSchemas_DerivesUnionTypeInsteadOfBlanketString(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "n":{"anyOf":[{"type":"number"},{"type":"null"}]}}}}}]}`)
+
+	n := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["n"], "n")
+	// A nullable-number union borrows "number", not a mislabelling "string".
+	if got := tsnType(t, n, "n"); got != "number" {
+		t.Errorf("n type = %q, want number", got)
+	}
+	if variants, ok := n["anyOf"].([]any); !ok || len(variants) != 2 {
+		t.Errorf("anyOf = %v, want 2 members preserved", n["anyOf"])
+	}
+}
+
+// Swift: skipsNormalizationForOversizedBodies (+ at-cap boundary processed)
+func TestNormalizeToolSchemas_SkipsNormalizationForOversizedBodies(t *testing.T) {
+	// A body above the cap is returned unchanged BEFORE any parse, even though
+	// it contains "tools" and a schema that WOULD be repaired — bounding the
+	// JSON round-trip cost (DoS amplification).
+	over := tsnPadBody(t, maxToolNormalizationBytes+1)
+	if out := NormalizeToolSchemas(over); !bytes.Equal(out, over) {
+		t.Error("oversized body was modified")
+	}
+	// At exactly the cap the body is still normalized (the gate is strictly >).
+	at := tsnPadBody(t, maxToolNormalizationBytes)
+	out := NormalizeToolSchemas(at)
+	if bytes.Equal(out, at) {
+		t.Fatal("at-cap body was not normalized")
+	}
+	u := tsnMap(t, tsnProps(t, out)["u"], "u")
+	if got := tsnType(t, u, "u"); got != "string" {
+		t.Errorf("u type = %q, want string", got)
+	}
+}
+
+// Array-typed (nullable) `type` values — the second DAR-130 class.
+// `"type": ["string","null"]` is what Pydantic emits for Optional[...] tool
+// parameters; the gemma template's `| upper` crashed on the list ("upper
+// filter requires string", reproduced on prod 2026-06-10).
+
+// Swift: collapsesNullableArrayTypeToConcreteMember
+func TestNormalizeToolSchemas_CollapsesNullableArrayTypeToConcreteMember(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"get_weather",
+	  "parameters":{"type":"object","properties":{
+	    "city":{"type":["string","null"],"description":"city"}},
+	    "required":["city"]}}}]}`)
+
+	out := NormalizeToolSchemas(body)
+	city := tsnMap(t, tsnProps(t, out)["city"], "city")
+	if got := tsnType(t, city, "city"); got != "string" {
+		t.Errorf("city type = %q, want string", got)
+	}
+	// Nullability preserved losslessly via the template-supported key.
+	if city["nullable"] != true {
+		t.Errorf("city nullable = %v, want true", city["nullable"])
+	}
+	if req, ok := tsnParams(t, out)["required"].([]any); !ok || len(req) != 1 || req[0] != "city" {
+		t.Errorf("required = %v, want [city]", tsnParams(t, out)["required"])
+	}
+}
+
+// Swift: collapsesArrayTypeSkippingLeadingNull
+func TestNormalizeToolSchemas_CollapsesArrayTypeSkippingLeadingNull(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "n":{"type":["null","integer"]}}}}}]}`)
+
+	n := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["n"], "n")
+	if got := tsnType(t, n, "n"); got != "integer" {
+		t.Errorf("n type = %q, want integer", got)
+	}
+	if n["nullable"] != true {
+		t.Errorf("n nullable = %v, want true", n["nullable"])
+	}
+}
+
+// Swift: collapsesNullOnlyArrayTypeToNullString
+func TestNormalizeToolSchemas_CollapsesNullOnlyArrayTypeToNullString(t *testing.T) {
+	// ["null"] has no concrete member — keep the honest "null", which still
+	// renders (it is a string for `| upper`).
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "x":{"type":["null"]}}}}}]}`)
+
+	x := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["x"], "x")
+	if got := tsnType(t, x, "x"); got != "null" {
+		t.Errorf("x type = %q, want null", got)
+	}
+	// No concrete member was collapsed away, so nullable is NOT synthesized.
+	if _, ok := x["nullable"]; ok {
+		t.Errorf("x nullable = %v, want absent", x["nullable"])
+	}
+}
+
+// Swift: collapsesArrayTypeInNestedObjectAndItems
+func TestNormalizeToolSchemas_CollapsesArrayTypeInNestedObjectAndItems(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"set_alarm",
+	  "parameters":{"type":"object","properties":{
+	    "opts":{"type":"object","properties":{"snooze":{"type":["integer","null"]}}},
+	    "tags":{"type":"array","items":{"type":["string","null"]}}}}}}]}`)
+
+	props := tsnProps(t, NormalizeToolSchemas(body))
+	snooze := tsnMap(t, tsnMap(t, tsnMap(t, props["opts"], "opts")["properties"], "opts.properties")["snooze"], "snooze")
+	if got := tsnType(t, snooze, "snooze"); got != "integer" {
+		t.Errorf("snooze type = %q, want integer", got)
+	}
+	if snooze["nullable"] != true {
+		t.Errorf("snooze nullable = %v, want true", snooze["nullable"])
+	}
+	items := tsnMap(t, tsnMap(t, props["tags"], "tags")["items"], "tags.items")
+	if got := tsnType(t, items, "tags.items"); got != "string" {
+		t.Errorf("items type = %q, want string", got)
+	}
+	if items["nullable"] != true {
+		t.Errorf("items nullable = %v, want true", items["nullable"])
+	}
+}
+
+// Swift: malformedNonStringTypeFallsBackToStructuralInference
+func TestNormalizeToolSchemas_MalformedNonStringTypeFallsBackToStructuralInference(t *testing.T) {
+	// A numeric `type` is invalid JSON Schema; repair it from structure
+	// (properties present → object) instead of leaving the list/number for
+	// the template to choke on.
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "cfg":{"type":42,"properties":{"k":{"type":"string"}}},
+	    "v":{"type":7,"description":"v"}}}}}]}`)
+
+	props := tsnProps(t, NormalizeToolSchemas(body))
+	cfg := tsnMap(t, props["cfg"], "cfg")
+	if got := tsnType(t, cfg, "cfg"); got != "object" {
+		t.Errorf("cfg type = %q, want object", got)
+	}
+	v := tsnMap(t, props["v"], "v")
+	if got := tsnType(t, v, "v"); got != "string" {
+		t.Errorf("v type = %q, want string", got)
+	}
+	// No "null" member was collapsed, so nullable is not synthesized.
+	for name, node := range map[string]map[string]any{"cfg": cfg, "v": v} {
+		if _, ok := node["nullable"]; ok {
+			t.Errorf("%s nullable = %v, want absent", name, node["nullable"])
+		}
+	}
+}
+
+// Swift: unionMemberWithArrayTypeStillDrivesParentInference
+func TestNormalizeToolSchemas_UnionMemberWithArrayTypeStillDrivesParentInference(t *testing.T) {
+	// Ordering is load-bearing: members collapse BEFORE the parent's union
+	// inference, so a first member declaring ["string","null"] must yield a
+	// "string" parent type (not fall through to the default).
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "u":{"anyOf":[{"type":["string","null"]},{"type":"integer"}],"description":"u"}}}}}]}`)
+
+	u := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["u"], "u")
+	if got := tsnType(t, u, "u"); got != "string" {
+		t.Errorf("u type = %q, want string", got)
+	}
+	variants, ok := u["anyOf"].([]any)
+	if !ok || len(variants) != 2 {
+		t.Fatalf("anyOf = %v, want 2 members", u["anyOf"])
+	}
+	first := tsnMap(t, variants[0], "anyOf[0]")
+	if got := tsnType(t, first, "anyOf[0]"); got != "string" {
+		t.Errorf("anyOf[0] type = %q, want string (collapsed before parent inference)", got)
+	}
+	if first["nullable"] != true {
+		t.Errorf("anyOf[0] nullable = %v, want true", first["nullable"])
+	}
+}
+
+// Swift: collapsesArrayTypeOnTopLevelParametersNode
+func TestNormalizeToolSchemas_CollapsesArrayTypeOnTopLevelParametersNode(t *testing.T) {
+	// The template also renders params['type'] | upper at the top level.
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":["object","null"],"properties":{"q":{"type":"string"}}}}}]}`)
+
+	params := tsnParams(t, NormalizeToolSchemas(body))
+	if got := tsnType(t, params, "parameters"); got != "object" {
+		t.Errorf("parameters type = %q, want object", got)
+	}
+	if params["nullable"] != true {
+		t.Errorf("parameters nullable = %v, want true", params["nullable"])
+	}
+}
+
+// Swift: collapsesArrayTypeInsideAdditionalPropertiesSchema
+func TestNormalizeToolSchemas_CollapsesArrayTypeInsideAdditionalPropertiesSchema(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "kv":{"type":"object","additionalProperties":{"type":["number","null"]}}}}}}]}`)
+
+	kv := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["kv"], "kv")
+	addl := tsnMap(t, kv["additionalProperties"], "kv.additionalProperties")
+	if got := tsnType(t, addl, "additionalProperties"); got != "number" {
+		t.Errorf("additionalProperties type = %q, want number", got)
+	}
+	if addl["nullable"] != true {
+		t.Errorf("additionalProperties nullable = %v, want true", addl["nullable"])
+	}
+}
+
+// Go-specific: an explicit `nullable` survives the array-type collapse.
+func TestNormalizeToolSchemas_ExplicitNullableLeftUntouched(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "kept":{"type":["string","null"],"nullable":false},
+	    "set":{"type":["string","null"]}}}}}]}`)
+
+	props := tsnProps(t, NormalizeToolSchemas(body))
+	kept := tsnMap(t, props["kept"], "kept")
+	if got := tsnType(t, kept, "kept"); got != "string" {
+		t.Errorf("kept type = %q, want string", got)
+	}
+	// The explicit value is NOT clobbered, even though it disagrees.
+	if kept["nullable"] != false {
+		t.Errorf("kept nullable = %v, want explicit false preserved", kept["nullable"])
+	}
+	set := tsnMap(t, props["set"], "set")
+	if set["nullable"] != true {
+		t.Errorf("set nullable = %v, want synthesized true", set["nullable"])
+	}
+}
+
+// Go-specific: a property whose only marker is `enum` gains "string".
+func TestNormalizeToolSchemas_EnumOnlyPropertyGainsStringType(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{"e":{"enum":["a","b"]}}}}}]}`)
+
+	e := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["e"], "e")
+	if got := tsnType(t, e, "e"); got != "string" {
+		t.Errorf("e type = %q, want string", got)
+	}
+	if enum, ok := e["enum"].([]any); !ok || len(enum) != 2 {
+		t.Errorf("e enum = %v, want 2 members preserved", e["enum"])
+	}
+}
+
+// Go-specific: bare boolean additionalProperties is left untouched (only a
+// map-shaped value is recursed), but its presence still drives "object"
+// inference on a typeless parent.
+func TestNormalizeToolSchemas_BareAdditionalPropertiesBoolUntouched(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "open":{"type":"object","additionalProperties":true},
+	    "closed":{"additionalProperties":false}}}}}]}`)
+
+	props := tsnProps(t, NormalizeToolSchemas(body))
+	open := tsnMap(t, props["open"], "open")
+	if open["additionalProperties"] != true {
+		t.Errorf("open additionalProperties = %v (%T), want bare true", open["additionalProperties"], open["additionalProperties"])
+	}
+	closed := tsnMap(t, props["closed"], "closed")
+	if got := tsnType(t, closed, "closed"); got != "object" {
+		t.Errorf("closed type = %q, want object (inferred from additionalProperties)", got)
+	}
+	if closed["additionalProperties"] != false {
+		t.Errorf("closed additionalProperties = %v (%T), want bare false", closed["additionalProperties"], closed["additionalProperties"])
+	}
+}
+
+// Go-specific: recursion reaches items.items.properties three levels down.
+func TestNormalizeToolSchemas_DeeplyNestedItemsProperties(t *testing.T) {
+	body := []byte(`{"tools":[{"type":"function","function":{"name":"f",
+	  "parameters":{"type":"object","properties":{
+	    "grid":{"type":"array","items":{"items":{"properties":{"name":{"description":"n"}}}}}}}}}]}`)
+
+	grid := tsnMap(t, tsnProps(t, NormalizeToolSchemas(body))["grid"], "grid")
+	l1 := tsnMap(t, grid["items"], "grid.items")
+	if got := tsnType(t, l1, "grid.items"); got != "array" {
+		t.Errorf("grid.items type = %q, want array (inferred from items)", got)
+	}
+	l2 := tsnMap(t, l1["items"], "grid.items.items")
+	if got := tsnType(t, l2, "grid.items.items"); got != "object" {
+		t.Errorf("grid.items.items type = %q, want object (inferred from properties)", got)
+	}
+	name := tsnMap(t, tsnMap(t, l2["properties"], "innermost properties")["name"], "name")
+	if got := tsnType(t, name, "name"); got != "string" {
+		t.Errorf("name type = %q, want string", got)
+	}
+}
+
+// Go-specific: bodies that pass the cheap `"tools"` byte gate but carry no
+// top-level tools array must no-op byte-identically.
+func TestNormalizeToolSchemas_ToolsBytesWithoutToolsArrayNoOp(t *testing.T) {
+	bodies := map[string][]byte{
+		// The literal string value "tools" inside a message — the quoted bytes
+		// match the gate even though there is no tools key at the top level.
+		"tools as prompt string": []byte(`{"model":"m","messages":[{"role":"user","content":"tools"}]}`),
+		// A nested "tools" key is NOT the top-level tools array; the embedded
+		// typeless schema must stay untouched.
+		"nested tools key": []byte(`{"metadata":{"tools":[{"function":{"parameters":{"properties":{"u":{"enum":["c"]}}}}}]},"model":"m"}`),
+		// Top-level "tools" present but not an array.
+		"tools is a string":  []byte(`{"tools":"none"}`),
+		"tools is an object": []byte(`{"tools":{"a":1}}`),
+		"tools is null":      []byte(`{"tools":null}`),
+	}
+	for name, body := range bodies {
+		if !bytes.Contains(body, []byte(`"tools"`)) {
+			t.Fatalf("%s: test body must pass the byte gate to exercise the JSON gate", name)
+		}
+		if out := NormalizeToolSchemas(body); !bytes.Equal(out, body) {
+			t.Errorf("%s: body changed:\n in: %s\nout: %s", name, body, out)
+		}
+	}
+}
+
+// Go-specific: non-object roots and malformed JSON are returned unchanged.
+func TestNormalizeToolSchemas_NonObjectOrMalformedBodyUnchanged(t *testing.T) {
+	bodies := map[string][]byte{
+		"array root":       []byte(`["tools"]`),
+		"string root":      []byte(`"tools"`),
+		"truncated JSON":   []byte(`{"tools":[`),
+		"trailing garbage": []byte(`{"tools":[]}{"x":1}`),
+		"trailing text":    []byte(`{"tools":[]}garbage`),
+		"empty body":       {},
+	}
+	for name, body := range bodies {
+		if out := NormalizeToolSchemas(body); !bytes.Equal(out, body) {
+			t.Errorf("%s: body changed:\n in: %s\nout: %s", name, body, out)
+		}
+	}
+}
+
+// Go-specific: UseNumber round-trip — int64-range-exceeding integers and
+// high-precision decimals survive exactly, and every top-level field other
+// than tools survives value-equivalent.
+func TestNormalizeToolSchemas_NumbersAndSiblingsSurviveExactly(t *testing.T) {
+	// 9007199254740993 = 2^53+1: unrepresentable in float64 (would mangle to
+	// ...992 without UseNumber).
+	body := []byte(`{"model":"gemma-4-26b","max_tokens":9007199254740993,` +
+		`"temperature":0.30000000000000004,"top_p":1e-7,"stream":true,"stop":null,` +
+		`"metadata":{"trace":"a<>&b","big":123456789012345678901234567890.5},` +
+		`"messages":[{"role":"user","content":"hi"}],` +
+		`"tools":[{"type":"function","function":{"name":"f","parameters":` +
+		`{"type":"object","properties":{"limit":{"description":"l","default":9007199254740993}}}}}]}`)
+
+	out := NormalizeToolSchemas(body)
+	if bytes.Equal(out, body) {
+		t.Fatal("body was not normalized (limit should gain a type)")
+	}
+	// The exact integer literal survives in both positions.
+	if got := bytes.Count(out, []byte("9007199254740993")); got != 2 {
+		t.Errorf("exact literal 9007199254740993 appears %d times, want 2\nout: %s", got, out)
+	}
+
+	root := tsnDecode(t, out)
+	for field, want := range map[string]json.Number{
+		"max_tokens":  "9007199254740993",
+		"temperature": "0.30000000000000004",
+		"top_p":       "1e-7",
+	} {
+		if got := root[field]; got != want {
+			t.Errorf("%s = %v (%T), want json.Number %s", field, got, got, want)
+		}
+	}
+	if got := tsnMap(t, root["metadata"], "metadata")["big"]; got != json.Number("123456789012345678901234567890.5") {
+		t.Errorf("metadata.big = %v (%T), want exact decimal", got, got)
+	}
+	limit := tsnMap(t, tsnProps(t, out)["limit"], "limit")
+	if got := tsnType(t, limit, "limit"); got != "string" {
+		t.Errorf("limit type = %q, want string", got)
+	}
+	if limit["default"] != json.Number("9007199254740993") {
+		t.Errorf("limit default = %v (%T), want exact json.Number", limit["default"], limit["default"])
+	}
+
+	// Every sibling of "tools" survives value-equivalent.
+	in, outRoot := tsnDecode(t, body), tsnDecode(t, out)
+	delete(in, "tools")
+	delete(outRoot, "tools")
+	if !reflect.DeepEqual(in, outRoot) {
+		t.Errorf("non-tools fields diverged:\n in: %#v\nout: %#v", in, outRoot)
+	}
+}
+
+// Go-specific: normalizing twice equals normalizing once, byte for byte.
+func TestNormalizeToolSchemas_Idempotent(t *testing.T) {
+	body := []byte(`{"model":"m","max_tokens":9007199254740993,` +
+		`"tools":[{"type":"function","function":{"name":"f","parameters":` +
+		`{"type":["object","null"],"properties":{` +
+		`"u":{"type":["string","null"],"description":"u"},` +
+		`"e":{"enum":["a"]},` +
+		`"n":{"anyOf":[{"type":["integer","null"]},{"type":"null"}]}}}}}]}`)
+
+	once := NormalizeToolSchemas(body)
+	if bytes.Equal(once, body) {
+		t.Fatal("first pass did not normalize")
+	}
+	twice := NormalizeToolSchemas(once)
+	if !bytes.Equal(once, twice) {
+		t.Errorf("not idempotent:\n once: %s\ntwice: %s", once, twice)
+	}
+}
+
+// Go-specific: the exact shape from the 2026-06-10 prod incident.
+func TestNormalizeToolSchemas_RealWorldIncidentShape(t *testing.T) {
+	body := []byte(`{"model":"gemma-4-26b","messages":[{"role":"user","content":"weather in SF"}],` +
+		`"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather",` +
+		`"parameters":{"type":"object","properties":{` +
+		`"location":{"type":"string","description":"city"},` +
+		`"unit":{"type":["string","null"],"description":"optional unit"}},` +
+		`"required":["location"]}}}]}`)
+
+	props := tsnProps(t, NormalizeToolSchemas(body))
+	unit := tsnMap(t, props["unit"], "unit")
+	if got := tsnType(t, unit, "unit"); got != "string" {
+		t.Errorf("unit type = %q, want string", got)
+	}
+	if unit["nullable"] != true {
+		t.Errorf("unit nullable = %v, want true", unit["nullable"])
+	}
+	if unit["description"] != "optional unit" {
+		t.Errorf("unit description = %v, want %q", unit["description"], "optional unit")
+	}
+	location := tsnMap(t, props["location"], "location")
+	if got := tsnType(t, location, "location"); got != "string" {
+		t.Errorf("location type = %q, want string", got)
+	}
+	if _, ok := location["nullable"]; ok {
+		t.Errorf("location nullable = %v, want absent", location["nullable"])
+	}
+}
+
+// Go-specific: tool entries that don't match the expected shape pass through
+// value-equivalent; well-formed siblings are still normalized.
+func TestNormalizeToolSchemas_MalformedToolEntriesPassedThrough(t *testing.T) {
+	body := []byte(`{"model":"m","tools":[` +
+		`42,` +
+		`"x",` +
+		`{"type":"function"},` +
+		`{"function":"notdict"},` +
+		`{"function":{"name":"f"}},` +
+		`{"function":{"name":"g","parameters":null}},` +
+		`{"function":{"name":"h","parameters":{"properties":{"q":{"description":"q"}}}}}]}`)
+
+	out := NormalizeToolSchemas(body)
+	tools, ok := tsnDecode(t, out)["tools"].([]any)
+	if !ok || len(tools) != 7 {
+		t.Fatalf("tools = %v, want 7 entries", tools)
+	}
+	if tools[0] != json.Number("42") || tools[1] != "x" {
+		t.Errorf("scalar entries changed: %v, %v", tools[0], tools[1])
+	}
+	if !reflect.DeepEqual(tools[2], map[string]any{"type": "function"}) {
+		t.Errorf("function-less entry changed: %#v", tools[2])
+	}
+	if got := tsnMap(t, tools[3], "tools[3]")["function"]; got != "notdict" {
+		t.Errorf("non-object function changed: %v", got)
+	}
+	fn4 := tsnMap(t, tsnMap(t, tools[4], "tools[4]")["function"], "tools[4].function")
+	if _, ok := fn4["parameters"]; ok {
+		t.Errorf("parameters key invented on tools[4]: %v", fn4["parameters"])
+	}
+	fn5 := tsnMap(t, tsnMap(t, tools[5], "tools[5]")["function"], "tools[5].function")
+	if v, ok := fn5["parameters"]; !ok || v != nil {
+		t.Errorf("null parameters = %v (present=%v), want preserved null", v, ok)
+	}
+	params6 := tsnMap(t, tsnMap(t, tsnMap(t, tools[6], "tools[6]")["function"], "tools[6].function")["parameters"], "tools[6].parameters")
+	if got := tsnType(t, params6, "tools[6].parameters"); got != "object" {
+		t.Errorf("valid sibling parameters type = %q, want object", got)
+	}
+	q := tsnMap(t, tsnMap(t, params6["properties"], "tools[6].properties")["q"], "q")
+	if got := tsnType(t, q, "q"); got != "string" {
+		t.Errorf("valid sibling q type = %q, want string", got)
+	}
+}
+
+// Responses-API flat and Anthropic Messages shapes (the DAR-130 breadth
+// extension): the coordinator repairs three wire shapes per tool entry; the
+// Swift provider-side normalizer covers only the chat shape as of 0.6.4, so
+// these paths have no provider-side fallback.
+
+// Responses-API flat tool: parameters at the tool's top level, no "function"
+// wrapper. Responses→chat conversion runs AFTER normalization and copies
+// parameters verbatim, so the flat repair fixes that path end-to-end. A
+// chat-shape sibling in the same array is normalized by its own rule.
+func TestNormalizeToolSchemas_ResponsesFlatToolNormalized(t *testing.T) {
+	body := []byte(`{"model":"gemma-4-26b","input":"weather in SF","tools":[` +
+		`{"type":"function","name":"get_weather","description":"Get weather",` +
+		`"parameters":{"type":"object","properties":{` +
+		`"city":{"type":["string","null"],"description":"city"},` +
+		`"unit":{"enum":["c","f"]}},"required":["city"]}},` +
+		`{"type":"function","function":{"name":"chat_sibling",` +
+		`"parameters":{"properties":{"q":{"type":["string","null"]}}}}}]}`)
+
+	tools := tsnTools(t, NormalizeToolSchemas(body), 2)
+
+	flat := tsnMap(t, tools[0], "tools[0]")
+	// The flat entry's own identity fields survive, and no wrapper is invented.
+	if flat["type"] != "function" || flat["name"] != "get_weather" || flat["description"] != "Get weather" {
+		t.Errorf("flat tool identity changed: type=%v name=%v description=%v",
+			flat["type"], flat["name"], flat["description"])
+	}
+	if _, ok := flat["function"]; ok {
+		t.Errorf("function wrapper invented on flat tool: %v", flat["function"])
+	}
+	params := tsnMap(t, flat["parameters"], "flat parameters")
+	props := tsnMap(t, params["properties"], "flat properties")
+	// Nullable array type collapses with nullability preserved.
+	city := tsnMap(t, props["city"], "city")
+	if got := tsnType(t, city, "city"); got != "string" {
+		t.Errorf("city type = %q, want string", got)
+	}
+	if city["nullable"] != true {
+		t.Errorf("city nullable = %v, want true", city["nullable"])
+	}
+	// Typeless enum-only property gains a type.
+	unit := tsnMap(t, props["unit"], "unit")
+	if got := tsnType(t, unit, "unit"); got != "string" {
+		t.Errorf("unit type = %q, want string", got)
+	}
+	if req, ok := params["required"].([]any); !ok || len(req) != 1 || req[0] != "city" {
+		t.Errorf("required = %v, want [city]", params["required"])
+	}
+
+	// The chat-shape sibling is still normalized through the function wrapper.
+	sibFn := tsnMap(t, tsnMap(t, tools[1], "tools[1]")["function"], "tools[1].function")
+	sibQ := tsnMap(t, tsnMap(t, tsnMap(t, sibFn["parameters"], "sibling parameters")["properties"], "sibling properties")["q"], "q")
+	if got := tsnType(t, sibQ, "sibling q"); got != "string" {
+		t.Errorf("sibling q type = %q, want string", got)
+	}
+	if sibQ["nullable"] != true {
+		t.Errorf("sibling q nullable = %v, want true", sibQ["nullable"])
+	}
+}
+
+// Pins the flat-detection rule for a degenerate hybrid: when "function" is
+// present but NOT an object, it is opaque garbage rather than a chat wrapper
+// — the entry counts as flat, its top-level parameters are normalized, and
+// the garbage value itself passes through verbatim.
+func TestNormalizeToolSchemas_FlatToolWithNonObjectFunctionStillNormalized(t *testing.T) {
+	body := []byte(`{"tools":[{"name":"f","function":"notdict",` +
+		`"parameters":{"properties":{"u":{"enum":["c","f"]}}}}]}`)
+
+	tool := tsnMap(t, tsnTools(t, NormalizeToolSchemas(body), 1)[0], "tools[0]")
+	if tool["function"] != "notdict" {
+		t.Errorf("garbage function value changed: %v", tool["function"])
+	}
+	params := tsnMap(t, tool["parameters"], "parameters")
+	if got := tsnType(t, params, "parameters"); got != "object" {
+		t.Errorf("parameters type = %q, want object", got)
+	}
+	u := tsnMap(t, tsnMap(t, params["properties"], "properties")["u"], "u")
+	if got := tsnType(t, u, "u"); got != "string" {
+		t.Errorf("u type = %q, want string", got)
+	}
+}
+
+// Pins the converse rule: an object "function" wrapper claims the entry as
+// chat-shape, so a stray top-level "parameters" beside it is not a
+// recognized schema home and stays untouched — an entry's two possible
+// OpenAI parameter homes are never both repaired.
+func TestNormalizeToolSchemas_ObjectFunctionWrapperClaimsEntry(t *testing.T) {
+	body := []byte(`{"tools":[{` +
+		`"function":{"name":"f","parameters":{"properties":{"a":{"enum":["x"]}}}},` +
+		`"parameters":{"properties":{"b":{"enum":["y"]}}}}]}`)
+
+	tool := tsnMap(t, tsnTools(t, NormalizeToolSchemas(body), 1)[0], "tools[0]")
+	// The wrapped parameters are normalized...
+	fnParams := tsnMap(t, tsnMap(t, tool["function"], "function")["parameters"], "function.parameters")
+	if got := tsnType(t, fnParams, "function.parameters"); got != "object" {
+		t.Errorf("function.parameters type = %q, want object", got)
+	}
+	a := tsnMap(t, tsnMap(t, fnParams["properties"], "function properties")["a"], "a")
+	if got := tsnType(t, a, "a"); got != "string" {
+		t.Errorf("function.parameters a type = %q, want string", got)
+	}
+	// ...the stray top-level ones are passed through untouched.
+	topParams := tsnMap(t, tool["parameters"], "top-level parameters")
+	if v, ok := topParams["type"]; ok {
+		t.Errorf("stray top-level parameters gained a type: %v", v)
+	}
+	b := tsnMap(t, tsnMap(t, topParams["properties"], "top-level properties")["b"], "b")
+	if v, ok := b["type"]; ok {
+		t.Errorf("stray top-level property gained a type: %v", v)
+	}
+}
+
+// Anthropic Messages tool: the JSON-Schema lives under "input_schema"
+// (served via /v1/messages). Same template crash class; nullable array types
+// collapse and recursion reaches nested properties and items.
+func TestNormalizeToolSchemas_AnthropicInputSchemaNormalized(t *testing.T) {
+	body := []byte(`{"model":"m","messages":[{"role":"user","content":"hi"}],` +
+		`"tools":[{"name":"get_weather","description":"Get weather",` +
+		`"input_schema":{"type":["object","null"],"properties":{` +
+		`"city":{"type":["string","null"],"description":"city"},` +
+		`"days":{"type":"array","items":{"type":["integer","null"]}},` +
+		`"unit":{"enum":["c","f"]}}}}]}`)
+
+	tool := tsnMap(t, tsnTools(t, NormalizeToolSchemas(body), 1)[0], "tools[0]")
+	// The entry's own fields are not schema nodes — identity survives and no
+	// type is invented on the tool itself despite its "description" key.
+	if tool["name"] != "get_weather" || tool["description"] != "Get weather" {
+		t.Errorf("tool identity changed: name=%v description=%v", tool["name"], tool["description"])
+	}
+	if v, ok := tool["type"]; ok {
+		t.Errorf("type invented on the tool entry itself: %v", v)
+	}
+	schema := tsnMap(t, tool["input_schema"], "input_schema")
+	if got := tsnType(t, schema, "input_schema"); got != "object" {
+		t.Errorf("input_schema type = %q, want object", got)
+	}
+	if schema["nullable"] != true {
+		t.Errorf("input_schema nullable = %v, want true", schema["nullable"])
+	}
+	props := tsnMap(t, schema["properties"], "input_schema.properties")
+	city := tsnMap(t, props["city"], "city")
+	if got := tsnType(t, city, "city"); got != "string" {
+		t.Errorf("city type = %q, want string", got)
+	}
+	if city["nullable"] != true {
+		t.Errorf("city nullable = %v, want true", city["nullable"])
+	}
+	items := tsnMap(t, tsnMap(t, props["days"], "days")["items"], "days.items")
+	if got := tsnType(t, items, "days.items"); got != "integer" {
+		t.Errorf("days.items type = %q, want integer", got)
+	}
+	if items["nullable"] != true {
+		t.Errorf("days.items nullable = %v, want true", items["nullable"])
+	}
+	unit := tsnMap(t, props["unit"], "unit")
+	if got := tsnType(t, unit, "unit"); got != "string" {
+		t.Errorf("unit type = %q, want string", got)
+	}
+}
+
+// Anthropic entries without input_schema (e.g. server-tool stubs) pass
+// through value-equivalent; a schema-carrying sibling is still normalized.
+func TestNormalizeToolSchemas_AnthropicToolWithoutInputSchemaUnchanged(t *testing.T) {
+	body := []byte(`{"tools":[` +
+		`{"type":"web_search_20250305","name":"web_search","max_uses":3},` +
+		`{"name":"f","input_schema":{"properties":{"q":{"description":"q"}}}}]}`)
+
+	tools := tsnTools(t, NormalizeToolSchemas(body), 2)
+	stub := tsnMap(t, tools[0], "tools[0]")
+	want := map[string]any{"type": "web_search_20250305", "name": "web_search", "max_uses": json.Number("3")}
+	if !reflect.DeepEqual(stub, want) {
+		t.Errorf("schema-less tool changed: %#v, want %#v", stub, want)
+	}
+	schema := tsnMap(t, tsnMap(t, tools[1], "tools[1]")["input_schema"], "tools[1].input_schema")
+	if got := tsnType(t, schema, "input_schema"); got != "object" {
+		t.Errorf("sibling input_schema type = %q, want object", got)
+	}
+	q := tsnMap(t, tsnMap(t, schema["properties"], "tools[1].properties")["q"], "q")
+	if got := tsnType(t, q, "q"); got != "string" {
+		t.Errorf("sibling q type = %q, want string", got)
+	}
+}
+
+// All three shapes plus a malformed scalar in ONE tools array — each entry
+// is detected and repaired by its own rule.
+func TestNormalizeToolSchemas_MixedShapesInOneToolsArray(t *testing.T) {
+	body := []byte(`{"model":"m","tools":[` +
+		`{"type":"function","function":{"name":"chat","parameters":{"properties":{"a":{"type":["string","null"]}}}}},` +
+		`{"type":"function","name":"flat","parameters":{"properties":{"b":{"enum":["x"]}}}},` +
+		`{"name":"anthropic","input_schema":{"type":["object","null"],"properties":{"c":{"description":"c"}}}},` +
+		`42]}`)
+
+	tools := tsnTools(t, NormalizeToolSchemas(body), 4)
+
+	chatParams := tsnMap(t, tsnMap(t, tsnMap(t, tools[0], "tools[0]")["function"], "tools[0].function")["parameters"], "chat parameters")
+	a := tsnMap(t, tsnMap(t, chatParams["properties"], "chat properties")["a"], "a")
+	if got := tsnType(t, a, "a"); got != "string" || a["nullable"] != true {
+		t.Errorf("chat a = type %q nullable %v, want string/true", got, a["nullable"])
+	}
+
+	flatParams := tsnMap(t, tsnMap(t, tools[1], "tools[1]")["parameters"], "flat parameters")
+	if got := tsnType(t, flatParams, "flat parameters"); got != "object" {
+		t.Errorf("flat parameters type = %q, want object", got)
+	}
+	b := tsnMap(t, tsnMap(t, flatParams["properties"], "flat properties")["b"], "b")
+	if got := tsnType(t, b, "b"); got != "string" {
+		t.Errorf("flat b type = %q, want string", got)
+	}
+
+	schema := tsnMap(t, tsnMap(t, tools[2], "tools[2]")["input_schema"], "input_schema")
+	if got := tsnType(t, schema, "input_schema"); got != "object" || schema["nullable"] != true {
+		t.Errorf("input_schema = type %q nullable %v, want object/true", got, schema["nullable"])
+	}
+	c := tsnMap(t, tsnMap(t, schema["properties"], "anthropic properties")["c"], "c")
+	if got := tsnType(t, c, "c"); got != "string" {
+		t.Errorf("anthropic c type = %q, want string", got)
+	}
+
+	if tools[3] != json.Number("42") {
+		t.Errorf("scalar entry changed: %v (%T)", tools[3], tools[3])
+	}
+}
+
+// Go-specific: null schema values are preserved verbatim in all three homes.
+func TestNormalizeToolSchemas_NullSchemasPreservedAcrossShapes(t *testing.T) {
+	body := []byte(`{"tools":[` +
+		`{"function":{"name":"chat","parameters":null}},` +
+		`{"name":"flat","parameters":null},` +
+		`{"name":"anthropic","input_schema":null}]}`)
+
+	tools := tsnTools(t, NormalizeToolSchemas(body), 3)
+	fn := tsnMap(t, tsnMap(t, tools[0], "tools[0]")["function"], "tools[0].function")
+	if v, ok := fn["parameters"]; !ok || v != nil {
+		t.Errorf("chat null parameters = %v (present=%v), want preserved null", v, ok)
+	}
+	flat := tsnMap(t, tools[1], "tools[1]")
+	if v, ok := flat["parameters"]; !ok || v != nil {
+		t.Errorf("flat null parameters = %v (present=%v), want preserved null", v, ok)
+	}
+	anthropic := tsnMap(t, tools[2], "tools[2]")
+	if v, ok := anthropic["input_schema"]; !ok || v != nil {
+		t.Errorf("anthropic null input_schema = %v (present=%v), want preserved null", v, ok)
+	}
+}
+
+// Go-specific: idempotency and exact number round-trip (UseNumber) across
+// all three shapes at once.
+func TestNormalizeToolSchemas_AllShapesIdempotentAndNumbersSurvive(t *testing.T) {
+	body := []byte(`{"model":"m","max_tokens":9007199254740993,"tools":[` +
+		`{"type":"function","function":{"name":"chat","parameters":` +
+		`{"properties":{"a":{"type":["integer","null"],"default":9007199254740993}}}}},` +
+		`{"type":"function","name":"flat","parameters":` +
+		`{"properties":{"b":{"enum":["x"],"default":0.30000000000000004}}}},` +
+		`{"name":"anthropic","input_schema":` +
+		`{"properties":{"c":{"type":["number","null"],"default":123456789012345678901234567890.5}}}}]}`)
+
+	once := NormalizeToolSchemas(body)
+	if bytes.Equal(once, body) {
+		t.Fatal("first pass did not normalize")
+	}
+	twice := NormalizeToolSchemas(once)
+	if !bytes.Equal(once, twice) {
+		t.Errorf("not idempotent:\n once: %s\ntwice: %s", once, twice)
+	}
+	// 2^53+1 appears twice (max_tokens + the chat default) and would mangle
+	// to ...992 without UseNumber; the others pin float-precision survival.
+	if got := bytes.Count(once, []byte("9007199254740993")); got != 2 {
+		t.Errorf("exact literal 9007199254740993 appears %d times, want 2\nout: %s", got, once)
+	}
+	for _, literal := range []string{"0.30000000000000004", "123456789012345678901234567890.5"} {
+		if !bytes.Contains(once, []byte(literal)) {
+			t.Errorf("exact literal %s lost\nout: %s", literal, once)
+		}
+	}
+	// And the repairs themselves landed in each shape.
+	tools := tsnTools(t, once, 3)
+	a := tsnMap(t, tsnMap(t, tsnMap(t, tsnMap(t, tsnMap(t, tools[0], "tools[0]")["function"], "function")["parameters"], "chat parameters")["properties"], "chat properties")["a"], "a")
+	if got := tsnType(t, a, "a"); got != "integer" || a["nullable"] != true {
+		t.Errorf("chat a = type %q nullable %v, want integer/true", got, a["nullable"])
+	}
+	b := tsnMap(t, tsnMap(t, tsnMap(t, tsnMap(t, tools[1], "tools[1]")["parameters"], "flat parameters")["properties"], "flat properties")["b"], "b")
+	if got := tsnType(t, b, "b"); got != "string" {
+		t.Errorf("flat b type = %q, want string", got)
+	}
+	c := tsnMap(t, tsnMap(t, tsnMap(t, tsnMap(t, tools[2], "tools[2]")["input_schema"], "input_schema")["properties"], "anthropic properties")["c"], "c")
+	if got := tsnType(t, c, "c"); got != "number" || c["nullable"] != true {
+		t.Errorf("anthropic c = type %q nullable %v, want number/true", got, c["nullable"])
+	}
+}
+
+// Hardening tests (DAR-130 follow-ups): a depth bound on the recursion so a
+// pathological deeply-nested schema can't blow the stack, and a "changed"
+// signal so a body needing no repair is returned byte-identically (skipping
+// the re-encode).
+
+// tsnDeepPropertiesBody builds a tool body whose parameters schema is a chain
+// of `levels` nested objects, each {"properties":{"child": <next> }}, ending
+// in an enum-only leaf that WOULD gain a "string" type if it were reached. The
+// chain is built inside-out as raw JSON so the nesting is real (not a Go data
+// structure the test would have to walk by hand). The outermost object is the
+// parameters node itself (processed at depth 0); its first child sits at depth
+// 1, and so on, so the leaf lands at depth `levels`.
+func tsnDeepPropertiesBody(levels int) []byte {
+	// Leaf: an enum-only schema — a recognized schema node with no type, the
+	// canonical case that injectDefaultTypes repairs to "string".
+	node := `{"enum":["x"]}`
+	for i := 0; i < levels; i++ {
+		node = `{"properties":{"child":` + node + `}}`
+	}
+	return []byte(`{"tools":[{"type":"function","function":{"name":"f","parameters":` + node + `}}]}`)
+}
+
+// (a) A schema nested far deeper than maxToolSchemaDepth must NOT panic or
+// overflow the stack; the shallow part is normalized and the part beyond the
+// depth budget is left exactly as-is (which is safe — see maxToolSchemaDepth).
+func TestNormalizeToolSchemas_DepthLimitStopsRecursionWithoutPanic(t *testing.T) {
+	const levels = maxToolSchemaDepth + 200 // comfortably past the ceiling
+	body := tsnDeepPropertiesBody(levels)
+
+	var out []byte
+	// A naive unbounded recursion on a sufficiently deep input would overflow
+	// the stack and crash the test process; reaching the assertions at all is
+	// the core guarantee. (defer/recover would not catch a fatal stack overflow,
+	// so the real protection is the depth bound under test, not this guard.)
+	func() {
+		defer func() {
+			if r := recover(); r != nil {
+				t.Fatalf("normalization panicked on a deeply-nested schema: %v", r)
+			}
+		}()
+		out = NormalizeToolSchemas(body)
+	}()
+
+	// The shallow part WAS repaired, so the body changed and re-encoded.
+	if bytes.Equal(out, body) {
+		t.Fatal("deeply-nested body was returned unchanged; the shallow part should have normalized")
+	}
+
+	// Walk down the properties chain and confirm: nodes above the limit gained
+	// a structural "object" type, and the node at the depth limit was left
+	// untouched (no type invented beyond the budget).
+	node := tsnParams(t, out) // depth 0
+	depth := 0
+	for {
+		props, ok := node["properties"].(map[string]any)
+		if !ok {
+			break
+		}
+		// A node that carries properties and was reached within the budget must
+		// have been typed "object".
+		if depth < maxToolSchemaDepth {
+			if got, ok := node["type"].(string); !ok || got != "object" {
+				t.Fatalf("node at depth %d type = %v, want object (within budget)", depth, node["type"])
+			}
+		} else {
+			// At or beyond the limit, recursion stopped: no type was injected.
+			if _, ok := node["type"]; ok {
+				t.Fatalf("node at depth %d gained a type %v; recursion should have stopped at the limit",
+					depth, node["type"])
+			}
+		}
+		child, ok := props["child"].(map[string]any)
+		if !ok {
+			t.Fatalf("missing properties.child at depth %d", depth)
+		}
+		node = child
+		depth++
+	}
+
+	// The leaf is the enum-only node; it sits at depth `levels`, far past the
+	// limit, so it must NOT have gained a "string" type — proof the deep part
+	// was left as-is rather than (impossibly) traversed.
+	if _, ok := node["type"]; ok {
+		t.Errorf("enum leaf at depth %d gained a type %v; it is past the depth budget and must be untouched",
+			depth, node["type"])
+	}
+	if enum, ok := node["enum"].([]any); !ok || len(enum) != 1 {
+		t.Errorf("enum leaf content changed: %v", node["enum"])
+	}
+}
+
+// A node sitting exactly at the LAST in-budget depth is still normalized; the
+// first node past it is not. Pins the boundary so an off-by-one in the depth
+// accounting is caught.
+func TestNormalizeToolSchemas_DepthLimitBoundaryIsNormalized(t *testing.T) {
+	// Leaf at depth maxToolSchemaDepth-1 (the deepest in-budget node) must be
+	// repaired; building exactly that many wrapper levels puts the enum leaf
+	// one step inside the budget.
+	body := tsnDeepPropertiesBody(maxToolSchemaDepth - 1)
+	out := NormalizeToolSchemas(body)
+	if bytes.Equal(out, body) {
+		t.Fatal("boundary body was not normalized")
+	}
+
+	node := tsnParams(t, out)
+	for i := 0; i < maxToolSchemaDepth-1; i++ {
+		props, ok := node["properties"].(map[string]any)
+		if !ok {
+			t.Fatalf("missing properties at depth %d", i)
+		}
+		node = tsnMap(t, props["child"], "child")
+	}
+	// node is now the enum leaf at depth maxToolSchemaDepth-1 (in budget).
+	if got, ok := node["type"].(string); !ok || got != "string" {
+		t.Errorf("in-budget leaf type = %v, want string", node["type"])
+	}
+}
+
+// (b) A tools body whose every schema node already carries a string `type`
+// needs NO repair, so NormalizeToolSchemas must return the caller's ORIGINAL
+// bytes verbatim — skipping the JSON re-encode entirely. The input is
+// deliberately written with key order and whitespace the Go encoder would
+// rewrite (keys not alphabetized, a space after a colon), so byte-equality
+// proves the re-marshal path was not taken.
+func TestNormalizeToolSchemas_NoRepairReturnsInputBytesIdentical(t *testing.T) {
+	// "tools" precedes "model" (Go's encoder sorts keys, so it would move
+	// "model" first), and there is a space after the first colon (the encoder
+	// emits none). Every schema node has a string type already.
+	body := []byte(`{"tools": [{"type":"function","function":{"name":"f","parameters":` +
+		`{"type":"object","properties":{` +
+		`"city":{"type":"string","description":"city"},` +
+		`"count":{"type":"integer"},` +
+		`"opts":{"type":"object","properties":{"verbose":{"type":"boolean"}}},` +
+		`"tags":{"type":"array","items":{"type":"string"}}},` +
+		`"required":["city"]}}}],"model":"gemma-4-26b"}`)
+
+	out := NormalizeToolSchemas(body)
+	if !bytes.Equal(out, body) {
+		t.Fatalf("fully-typed body was re-encoded; want byte-identical input.\n in: %s\nout: %s", body, out)
+	}
+	// Sanity: had it re-encoded, the encoder would have sorted "model" ahead of
+	// "tools" and dropped the space — so a byte match really does mean no
+	// re-encode. Confirm the distinguishing bytes survived.
+	if !bytes.HasPrefix(out, []byte(`{"tools": [`)) {
+		t.Errorf("output lost its original key order / spacing: %s", out)
+	}
+}
+
+// (c) A body that DOES need normalization is still corrected (regression for
+// the changed-tracking path — a single missing type must flip `changed` and
+// trigger the re-encode that injects it).
+func TestNormalizeToolSchemas_RepairNeededStillCorrected(t *testing.T) {
+	// The `unit` property is enum-only (no type) — exactly one repair needed.
+	body := []byte(`{"model":"m","tools":[{"type":"function","function":{"name":"f",` +
+		`"parameters":{"type":"object","properties":{` +
+		`"city":{"type":"string"},` +
+		`"unit":{"enum":["c","f"]}}}}}]}`)
+
+	out := NormalizeToolSchemas(body)
+	if bytes.Equal(out, body) {
+		t.Fatal("body needing a repair was returned unchanged")
+	}
+	props := tsnProps(t, out)
+	unit := tsnMap(t, props["unit"], "unit")
+	if got := tsnType(t, unit, "unit"); got != "string" {
+		t.Errorf("unit type = %q, want string (the injected default)", got)
+	}
+	// The already-typed sibling is untouched.
+	city := tsnMap(t, props["city"], "city")
+	if got := tsnType(t, city, "city"); got != "string" {
+		t.Errorf("city type = %q, want string (preserved)", got)
+	}
+}


🔵 [INFO] 🧩 1118-line test file with repetitive test helper patterns

💡 Suggestion: Extract common test setup patterns into shared helpers to reduce duplication

📊 Score: 2×2 = 4 · Category: duplicate logic

ethenotethan · 2026-06-12T01:16:22Z

+package api
+
+// Routing-failover integration tests (WS-E).
+//
+// These tests exercise the reliability contracts being implemented by the
+// routing-failover workstreams against a real coordinator (httptest server,
+// in-memory store, real registry) and fake WebSocket providers that speak the
+// full encrypted protocol:
+//
+//   - [WS-C] Deferred commit: a boilerplate role-only delta chunk no longer
+//     commits the dispatch. A provider error or disconnect BEFORE any
+//     content-bearing chunk results in a transparent retry on another
+//     provider — invisible to the consumer (200, clean stream, one [DONE]).
+//     In-band SSE errors are surfaced ONLY after content has flowed.
+//   - [WS-R] Inference-error cooldown, tools version floors, and the
+//     template_render_ok routing gate (see failover_routing_integration_test.go).
+//   - [WS-T] Tool-schema normalization reaching the provider (see
+//     failover_routing_integration_test.go).
+//
+// The fake-provider harness here follows the established patterns from
+// cancellation_integration_test.go / multi_provider_test.go /
+// load_integration_test.go: register over /ws/provider with an X25519 key from
+// testPublicKeyB64() (keypair cached in testProviderKeys), answer attestation
+// challenges with makeValidChallengeResponse, receive the E2E-encrypted
+// inference_request, and reply with encrypted inference_response_chunk
+// messages (plaintext chunks are rejected by decryptTextResponseChunk) plus
+// plaintext inference_error / inference_complete terminals.
+//
+// Registration is sent as raw JSON (a patched protocol.RegisterMessage map) so
+// tests can set fields that are still landing in sibling workstreams (e.g. the
+// per-model template_render_ok flag) without this file depending on their
+// struct changes to compile.
+//
+// INTEGRATION-NOTE(WS-C): TestPreContentFailover_* encode the deferred-commit
+// contract and FAIL against the pre-workstream coordinator (which commits on
+// the role chunk and surfaces an in-band error). They must pass once WS-C
+// lands. TestPostContentErrorStillSurfaced and TestBoilerplateThenCleanClose
+// pass against the current coordinator and act as regression guards on the
+// correctness boundary WS-C must not move.
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"io"
+	"log/slog"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"strings"
+	"sync"
+	"sync/atomic"
+	"testing"
+	"time"
+
+	"github.com/eigeninference/d-inference/coordinator/attestation"
+	"github.com/eigeninference/d-inference/coordinator/internal/e2e"
+	"github.com/eigeninference/d-inference/coordinator/protocol"
+	"github.com/eigeninference/d-inference/coordinator/registry"
+	"github.com/eigeninference/d-inference/coordinator/store"
+	"nhooyr.io/websocket"
+)
+
+// ---------------------------------------------------------------------------
+// Server setup
+// ---------------------------------------------------------------------------
+
+// setupFailoverServer creates a coordinator test server for failover tests,
+// mirroring setupTestServer / setupLoadTestServer.
+func setupFailoverServer(t *testing.T) (*registry.Registry, *store.MemoryStore, *httptest.Server) {
+	t.Helper()
+	logger := slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelError}))
+	st := store.NewMemory(store.Config{AdminKey: "test-key"})
+	reg := registry.New(logger)
+	srv := NewServer(reg, st, ServerConfig{}, logger)
+	srv.challengeInterval = 500 * time.Millisecond
+	ts := httptest.NewServer(srv.Handler())
+	t.Cleanup(ts.Close)
+	return reg, st, ts
+}
+
+// ---------------------------------------------------------------------------
+// Fake provider actor
+// ---------------------------------------------------------------------------
+
+// failoverModelSpec describes one advertised model for a fake provider.
+// TemplateRenderOK is emitted as the raw wire field "template_render_ok" —
+// the exact bytes a 0.6.5+ Swift provider sends — which keeps these tests
+// independent of the Go struct shape (protocol.ModelInfo.TemplateRenderOK
+// *bool has landed and decodes this field).
+type failoverModelSpec struct {
+	ID               string
+	TemplateRenderOK *bool
+}
+
+// inferenceScript is a fake provider's behavior for one inference dispatch.
+// body is the decrypted request body (nil if decryption failed).
+type inferenceScript func(ctx context.Context, fp *failoverProvider, req protocol.InferenceRequestMessage, body []byte)
+
+// failoverProvider is a scripted fake provider speaking the full WS protocol.
+type failoverProvider struct {
+	t          *testing.T
+	name       string
+	conn       *websocket.Conn
+	pubKey     string
+	privKey    [32]byte
+	registryID string
+	script     inferenceScript
+	dispatches atomic.Int32
+	bodies     chan []byte
+	done       chan struct{}
+	closeOnce  sync.Once
+}
+
+type failoverProviderConfig struct {
+	Name      string
+	Version   string
+	DecodeTPS float64
+	Models    []failoverModelSpec
+	Script    inferenceScript
+	// Serial, when non-empty, is stamped onto the provider's AttestationResult
+	// after registration so a request carrying a provider_serials allowlist can
+	// constrain routing to (or away from) this provider — mirroring the
+	// allowlist-aware capability fast-fails and QuickCapacityCheck.
+	Serial string
+}
+
+// startFailoverProvider dials the provider WebSocket, registers (raw-JSON
+// register message patched from a protocol.RegisterMessage), marks the new
+// provider hardware-trusted + challenge-verified, and starts the read loop.
+func startFailoverProvider(t *testing.T, ctx context.Context, ts *httptest.Server, reg *registry.Registry, cfg failoverProviderConfig) *failoverProvider {
+	t.Helper()
+
+	pubKey := testPublicKeyB64()
+	v, ok := testProviderKeys.Load(pubKey)
+	if !ok {
+		t.Fatalf("provider %s: missing cached keypair for %q", cfg.Name, pubKey)
+	}
+	keypair := v.(testProviderKeyPair)
+
+	before := make(map[string]struct{})
+	for _, id := range reg.ProviderIDs() {
+		before[id] = struct{}{}
+	}
+
+	wsURL := "ws" + strings.TrimPrefix(ts.URL, "http") + "/ws/provider"
+	conn, _, err := websocket.Dial(ctx, wsURL, nil)
+	if err != nil {
+		t.Fatalf("provider %s: websocket dial: %v", cfg.Name, err)
+	}
+
+	// Build the register message from the canonical struct, then patch the
+	// models entries in as raw maps so per-model fields still landing in
+	// sibling workstreams (template_render_ok) can be set by tests.
+	regStruct := protocol.RegisterMessage{
+		Type: protocol.TypeRegister,
+		Hardware: protocol.Hardware{
+			MachineModel: "Mac15,8",
+			ChipName:     "Apple M3 Max",
+			MemoryGB:     64,
+		},
+		Backend:                 "mlx-swift",
+		Version:                 cfg.Version,
+		PublicKey:               pubKey,
+		EncryptedResponseChunks: true,
+		DecodeTPS:               cfg.DecodeTPS,
+		PrivacyCapabilities:     testPrivacyCaps(),
+	}
+	rawReg, err := json.Marshal(regStruct)
+	if err != nil {
+		t.Fatalf("provider %s: marshal register struct: %v", cfg.Name, err)
+	}
+	var regMap map[string]any
+	if err := json.Unmarshal(rawReg, &regMap); err != nil {
+		t.Fatalf("provider %s: unmarshal register struct: %v", cfg.Name, err)
+	}
+	modelEntries := make([]map[string]any, 0, len(cfg.Models))
+	for _, m := range cfg.Models {
+		entry := map[string]any{
+			"id":           m.ID,
+			"size_bytes":   int64(1000),
+			"model_type":   "chat",
+			"quantization": "4bit",
+		}
+		if m.TemplateRenderOK != nil {
+			entry["template_render_ok"] = *m.TemplateRenderOK
+		}
+		modelEntries = append(modelEntries, entry)
+	}
+	regMap["models"] = modelEntries
+	regData, err := json.Marshal(regMap)
+	if err != nil {
+		t.Fatalf("provider %s: marshal register message: %v", cfg.Name, err)
+	}
+	if err := conn.Write(ctx, websocket.MessageText, regData); err != nil {
+		t.Fatalf("provider %s: write register: %v", cfg.Name, err)
+	}
+
+	// Let registration process, then identify and trust the new provider.
+	time.Sleep(200 * time.Millisecond)
+	registryID := ""
+	for _, id := range reg.ProviderIDs() {
+		if _, existed := before[id]; !existed {
+			registryID = id
+			break
+		}
+	}
+	if registryID == "" {
+		t.Fatalf("provider %s: did not appear in registry after register", cfg.Name)
+	}
+	reg.SetTrustLevel(registryID, registry.TrustHardware)
+	reg.RecordChallengeSuccess(registryID)
+
+	// Stamp an attested serial so provider_serials allowlist routing can target
+	// or exclude this provider (providerMatchesAllowedSerial reads
+	// AttestationResult.SerialNumber / MDAResult.DeviceSerial).
+	if cfg.Serial != "" {
+		if p := reg.GetProvider(registryID); p != nil {
+			p.SetAttestationResult(&attestation.VerificationResult{
+				Valid:        true,
+				SerialNumber: cfg.Serial,
+			})
+		}
+	}
+
+	fp := &failoverProvider{
+		t:          t,
+		name:       cfg.Name,
+		conn:       conn,
+		pubKey:     pubKey,
+		privKey:    keypair.private,
+		registryID: registryID,
+		script:     cfg.Script,
+		bodies:     make(chan []byte, 8),
+		done:       make(chan struct{}),
+	}
+	go fp.run(ctx)
+	t.Cleanup(fp.close)
+	return fp
+}
+
+// run reads coordinator messages: answers attestation challenges, counts and
+// dispatches inference requests to the script. Returns on connection close.
+func (fp *failoverProvider) run(ctx context.Context) {
+	defer close(fp.done)
+	for {
+		_, data, err := fp.conn.Read(ctx)
+		if err != nil {
+			return
+		}
+		var env struct {
+			Type string `json:"type"`
+		}
+		if err := json.Unmarshal(data, &env); err != nil {
+			continue
+		}
+		switch env.Type {
+		case protocol.TypeAttestationChallenge:
+			resp := makeValidChallengeResponse(data, fp.pubKey)
+			if err := fp.conn.Write(ctx, websocket.MessageText, resp); err != nil {
+				return
+			}
+		case protocol.TypeInferenceRequest:
+			var req protocol.InferenceRequestMessage
+			if err := json.Unmarshal(data, &req); err != nil {
+				continue
+			}
+			fp.dispatches.Add(1)
+			body := fp.decryptBody(req)
+			select {
+			case fp.bodies <- body:
+			default:
+			}
+			if fp.script != nil {
+				fp.script(ctx, fp, req, body)
+			}
+		}
+	}
+}
+
+// decryptBody decrypts the E2E-encrypted request body with the provider's
+// X25519 private key (nil on failure).
+func (fp *failoverProvider) decryptBody(req protocol.InferenceRequestMessage) []byte {
+	if req.EncryptedBody == nil {
+		fp.t.Logf("provider %s: inference request %s missing encrypted body", fp.name, req.RequestID)
+		return nil
+	}
+	payload := &e2e.EncryptedPayload{
+		EphemeralPublicKey: req.EncryptedBody.EphemeralPublicKey,
+		Ciphertext:         req.EncryptedBody.Ciphertext,
+	}
+	plaintext, err := e2e.DecryptWithPrivateKey(payload, fp.privKey)
+	if err != nil {
+		fp.t.Logf("provider %s: decrypt request body: %v", fp.name, err)
+		return nil
+	}
+	return plaintext
+}
+
+func (fp *failoverProvider) dispatchCount() int {
+	return int(fp.dispatches.Load())
+}
+
+// close shuts the provider WebSocket down (idempotent; safe in t.Cleanup).
+func (fp *failoverProvider) close() {
+	fp.closeOnce.Do(func() {
+		_ = fp.conn.Close(websocket.StatusNormalClosure, "test done")
+	})
+}
+
+// closeNow abruptly drops the provider connection, simulating a crash /
+// network drop mid-request (the OpenRouter partner symptom).
+func (fp *failoverProvider) closeNow() {
+	fp.closeOnce.Do(func() {
+		_ = fp.conn.CloseNow()
+	})
+}
+
+// ---------------------------------------------------------------------------
+// Scripted provider behaviors
+// ---------------------------------------------------------------------------
+
+// roleOnlyChunkSSE is the OpenAI boilerplate role-only delta chunk every
+// backend emits before any content. Per the WS-C contract it must NOT commit
+// the dispatch.
+func roleOnlyChunkSSE(model string) string {
+	return fmt.Sprintf(`data: {"id":"chatcmpl-failover","object":"chat.completion.chunk","created":1700000000,"model":%q,"choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}`+"\n\n", model)
+}
+
+func contentChunkSSE(model, text string) string {
+	data, _ := json.Marshal(text)
+	return fmt.Sprintf(`data: {"id":"chatcmpl-failover","object":"chat.completion.chunk","created":1700000000,"model":%q,"choices":[{"index":0,"delta":{"content":%s},"finish_reason":null}]}`+"\n\n", model, data)
+}
+
+func (fp *failoverProvider) sendRoleChunk(ctx context.Context, req protocol.InferenceRequestMessage, model string) {
+	writeEncryptedTestChunk(fp.t, ctx, fp.conn, req, fp.pubKey, roleOnlyChunkSSE(model))
+}
+
+func (fp *failoverProvider) sendContentChunk(ctx context.Context, req protocol.InferenceRequestMessage, model, text string) {
+	writeEncryptedTestChunk(fp.t, ctx, fp.conn, req, fp.pubKey, contentChunkSSE(model, text))
+}
+
+func (fp *failoverProvider) sendInferenceError(ctx context.Context, req protocol.InferenceRequestMessage, errMsg string, statusCode int) {
+	msg := protocol.InferenceErrorMessage{
+		Type:       protocol.TypeInferenceError,
+		RequestID:  req.RequestID,
+		Error:      errMsg,
+		StatusCode: statusCode,
+	}
+	data, _ := json.Marshal(msg)
+	if err := fp.conn.Write(ctx, websocket.MessageText, data); err != nil {
+		fp.t.Logf("provider %s: write inference_error: %v", fp.name, err)
+	}
+}
+
+func (fp *failoverProvider) sendComplete(ctx context.Context, req protocol.InferenceRequestMessage, usage protocol.UsageInfo) {
+	msg := protocol.InferenceCompleteMessage{
+		Type:      protocol.TypeInferenceComplete,
+		RequestID: req.RequestID,
+		Usage:     usage,
+	}
+	data, _ := json.Marshal(msg)
+	if err := fp.conn.Write(ctx, websocket.MessageText, data); err != nil {
+		fp.t.Logf("provider %s: write inference_complete: %v", fp.name, err)
+	}
+}
+
+// serveFull streams role + one content chunk carrying marker, then completes.
+func (fp *failoverProvider) serveFull(ctx context.Context, req protocol.InferenceRequestMessage, model, marker string) {
+	fp.sendRoleChunk(ctx, req, model)
+	fp.sendContentChunk(ctx, req, model, marker)
+	fp.sendComplete(ctx, req, protocol.UsageInfo{PromptTokens: 5, CompletionTokens: 3})
+}
+
+// markerFor returns the content marker a full-serve script emits for a
+// provider, so tests can assert WHICH provider's content reached the consumer.
+func markerFor(name string) string {
+	return "content-from-" + name
+}
+
+// fullServeScript serves every dispatch successfully with the provider's marker.
+func fullServeScript(model string) inferenceScript {
+	return func(ctx context.Context, fp *failoverProvider, req protocol.InferenceRequestMessage, body []byte) {
+		fp.serveFull(ctx, req, model, markerFor(fp.name))
+	}
+}
+
+// dispatchRecorder tracks the global order in which providers received
+// dispatches, so failover tests are independent of which provider the
+// scheduler happens to pick first.
+type dispatchRecorder struct {
+	mu    sync.Mutex
+	order []string
+}
+
+func (d *dispatchRecorder) record(name string) int {
+	d.mu.Lock()
+	defer d.mu.Unlock()
+	d.order = append(d.order, name)
+	return len(d.order)
+}
+
+func (d *dispatchRecorder) sequence() []string {
+	d.mu.Lock()
+	defer d.mu.Unlock()
+	out := make([]string, len(d.order))
+	copy(out, d.order)
+	return out
+}
+
+// failFirstScript makes the provider that receives the globally-FIRST dispatch
+// fail pre-content (role-only chunk, then failMode), while every later
+// dispatch is served fully. failMode is "error" (inference_error 500) or
+// "disconnect" (abrupt WebSocket drop after the role chunk).
+func failFirstScript(rec *dispatchRecorder, model, failMode string) inferenceScript {
+	return func(ctx context.Context, fp *failoverProvider, req protocol.InferenceRequestMessage, body []byte) {
+		seq := rec.record(fp.name)
+		if seq == 1 {
+			fp.sendRoleChunk(ctx, req, model)
+			// Let the role chunk relay through the coordinator before the
+			// failure signal so the "boilerplate already flowed" ordering is
+			// deterministic.
+			time.Sleep(40 * time.Millisecond)
+			switch failMode {
+			case "error":
+				fp.sendInferenceError(ctx, req, "simulated backend failure", http.StatusInternalServerError)
+			case "disconnect":
+				fp.closeNow()
+			default:
+				fp.t.Errorf("unknown failMode %q", failMode)
+			}
+			return
+		}
+		fp.serveFull(ctx, req, model, markerFor(fp.name))
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Consumer helpers
+// ---------------------------------------------------------------------------
+
+// buildChatBody constructs a chat-completions request body. tools (optional)
+// is an OpenAI tools array. max_tokens is set explicitly so the scheduler's
+// cost model (reqMax/decodeTPS) deterministically prefers high-TPS providers.
+func buildChatBody(t *testing.T, model string, stream bool, tools []map[string]any) string {
+	t.Helper()
+	body := map[string]any{
+		"model":      model,
+		"messages":   []map[string]any{{"role": "user", "content": "failover test prompt"}},
+		"stream":     stream,
+		"max_tokens": 64,
+	}
+	if tools != nil {
+		body["tools"] = tools
+	}
+	data, err := json.Marshal(body)
+	if err != nil {
+		t.Fatalf("marshal chat body: %v", err)
+	}
+	return string(data)
+}
+
+// postChat sends a chat-completions request and drains the full response.
+func postChat(ctx context.Context, tsURL, apiKey, body string) (int, string, error) {
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, tsURL+"/v1/chat/completions", strings.NewReader(body))
+	if err != nil {
+		return 0, "", err
+	}
+	req.Header.Set("Authorization", "Bearer "+apiKey)
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		return 0, "", err
+	}
+	defer resp.Body.Close()
+	respBody, _ := io.ReadAll(resp.Body)
+	return resp.StatusCode, string(respBody), nil
+}
+
+// assertCleanFailoverStream asserts the consumer-visible contract of a
+// transparent failover: HTTP 200, the winning provider's content present,
+// exactly one [DONE], and no in-band {"error"} event anywhere.
+func assertCleanFailoverStream(t *testing.T, status int, body, wantMarker string) {
+	t.Helper()
+	if status != http.StatusOK {
+		t.Errorf("status = %d, want 200; body = %s", status, body)
+	}
+	if wantMarker != "" && !strings.Contains(body, wantMarker) {
+		t.Errorf("stream missing failover content %q; body = %s", wantMarker, body)
+	}
+	if n := strings.Count(body, "data: [DONE]"); n != 1 {
+		t.Errorf("stream has %d [DONE] terminators, want exactly 1; body = %s", n, body)
+	}
+	if strings.Contains(body, `"error"`) {
+		t.Errorf("stream contains an in-band error event — provider failure leaked to the consumer; body = %s", body)
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Test 1: pre-content failover on provider error (streaming)
+// ---------------------------------------------------------------------------
+
+// TestPreContentFailover_ErrorAfterRoleChunk: the first-dispatched provider
+// sends ONLY the boilerplate role chunk, then an inference_error (500). The
+// coordinator must NOT commit on the role chunk: it retries transparently on
+// the other provider, and the consumer sees a clean 200 stream with the other
+// provider's content, exactly one [DONE], and no in-band error.
+//
+// INTEGRATION-NOTE(WS-C): fails against the pre-workstream coordinator (role
+// chunk commits; error surfaces in-band). Encodes the deferred-commit contract.
+func TestPreContentFailover_ErrorAfterRoleChunk(t *testing.T) {
+	reg, _, ts := setupFailoverServer(t)
+
+	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+
+	model := "failover-error-model"
+	rec := &dispatchRecorder{}
+	script := failFirstScript(rec, model, "error")
+
+	// DecodeTPS 200 vs 1 puts the providers ~63s apart in scheduler cost
+	// (max_tokens=64), far outside the 3s near-tie window — provider A is
+	// deterministically dispatched first. The script is order-independent
+	// anyway: whoever is dispatched first fails.
+	pA := startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-a", Version: "0.6.4", DecodeTPS: 200,
+		Models: []failoverModelSpec{{ID: model}}, Script: script,
+	})
+	pB := startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-b", Version: "0.6.4", DecodeTPS: 1,
+		Models: []failoverModelSpec{{ID: model}}, Script: script,
+	})
+
+	status, body, err := postChat(ctx, ts.URL, "test-key", buildChatBody(t, model, true, nil))
+	if err != nil {
+		t.Fatalf("chat request: %v", err)
+	}
+
+	seq := rec.sequence()
+	if len(seq) != 2 {
+		t.Fatalf("dispatch sequence = %v, want exactly 2 dispatches (failed primary + failover winner); status=%d body=%s", seq, status, body)
+	}
+	if seq[0] == seq[1] {
+		t.Errorf("both dispatches went to %q — failover must retry on a DIFFERENT provider", seq[0])
+	}
+	assertCleanFailoverStream(t, status, body, markerFor(seq[1]))
+	if strings.Contains(body, markerFor(seq[0])) {
+		t.Errorf("stream contains content from the failed provider %q; body = %s", seq[0], body)
+	}
+	if got := pA.dispatchCount() + pB.dispatchCount(); got != 2 {
+		t.Errorf("total dispatches = %d, want 2", got)
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Test 1b: pre-content failover on provider error (non-streaming)
+// ---------------------------------------------------------------------------
+
+// TestPreContentFailover_ErrorAfterRoleChunk_NonStreaming is the stream:false
+// variant of Test 1: the assembled JSON response must carry the failover
+// winner's content and no error.
+//
+// INTEGRATION-NOTE(WS-C): same contract dependency as Test 1.
+func TestPreContentFailover_ErrorAfterRoleChunk_NonStreaming(t *testing.T) {
+	reg, _, ts := setupFailoverServer(t)
+
+	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+
+	model := "failover-error-nonstream-model"
+	rec := &dispatchRecorder{}
+	script := failFirstScript(rec, model, "error")
+
+	startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-a", Version: "0.6.4", DecodeTPS: 200,
+		Models: []failoverModelSpec{{ID: model}}, Script: script,
+	})
+	startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-b", Version: "0.6.4", DecodeTPS: 1,
+		Models: []failoverModelSpec{{ID: model}}, Script: script,
+	})
+
+	status, body, err := postChat(ctx, ts.URL, "test-key", buildChatBody(t, model, false, nil))
+	if err != nil {
+		t.Fatalf("chat request: %v", err)
+	}
+	if status != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body = %s", status, body)
+	}
+
+	seq := rec.sequence()
+	if len(seq) != 2 {
+		t.Fatalf("dispatch sequence = %v, want exactly 2 dispatches; body = %s", seq, body)
+	}
+	if seq[0] == seq[1] {
+		t.Errorf("both dispatches went to %q — failover must retry on a DIFFERENT provider", seq[0])
+	}
+
+	var resp struct {
+		Choices []struct {
+			Message struct {
+				Content string `json:"content"`
+			} `json:"message"`
+		} `json:"choices"`
+		Error any `json:"error"`
+	}
+	if err := json.Unmarshal([]byte(body), &resp); err != nil {
+		t.Fatalf("response is not valid JSON: %v; body = %s", err, body)
+	}
+	if resp.Error != nil {
+		t.Errorf("response contains an error field — provider failure leaked: %s", body)
+	}
+	if len(resp.Choices) == 0 || !strings.Contains(resp.Choices[0].Message.Content, markerFor(seq[1])) {
+		t.Errorf("response content missing failover winner marker %q; body = %s", markerFor(seq[1]), body)
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Test 2: pre-content failover on provider disconnect
+// ---------------------------------------------------------------------------
+
+// TestPreContentFailover_DisconnectAfterRoleChunk: identical to Test 1, but
+// the first-dispatched provider drops its WebSocket after the role chunk
+// instead of sending an error — the exact OpenRouter partner symptom. The
+// registry converts the drop into a "provider disconnected" (502) terminal;
+// pre-content, that must trigger a transparent retry, not an in-band error.
+//
+// INTEGRATION-NOTE(WS-C): fails against the pre-workstream coordinator.
+func TestPreContentFailover_DisconnectAfterRoleChunk(t *testing.T) {
+	reg, _, ts := setupFailoverServer(t)
+
+	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+
+	model := "failover-disconnect-model"
+	rec := &dispatchRecorder{}
+	script := failFirstScript(rec, model, "disconnect")
+
+	startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-a", Version: "0.6.4", DecodeTPS: 200,
+		Models: []failoverModelSpec{{ID: model}}, Script: script,
+	})
+	startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-b", Version: "0.6.4", DecodeTPS: 1,
+		Models: []failoverModelSpec{{ID: model}}, Script: script,
+	})
+
+	status, body, err := postChat(ctx, ts.URL, "test-key", buildChatBody(t, model, true, nil))
+	if err != nil {
+		t.Fatalf("chat request: %v", err)
+	}
+
+	seq := rec.sequence()
+	if len(seq) != 2 {
+		t.Fatalf("dispatch sequence = %v, want exactly 2 dispatches (dropped primary + failover winner); status=%d body=%s", seq, status, body)
+	}
+	if seq[0] == seq[1] {
+		t.Errorf("both dispatches went to %q — failover must retry on a DIFFERENT provider", seq[0])
+	}
+	assertCleanFailoverStream(t, status, body, markerFor(seq[1]))
+	if strings.Contains(body, markerFor(seq[0])) {
+		t.Errorf("stream contains content from the dropped provider %q; body = %s", seq[0], body)
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Test 3: post-content errors must STILL surface in-band
+// ---------------------------------------------------------------------------
+
+// TestPostContentErrorStillSurfaced guards the correctness boundary of the
+// deferred-commit change: once a CONTENT-bearing chunk has flowed to the
+// consumer, a provider failure must surface as an in-band error — silently
+// retrying on another provider would duplicate/corrupt already-delivered
+// output. Provider B is present and healthy specifically so a (buggy)
+// post-content retry would be detectable: its marker must NOT appear and it
+// must receive no dispatch.
+//
+// Passes against the current coordinator; must keep passing after WS-C.
+func TestPostContentErrorStillSurfaced(t *testing.T) {
+	reg, _, ts := setupFailoverServer(t)
+
+	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+
+	model := "post-content-error-model"
+	const partialContent = "partial-content-before-failure"
+
+	failAfterContent := func(ctx context.Context, fp *failoverProvider, req protocol.InferenceRequestMessage, body []byte) {
+		fp.sendRoleChunk(ctx, req, model)
+		fp.sendContentChunk(ctx, req, model, partialContent)
+		// Let the content chunk relay before the error terminal.
+		time.Sleep(40 * time.Millisecond)
+		fp.sendInferenceError(ctx, req, "backend exploded mid-generation", http.StatusInternalServerError)
+	}
+
+	startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-a", Version: "0.6.4", DecodeTPS: 200,
+		Models: []failoverModelSpec{{ID: model}}, Script: failAfterContent,
+	})
+	pB := startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-b", Version: "0.6.4", DecodeTPS: 1,
+		Models: []failoverModelSpec{{ID: model}}, Script: fullServeScript(model),
+	})
+
+	status, body, err := postChat(ctx, ts.URL, "test-key", buildChatBody(t, model, true, nil))
+	if err != nil {
+		t.Fatalf("chat request: %v", err)
+	}
+	if status != http.StatusOK {
+		t.Fatalf("status = %d, want 200 (stream committed by the content chunk); body = %s", status, body)
+	}
+
+	idxContent := strings.Index(body, partialContent)
+	idxErr := strings.Index(body, `"provider_error"`)
+	if idxContent < 0 {
+		t.Errorf("stream missing the pre-failure content chunk %q; body = %s", partialContent, body)
+	}
+	if idxErr < 0 {
+		t.Errorf("stream did NOT surface an in-band error after content had flowed — silent post-content retry is a correctness bug; body = %s", body)
+	}
+	if idxContent >= 0 && idxErr >= 0 && idxContent > idxErr {
+		t.Errorf("in-band error appeared BEFORE the content chunk (content@%d, error@%d); body = %s", idxContent, idxErr, body)
+	}
+	if strings.Contains(body, markerFor("provider-b")) {
+		t.Errorf("stream contains provider-b content — coordinator silently retried AFTER content had flowed; body = %s", body)
+	}
+	if got := pB.dispatchCount(); got != 0 {
+		t.Errorf("provider-b received %d dispatch(es), want 0 — no retry after content has flowed", got)
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Test 8: boilerplate role chunk then clean close
+// ---------------------------------------------------------------------------
+
+// TestBoilerplateThenCleanClose guards the held-chunks-then-clean-close edge
+// of deferred commit: a provider that sends only the boilerplate role chunk
+// and then a clean inference_complete (zero content) must yield a well-formed,
+// empty-ish 200 completion — no hang, no in-band error, exactly one [DONE].
+//
+// Passes against the current coordinator; must keep passing after WS-C (the
+// held boilerplate must be committed/flushed on clean close, not retried and
+// not abandoned).
+func TestBoilerplateThenCleanClose(t *testing.T) {
+	reg, _, ts := setupFailoverServer(t)
+
+	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+
+	model := "boilerplate-clean-close-model"
+
+	roleThenComplete := func(ctx context.Context, fp *failoverProvider, req protocol.InferenceRequestMessage, body []byte) {
+		fp.sendRoleChunk(ctx, req, model)
+		time.Sleep(30 * time.Millisecond)
+		fp.sendComplete(ctx, req, protocol.UsageInfo{PromptTokens: 5, CompletionTokens: 0})
+	}
+
+	pA := startFailoverProvider(t, ctx, ts, reg, failoverProviderConfig{
+		Name: "provider-a", Version: "0.6.4", DecodeTPS: 100,
+		Models: []failoverModelSpec{{ID: model}}, Script: roleThenComplete,
+	})
+
+	start := time.Now()
+	status, body, err := postChat(ctx, ts.URL, "test-key", buildChatBody(t, model, true, nil))
+	if err != nil {
+		t.Fatalf("chat request: %v", err)
+	}
+	elapsed := time.Since(start)
+
+	if status != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body = %s", status, body)
+	}
+	if n := strings.Count(body, "data: [DONE]"); n != 1 {
+		t.Errorf("stream has %d [DONE] terminators, want exactly 1; body = %s", n, body)
+	}
+	if strings.Contains(body, `"error"`) {
+		t.Errorf("clean zero-content completion surfaced an error; body = %s", body)
+	}
+	if got := pA.dispatchCount(); got != 1 {
+		t.Errorf("provider received %d dispatch(es), want 1 (clean close must not trigger a retry)", got)
+	}
+	// Guard the no-hang property well inside the per-test wall budget.
+	if elapsed > 8*time.Second {
+		t.Errorf("empty completion took %s — held-chunks-then-clean-close is hanging", elapsed)
+	}
+}


🟡 [MEDIUM] 🧩 Large integration test file duplicates provider setup patterns

💡 Suggestion: Extract fake provider harness into reusable test utilities shared across test files

📊 Score: 3×3 = 9 · Category: duplicate logic

ethenotethan · 2026-06-12T01:16:22Z

+}
+
+// CooldownShape returns the inference-error circuit-breaker dimension for the
+// request shape. The breaker is keyed by (provider, model, shape) so that a
+// deterministic failure that affects ONLY one shape — e.g. a chat-template
+// crash on tool schemas — accumulates strikes in its own bucket and is not
+// reset by interleaved clean text successes. Today only "tools" splits off the
+// "base" bucket; vision (and any future template-rendered shape) can extend
+// this. Keep in sync with the shapes the consumer stamps onto a request.
+//
+// EXPORTED (vs the contract's lowercase cooldownShape): the consumer handler in
+// the coordinator/api package derives the shape for RecordInferenceError /
+// RecordInferenceSuccess by calling this on a registry.RequestTraits, which is
+// impossible across packages with an unexported method. The capitalized name is
+// the only viable cross-package signature; the integrator should ensure the
+// consumer calls CooldownShape().
+func (t RequestTraits) CooldownShape() string {
+	if t.HasTools {
+		return "tools"
+	}
+	return "base"
+}


🔵 [INFO] 🧩 RequestTraits struct with multiple capability flags could grow complex

💡 Suggestion: Consider a more extensible trait system if more request attributes will be added

📊 Score: 2×3 = 6 · Category: over-configuration

ethenotethan · 2026-06-12T01:16:22Z

 	//
 	// No HTTP response is written until a provider starts generating, so
 	// retries and speculative dispatch are invisible to the consumer.
-	var (
-		provider    *registry.Provider
-		pr          *registry.PendingRequest
-		requestID   string
-		firstChunk  string
-		lastErr     string
-		lastErrCode int
-		committed   bool
-	)
-
+	// Dispatch is driven by the per-request state machine in dispatch.go: it
+	// picks a provider (or queues), runs the speculative TTFT-aware first-chunk
+	// wait with an invisible backup race + failover up to maxDispatchAttempts,
+	// commits exactly once, then writes attestation/timing headers and streams.
 	consumerKey := consumerKeyFromContext(r.Context())
 	consumerLocation := s.requestLocation(r)
-
-	// Track providers that failed during retry so we don't dispatch to them again.
-	excludeProviders := make(map[string]struct{})
-
 	deadline := ttftDeadline(estimatedPromptTokens)
-	speculativeAt := time.Duration(float64(deadline) * speculativeTimerRatio)
-
-	for attempt := range maxDispatchAttempts {
-		// Dispatch the primary provider.
-		var dispatchErr string
-		var dispatchErrCode int
-		provider, pr, _, dispatchErr, dispatchErrCode = s.dispatchOneProvider(
-			r, model, publicModel, rawBody, consumerKey, consumerLocation, reservedMicroUSD,
-			estimatedPromptTokens, requestedMaxTokens, requiresVision, allowedProviderSerials,
-			isResponsesAPI, policy, timing, excludeProviders,
-		)
-		if provider == nil {
-			// No online provider has enough memory to ever fit this model.
-			// Retrying and queueing are both pointless — reject immediately
-			// with a clear, non-retryable error.
-			if dispatchErr == errModelTooLarge {
-				s.ddIncr("routing.decisions", []string{"model:" + model, "model_type:" + s.registry.ModelType(model), "outcome:model_too_large"})
-				lastErr = dispatchErr
-				lastErrCode = dispatchErrCode
-				break
-			}
-
-			// dispatchOneProvider may have found a provider but rejected it
-			// (payout destination missing, insufficient funds, encryption
-			// missing). In that case it already added the provider to
-			// excludeProviders. If there may be more providers to try,
-			// continue to the next attempt.
-			providerWasRejected := dispatchErr != "no provider available"
-			if providerWasRejected {
-				lastErr = dispatchErr
-				lastErrCode = dispatchErrCode
-				continue
-			}
-
-			// On retry attempts, don't queue — if the only available
-			// providers already failed, waiting 120s for one of them
-			// to come back won't help. Break and return the last error.
-			// Don't overwrite lastErr/lastErrCode from the real provider
-			// error — preserve the original status code.
-			if attempt > 0 {
-				if lastErr == "" {
-					lastErr = dispatchErr
-					lastErrCode = dispatchErrCode
-				}
-				break
-			}
-			// No idle provider — try queueing.
-			requestID = uuid.New().String()
-			queuePR := &registry.PendingRequest{
-				RequestID:              requestID,
-				Model:                  model,
-				PublicModel:            publicModel,
-				ConsumerKey:            consumerKey,
-				KeyID:                  keyIDFromContext(r.Context()),
-				KeyLimitMicroUSD:       keyLimitMicroFromContext(r.Context()),
-				KeyLimitReset:          keyLimitResetFromContext(r.Context()),
-				ConsumerLocation:       consumerLocation,
-				IsResponsesAPI:         isResponsesAPI,
-				EstimatedPromptTokens:  estimatedPromptTokens,
-				RequiresVision:         requiresVision,
-				RequestedMaxTokens:     requestedMaxTokens,
-				ReservedMicroUSD:       reservedMicroUSD,
-				BaseReservedMicroUSD:   reservedMicroUSD,
-				AllowedProviderSerials: allowedProviderSerials,
-				SelfRouteOnly:          policy.enabled,
-				PreferOwner:            policy.prefer,
-				OwnerAccountID:         policy.ownerAccountID,
-				FreeSelfRoute:          policy.enabled,
-				AcceptedCh:             make(chan struct{}, 1),
-				ChunkCh:                make(chan string, chunkBufferSize),
-				CompleteCh:             make(chan protocol.UsageInfo, 1),
-				ErrorCh:                make(chan protocol.InferenceErrorMessage, 1),
-				Timing:                 timing,
-			}
-			queuedReq := &registry.QueuedRequest{
-				RequestID:  requestID,
-				Model:      model,
-				Pending:    queuePR,
-				ResponseCh: make(chan *registry.Provider, 1),
-			}
-			queuePR.Timing.QueuedAt = time.Now()
-			if err := s.registry.Queue().Enqueue(queuedReq); err != nil {
-				s.ddIncr("routing.decisions", []string{"model:" + model, "model_type:" + s.registry.ModelType(model), "outcome:over_capacity"})
-				retryAfter := s.estimateRetryAfter(model)
-				w.Header().Set("Retry-After", strconv.Itoa(retryAfter))
-				refundReservation()
-				if policy.enabled {
-					writeJSON(w, http.StatusTooManyRequests, errorResponse("machine_busy",
-						"your machine is at capacity — retry shortly", withCode("machine_busy")))
-				} else {
-					writeJSON(w, http.StatusTooManyRequests, errorResponse("rate_limit_exceeded",
-						fmt.Sprintf("all providers for model %q are at capacity and queue is full", publicModel),
-						withCode("rate_limit_exceeded")))
-				}
-				return
-			}
-			s.ddIncr("routing.decisions", []string{"model:" + model, "model_type:" + s.registry.ModelType(model), "outcome:queued"})
-
-			s.logger.Info("request queued, waiting for provider",
-				"model", model,
-				"attempt", attempt+1,
-			)
-
-			var err error
-			provider, err = s.registry.Queue().WaitForProviderContext(r.Context(), queuedReq)
-			if err != nil {
-				if errors.Is(err, context.Canceled) {
-					refundReservation()
-					return
-				}
-				refundReservation()
-				s.ddIncr("request_queue.timeout", []string{"model:" + model, "model_type:" + s.registry.ModelType(model)})
-				retryAfter := s.estimateRetryAfter(model)
-				w.Header().Set("Retry-After", strconv.Itoa(retryAfter))
-				if policy.enabled {
-					writeJSON(w, http.StatusTooManyRequests, errorResponse("machine_busy",
-						"your machine is at capacity (timed out waiting for a free slot) — retry shortly", withCode("machine_busy")))
-				} else {
-					writeJSON(w, http.StatusTooManyRequests, errorResponse("rate_limit_exceeded",
-						fmt.Sprintf("all providers for model %q are at capacity (queue timeout)", publicModel),
-						withCode("rate_limit_exceeded")))
-				}
-				return
-			}
-			// Queue assigned a provider; still need to dispatch.
-			// Use the queue PR's channels.
-			pr = queuePR
-			requestID = pr.RequestID
-			timing.RoutedAt = time.Now()
-
-			// Log missing payout destination but don't skip — earnings
-			// are credited to the provider's internal ledger and can be
-			// withdrawn once they complete Stripe Connect onboarding.
-			// A queued request settles FREE when its drained provider is the
-			// caller's own machine: exclusive self-route always, OR a prefer
-			// request whose selected provider is owned (settlement refunds to
-			// zero). Skip the payout warning and the custom-price top-up then
-			// (the top-up could otherwise 429 the free owned route).
-			queuedSettlesFree := policy.enabled
-			if !queuedSettlesFree && policy.prefer {
-				provider.Mu().Lock()
-				queuedSettlesFree = policy.ownerAccountID != "" && provider.AccountID == policy.ownerAccountID
-				provider.Mu().Unlock()
-			}
-
-			if s.billing != nil && !queuedSettlesFree && !providerHasPayoutDestination(provider) {
-				s.logger.Warn("queued provider missing payout destination, crediting to internal ledger",
-					"request_id", requestID,
-					"provider_id", provider.ID,
-				)
-			}
-
-			// Custom pricing check — provider may charge more than the
-			// platform rate. Reserve the additional amount now. Skipped for
-			// free self-route, which settles at zero cost.
-			if s.billing != nil && !queuedSettlesFree {
-				if _, err := s.reserveAdditionalForProvider(pr, provider); err != nil {
-					provider.RemovePending(requestID)
-					s.registry.SetProviderIdle(provider.ID)
-					excludeProviders[provider.ID] = struct{}{}
-					if errors.Is(err, store.ErrInsufficientBalance) {
-						s.logger.Warn("queued provider pricing exceeds balance, skipping",
-							"request_id", requestID,
-							"provider_id", provider.ID,
-							"error", err,
-						)
-						lastErr = "insufficient funds for provider price"
-						lastErrCode = http.StatusPaymentRequired
-					} else {
-						s.logger.Error("queued provider reservation failed (DB error)",
-							"request_id", requestID,
-							"provider_id", provider.ID,
-							"error", err,
-						)
-						lastErr = "service temporarily unavailable — please retry"
-						lastErrCode = http.StatusServiceUnavailable
-					}
-					continue
-				}
-			}
-			// Perform E2E encryption and send the request.
-			if provider.PublicKey == "" {
-				provider.RemovePending(requestID)
-				s.registry.SetProviderIdle(provider.ID)
-				s.refundProviderExtra(pr)
-				excludeProviders[provider.ID] = struct{}{}
-				lastErr = "no provider with E2E encryption"
-				continue
-			}
-			providerPubKey, err := e2e.ParsePublicKey(provider.PublicKey)
-			if err != nil {
-				provider.RemovePending(requestID)
-				s.registry.SetProviderIdle(provider.ID)
-				s.refundProviderExtra(pr)
-				excludeProviders[provider.ID] = struct{}{}
-				lastErr = "provider public key invalid"
-				continue
-			}
-			sessionKeys, err := e2e.GenerateSessionKeys()
-			if err != nil {
-				provider.RemovePending(requestID)
-				s.registry.SetProviderIdle(provider.ID)
-				s.refundProviderExtra(pr)
-				lastErr = "failed to generate session keys"
-				continue
-			}
-			encrypted, err := e2e.Encrypt(rawBody, providerPubKey, sessionKeys)
-			if err != nil {
-				provider.RemovePending(requestID)
-				s.registry.SetProviderIdle(provider.ID)
-				s.refundProviderExtra(pr)
-				lastErr = "failed to encrypt request"
-				continue
-			}
-			timing.EncryptedAt = time.Now()
-			wireMsg := map[string]any{
-				"type":       protocol.TypeInferenceRequest,
-				"request_id": requestID,
-				"encrypted_body": map[string]string{
-					"ephemeral_public_key": encrypted.EphemeralPublicKey,
-					"ciphertext":           encrypted.Ciphertext,
-				},
-			}
-			pr.SessionPrivKey = &sessionKeys.PrivateKey
-			// pr.ReservedMicroUSD was already set in the struct literal and may
-			// have been increased by reserveAdditionalForProvider. Don't overwrite.
-			data, _ := json.Marshal(wireMsg)
-			if err := provider.Conn.Write(r.Context(), websocket.MessageText, data); err != nil {
-				provider.RemovePending(requestID)
-				s.registry.SetProviderIdle(provider.ID)
-				s.refundProviderExtra(pr)
-				excludeProviders[provider.ID] = struct{}{}
-				lastErr = "failed to send request to provider"
-				continue
-			}
-			pr.Timing.DispatchedAt = time.Now()
-		}
-		requestID = pr.RequestID
-		if timing.RoutedAt.IsZero() {
-			timing.RoutedAt = time.Now()
-		}
-		s.ddIncr("routing.decisions", []string{"model:" + model, "outcome:selected"})
-		s.ddIncr("routing.provider_selected", []string{"provider_id:" + provider.ID, "model:" + model})
-
-		s.logger.Info("inference request dispatched",
-			"trace_id", requestIDFromContext(r.Context()),
-			"request_id", requestID,
-			"model", model,
-			"provider_id", provider.ID,
-			"stream", stream,
-			"attempt", attempt+1,
-		)
-
-		s.logger.Info("dispatch_pool",
-			"model", model,
-			"ttft_deadline_ms", deadline.Milliseconds(),
-			"speculative_at_ms", speculativeAt.Milliseconds(),
-		)
-
-		// ---- Speculative TTFT-aware first-chunk wait ----
-		//
-		// Phase 1: Wait for first chunk with speculative timer.
-		// - If primary sends first chunk → commit.
-		// - If primary sends accepted → extend to inferenceTimeout (model reload).
-		// - If primary errors → retry immediately (sequential fallback).
-		// - If speculative timer fires → dispatch backup and race.
-		// - If full deadline expires → fail.
-
-		speculativeTimer := time.NewTimer(speculativeAt)
-		deadlineTimer := time.NewTimer(deadline)
-		accepted := false
-
-		select {
-		case chunk, ok := <-pr.ChunkCh:
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			if ok {
-				firstChunk = chunk
-				pr.Timing.FirstChunkAt = time.Now()
-				committed = true
-			} else {
-				select {
-				case errMsg := <-pr.ErrorCh:
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					lastErr = errMsg.Error
-					lastErrCode = errMsg.StatusCode
-					provider = nil
-					pr = nil
-					continue
-				default:
-					committed = true
-				}
-			}
-
-		case <-pr.AcceptedCh:
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			accepted = true
-
-		case errMsg := <-pr.ErrorCh:
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			excludeProviders[provider.ID] = struct{}{}
-			s.cancelDispatch(provider, pr)
-			lastErr = errMsg.Error
-			lastErrCode = errMsg.StatusCode
-			s.logger.Warn("provider failed, retrying",
-				"request_id", requestID,
-				"provider_id", provider.ID,
-				"attempt", attempt+1,
-				"error", errMsg.Error,
-			)
-			s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-				"provider failed, retrying",
-				map[string]any{
-					"provider_id": provider.ID,
-					"attempt":     attempt + 1,
-					"reason":      "provider_error",
-					"status_code": errMsg.StatusCode,
-				})
-			if s.metrics != nil {
-				s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-			}
-			s.ddIncr("inference.dispatches", []string{"status:retry"})
-			provider = nil
-			pr = nil
-			continue
-
-		case <-speculativeTimer.C:
-			deadlineTimer.Stop()
-			// Primary is slow. Attempt speculative backup dispatch.
-			s.ddIncr("inference.speculative_dispatch", []string{"model:" + model})
-
-			var backupProvider *registry.Provider
-			var backupPR *registry.PendingRequest
-
-			// Do NOT speculatively race a paid PUBLIC backup against a prefer
-			// request that is being served by the caller's OWN machine: the user
-			// opted into "prefer my machine (free)", so a slow owned machine must
-			// be waited on, not raced (and billed) by the public fleet. (Exclusive
-			// self-route is already safe — its backup selection is owned-only and
-			// returns nil when there's no other owned machine.) When the prefer
-			// primary is itself a public provider (the owner owns nothing / fell
-			// back), normal speculative behaviour applies.
-			skipBackup := false
-			if policy.prefer {
-				provider.Mu().Lock()
-				skipBackup = policy.ownerAccountID != "" && provider.AccountID == policy.ownerAccountID
-				provider.Mu().Unlock()
-			}
-
-			if !skipBackup {
-				backupExclude := make(map[string]struct{}, len(excludeProviders)+1)
-				for id := range excludeProviders {
-					backupExclude[id] = struct{}{}
-				}
-				backupExclude[provider.ID] = struct{}{}
-
-				backupProvider, backupPR, _, _, _ = s.dispatchOneProvider(
-					r, model, publicModel, rawBody, consumerKey, consumerLocation, reservedMicroUSD,
-					estimatedPromptTokens, requestedMaxTokens, requiresVision, allowedProviderSerials,
-					isResponsesAPI, policy, &registry.RequestTiming{ReceivedAt: timing.ReceivedAt},
-					backupExclude,
-				)
-			}
-
-			if backupProvider == nil {
-				// No backup available. Keep waiting for primary with remaining deadline.
-				s.logger.Info("speculative_dispatch_no_backup",
-					"request_id", requestID,
-					"primary_provider", provider.ID,
-				)
-				remainingDeadline := time.NewTimer(deadline - speculativeAt)
-				select {
-				case chunk, ok := <-pr.ChunkCh:
-					remainingDeadline.Stop()
-					if ok {
-						firstChunk = chunk
-						pr.Timing.FirstChunkAt = time.Now()
-						committed = true
-					} else {
-						select {
-						case errMsg := <-pr.ErrorCh:
-							excludeProviders[provider.ID] = struct{}{}
-							s.cancelDispatch(provider, pr)
-							lastErr = errMsg.Error
-							lastErrCode = errMsg.StatusCode
-							provider = nil
-							pr = nil
-							continue
-						default:
-							committed = true
-						}
-					}
-				case <-pr.AcceptedCh:
-					remainingDeadline.Stop()
-					accepted = true
-				case errMsg := <-pr.ErrorCh:
-					remainingDeadline.Stop()
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					lastErr = errMsg.Error
-					lastErrCode = errMsg.StatusCode
-					if s.metrics != nil {
-						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-					}
-					s.ddIncr("inference.dispatches", []string{"status:retry"})
-					provider = nil
-					pr = nil
-					continue
-				case <-remainingDeadline.C:
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					lastErr = "timeout waiting for first response"
-					lastErrCode = http.StatusGatewayTimeout
-					s.logger.Warn("provider timeout (no backup), retrying",
-						"request_id", requestID,
-						"provider_id", provider.ID,
-						"attempt", attempt+1,
-					)
-					s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-						"provider first-chunk timeout",
-						map[string]any{
-							"provider_id": provider.ID,
-							"attempt":     attempt + 1,
-							"reason":      "first_chunk_timeout",
-						})
-					if s.metrics != nil {
-						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-					}
-					s.ddIncr("inference.dispatches", []string{"status:timeout"})
-					provider = nil
-					pr = nil
-					continue
-				case <-r.Context().Done():
-					remainingDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					refundReservation()
-					return
-				}
-			} else {
-				// Backup dispatched — race primary vs backup.
-				s.logger.Info("speculative_dispatch",
-					"request_id", requestID,
-					"primary_provider", provider.ID,
-					"backup_provider", backupProvider.ID,
-					"ttft_deadline_ms", deadline.Milliseconds(),
-					"speculative_at_ms", speculativeAt.Milliseconds(),
-				)
-
-				raceDeadline := time.NewTimer(deadline - speculativeAt)
-
-				select {
-				case chunk, ok := <-pr.ChunkCh:
-					// Primary wins!
-					raceDeadline.Stop()
-					s.cancelDispatch(backupProvider, backupPR)
-					if ok {
-						firstChunk = chunk
-						pr.Timing.FirstChunkAt = time.Now()
-						committed = true
-					} else {
-						select {
-						case errMsg := <-pr.ErrorCh:
-							// Primary failed but we already cancelled backup.
-							excludeProviders[provider.ID] = struct{}{}
-							s.cancelDispatch(provider, pr)
-							lastErr = errMsg.Error
-							lastErrCode = errMsg.StatusCode
-							provider = nil
-							pr = nil
-							continue
-						default:
-							committed = true
-						}
-					}
-
-				case chunk, ok := <-backupPR.ChunkCh:
-					// Backup wins!
-					raceDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					s.ddIncr("inference.speculative_win", []string{"model:" + model})
-					if ok {
-						provider = backupProvider
-						pr = backupPR
-						requestID = pr.RequestID
-						firstChunk = chunk
-						pr.Timing.FirstChunkAt = time.Now()
-						committed = true
-					} else {
-						select {
-						case errMsg := <-backupPR.ErrorCh:
-							// Backup failed too. Keep primary context for retry.
-							excludeProviders[backupProvider.ID] = struct{}{}
-							// Wait remaining deadline for primary.
-							remainingPrimary := time.NewTimer(deadline - speculativeAt)
-							select {
-							case chunk, ok := <-pr.ChunkCh:
-								remainingPrimary.Stop()
-								if ok {
-									firstChunk = chunk
-									pr.Timing.FirstChunkAt = time.Now()
-									committed = true
-								} else {
-									select {
-									case errMsg2 := <-pr.ErrorCh:
-										excludeProviders[provider.ID] = struct{}{}
-										s.cancelDispatch(provider, pr)
-										lastErr = errMsg2.Error
-										lastErrCode = errMsg2.StatusCode
-										provider = nil
-										pr = nil
-										continue
-									default:
-										committed = true
-									}
-								}
-							case <-pr.AcceptedCh:
-								remainingPrimary.Stop()
-								accepted = true
-							case <-remainingPrimary.C:
-								excludeProviders[provider.ID] = struct{}{}
-								s.cancelDispatch(provider, pr)
-								lastErr = errMsg.Error
-								lastErrCode = errMsg.StatusCode
-								provider = nil
-								pr = nil
-								continue
-							case <-r.Context().Done():
-								remainingPrimary.Stop()
-								s.cancelDispatch(provider, pr)
-								refundReservation()
-								return
-							}
-						default:
-							// Backup channel closed with no error — treat as committed.
-							s.cancelDispatch(provider, pr)
-							provider = backupProvider
-							pr = backupPR
-							requestID = pr.RequestID
-							committed = true
-						}
-					}
-
-				case <-pr.AcceptedCh:
-					// Primary accepted (model reload). Cancel backup, extend deadline.
-					raceDeadline.Stop()
-					s.cancelDispatch(backupProvider, backupPR)
-					accepted = true
-
-				case <-backupPR.AcceptedCh:
-					// Backup accepted (model reload). Cancel primary, extend deadline.
-					raceDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					provider = backupProvider
-					pr = backupPR
-					requestID = pr.RequestID
-					accepted = true
-
-				case errMsg := <-pr.ErrorCh:
-					// Primary failed. Keep waiting for backup.
-					raceDeadline.Stop()
-					excludeProviders[provider.ID] = struct{}{}
-					s.cancelDispatch(provider, pr)
-					// Wait for backup with remaining deadline.
-					backupDeadline := time.NewTimer(deadline - speculativeAt)
-					select {
-					case chunk, ok := <-backupPR.ChunkCh:
-						backupDeadline.Stop()
-						_ = errMsg // used implicitly via excludeProviders
-						if ok {
-							provider = backupProvider
-							pr = backupPR
-							requestID = pr.RequestID
-							firstChunk = chunk
-							pr.Timing.FirstChunkAt = time.Now()
-							committed = true
-						} else {
-							select {
-							case errMsg2 := <-backupPR.ErrorCh:
-								excludeProviders[backupProvider.ID] = struct{}{}
-								s.cancelDispatch(backupProvider, backupPR)
-								lastErr = errMsg2.Error
-								lastErrCode = errMsg2.StatusCode
-								provider = nil
-								pr = nil
-								continue
-							default:
-								provider = backupProvider
-								pr = backupPR
-								requestID = pr.RequestID
-								committed = true
-							}
-						}
-					case <-backupPR.AcceptedCh:
-						backupDeadline.Stop()
-						provider = backupProvider
-						pr = backupPR
-						requestID = pr.RequestID
-						accepted = true
-					case errMsg2 := <-backupPR.ErrorCh:
-						backupDeadline.Stop()
-						excludeProviders[backupProvider.ID] = struct{}{}
-						s.cancelDispatch(backupProvider, backupPR)
-						lastErr = errMsg2.Error
-						lastErrCode = errMsg2.StatusCode
-						provider = nil
-						pr = nil
-						continue
-					case <-backupDeadline.C:
-						excludeProviders[backupProvider.ID] = struct{}{}
-						s.cancelDispatch(backupProvider, backupPR)
-						lastErr = "timeout waiting for first response (backup)"
-						lastErrCode = http.StatusGatewayTimeout
-						if s.metrics != nil {
-							s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-						}
-						s.ddIncr("inference.dispatches", []string{"status:timeout"})
-						provider = nil
-						pr = nil
-						continue
-					case <-r.Context().Done():
-						backupDeadline.Stop()
-						s.cancelDispatch(backupProvider, backupPR)
-						refundReservation()
-						return
-					}
-
-				case errMsg := <-backupPR.ErrorCh:
-					// Backup failed. Keep waiting for primary.
-					raceDeadline.Stop()
-					excludeProviders[backupProvider.ID] = struct{}{}
-					s.cancelDispatch(backupProvider, backupPR)
-					_ = errMsg
-					primaryDeadline := time.NewTimer(deadline - speculativeAt)
-					select {
-					case chunk, ok := <-pr.ChunkCh:
-						primaryDeadline.Stop()
-						if ok {
-							firstChunk = chunk
-							pr.Timing.FirstChunkAt = time.Now()
-							committed = true
-						} else {
-							select {
-							case errMsg2 := <-pr.ErrorCh:
-								excludeProviders[provider.ID] = struct{}{}
-								s.cancelDispatch(provider, pr)
-								lastErr = errMsg2.Error
-								lastErrCode = errMsg2.StatusCode
-								provider = nil
-								pr = nil
-								continue
-							default:
-								committed = true
-							}
-						}
-					case <-pr.AcceptedCh:
-						primaryDeadline.Stop()
-						accepted = true
-					case errMsg2 := <-pr.ErrorCh:
-						primaryDeadline.Stop()
-						excludeProviders[provider.ID] = struct{}{}
-						s.cancelDispatch(provider, pr)
-						lastErr = errMsg2.Error
-						lastErrCode = errMsg2.StatusCode
-						provider = nil
-						pr = nil
-						continue
-					case <-primaryDeadline.C:
-						excludeProviders[provider.ID] = struct{}{}
-						s.cancelDispatch(provider, pr)
-						lastErr = "timeout waiting for first response"
-						lastErrCode = http.StatusGatewayTimeout
-						if s.metrics != nil {
-							s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-						}
-						s.ddIncr("inference.dispatches", []string{"status:timeout"})
-						provider = nil
-						pr = nil
-						continue
-					case <-r.Context().Done():
-						primaryDeadline.Stop()
-						s.cancelDispatch(provider, pr)
-						refundReservation()
-						return
-					}
-
-				case <-raceDeadline.C:
-					// Both missed deadline.
-					s.cancelDispatch(provider, pr)
-					s.cancelDispatch(backupProvider, backupPR)
-					excludeProviders[provider.ID] = struct{}{}
-					excludeProviders[backupProvider.ID] = struct{}{}
-					lastErr = "timeout waiting for first response (both providers)"
-					lastErrCode = http.StatusGatewayTimeout
-					if s.metrics != nil {
-						s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-					}
-					s.ddIncr("inference.dispatches", []string{"status:timeout"})
-					provider = nil
-					pr = nil
-					continue
-
-				case <-r.Context().Done():
-					raceDeadline.Stop()
-					s.cancelDispatch(provider, pr)
-					s.cancelDispatch(backupProvider, backupPR)
-					refundReservation()
-					return
-				}
-			}
-
-		case <-deadlineTimer.C:
-			speculativeTimer.Stop()
-			excludeProviders[provider.ID] = struct{}{}
-			s.cancelDispatch(provider, pr)
-			lastErr = "timeout waiting for first response"
-			lastErrCode = http.StatusGatewayTimeout
-			s.logger.Warn("provider timeout (full deadline), retrying",
-				"request_id", requestID,
-				"provider_id", provider.ID,
-				"attempt", attempt+1,
-			)
-			s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-				"provider first-chunk timeout",
-				map[string]any{
-					"provider_id": provider.ID,
-					"attempt":     attempt + 1,
-					"reason":      "first_chunk_timeout",
-				})
-			if s.metrics != nil {
-				s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-			}
-			s.ddIncr("inference.dispatches", []string{"status:timeout"})
-			provider = nil
-			pr = nil
-			continue
-
-		case <-r.Context().Done():
-			speculativeTimer.Stop()
-			deadlineTimer.Stop()
-			s.cancelDispatch(provider, pr)
-			refundReservation()
-			return
-		}
-
-		// Provider accepted or sent first chunk — commit to this provider.
-		// If only accepted (no chunk yet), wait for the first chunk with
-		// the full inference timeout since the backend may be reloading.
-		if accepted && !committed {
-			chunkTimer := time.NewTimer(inferenceTimeout)
-			select {
-			case chunk, ok := <-pr.ChunkCh:
-				chunkTimer.Stop()
-				if ok {
-					firstChunk = chunk
-					pr.Timing.FirstChunkAt = time.Now()
-					committed = true
-				} else {
-					// Closed — check for error. Use a short grace
-					// period instead of a non-blocking default to
-					// close the race where Go's select picks the
-					// ChunkCh close before the ErrorCh value (sent
-					// by the provider handler before closing ChunkCh).
-					select {
-					case errMsg := <-pr.ErrorCh:
-						excludeProviders[provider.ID] = struct{}{}
-						s.cancelDispatch(provider, pr)
-						lastErr = errMsg.Error
-						lastErrCode = errMsg.StatusCode
-						s.logger.Warn("provider failed after accepting request, retrying",
-							"request_id", requestID,
-							"provider_id", provider.ID,
-							"attempt", attempt+1,
-							"error", errMsg.Error,
-						)
-						s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-							"provider failed after accepting request, retrying",
-							map[string]any{
-								"provider_id": provider.ID,
-								"attempt":     attempt + 1,
-								"reason":      "provider_error",
-								"status_code": errMsg.StatusCode,
-							})
-						if s.metrics != nil {
-							s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-						}
-						s.ddIncr("inference.dispatches", []string{"status:retry"})
-						provider = nil
-						pr = nil
-						continue
-					case <-time.After(50 * time.Millisecond):
-						committed = true
-					}
-				}
-			case errMsg := <-pr.ErrorCh:
-				chunkTimer.Stop()
-				excludeProviders[provider.ID] = struct{}{}
-				s.cancelDispatch(provider, pr)
-				lastErr = errMsg.Error
-				lastErrCode = errMsg.StatusCode
-				s.logger.Warn("provider failed after accepting request, retrying",
-					"request_id", requestID,
-					"provider_id", provider.ID,
-					"attempt", attempt+1,
-					"error", errMsg.Error,
-				)
-				s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-					"provider failed after accepting request, retrying",
-					map[string]any{
-						"provider_id": provider.ID,
-						"attempt":     attempt + 1,
-						"reason":      "provider_error",
-						"status_code": errMsg.StatusCode,
-					})
-				if s.metrics != nil {
-					s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "retry"})
-				}
-				s.ddIncr("inference.dispatches", []string{"status:retry"})
-				provider = nil
-				pr = nil
-				continue
-			case <-chunkTimer.C:
-				excludeProviders[provider.ID] = struct{}{}
-				s.cancelDispatch(provider, pr)
-				lastErr = "provider accepted but timed out before first chunk"
-				lastErrCode = http.StatusGatewayTimeout
-				s.logger.Warn("provider timed out after accepting request, retrying",
-					"request_id", requestID,
-					"provider_id", provider.ID,
-					"attempt", attempt+1,
-				)
-				s.emitRequest(r.Context(), protocol.SeverityWarn, requestID,
-					"provider accepted timeout",
-					map[string]any{
-						"provider_id": provider.ID,
-						"attempt":     attempt + 1,
-						"reason":      "accepted_timeout",
-					})
-				if s.metrics != nil {
-					s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "timeout"})
-				}
-				s.ddIncr("inference.dispatches", []string{"status:timeout"})
-				provider = nil
-				pr = nil
-				continue
-			case <-r.Context().Done():
-				s.cancelDispatch(provider, pr)
-				refundReservation()
-				return
-			}
-		}
-
-		break
-	}
-
-	if !committed {
-		refundReservation()
-		statusCode := lastErrCode
-		if statusCode == 0 {
-			// Distinguish capacity exhaustion (429) from genuine unavailability (503).
-			// A quick capacity check tells us if providers exist but are full.
-			_, capRej, _ := s.registry.QuickCapacityCheck(model, estimatedPromptTokens, requestedMaxTokens, allowedProviderSerials...)
-			if capRej > 0 {
-				statusCode = http.StatusTooManyRequests
-			} else {
-				statusCode = http.StatusServiceUnavailable
-			}
-		}
-		s.emitRequest(r.Context(), protocol.SeverityError, requestID,
-			fmt.Sprintf("inference failed after %d attempt(s)", maxDispatchAttempts),
-			map[string]any{
-				"reason":      "dispatch_exhausted",
-				"attempt":     maxDispatchAttempts,
-				"status_code": statusCode,
-				"last_error":  lastErr,
-			})
-		if s.metrics != nil {
-			s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "failure"})
-		}
-		s.ddIncr("inference.dispatches", []string{"status:failure"})
-		if statusCode == http.StatusTooManyRequests || statusCode == http.StatusServiceUnavailable {
-			w.Header().Set("Retry-After", strconv.Itoa(s.estimateRetryAfter(model)))
-		}
-		if statusCode == http.StatusTooManyRequests {
-			writeJSON(w, statusCode, errorResponse("rate_limit_exceeded",
-				fmt.Sprintf("all providers at capacity after %d attempt(s): %s", maxDispatchAttempts, lastErr),
-				withCode("rate_limit_exceeded")))
-		} else {
-			writeJSON(w, statusCode, errorResponse("provider_error",
-				fmt.Sprintf("inference failed after %d attempt(s): %s", maxDispatchAttempts, lastErr)))
-		}
-		return
-	}
-	if s.metrics != nil {
-		s.metrics.IncCounter("inference_dispatches_total", MetricLabel{"result", "success"})
-	}
-	s.ddIncr("inference.dispatches", []string{"status:success"})
-
-	// Write provider attestation headers now that we're committed.
-	provider.Mu().Lock()
-	pubKey := provider.PublicKey
-	attested := provider.Attested
-	trustLevel := provider.TrustLevel
-	attestResult := provider.AttestationResult
-	mdaVerified := provider.MDAVerified
-	provider.Mu().Unlock()
-
-	providerID := provider.ID
-	chipName := provider.Hardware.ChipName
-	machineModel := provider.Hardware.MachineModel
-
-	if pubKey != "" {
-		w.Header().Set("X-Provider-Encrypted", "true")
-	}
-	if attested {
-		w.Header().Set("X-Provider-Attested", "true")
-	} else {
-		w.Header().Set("X-Provider-Attested", "false")
-	}
-	w.Header().Set("X-Provider-Trust-Level", string(trustLevel))
-	w.Header().Set("X-Provider-Id", providerID)
-	w.Header().Set("X-Provider-Chip", chipName)
-	w.Header().Set("X-Provider-Model", machineModel)
-	if attestResult != nil {
-		w.Header().Set("X-Provider-Serial", attestResult.SerialNumber)
-		if attestResult.SecureEnclaveAvailable {
-			w.Header().Set("X-Provider-Secure-Enclave", "true")
-		} else {
-			w.Header().Set("X-Provider-Secure-Enclave", "false")
-		}
-	}
-	if mdaVerified {
-		w.Header().Set("X-Provider-Mda-Verified", "true")
-	}
-	// SE public key for attestation receipt verification.
-	// Consumers can use this to verify SE signatures on response hashes.
-	if attestResult != nil && attestResult.PublicKey != "" {
-		w.Header().Set("X-Attestation-Se-Public-Key", attestResult.PublicKey)
-		w.Header().Set("X-Attestation-Device-Serial", attestResult.SerialNumber)
-	}
-
-	// Latency decomposition header for observability.
-	if timing := pr.Timing; timing != nil {
-		type timingJSON struct {
-			ParseUs    int64 `json:"parse_us"`
-			ReserveUs  int64 `json:"reserve_us"`
-			RouteUs    int64 `json:"route_us"`
-			QueueUs    int64 `json:"queue_us"`
-			EncryptUs  int64 `json:"encrypt_us"`
-			DispatchUs int64 `json:"dispatch_us"`
-			ProviderUs int64 `json:"provider_us"`
-		}
-		tj := timingJSON{}
-		if !timing.ParsedAt.IsZero() {
-			tj.ParseUs = timing.ParsedAt.Sub(timing.ReceivedAt).Microseconds()
-		}
-		if !timing.ReservedAt.IsZero() && !timing.ParsedAt.IsZero() {
-			tj.ReserveUs = timing.ReservedAt.Sub(timing.ParsedAt).Microseconds()
-		}
-		if !timing.RoutedAt.IsZero() && !timing.ReservedAt.IsZero() {
-			tj.RouteUs = timing.RoutedAt.Sub(timing.ReservedAt).Microseconds()
-		}
-		if !timing.QueuedAt.IsZero() && !timing.DispatchedAt.IsZero() {
-			tj.QueueUs = timing.DispatchedAt.Sub(timing.QueuedAt).Microseconds()
-		}
-		if !timing.EncryptedAt.IsZero() && !timing.RoutedAt.IsZero() {
-			tj.EncryptUs = timing.EncryptedAt.Sub(timing.RoutedAt).Microseconds()
-		}
-		if !timing.DispatchedAt.IsZero() && !timing.EncryptedAt.IsZero() {
-			tj.DispatchUs = timing.DispatchedAt.Sub(timing.EncryptedAt).Microseconds()
-		}
-		if !timing.FirstChunkAt.IsZero() && !timing.DispatchedAt.IsZero() {
-			tj.ProviderUs = timing.FirstChunkAt.Sub(timing.DispatchedAt).Microseconds()
-		}
-		if tjJSON, err := json.Marshal(tj); err == nil {
-			w.Header().Set("X-Timing", string(tjJSON))
-		}
-	}

-	// On return (disconnect/timeout/completion): free the slot, tell the
-	// provider to stop, and preserve billing for a mid-stream disconnect.
-	// Park BEFORE RemovePending so a racing provider terminal always finds the
-	// record in pending or the holder — never neither (which would drop it and
-	// mis-refund). GetPending is nil if a terminal already settled it (normal
-	// completion), so nothing is parked then. Both settle paths are
-	// FinalizeReservation-guarded, so the park-then-remove overlap can't double-bill.
-	defer func() {
-		if stale := provider.GetPending(requestID); stale != nil {
-			s.holdForSettlement(stale)
-		} else {
-			// A terminal already claimed the pending. In every normal path the
-			// reservation is finalized by now (completion billed it, the relay
-			// error/timeout branches refunded it) and this is a no-op. The one
-			// exception is a provider error landing in the gap between this
-			// handler abandoning its channels and this defer running: that
-			// terminal pushed into an unread ErrorCh and nobody settled — sweep
-			// it here. Post-commit only, so it can never finalize a reservation
-			// the dispatch loop still needs for a retry attempt.
-			refundPr := pr
-			saferun.Go(s.logger, "api.postTerminalSweep", func() {
-				s.refundReservedBalance(refundPr, "post_terminal_sweep:"+requestID)
-			})
-		}
-		provider.RemovePending(requestID) // then remove so SetProviderIdle frees the slot
-		s.registry.SetProviderIdle(provider.ID)
-		s.sendProviderCancel(provider, requestID)
-	}()
-
-	if stream {
-		s.handleStreamingResponseWithFirstChunk(w, r, pr, firstChunk)
-	} else {
-		s.handleNonStreamingResponseWithFirstChunk(w, r, pr, firstChunk)
-	}
+	d := &dispatchState{
+		s:                      s,
+		w:                      w,
+		r:                      r,
+		model:                  model,
+		publicModel:            publicModel,
+		rawBody:                rawBody,
+		consumerKey:            consumerKey,
+		consumerLocation:       consumerLocation,
+		reservedMicroUSD:       reservedMicroUSD,
+		estimatedPromptTokens:  estimatedPromptTokens,
+		requestedMaxTokens:     requestedMaxTokens,
+		requiresVision:         requiresVision,
+		hasTools:               hasTools,
+		isResponsesAPI:         isResponsesAPI,


🟡 [MEDIUM] 🧩 Consumer handler delegates entire dispatch logic to external state machine

💡 Suggestion: Keep core dispatch logic in the consumer handler and extract only reusable components

📊 Score: 3×4 = 12 · Category: misplaced responsibility

ethenotethan · 2026-06-12T01:16:22Z

+)
+
+// inferenceErrorKey identifies a circuit-breaker bucket. Shape is the request
+// dimension from RequestTraits.CooldownShape ("tools" / "base"), so a failure
+// that only affects one shape quarantines only that shape. A struct key (vs a
+// delimiter-joined string) cannot alias across ids that contain the delimiter.
+type inferenceErrorKey struct {
+	ProviderID string
+	ModelID    string
+	Shape      string
+}


🔵 [INFO] 🧩 Complex struct key for circuit breaker when string key might suffice

💡 Suggestion: Consider if a well-formatted string key could replace the struct key while avoiding collisions

📊 Score: 2×2 = 4 · Category: over-abstraction

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 757d90313a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T01:25:56Z

 				withCode("rate_limit_exceeded")))
 			return
 		}
+		if candidateCount == 0 && capacityRejections == 0 && modelTooLarge == 0 {


Try previous alias build before no-eligible 503

When an alias resolves to its Desired build, this new no-eligible branch now returns 503 for trait/cooldown rejections without first checking the alias Previous build. ResolveModel/ResolveModelConstrained are trait-blind for unconstrained requests, so a tool request can pick a Desired build that has base-routable providers but zero tool-eligible providers, while maybeFallbackAliasCapacity would find immediate capacity on Previous; because that helper is only called in the capacity-rejection branch above, the request fails even though the alias has a usable fallback. The same preflight pattern in handleGenericInference needs the same fallback check.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T01:25:56Z

+func (t RequestTraits) CooldownShape() string {
+	if t.HasTools {
+		return "tools"
+	}
+	return "base"


Key vision failures separately from base traffic

For image/video requests, RequiresVision is not part of RequestTraits, so CooldownShape() returns base for multimodal traffic. If a provider/model repeatedly 500s or 504s only on the VLM path (media decode, multimodal template, or vision prefill), the new breaker quarantines that provider for ordinary non-tool text too, and a clean text success also clears the vision strikes; that defeats the shape-keying this breaker relies on for deterministic request-shape failures. Please carry the vision trait into the cooldown key separately from base text.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T01:25:56Z

 				withCode("rate_limit_exceeded")))
 			return
 		}
+		if candidateCount == 0 && capacityRejections == 0 && modelTooLarge == 0 {


Keep offline-capable requests queueable

This new no-eligible branch also catches the case where the model exists but all providers are currently offline/unregistered, because QuickCapacityCheck reports 0/0/0 for that state. Before this change the request reached the existing 120s queue and could be drained by DrainQueuedRequestsForProvider when a provider reconnected or re-attested; now it fails immediately with 503 even though the queue is specifically wired to be drained on provider availability. Please distinguish transiently offline providers from trait/cooldown-ineligible providers before bypassing the queue.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T01:25:56Z

+        // template throws at request time. nil = no template found (key omitted
+        // on the wire); false = some fixture threw (the routing signal).
+        // `renderOK` never throws — the startup scan must stay crash-free.
+        let templateRenderOK = TemplateRenderCheck.renderOK(at: snapshotDir)


Bump provider version for template-render capability

This starts advertising the new template_render_ok capability, but the provider and coordinator fallback release versions still report 0.6.4, so old 0.6.4 binaries that never run this scan are indistinguishable from binaries with the new field. In rollout/update paths that compare provider versions, the fleet will not be forced onto the build that emits false for broken templates, while the coordinator continues treating nil as allowed; please bump the provider/coordinator release version with this protocol capability.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T01:25:56Z

+		// sets are matched by ownerAccountID (not expressible as serials here),
+		// so the fail-fast is skipped for them — those paths enforce their own
+		// availability and we must never wrongly block them.
+		if !policy.enabled && !policy.prefer && !s.registry.HasVisionProviderForModel(model, allowedProviderSerials...) {


Gate self-route media on owned vision support

For exclusive self-route image/video requests, this condition skips the only vision-capability preflight, but the self-route preflight it relies on only checks that an owned provider serves the model, not that the owned build advertises is_vision. If a user routes media to their own text-only Mac, dispatch rejects the provider on RequiresVision and then queues until timeout as machine_busy instead of failing immediately with the vision-specific error; the owned-provider check needs to include the same vision gate before skipping the global fail-fast.

Useful? React with 👍 / 👎.

Gajesh2007 requested a deployment to benchmarks June 11, 2026 22:56 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference June 11, 2026 22:56 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 11, 2026 22:56 View deployment

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Gajesh2007 requested a deployment to benchmarks June 12, 2026 00:04 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 12, 2026 00:04 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 12, 2026 00:05 View deployment

vercel Bot deployed to Preview – d-inference June 12, 2026 00:05 View deployment

Gajesh2007 added 2 commits June 11, 2026 17:07

Gajesh2007 force-pushed the feat/routing-failover branch from 5747ba0 to b9cc676 Compare June 12, 2026 00:12

Gajesh2007 requested a deployment to benchmarks June 12, 2026 00:12 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 12, 2026 00:12 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 12, 2026 00:13 View deployment

vercel Bot deployed to Preview – d-inference June 12, 2026 00:13 View deployment

ethenotethan requested changes Jun 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Gajesh2007 requested a deployment to benchmarks June 12, 2026 01:15 — with GitHub Actions Waiting

Gajesh2007 requested a review from ethenotethan June 12, 2026 01:15

vercel Bot deployed to Preview – d-inference-landing June 12, 2026 01:16 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 12, 2026 01:16 View deployment

vercel Bot deployed to Preview – d-inference June 12, 2026 01:16 View deployment

ethenotethan requested changes Jun 12, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Gajesh2007 merged commit 9692598 into master Jun 12, 2026
83 of 86 checks passed

		if strings.Contains(chunk, `"response.created"`) \|\| strings.Contains(chunk, `"response.in_progress"`) {
		return true

		if r.inferenceErrorCooldownActiveLocked(p.ID, model, now) {
		return routingSnapshot{}, false

		@@ -1500,14 +1664,18 @@ func (s Server) handleChatCompletions(w http.ResponseWriter, r http.Request) {
		deadline := ttftDeadline(estimatedPromptTokens)

		// Channel closed — inference complete.
		s.noteInferenceSuccess(pr)

		@@ -1481,1047 +1575,49 @@ func (s Server) handleChatCompletions(w http.ResponseWriter, r http.Request) {
		//

		}

		// normalizeToolEntry rewrites one element of the "tools" array. A tool's

Conversation

Gajesh2007 commented Jun 11, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Incident

Fixes

Protocol changes

Verification

Follow-ups (not in this PR)

Uh oh!

vercel Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Trust Boundaries Touched

Threat Analysis

New Attack Surface

SEC-* Findings Resolved

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

ethenotethan left a comment

Choose a reason for hiding this comment

Automated Code Review — Layr-Labs/d-inference#

Security — ⚠️ SKIPPED

Performance — 6 finding(s) (5 blocking)

Type_diligence — ✅ No issues found

Additive_complexity — 8 finding(s) (5 blocking)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blacksmith-sh Bot commented Jun 12, 2026

Failure

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Gajesh2007 commented Jun 11, 2026 •

edited by blacksmith-sh Bot

Loading

vercel Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading

Addressed the review (commit `757d9031`)