-
-
Notifications
You must be signed in to change notification settings - Fork 359
fix(proxy): limit provider circuit breaker failures #1134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -94,7 +94,7 @@ DASHBOARD_LOGS_POLL_INTERVAL_MS=5000 # 日志页自动刷新轮询间隔( | |
| # 使用场景: | ||
| # - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器 | ||
| # - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商 | ||
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false | ||
| ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This change sets Useful? React with 👍 / 👎. |
||
|
|
||
| # 端点级别熔断器 | ||
| # 功能说明:控制是否启用端点级别的熔断器 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -344,7 +344,7 @@ See **[docs/k8s-deployment.md](docs/k8s-deployment.md)** for full options, place | |
| | `API_KEY_AUTH_CACHE_TTL_SECONDS` | `60` | API Key auth cache TTL in seconds (default 60, max 3600). | | ||
| | `SESSION_TTL` | `300` | Session cache window (seconds) that drives vendor reuse. | | ||
| | `ENABLE_SECURE_COOKIES` | `true` | Browsers require HTTPS for Secure cookies; set to `false` when serving plain HTTP outside localhost. | | ||
| | `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `false` | When `true`, network errors also trip the circuit breaker for quicker isolation. | | ||
| | `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `true` | When `true`, only connectivity failures and timeouts affect provider circuit breaker state. | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The new description "When Consider wording like: "When Prompt To Fix With AIThis is a comment left during a code review.
Path: README.en.md
Line: 347
Comment:
**Description ambiguous when variable is `false`**
The new description "When `true`, only connectivity failures and timeouts affect provider circuit breaker state" implies that when `false`, other things also affect provider circuit state. In practice, HTTP 524 still triggers `recordFailure` regardless of this variable — it purely controls whether *network-layer* errors are additionally included.
Consider wording like: "When `true`, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."
How can I resolve this? If you propose a fix, please make it concise.
coderabbitai[bot] marked this conversation as resolved.
|
||
| | `APP_PORT` | `23000` | Production port (override via container or process manager). | | ||
| | `APP_URL` | empty | Populate to expose correct `servers` entries in OpenAPI docs. | | ||
| | `API_TEST_TIMEOUT_MS` | `15000` | Timeout (ms) for provider API connectivity tests. Accepts 5000-120000 for regional tuning. | | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1733,7 +1733,7 @@ export class ProxyForwarder { | |
| } | ||
| } else { | ||
| logger.debug( | ||
| "ProxyForwarder: Network error not counted towards circuit breaker (disabled by default)", | ||
| "ProxyForwarder: Network error not counted towards circuit breaker (disabled by config)", | ||
| { | ||
| providerId: currentProvider.id, | ||
| providerName: currentProvider.name, | ||
|
|
@@ -1816,7 +1816,7 @@ export class ProxyForwarder { | |
| break; // ⭐ 跳出内层循环,进入供应商切换逻辑 | ||
| } | ||
|
|
||
| // ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,计入熔断器,重试耗尽后切换) | ||
| // ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,重试耗尽后切换) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 请移除新增注释中的 emoji 字符。 Line 1819、Line 2015 的注释含有 emoji,不符合仓库规则,建议改成纯文本注释。 As per coding guidelines " Also applies to: 2015-2016 🤖 Prompt for AI Agents |
||
| if (errorCategory === ErrorCategory.PROVIDER_ERROR) { | ||
| // 🆕 空响应错误特殊处理(EmptyResponseError 不是 ProxyError) | ||
| if (isEmptyResponseError(lastError)) { | ||
|
|
@@ -1844,7 +1844,7 @@ export class ProxyForwarder { | |
| attemptNumber: attemptCount, | ||
| rawCrossProviderFallbackEnabled: rawCrossProviderFallbackEnabled || undefined, | ||
| errorMessage: emptyError.message, | ||
| circuitFailureCount: health.failureCount + 1, | ||
| circuitFailureCount: health.failureCount, | ||
| circuitFailureThreshold: config.failureThreshold, | ||
| statusCode: 520, // Web Server Returned an Unknown Error | ||
| errorDetails: { | ||
|
|
@@ -1877,12 +1877,7 @@ export class ProxyForwarder { | |
| continue; | ||
| } | ||
|
|
||
| // 重试耗尽:计入熔断器并切换供应商 | ||
| if (!session.isProbeRequest()) { | ||
| if (shouldAccountCircuitBreaker) { | ||
| await recordFailure(currentProvider.id, lastError); | ||
| } | ||
| } | ||
| // 重试耗尽:空响应说明上游可达但本次响应不可用,不累计 provider 熔断。 | ||
|
|
||
| ProxyForwarder.markProviderFailed(session, failedProviderIds, currentProvider.id); | ||
| break; // 跳出内层循环,进入供应商切换逻辑 | ||
|
|
@@ -1892,6 +1887,7 @@ export class ProxyForwarder { | |
| const proxyError = lastError as ProxyError; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [High] [TEST-MISSING-CRITICAL] No test verifies that 524 status codes still trigger Why this is a problem: This PR introduces a critical behavioral gate -- only HTTP 524 (timeout) errors now trigger provider circuit breaker recording in the PROVIDER_ERROR path. The existing tests were updated to assert The hedge path (~line 3782, changed from Per CLAUDE.md: "All new features must have unit test coverage of at least 80%" Suggested fix -- add a test in // Verify 524 timeout still triggers circuit breaker recording
it("should record circuit breaker failure for 524 timeout", async () => {
// Setup: mock provider that throws ProxyError with statusCode 524
doForward.mockRejectedValueOnce(
new ProxyError("provider timeout", 524, { /* ... */ })
);
await ProxyForwarder.send(session);
expect(mocks.recordFailure).toHaveBeenCalledWith(
expect.any(Number),
expect.objectContaining({ message: expect.stringContaining("timeout") })
);
}); |
||
| const statusCode = proxyError.statusCode; | ||
| const willRetry = attemptCount < maxAttemptsPerProvider; | ||
| const shouldRecordProviderCircuitFailure = statusCode === 524; | ||
|
|
||
| if ( | ||
| !isMcpRequest && | ||
|
|
@@ -1975,7 +1971,8 @@ export class ProxyForwarder { | |
| attemptNumber: attemptCount, | ||
| rawCrossProviderFallbackEnabled: rawCrossProviderFallbackEnabled || undefined, | ||
| errorMessage: errorMessage, | ||
| circuitFailureCount: health.failureCount + 1, // 包含本次失败 | ||
| circuitFailureCount: | ||
| health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0), | ||
|
Comment on lines
+1974
to
+1975
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Line 1974-1975 只按 建议修复 const statusCode = proxyError.statusCode;
const willRetry = attemptCount < maxAttemptsPerProvider;
const shouldRecordProviderCircuitFailure = statusCode === 524;
+const isProbeRequest = session.isProbeRequest();
+const willRecordProviderCircuitFailure =
+ shouldRecordProviderCircuitFailure &&
+ shouldAccountCircuitBreaker &&
+ !isProbeRequest &&
+ !willRetry;
...
- circuitFailureCount:
- health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),
+ circuitFailureCount:
+ health.failureCount + (willRecordProviderCircuitFailure ? 1 : 0),
...
- if (session.isProbeRequest()) {
+ if (isProbeRequest) {
logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", {
providerId: currentProvider.id,
providerName: currentProvider.name,
messagesCount: session.getMessagesLength(),
});
} else {
if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
await recordFailure(currentProvider.id, lastError);
}
}Also applies to: 2017-2025 🤖 Prompt for AI Agents |
||
| circuitFailureThreshold: config.failureThreshold, | ||
| statusCode: statusCode, | ||
| statusCodeInferred: proxyError.upstreamError?.statusCodeInferred ?? false, | ||
|
|
@@ -2015,15 +2012,16 @@ export class ProxyForwarder { | |
| continue; | ||
| } | ||
|
|
||
| // ⭐ 重试耗尽:只有非探测请求才计入熔断器 | ||
| // ⭐ 重试耗尽:只有连接/超时类失败才累计 provider 熔断。 | ||
| // HTTP 500/502 等说明上游可达,可能只是部分请求或模型错误,不应影响整个供应商。 | ||
| if (session.isProbeRequest()) { | ||
| logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", { | ||
| providerId: currentProvider.id, | ||
| providerName: currentProvider.name, | ||
| messagesCount: session.getMessagesLength(), | ||
| }); | ||
| } else { | ||
| if (shouldAccountCircuitBreaker) { | ||
| if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) { | ||
| await recordFailure(currentProvider.id, lastError); | ||
| } | ||
| } | ||
|
|
@@ -3781,7 +3779,7 @@ export class ProxyForwarder { | |
| attempts.delete(attempt); | ||
| ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id); | ||
|
|
||
| if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode !== 404) { | ||
| if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) { | ||
| await recordFailure(attempt.provider.id, error); | ||
| } | ||
|
Comment on lines
+3782
to
3784
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The circuit breaker logic in the streaming hedge path is inconsistent with the non-hedge path and the PR description. Connectivity failures (SYSTEM_ERROR) should also affect the provider circuit state when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is enabled. Additionally, the allowCircuitBreakerAccounting endpoint policy should be respected. Note: In performance-sensitive paths like this, the recordFailure I/O operation should be fire-and-forget to avoid blocking the main logic. const shouldAccountCircuitBreaker = session.getEndpointPolicy().allowCircuitBreakerAccounting;
const shouldRecordProviderCircuitFailure =
(errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) ||
(errorCategory === ErrorCategory.SYSTEM_ERROR && getEnvConfig().ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS);
if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
recordFailure(attempt.provider.id, error);
}References
Comment on lines
+3782
to
3784
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The sequential path gates its 524-failure recording on This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
await recordFailure(attempt.provider.id, error);
}Prompt To Fix With AIThis is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 3782-3784
Comment:
**Concurrent mode skips `shouldAccountCircuitBreaker` guard**
The sequential path gates its 524-failure recording on `shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker` (line ~2024), but the concurrent/hedging path at this line does not check `shouldAccountCircuitBreaker` (`endpointPolicy.allowCircuitBreakerAccounting`). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.
This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only `524`, the missing guard becomes more visible. Consider adding the same guard for consistency:
```typescript
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
await recordFailure(attempt.provider.id, error);
}
```
How can I resolve this? If you propose a fix, please make it concise.
Comment on lines
+3782
to
3784
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hedge 分支缺少熔断记账开关保护。 Line 3782-3784 在 524 时直接 建议修复 // in sendStreamingWithHedge(...) near function start
+const shouldAccountCircuitBreaker =
+ ProxyForwarder.getEndpointPolicy(session).allowCircuitBreakerAccounting;
+const isProbeRequest = session.isProbeRequest();
// ...
-if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
+if (
+ errorCategory === ErrorCategory.PROVIDER_ERROR &&
+ statusCode === 524 &&
+ shouldAccountCircuitBreaker &&
+ !isProbeRequest
+) {
await recordFailure(attempt.provider.id, error);
}🤖 Prompt for AI Agents |
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment block still reads
- 默认关闭:适用于网络不稳定环境…(disabled by default: for unstable networks), but the default was just flipped totrue. Operators who set up a fresh deployment will read "disabled by default" and assumefalseis the initial state, leading to misconfigured circuit breakers.Prompt To Fix With AI