Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ DASHBOARD_LOGS_POLL_INTERVAL_MS=5000 # 日志页自动刷新轮询间隔(
# 使用场景:
# - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
Comment on lines 94 to +97

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stale "默认关闭" comment after default change

The comment block still reads - 默认关闭:适用于网络不稳定环境… (disabled by default: for unstable networks), but the default was just flipped to true. Operators who set up a fresh deployment will read "disabled by default" and assume false is the initial state, leading to misconfigured circuit breakers.

Suggested change
# 使用场景:
# - 默认关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用:适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=false
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
# 使用场景:
# - 关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
# - 启用(默认):适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
Prompt To Fix With AI
This is a comment left during a code review.
Path: .env.example
Line: 94-97

Comment:
**Stale "默认关闭" comment after default change**

The comment block still reads `- 默认关闭:适用于网络不稳定环境…` (disabled by default: for unstable networks), but the default was just flipped to `true`. Operators who set up a fresh deployment will read "disabled by default" and assume `false` is the initial state, leading to misconfigured circuit breakers.

```suggestion
 # 使用场景:
 # - 关闭:适用于网络不稳定环境(如使用代理),避免因临时网络抖动触发熔断器
 # - 启用(默认):适用于网络稳定环境,连续网络错误也应触发熔断保护,避免持续请求不可达的供应商
ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct circuit-breaker env comments after default flip

This change sets ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS=true, but the surrounding .env.example guidance still describes false as the default and says provider HTTP errors are counted when disabled/enabled. After this commit, provider 4xx/5xx are generally not counted (except timeout-specific handling), so operators who rely on this example can deploy with an incorrect mental model and mis-tune production circuit-breaker behavior.

Useful? React with 👍 / 👎.


# 端点级别熔断器
# 功能说明:控制是否启用端点级别的熔断器
Expand Down
2 changes: 1 addition & 1 deletion README.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ See **[docs/k8s-deployment.md](docs/k8s-deployment.md)** for full options, place
| `API_KEY_AUTH_CACHE_TTL_SECONDS` | `60` | API Key auth cache TTL in seconds (default 60, max 3600). |
| `SESSION_TTL` | `300` | Session cache window (seconds) that drives vendor reuse. |
| `ENABLE_SECURE_COOKIES` | `true` | Browsers require HTTPS for Secure cookies; set to `false` when serving plain HTTP outside localhost. |
| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `false` | When `true`, network errors also trip the circuit breaker for quicker isolation. |
| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `true` | When `true`, only connectivity failures and timeouts affect provider circuit breaker state. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Description ambiguous when variable is false

The new description "When true, only connectivity failures and timeouts affect provider circuit breaker state" implies that when false, other things also affect provider circuit state. In practice, HTTP 524 still triggers recordFailure regardless of this variable — it purely controls whether network-layer errors are additionally included.

Consider wording like: "When true, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."

Prompt To Fix With AI
This is a comment left during a code review.
Path: README.en.md
Line: 347

Comment:
**Description ambiguous when variable is `false`**

The new description "When `true`, only connectivity failures and timeouts affect provider circuit breaker state" implies that when `false`, other things also affect provider circuit state. In practice, HTTP 524 still triggers `recordFailure` regardless of this variable — it purely controls whether *network-layer* errors are additionally included.

Consider wording like: "When `true`, network-layer errors (DNS, connection refused, etc.) also trip the provider circuit breaker in addition to 524 timeouts."

How can I resolve this? If you propose a fix, please make it concise.

Comment thread
coderabbitai[bot] marked this conversation as resolved.
| `APP_PORT` | `23000` | Production port (override via container or process manager). |
| `APP_URL` | empty | Populate to expose correct `servers` entries in OpenAPI docs. |
| `API_TEST_TIMEOUT_MS` | `15000` | Timeout (ms) for provider API connectivity tests. Accepts 5000-120000 for regional tuning. |
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@ cch doctor # 诊断集群与部署状态
| `API_KEY_AUTH_CACHE_TTL_SECONDS` | `60` | API Key 鉴权缓存 TTL(秒,默认 60,最大 3600)。 |
| `SESSION_TTL` | `300` | Session 缓存时间(秒),影响供应商复用策略。 |
| `ENABLE_SECURE_COOKIES` | `true` | 仅 HTTPS 场景能设置 Secure Cookie;HTTP 访问(非 localhost)需改为 `false`。 |
| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `false` | 是否将网络错误计入熔断器;开启后能更激进地阻断异常线路。 |
| `ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS` | `true` | 是否将网络错误计入熔断器;开启后只让连接失败、超时这类不可达问题影响供应商熔断。 |
Comment thread
coderabbitai[bot] marked this conversation as resolved.
| `APP_PORT` | `23000` | 生产端口,可被容器或进程管理器覆盖。 |
| `APP_URL` | 空 | 设置后 OpenAPI 文档 `servers` 将展示正确域名/端口。 |
| `API_TEST_TIMEOUT_MS` | `15000` | 供应商 API 测试超时时间(毫秒,范围 5000-120000),跨境网络可适当提高。 |
Expand Down
18 changes: 9 additions & 9 deletions src/app/v1/_lib/proxy/errors.ts
Original file line number Diff line number Diff line change
Expand Up @@ -549,8 +549,8 @@ export class ProxyError extends Error {
* 错误分类:区分供应商错误和系统错误
*/
export enum ErrorCategory {
PROVIDER_ERROR, // 供应商问题(所有 4xx/5xx HTTP 错误)→ 计入熔断器 + 直接切换
SYSTEM_ERROR, // 系统/网络问题(fetch 网络异常)→ 不计入熔断器 + 先重试1次
PROVIDER_ERROR, // 上游已响应但请求失败(HTTP 4xx/5xx、空响应等)→ 重试/切换,但不累计 provider 熔断
SYSTEM_ERROR, // 系统/网络问题(fetch 网络异常)→ 按网络错误配置决定是否累计 provider 熔断
CLIENT_ABORT, // 客户端主动中断 → 不计入熔断器 + 不重试 + 直接返回
NON_RETRYABLE_CLIENT_ERROR, // 客户端输入错误(Prompt 超限、内容过滤、PDF 限制、Thinking 格式、参数缺失/额外参数、非法请求)→ 不计入熔断器 + 不重试 + 直接返回
RESOURCE_NOT_FOUND, // 上游 404 错误 → 不计入熔断器 + 直接切换供应商
Expand Down Expand Up @@ -844,7 +844,7 @@ export function isRateLimitError(error: unknown): error is RateLimitError {
*
* 设计原则:
* 1. 结构化错误:携带供应商信息和失败原因
* 2. 计入熔断器:空响应视为供应商问题
* 2. 不累计 provider 熔断:空响应可能只影响当前请求
* 3. 触发故障切换:尝试其他供应商
*/
export class EmptyResponseError extends Error {
Expand Down Expand Up @@ -912,14 +912,14 @@ export function isEmptyResponseError(error: unknown): error is EmptyResponseErro
* → 不应重试(重试也会失败)
* → 应立即返回错误,提示用户修正输入
*
* 3. 供应商问题(ProxyError - 所有 4xx/5xx HTTP 错误)
* 3. 上游已响应但请求失败(ProxyError - 所有 4xx/5xx HTTP 错误)
* → 说明请求到达供应商并得到响应,但供应商无法正常处理
* → 应计入熔断器,连续失败时触发熔断保护
* → 应直接切换到其他供应商
* → 不累计 provider 熔断,避免部分模型/部分请求错误影响整个供应商
* → 应按重试策略尝试当前供应商或切换到其他供应商
*
* 4. 系统/网络问题(fetch 网络异常)
* → 包括:DNS 解析失败、连接被拒绝、连接超时、网络中断等
* → 不应计入供应商熔断器(不是供应商服务不可用)
* → 可按 ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS 配置累计 provider 熔断
* → 应先重试1次当前供应商(可能是临时网络抖动)
*
* 此函数会确保错误规则已加载后再进行检测
Expand Down Expand Up @@ -954,9 +954,9 @@ export async function categorizeErrorAsync(error: Error): Promise<ErrorCategory>
return ErrorCategory.PROVIDER_ERROR; // 其他 HTTP 错误都是供应商问题
}

// 优先级 3.2: 空响应错误 - 计入熔断器 + 触发故障切换
// 优先级 3.2: 空响应错误 - 触发故障切换,但不累计 provider 熔断
if (error instanceof EmptyResponseError) {
return ErrorCategory.PROVIDER_ERROR; // 空响应视为供应商问题
return ErrorCategory.PROVIDER_ERROR; // 空响应视为本次供应商请求失败
}

// 优先级 4: 其他所有错误都是系统错误
Expand Down
24 changes: 11 additions & 13 deletions src/app/v1/_lib/proxy/forwarder.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1733,7 +1733,7 @@ export class ProxyForwarder {
}
} else {
logger.debug(
"ProxyForwarder: Network error not counted towards circuit breaker (disabled by default)",
"ProxyForwarder: Network error not counted towards circuit breaker (disabled by config)",
{
providerId: currentProvider.id,
providerName: currentProvider.name,
Expand Down Expand Up @@ -1816,7 +1816,7 @@ export class ProxyForwarder {
break; // ⭐ 跳出内层循环,进入供应商切换逻辑
}

// ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,计入熔断器,重试耗尽后切换)
// ⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 + 空响应错误,重试耗尽后切换)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

请移除新增注释中的 emoji 字符。

Line 1819、Line 2015 的注释含有 emoji,不符合仓库规则,建议改成纯文本注释。

As per coding guidelines "**/*.{ts,tsx,js,jsx,json,css,scss,sql,py,go,sh,bash}: Never use emoji characters in any code, comments, or string literals".

Also applies to: 2015-2016

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` at line 1819, Remove emoji characters
from the inline comments in src/app/v1/_lib/proxy/forwarder.ts — specifically
replace the comment containing "⭐ 6. 供应商错误处理(所有 4xx/5xx HTTP 错误 +
空响应错误,重试耗尽后切换)" and the comments at the nearby block around lines 2015-2016 with
equivalent plain-text comments (e.g., remove "⭐" and any other emoji), and scan
the forwarder.ts file for any other emoji in comments or string literals and
convert them to plain text to comply with the repository rule.

if (errorCategory === ErrorCategory.PROVIDER_ERROR) {
// 🆕 空响应错误特殊处理(EmptyResponseError 不是 ProxyError)
if (isEmptyResponseError(lastError)) {
Expand Down Expand Up @@ -1844,7 +1844,7 @@ export class ProxyForwarder {
attemptNumber: attemptCount,
rawCrossProviderFallbackEnabled: rawCrossProviderFallbackEnabled || undefined,
errorMessage: emptyError.message,
circuitFailureCount: health.failureCount + 1,
circuitFailureCount: health.failureCount,
circuitFailureThreshold: config.failureThreshold,
statusCode: 520, // Web Server Returned an Unknown Error
errorDetails: {
Expand Down Expand Up @@ -1877,12 +1877,7 @@ export class ProxyForwarder {
continue;
}

// 重试耗尽:计入熔断器并切换供应商
if (!session.isProbeRequest()) {
if (shouldAccountCircuitBreaker) {
await recordFailure(currentProvider.id, lastError);
}
}
// 重试耗尽:空响应说明上游可达但本次响应不可用,不累计 provider 熔断。

ProxyForwarder.markProviderFailed(session, failedProviderIds, currentProvider.id);
break; // 跳出内层循环,进入供应商切换逻辑
Expand All @@ -1892,6 +1887,7 @@ export class ProxyForwarder {
const proxyError = lastError as ProxyError;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] [TEST-MISSING-CRITICAL] No test verifies that 524 status codes still trigger recordFailure

Why this is a problem: This PR introduces a critical behavioral gate -- only HTTP 524 (timeout) errors now trigger provider circuit breaker recording in the PROVIDER_ERROR path. The existing tests were updated to assert recordFailure.not.toHaveBeenCalled() for non-524 errors, but no test verifies the positive case: that a 524 error does call recordFailure. This is the sole remaining path for provider errors to affect circuit breaker state. If this conditional breaks, the circuit breaker becomes silently non-functional for all provider errors.

The hedge path (~line 3782, changed from statusCode \!== 404 to statusCode === 524) has the same gap.

Per CLAUDE.md: "All new features must have unit test coverage of at least 80%"

Suggested fix -- add a test in tests/unit/proxy/ that:

// Verify 524 timeout still triggers circuit breaker recording
it("should record circuit breaker failure for 524 timeout", async () => {
  // Setup: mock provider that throws ProxyError with statusCode 524
  doForward.mockRejectedValueOnce(
    new ProxyError("provider timeout", 524, { /* ... */ })
  );

  await ProxyForwarder.send(session);

  expect(mocks.recordFailure).toHaveBeenCalledWith(
    expect.any(Number),
    expect.objectContaining({ message: expect.stringContaining("timeout") })
  );
});

const statusCode = proxyError.statusCode;
const willRetry = attemptCount < maxAttemptsPerProvider;
const shouldRecordProviderCircuitFailure = statusCode === 524;

if (
!isMcpRequest &&
Expand Down Expand Up @@ -1975,7 +1971,8 @@ export class ProxyForwarder {
attemptNumber: attemptCount,
rawCrossProviderFallbackEnabled: rawCrossProviderFallbackEnabled || undefined,
errorMessage: errorMessage,
circuitFailureCount: health.failureCount + 1, // 包含本次失败
circuitFailureCount:
health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),
Comment on lines +1974 to +1975

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

circuitFailureCount 会在未实际 recordFailure() 时被提前加 1。

Line 1974-1975 只按 statusCode===524 预加计数,但真正写入还受 probe、shouldAccountCircuitBreaker、以及是否重试耗尽控制(Line 2017-2025)。这会让决策链计数与真实熔断状态不一致。

建议修复
 const statusCode = proxyError.statusCode;
 const willRetry = attemptCount < maxAttemptsPerProvider;
 const shouldRecordProviderCircuitFailure = statusCode === 524;
+const isProbeRequest = session.isProbeRequest();
+const willRecordProviderCircuitFailure =
+  shouldRecordProviderCircuitFailure &&
+  shouldAccountCircuitBreaker &&
+  !isProbeRequest &&
+  !willRetry;
 ...
-  circuitFailureCount:
-    health.failureCount + (shouldRecordProviderCircuitFailure ? 1 : 0),
+  circuitFailureCount:
+    health.failureCount + (willRecordProviderCircuitFailure ? 1 : 0),
 ...
- if (session.isProbeRequest()) {
+ if (isProbeRequest) {
   logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", {
     providerId: currentProvider.id,
     providerName: currentProvider.name,
     messagesCount: session.getMessagesLength(),
   });
 } else {
   if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
     await recordFailure(currentProvider.id, lastError);
   }
 }

Also applies to: 2017-2025

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 1974 - 1975, The interim
increment of circuitFailureCount (using health.failureCount +
(shouldRecordProviderCircuitFailure ? 1 : 0)) is applied even when
recordFailure() may not actually run, causing mismatch with the real circuit
state; change the logic so circuitFailureCount is only incremented under the
exact same conditions used when calling recordFailure() — i.e., evaluate probe
status, shouldAccountCircuitBreaker, retry exhaustion, and
shouldRecordProviderCircuitFailure together (the same predicates used around
recordFailure()) or simply defer computing/setting circuitFailureCount until
after recordFailure() actually executes; update references in the forwarder code
where circuitFailureCount, shouldRecordProviderCircuitFailure,
shouldAccountCircuitBreaker, probe, and recordFailure() are used so the
displayed count always matches real writes.

circuitFailureThreshold: config.failureThreshold,
statusCode: statusCode,
statusCodeInferred: proxyError.upstreamError?.statusCodeInferred ?? false,
Expand Down Expand Up @@ -2015,15 +2012,16 @@ export class ProxyForwarder {
continue;
}

// ⭐ 重试耗尽:只有非探测请求才计入熔断器
// ⭐ 重试耗尽:只有连接/超时类失败才累计 provider 熔断。
// HTTP 500/502 等说明上游可达,可能只是部分请求或模型错误,不应影响整个供应商。
if (session.isProbeRequest()) {
logger.debug("ProxyForwarder: Probe request error, skipping circuit breaker", {
providerId: currentProvider.id,
providerName: currentProvider.name,
messagesCount: session.getMessagesLength(),
});
} else {
if (shouldAccountCircuitBreaker) {
if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
await recordFailure(currentProvider.id, lastError);
}
}
Expand Down Expand Up @@ -3781,7 +3779,7 @@ export class ProxyForwarder {
attempts.delete(attempt);
ProxyForwarder.markProviderFailed(session, failedProviderIds, attempt.provider.id);

if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode !== 404) {
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
await recordFailure(attempt.provider.id, error);
}
Comment on lines +3782 to 3784

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The circuit breaker logic in the streaming hedge path is inconsistent with the non-hedge path and the PR description. Connectivity failures (SYSTEM_ERROR) should also affect the provider circuit state when ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS is enabled. Additionally, the allowCircuitBreakerAccounting endpoint policy should be respected. Note: In performance-sensitive paths like this, the recordFailure I/O operation should be fire-and-forget to avoid blocking the main logic.

      const shouldAccountCircuitBreaker = session.getEndpointPolicy().allowCircuitBreakerAccounting;
      const shouldRecordProviderCircuitFailure =
        (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) ||
        (errorCategory === ErrorCategory.SYSTEM_ERROR && getEnvConfig().ENABLE_CIRCUIT_BREAKER_ON_NETWORK_ERRORS);

      if (shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker) {
        recordFailure(attempt.provider.id, error);
      }
References
  1. In performance-sensitive code paths like provider failover, non-critical I/O operations (e.g., releasing a session in Redis) should be executed as fire-and-forget tasks to avoid blocking the main logic.

Comment on lines +3782 to 3784

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Concurrent mode skips shouldAccountCircuitBreaker guard

The sequential path gates its 524-failure recording on shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker (line ~2024), but the concurrent/hedging path at this line does not check shouldAccountCircuitBreaker (endpointPolicy.allowCircuitBreakerAccounting). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.

This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only 524, the missing guard becomes more visible. Consider adding the same guard for consistency:

if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
  await recordFailure(attempt.provider.id, error);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/proxy/forwarder.ts
Line: 3782-3784

Comment:
**Concurrent mode skips `shouldAccountCircuitBreaker` guard**

The sequential path gates its 524-failure recording on `shouldRecordProviderCircuitFailure && shouldAccountCircuitBreaker` (line ~2024), but the concurrent/hedging path at this line does not check `shouldAccountCircuitBreaker` (`endpointPolicy.allowCircuitBreakerAccounting`). This means a 524 error in a hedge attempt will always record a circuit-breaker failure, even for endpoints whose policy explicitly opts out of circuit-breaker accounting.

This asymmetry pre-dates this PR, but since this PR now narrows the trigger to only `524`, the missing guard becomes more visible. Consider adding the same guard for consistency:

```typescript
if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524 && shouldAccountCircuitBreaker) {
  await recordFailure(attempt.provider.id, error);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +3782 to 3784

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hedge 分支缺少熔断记账开关保护。

Line 3782-3784 在 524 时直接 recordFailure();但普通分支会受 allowCircuitBreakerAccounting 与 probe 请求保护。这里会导致同一策略在 hedge 路径下被绕过。

建议修复
 // in sendStreamingWithHedge(...) near function start
+const shouldAccountCircuitBreaker =
+  ProxyForwarder.getEndpointPolicy(session).allowCircuitBreakerAccounting;
+const isProbeRequest = session.isProbeRequest();

 // ...
-if (errorCategory === ErrorCategory.PROVIDER_ERROR && statusCode === 524) {
+if (
+  errorCategory === ErrorCategory.PROVIDER_ERROR &&
+  statusCode === 524 &&
+  shouldAccountCircuitBreaker &&
+  !isProbeRequest
+) {
   await recordFailure(attempt.provider.id, error);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/v1/_lib/proxy/forwarder.ts` around lines 3782 - 3784, The hedge
branch calls recordFailure(attempt.provider.id, error) on a 524 provider error
without honoring the same circuit-breaker accounting guards used elsewhere;
update the hedge-path check to only call recordFailure when
allowCircuitBreakerAccounting is enabled and the request is not a probe (mirror
the ordinary branch's conditions), i.e. wrap the existing if (errorCategory ===
ErrorCategory.PROVIDER_ERROR && statusCode === 524) { recordFailure(...) } with
the same allowCircuitBreakerAccounting and probe-request guards (using the
existing probe flag/check and allowCircuitBreakerAccounting variable) so hedge
failures are accounted consistently.


Expand Down
95 changes: 13 additions & 82 deletions src/app/v1/_lib/proxy/response-handler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,7 @@ type FinalizeDeferredStreamingResult = {
* - Forwarder 收到 Response 且识别为 SSE 时,会在 session 上挂载 DeferredStreamingFinalization 元信息。
* - ResponseHandler 在后台读取完整 SSE 内容后,调用本函数:
* - 如果内容看起来是上游错误 JSON(假 200),则:
* - 计入熔断器失败
* - 记录本次请求失败,但不累计 provider 熔断
* - 不更新 session 智能绑定(避免把会话粘到坏 provider);
* - 内部状态码改为“推断得到的 4xx/5xx”(未命中则回退 502),
* 仅影响统计与后续重试选择,不影响本次客户端响应。
Expand Down Expand Up @@ -587,30 +587,16 @@ async function finalizeDeferredStreamingFinalizationIfNeeded(

// 未自然结束:不更新 session 绑定(避免把会话粘到不稳定 provider),但要避免把它误记为 200 completed。
//
// 同时,为了让故障转移/熔断能正确工作
// 同时,为了让故障转移能正确工作
// - 客户端主动中断:不计入熔断器(这通常不是供应商问题)
// - 非客户端中断:计入 provider/endpoint 熔断失败(与 timeout 路径保持一致)
// - 非客户端中断:记录为本次请求失败,但不累计 provider 熔断;
// 已经建立上游响应流的错误可能只是部分请求/模型异常,不应影响整个供应商。
if (!streamEndedNormally) {
await clearSessionBinding();

if (!clientAborted && session.getEndpointPolicy().allowCircuitBreakerAccounting) {
try {
// 动态导入:避免 proxy 模块与熔断器模块之间潜在的循环依赖。
const { recordFailure } = await import("@/lib/circuit-breaker");
await recordFailure(meta.providerId, new Error(errorMessage ?? "STREAM_ABORTED"));
} catch (cbError) {
logger.warn("[ResponseHandler] Failed to record streaming failure in circuit breaker", {
providerId: meta.providerId,
sessionId: session.sessionId ?? null,
error: cbError,
});
}

// NOTE: Do NOT call recordEndpointFailure here. Stream aborts are key-level
// errors (auth, rate limit, bad key). The endpoint itself delivered HTTP 200
// successfully. Only forwarder-level failures (timeout, network error) and
// probe failures should penalize the endpoint circuit breaker.
}
// NOTE: Do NOT call recordFailure or recordEndpointFailure here. Stream aborts
// happen after the endpoint has already responded; forwarder-level connection
// failures remain the path that can affect provider/endpoint circuit state.

session.addProviderToChain(providerForChain, {
endpointId: meta.endpointId,
Expand Down Expand Up @@ -640,22 +626,8 @@ async function finalizeDeferredStreamingFinalizationIfNeeded(

const chainReason = effectiveStatusCode === 404 ? "resource_not_found" : "retry_failed";

// 计入熔断器:让后续请求能正确触发故障转移/熔断。
//
// 注意:404 语义在 forwarder 中属于 RESOURCE_NOT_FOUND,不计入熔断器(避免把“资源/模型不存在”当作供应商故障)。
if (effectiveStatusCode !== 404 && session.getEndpointPolicy().allowCircuitBreakerAccounting) {
try {
// 动态导入:避免 proxy 模块与熔断器模块之间潜在的循环依赖。
const { recordFailure } = await import("@/lib/circuit-breaker");
await recordFailure(meta.providerId, new Error(detected.code));
} catch (cbError) {
logger.warn("[ResponseHandler] Failed to record fake-200 error in circuit breaker", {
providerId: meta.providerId,
sessionId: session.sessionId ?? null,
error: cbError,
});
}
}
// 假 200 代表请求已到达上游并拿到响应,可能只是部分请求或模型返回错误。
// 这里保留本次失败和供应商切换信息,但不累计 provider 熔断。

// NOTE: Do NOT call recordEndpointFailure here. Fake-200 errors are key-level
// issues (invalid key, auth failure). The endpoint returned HTTP 200 successfully;
Expand Down Expand Up @@ -692,20 +664,8 @@ async function finalizeDeferredStreamingFinalizationIfNeeded(

const chainReason = effectiveStatusCode === 404 ? "resource_not_found" : "retry_failed";

// 计入熔断器:让后续请求能正确触发故障转移/熔断。
// 注意:与 forwarder 口径保持一致:404 不计入熔断器(资源不存在不是供应商故障)。
if (effectiveStatusCode !== 404 && session.getEndpointPolicy().allowCircuitBreakerAccounting) {
try {
const { recordFailure } = await import("@/lib/circuit-breaker");
await recordFailure(meta.providerId, new Error(errorMessage));
} catch (cbError) {
logger.warn("[ResponseHandler] Failed to record non-200 error in circuit breaker", {
providerId: meta.providerId,
sessionId: session.sessionId ?? null,
error: cbError,
});
}
}
// 非 200 HTTP 响应说明上游可达;例如 500/502 可能只影响当前请求。
// 这里不累计 provider 熔断,避免健康供应商因个别请求被整体排除。

// NOTE: Do NOT call recordEndpointFailure here. Non-200 HTTP errors (401, 429,
// etc.) are typically key/auth-level errors. The endpoint was reachable and
Expand Down Expand Up @@ -963,28 +923,12 @@ export class ProxyResponseHandler {
});
}

// 非200状态码处理:解析错误响应并计入熔断器
// 非200状态码处理:解析错误响应并记录到决策链。
let errorMessageForFinalize: string | undefined;
if (statusCode >= 400) {
const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
errorMessageForFinalize = detected.isError ? detected.code : `HTTP ${statusCode}`;

// 计入熔断器
if (session.getEndpointPolicy().allowCircuitBreakerAccounting) {
try {
const { recordFailure } = await import("@/lib/circuit-breaker");
await recordFailure(provider.id, new Error(errorMessageForFinalize));
} catch (cbError) {
logger.warn(
"ResponseHandler: Failed to record non-200 error in circuit breaker (passthrough)",
{
providerId: provider.id,
error: cbError,
}
);
}
}

// 记录到决策链
session.addProviderToChain(provider, {
reason: "retry_failed",
Expand Down Expand Up @@ -1358,24 +1302,11 @@ export class ProxyResponseHandler {
});
}

// 非200状态码处理:解析错误响应并计入熔断器
// 非200状态码处理:解析错误响应并记录到决策链。
if (statusCode >= 400) {
const detected = detectUpstreamErrorFromSseOrJsonText(responseText);
const errorMessageForDb = detected.isError ? detected.code : `HTTP ${statusCode}`;

// 计入熔断器
if (session.getEndpointPolicy().allowCircuitBreakerAccounting) {
try {
const { recordFailure } = await import("@/lib/circuit-breaker");
await recordFailure(provider.id, new Error(errorMessageForDb));
} catch (cbError) {
logger.warn("ResponseHandler: Failed to record non-200 error in circuit breaker", {
providerId: provider.id,
error: cbError,
});
}
}

// 记录到决策链
session.addProviderToChain(provider, {
reason: "retry_failed",
Expand Down
4 changes: 2 additions & 2 deletions src/app/v1/_lib/proxy/session.ts
Original file line number Diff line number Diff line change
Expand Up @@ -562,8 +562,8 @@ export class ProxySession {
| "concurrent_limit_failed"
| "request_success" // 修复:添加 request_success
| "retry_success"
| "retry_failed" // 供应商错误(已计入熔断器
| "system_error" // 系统/网络错误(不计入熔断器
| "retry_failed" // 供应商错误(记录本次失败,通常不累计 provider 熔断
| "system_error" // 系统/网络错误(按网络错误配置决定是否累计 provider 熔断
| "resource_not_found" // 上游 404 错误(不计入熔断器,仅切换供应商)
| "retry_with_official_instructions" // Codex instructions 自动重试(官方)
| "retry_with_cached_instructions" // Codex instructions 智能重试(缓存)
Expand Down
Loading