Improve responses stream startup recovery during temporary channel outages#3392
Improve responses stream startup recovery during temporary channel outages#3392MisonL wants to merge 15 commits intoQuantumNous:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Responses Bootstrap Recovery feature for Changes
Sequence DiagramsequenceDiagram
actor Client
participant Distributor as Distributor (middleware)
participant Service as Recovery Service
participant ChannelStore as Channel Cache/DB
participant Relay as Relay Controller
participant SSE as SSE Helper
Client->>Distributor: POST /v1/responses (stream: true)
Distributor->>Service: EnsureResponsesBootstrapRecoveryState(c)
Service-->>Distributor: state (enabled, deadline, intervals)
Distributor->>ChannelStore: getChannelWithBootstrapGrace(model)
ChannelStore-->>Distributor: no channel
loop within grace window
Distributor->>Service: NextResponsesBootstrapWait(state)
Service-->>Distributor: waitDuration, sendPing
alt sendPing
Distributor->>SSE: SetEventStreamHeaders + PingData
SSE-->>Client: ": PING"
end
Distributor->>ChannelStore: try get channel again
ChannelStore-->>Distributor: channel available? / nil
end
alt channel found
Distributor->>Relay: forward request (with channel)
Relay->>Relay: processing (billing/retries, recovery checks)
Relay-->>Client: success stream
else grace expired or unrecoverable
Relay->>Service: CanContinueResponsesBootstrapRecovery(err)
Service-->>Relay: false
Relay->>SSE: OpenAIErrorEvent(err)
SSE-->>Client: "event: error" + data
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
setting/operation_setting/general_setting.go (1)
39-48: Consider documenting the rationale for 401/403 in retryable codes.Including 401 and 403 in
ResponsesStreamBootstrapRetryableStatusCodesis non-obvious since these are typically permanent auth failures. In a multi-channel context, retrying makes sense if another channel might succeed, but this deserves a brief comment explaining the reasoning.📝 Suggested documentation
ResponsesStreamBootstrapRetryableStatusCodes: []int{ + // 401/403 included because in multi-channel scenarios, another channel may succeed 401, 403, 408, 429, 500, 502, 503, 504, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@setting/operation_setting/general_setting.go` around lines 39 - 48, Add a brief explanatory comment above the ResponsesStreamBootstrapRetryableStatusCodes constant that documents why 401 and 403 are considered retryable: explain that in multi-channel delivery the error may be channel-specific (transient or credential-scoped) and retrying allows fallback to other channels or refreshed credentials, and note any assumptions (e.g., presence of per-channel auth/rotation or downstream retry/fallback logic) so future readers understand the non-obvious choice. Reference the ResponsesStreamBootstrapRetryableStatusCodes symbol when adding the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controller/relay.go`:
- Around line 248-254: The handler currently pre-consumes quota too early
causing reserved quota to hang while in the bootstrap-recovery wait path; update
the flow around CanContinueResponsesBootstrapRecovery and
waitForResponsesBootstrapRecoveryProbe so PreConsumeBilling is not held while
sleeping: either move the PreConsumeBilling invocation so it runs immediately
before the actual dispatch attempt (after CanContinueResponsesBootstrapRecovery
returns true and any wait probe completes) or, if pre-consume must occur
earlier, explicitly call the matching release/refund routine before calling
waitForResponsesBootstrapRecoveryProbe and then reacquire/pre-consume again when
the probe completes; update references around PreConsumeBilling,
CanContinueResponsesBootstrapRecovery, waitForResponsesBootstrapRecoveryProbe
and newAPIError to ensure quota is refunded on each break/sleep path and
re-acquired only when about to dispatch.
In `@middleware/distributor_bootstrap_integration_test.go`:
- Around line 53-67: The cleanup currently only calls model.InitChannelCache()
when oldDB != nil causing stale channel cache if the original model.DB was nil;
change the t.Cleanup block to always clear or rebuild the channel cache after
restoring globals by calling model.InitChannelCache() unconditionally (remove
the oldDB != nil guard) or explicitly clear the cache via the same cache-reset
path used by InitChannelCache, ensuring model.DB, model.LOG_DB and related flags
are restored before invoking InitChannelCache to repopulate or clear cached
channels.
In `@service/responses_bootstrap_recovery.go`:
- Around line 141-159: Change the deadline boundary so both recovery checks
treat now==Deadline as expired: in CanContinueResponsesBootstrapRecovery replace
the time comparison (currently time.Now().After(state.Deadline)) with a
non-before check (e.g., !time.Now().Before(state.Deadline)) so that now ==
state.Deadline returns false; also review NextResponsesBootstrapWait to ensure
it uses the same boundary semantics with state.Deadline (use the same
!time.Now().Before logic or equivalent remaining <= 0 check consistently) so
both functions (CanContinueResponsesBootstrapRecovery and
NextResponsesBootstrapWait) agree on the cutoff.
---
Nitpick comments:
In `@setting/operation_setting/general_setting.go`:
- Around line 39-48: Add a brief explanatory comment above the
ResponsesStreamBootstrapRetryableStatusCodes constant that documents why 401 and
403 are considered retryable: explain that in multi-channel delivery the error
may be channel-specific (transient or credential-scoped) and retrying allows
fallback to other channels or refreshed credentials, and note any assumptions
(e.g., presence of per-channel auth/rotation or downstream retry/fallback logic)
so future readers understand the non-obvious choice. Reference the
ResponsesStreamBootstrapRetryableStatusCodes symbol when adding the comment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f0a9146c-5253-4e6b-8d0c-db0ba63c4f78
📒 Files selected for processing (12)
constant/context_key.gocontroller/relay.gomiddleware/distributor.gomiddleware/distributor_bootstrap_integration_test.gomiddleware/utils.gomiddleware/utils_test.gorelay/helper/common.gorelay/helper/stream_scanner.gorelay/helper/stream_scanner_test.goservice/responses_bootstrap_recovery.goservice/responses_bootstrap_recovery_test.gosetting/operation_setting/general_setting.go
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@web/src/i18n/locales/ru.json`:
- Around line 3319-3329: The Russian locale file currently contains English
values for the new "Responses 流启动恢复" settings; replace the English strings with
proper Russian translations for the keys such as "启用 Responses 流启动恢复", "仅在
/v1/responses 流式请求首包前生效,用于短时故障恢复。", "启动恢复等待窗口(秒)", "在该时间窗口内持续探测可用渠道,超时后返回真实错误。",
"渠道探测间隔(毫秒)", "每次重新探测可用渠道之间的等待时间。", "SSE 保活间隔(秒)", "在等待恢复期间发送 :
PING,帮助客户端保持连接。", "可触发启动恢复的状态码", "填写 JSON 数组,例如 [401,403,429,500,502,503,504]。",
and the long description string — update their values to correct Russian text so
the UI shows Russian for these new settings.
- Around line 3318-3329: The JSON locale has several keys (e.g. "启用 Responses
流启动恢复", "仅在 /v1/responses 流式请求首包前生效,用于短时故障恢复。", "启动恢复等待窗口(秒)", etc.) that were
accidentally placed at the root because the "translation" object was closed too
early; move these keys back inside the existing "translation" object (i.e.,
remove the stray closing brace that ends "translation" before these entries) so
all new keys remain under "translation" and restore the proper JSON structure
for t('中文key') lookups.
In `@web/src/i18n/locales/zh-CN.json`:
- Around line 2960-2971: The new i18n keys were placed at the root because the
"translation" object was closed too early; reopen/move the closing brace so the
entries like "启用 Responses 流启动恢复", "启动恢复等待窗口(秒)", "渠道探测间隔(毫秒)", "SSE 保活间隔(秒)",
"可触发启动恢复的状态码" and the descriptive sentence are nested under the "translation"
namespace again (so they are available to useTranslation() / t('...')); ensure
the JSON structure has a single top-level "translation" object containing those
keys and a valid final closing brace for the file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dae50967-78c2-493a-918a-9f292896fbe9
📒 Files selected for processing (10)
docs/installation/responses-stream-bootstrap-recovery.mdweb/src/components/settings/OperationSetting.jsxweb/src/i18n/locales/en.jsonweb/src/i18n/locales/fr.jsonweb/src/i18n/locales/ja.jsonweb/src/i18n/locales/ru.jsonweb/src/i18n/locales/vi.jsonweb/src/i18n/locales/zh-CN.jsonweb/src/i18n/locales/zh-TW.jsonweb/src/pages/Setting/Operation/SettingsGeneral.jsx
✅ Files skipped from review due to trivial changes (7)
- web/src/components/settings/OperationSetting.jsx
- web/src/i18n/locales/zh-TW.json
- web/src/i18n/locales/ja.json
- web/src/i18n/locales/en.json
- web/src/i18n/locales/fr.json
- web/src/i18n/locales/vi.json
- docs/installation/responses-stream-bootstrap-recovery.md
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
setting/operation_setting/general_setting.go (1)
35-50: Define the retryable-status default in one place.This list now exists both here and in
service.GetResponsesBootstrapRecoveryConfig. If one side changes and the other does not, an explicit default config and an empty/invalid config will recover on different status codes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@setting/operation_setting/general_setting.go` around lines 35 - 50, ResponsesStreamBootstrapRetryableStatusCodes is duplicated between this config and service.GetResponsesBootstrapRecoveryConfig causing drift; centralize the default by moving the slice into a single source of truth (e.g., define a package-level constant or a shared function like DefaultResponsesBootstrapRetryableStatusCodes) and have both general_setting.go (the ResponsesStreamBootstrapRetryableStatusCodes field) and service.GetResponsesBootstrapRecoveryConfig reference that single symbol so changes only occur in one place.controller/relay.go (1)
269-284: Centralize billing cleanup instead of mirroring it here.This helper hard-codes a controller-side list of
relayInfofields that must stay in sync with whateverPreConsumeBillingpopulates. The next billing field added there can silently leak stale metadata into the next probe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controller/relay.go` around lines 269 - 284, The function releaseBootstrapRecoveryBilling currently duplicates field-by-field billing cleanup, which risks falling out of sync with PreConsumeBilling; instead add a single centralized cleanup method (either RelayInfo.ResetBillingMetadata() on the relaycommon.RelayInfo type or a helper like relaycommon.ClearBilling(relayInfo *RelayInfo)) that performs Refund(c), nils the Billing pointer and clears all billing-related fields, update releaseBootstrapRecoveryBilling to call that new method, and ensure PreConsumeBilling and any other code paths rely on the same centralized Reset/Clear function so new billing fields won’t be missed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controller/relay.go`:
- Around line 242-253: processChannelError is currently called unconditionally
and can trigger auto-disable for status 401 during bootstrap recovery; before
calling processChannelError in the bootstrap retry loop, detect the bootstrap
grace window via service.CanContinueResponsesBootstrapRecovery(c, newAPIError)
(or an explicit bootstrap flag derived from relayInfo) and suppress
channelError.AutoBan (or clear the AutoBan flag) when inside that grace period
so 401 does not auto-disable the channel; keep the rest of the retry logic
intact and only alter the AutoBan behavior for bootstrap recovery rather than
changing AutomaticDisableStatusCodeRanges globally.
- Around line 253-255: The bootstrap recovery probe currently treats client
context cancellation as a server error; update
waitForResponsesBootstrapRecoveryProbe to differentiate cancellation from real
errors by returning a cancellation flag (e.g., change signature to return
(error, bool) or return a sentinel ErrClientCanceled) and ensure the caller
checks c.Request.Context().Err() / errors.Is(ctx.Err(), context.Canceled) or the
returned flag; if cancelled, do not set newAPIError or produce a 503—return
early and avoid treating cancelation as failure (mirror the graceful handling in
relay/helper/stream_scanner.go), and update the call site where you currently do
newAPIError = waitErr to handle the cancelled case separately.
---
Nitpick comments:
In `@controller/relay.go`:
- Around line 269-284: The function releaseBootstrapRecoveryBilling currently
duplicates field-by-field billing cleanup, which risks falling out of sync with
PreConsumeBilling; instead add a single centralized cleanup method (either
RelayInfo.ResetBillingMetadata() on the relaycommon.RelayInfo type or a helper
like relaycommon.ClearBilling(relayInfo *RelayInfo)) that performs Refund(c),
nils the Billing pointer and clears all billing-related fields, update
releaseBootstrapRecoveryBilling to call that new method, and ensure
PreConsumeBilling and any other code paths rely on the same centralized
Reset/Clear function so new billing fields won’t be missed.
In `@setting/operation_setting/general_setting.go`:
- Around line 35-50: ResponsesStreamBootstrapRetryableStatusCodes is duplicated
between this config and service.GetResponsesBootstrapRecoveryConfig causing
drift; centralize the default by moving the slice into a single source of truth
(e.g., define a package-level constant or a shared function like
DefaultResponsesBootstrapRetryableStatusCodes) and have both general_setting.go
(the ResponsesStreamBootstrapRetryableStatusCodes field) and
service.GetResponsesBootstrapRecoveryConfig reference that single symbol so
changes only occur in one place.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c930f281-f254-4fda-bdac-f6f97d55c4d5
📒 Files selected for processing (5)
controller/relay.gomiddleware/distributor_bootstrap_integration_test.goservice/responses_bootstrap_recovery.goservice/responses_bootstrap_recovery_test.gosetting/operation_setting/general_setting.go
✅ Files skipped from review due to trivial changes (3)
- middleware/distributor_bootstrap_integration_test.go
- service/responses_bootstrap_recovery_test.go
- service/responses_bootstrap_recovery.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
service/responses_bootstrap_recovery.go (1)
163-171: Consider explicit handling of zero status codes for network-level errors.When
newAPIError.StatusCodeis 0 (common for network timeouts, connection refused, DNS failures), the lookup at line 169 will returnfalsesince 0 is not in the retryable codes map. This is likely the desired behavior for true network failures during bootstrap, but worth documenting.Per context snippet 4 (
types/error.go:90-99),StatusCodedefaults to 500 inNewError()but can be 0 for raw network errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/responses_bootstrap_recovery.go` around lines 163 - 171, The code should explicitly handle network-level errors where newAPIError.StatusCode == 0 instead of relying on the map lookup; update the function (the block using types.IsChannelError, types.IsSkipRetryError and GetResponsesBootstrapRecoveryConfig().RetryableStatusCodes) to add a clear branch "if newAPIError.StatusCode == 0 { return false }" (optionally with a short log message) before the map lookup so zero-status network failures are deterministically treated as non-retryable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@service/responses_bootstrap_recovery.go`:
- Around line 163-171: The code should explicitly handle network-level errors
where newAPIError.StatusCode == 0 instead of relying on the map lookup; update
the function (the block using types.IsChannelError, types.IsSkipRetryError and
GetResponsesBootstrapRecoveryConfig().RetryableStatusCodes) to add a clear
branch "if newAPIError.StatusCode == 0 { return false }" (optionally with a
short log message) before the map lookup so zero-status network failures are
deterministically treated as non-retryable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 014eb939-84c5-49ac-96d1-88975b09d90f
📒 Files selected for processing (5)
controller/relay.gorelay/common/relay_info.gorelay/helper/common.goservice/responses_bootstrap_recovery.gosetting/operation_setting/general_setting.go
🚧 Files skipped from review as they are similar to previous changes (1)
- setting/operation_setting/general_setting.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
service/responses_bootstrap_recovery.go (1)
169-173: Consider caching retryable status codes in state for consistency.The
ProbeIntervalandPingIntervalare stored in the state at creation time (lines 107-108), butRetryableStatusCodesis re-fetched from config on each call. This creates an inconsistency: intervals are frozen per-request while status codes can change mid-recovery.If dynamic updates are intentional, consider documenting this. Otherwise, store the retryable codes in the state for consistency.
♻️ Suggested fix to cache status codes in state
// ResponsesBootstrapRecoveryState tracks a single request's bootstrap recovery window. type ResponsesBootstrapRecoveryState struct { Enabled bool StartedAt time.Time Deadline time.Time ProbeInterval time.Duration PingInterval time.Duration + RetryableStatusCodes map[int]struct{} LastPingAt time.Time HeadersSent bool PayloadStarted bool WaitAttempts int WaitDuration time.Duration }Then in
EnsureResponsesBootstrapRecoveryState:state := &ResponsesBootstrapRecoveryState{ - Enabled: true, - StartedAt: now, - Deadline: now.Add(cfg.GracePeriod), - ProbeInterval: cfg.ProbeInterval, - PingInterval: cfg.PingInterval, + Enabled: true, + StartedAt: now, + Deadline: now.Add(cfg.GracePeriod), + ProbeInterval: cfg.ProbeInterval, + PingInterval: cfg.PingInterval, + RetryableStatusCodes: cfg.RetryableStatusCodes, }And in
CanContinueResponsesBootstrapRecovery:- _, ok = GetResponsesBootstrapRecoveryConfig().RetryableStatusCodes[newAPIError.StatusCode] + _, ok = state.RetryableStatusCodes[newAPIError.StatusCode] return ok🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/responses_bootstrap_recovery.go` around lines 169 - 173, The code currently reads RetryableStatusCodes from GetResponsesBootstrapRecoveryConfig() on every call to CanContinueResponsesBootstrapRecovery while ProbeInterval and PingInterval are captured in state at creation, causing inconsistent behavior; update EnsureResponsesBootstrapRecoveryState to capture and store the current RetryableStatusCodes in the state struct (alongside ProbeInterval and PingInterval) and change CanContinueResponsesBootstrapRecovery to consult that cached state field instead of calling GetResponsesBootstrapRecoveryConfig() each time, keeping references to RetryableStatusCodes, EnsureResponsesBootstrapRecoveryState and CanContinueResponsesBootstrapRecovery consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@service/responses_bootstrap_recovery.go`:
- Around line 169-173: The code currently reads RetryableStatusCodes from
GetResponsesBootstrapRecoveryConfig() on every call to
CanContinueResponsesBootstrapRecovery while ProbeInterval and PingInterval are
captured in state at creation, causing inconsistent behavior; update
EnsureResponsesBootstrapRecoveryState to capture and store the current
RetryableStatusCodes in the state struct (alongside ProbeInterval and
PingInterval) and change CanContinueResponsesBootstrapRecovery to consult that
cached state field instead of calling GetResponsesBootstrapRecoveryConfig() each
time, keeping references to RetryableStatusCodes,
EnsureResponsesBootstrapRecoveryState and CanContinueResponsesBootstrapRecovery
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ae96c3b0-a24e-4af5-beb7-a0130482ca2d
📒 Files selected for processing (3)
relay/common/relay_info.goservice/responses_bootstrap_recovery.gosetting/operation_setting/general_setting.go
🚧 Files skipped from review as they are similar to previous changes (1)
- relay/common/relay_info.go
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
service/responses_bootstrap_recovery.go (1)
78-80: Tighten path matching to avoid false-positive prefixesLine 79 currently matches any prefix starting with
/v1/responses(e.g./v1/responsesX). Consider exact path-segment matching.Proposed refactor
func IsResponsesBootstrapRecoveryPath(path string) bool { - return strings.HasPrefix(path, "/v1/responses") && - !strings.HasPrefix(path, "/v1/responses/compact") + if path != "/v1/responses" && !strings.HasPrefix(path, "/v1/responses/") { + return false + } + return !strings.HasPrefix(path, "/v1/responses/compact") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/responses_bootstrap_recovery.go` around lines 78 - 80, The current IsResponsesBootstrapRecoveryPath function uses strings.HasPrefix(path, "/v1/responses") which incorrectly matches paths like "/v1/responsesX"; change the logic to only accept the exact segment "/v1/responses" or paths under that segment (e.g. "/v1/responses/" prefix) while still excluding the "/v1/responses/compact" subtree; update IsResponsesBootstrapRecoveryPath to check for path == "/v1/responses" OR strings.HasPrefix(path, "/v1/responses/") and then ensure it does not match "/v1/responses/compact" (or its subtree) to avoid false positives.service/responses_bootstrap_recovery_test.go (1)
18-37: Guard global settings mutation in tests to prevent future parallel-test flakesThis helper mutates shared global settings. If any package test uses
t.Parallel(), these tests can interfere. Add a test mutex around override/restore.Proposed refactor
import ( "errors" "net/http" "net/http/httptest" + "sync" "strings" "testing" "time" @@ ) +var responsesBootstrapTestSettingsMu sync.Mutex + func withResponsesBootstrapRecoverySetting(t *testing.T) { t.Helper() + responsesBootstrapTestSettingsMu.Lock() settings := operation_setting.GetGeneralSetting() @@ t.Cleanup(func() { settings.ResponsesStreamBootstrapRecoveryEnabled = oldEnabled settings.ResponsesStreamBootstrapGracePeriodSeconds = oldGrace settings.ResponsesStreamBootstrapProbeIntervalMilliseconds = oldProbe settings.ResponsesStreamBootstrapPingIntervalSeconds = oldPing settings.ResponsesStreamBootstrapRetryableStatusCodes = oldCodes + responsesBootstrapTestSettingsMu.Unlock() }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@service/responses_bootstrap_recovery_test.go` around lines 18 - 37, withResponsesBootstrapRecoverySetting mutates global settings from operation_setting.GetGeneralSetting(), which can cause flakiness when tests run in parallel; protect the override/restore by introducing a package-level sync.Mutex (e.g., testSettingsMu) and acquire it at the start of withResponsesBootstrapRecoverySetting and release it in the t.Cleanup restore closure so the mutation and restoration of settings.ResponsesStreamBootstrap* are serialized; reference the function name withResponsesBootstrapRecoverySetting, the settings variable, and t.Cleanup when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@service/responses_bootstrap_recovery.go`:
- Around line 47-74: The code reads the shared mutable singleton via
operation_setting.GetGeneralSetting() without synchronization which risks data
races if settings ever change at runtime; replace the unsynchronized access by
adding a mutex-protected accessor in the operation_setting package (e.g., a
GetGeneralSettingCopy or GetGeneralSettingLocked) that returns an immutable copy
of the fields used by ResponsesBootstrapRecoveryConfig and use that accessor
here instead of GetGeneralSetting(); also ensure LoadFromDB (or any writer)
acquires the same mutex when mutating the singleton so readers and writers
synchronize properly (alternatively, if you prefer not to add locking, add a
clear comment near operation_setting.GetGeneralSetting/LoadFromDB and above
ResponsesBootstrapRecoveryConfig construction documenting that bootstrap
settings are intentionally immutable after startup and any future runtime
mutator must add synchronization).
---
Nitpick comments:
In `@service/responses_bootstrap_recovery_test.go`:
- Around line 18-37: withResponsesBootstrapRecoverySetting mutates global
settings from operation_setting.GetGeneralSetting(), which can cause flakiness
when tests run in parallel; protect the override/restore by introducing a
package-level sync.Mutex (e.g., testSettingsMu) and acquire it at the start of
withResponsesBootstrapRecoverySetting and release it in the t.Cleanup restore
closure so the mutation and restoration of settings.ResponsesStreamBootstrap*
are serialized; reference the function name
withResponsesBootstrapRecoverySetting, the settings variable, and t.Cleanup when
making the change.
In `@service/responses_bootstrap_recovery.go`:
- Around line 78-80: The current IsResponsesBootstrapRecoveryPath function uses
strings.HasPrefix(path, "/v1/responses") which incorrectly matches paths like
"/v1/responsesX"; change the logic to only accept the exact segment
"/v1/responses" or paths under that segment (e.g. "/v1/responses/" prefix) while
still excluding the "/v1/responses/compact" subtree; update
IsResponsesBootstrapRecoveryPath to check for path == "/v1/responses" OR
strings.HasPrefix(path, "/v1/responses/") and then ensure it does not match
"/v1/responses/compact" (or its subtree) to avoid false positives.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 76c2fb06-3904-4745-9293-6b4c96f5b958
📒 Files selected for processing (4)
relay/common/relay_info.goservice/responses_bootstrap_recovery.goservice/responses_bootstrap_recovery_test.gosetting/operation_setting/general_setting.go
🚧 Files skipped from review as they are similar to previous changes (2)
- relay/common/relay_info.go
- setting/operation_setting/general_setting.go
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
setting/config/config.go (1)
42-45: Constrain callback behavior while read lock is held.At Line 45,
fnexecutes underRLock. If callback code (directly or indirectly) calls a write path likeRegister/LoadFromDB, it can deadlock on lock upgrade. Consider renaming toReadLockedand documenting that callback paths must not acquirecm.mutex.Lock().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@setting/config/config.go` around lines 42 - 45, The Read method executes the user callback fn while holding cm.mutex.RLock(), which allows the callback to call write paths like Register or LoadFromDB and risk deadlock; change Read to either (A) take and return a copy of cm.configs[name] then release the RLock before invoking fn, or (B) rename the method to ReadLocked and update its doc comment to explicitly state that callers must not call cm.mutex.Lock() or any write paths (e.g., Register, LoadFromDB) from inside fn; apply the chosen approach to the method named Read on ConfigManager and update any call sites or documentation accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@setting/config/config.go`:
- Around line 42-45: The Read method executes the user callback fn while holding
cm.mutex.RLock(), which allows the callback to call write paths like Register or
LoadFromDB and risk deadlock; change Read to either (A) take and return a copy
of cm.configs[name] then release the RLock before invoking fn, or (B) rename the
method to ReadLocked and update its doc comment to explicitly state that callers
must not call cm.mutex.Lock() or any write paths (e.g., Register, LoadFromDB)
from inside fn; apply the chosen approach to the method named Read on
ConfigManager and update any call sites or documentation accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 054473d5-a9d0-4439-94b1-2a5a070af6b0
📒 Files selected for processing (4)
service/responses_bootstrap_recovery.goservice/responses_bootstrap_recovery_test.gosetting/config/config.gosetting/operation_setting/general_setting.go
🚧 Files skipped from review as they are similar to previous changes (1)
- setting/operation_setting/general_setting.go
|
根据最新整理,这次不再保留单独的 |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@setting/config/config.go`:
- Around line 41-47: The Read method currently releases the read lock before
calling fn(config), exposing a live pointer in cm.configs to concurrent writers
(e.g., LoadFromDB -> updateConfigFromMap) and also panics if fn is nil; fix it
by taking a snapshot/copy of the config under lock and validating fn != nil
while still holding the read lock, then release the lock and invoke fn with the
snapshot. Specifically, in ConfigManager.Read: acquire cm.mutex.RLock(),
retrieve cfg := cm.configs[name], create a safe deep/shallow copy (depending on
config mutability) of cfg while holding the lock, release the lock, then if fn
!= nil call fn(snapshot); reference ConfigManager.Read, cm.configs, LoadFromDB
and updateConfigFromMap to ensure the copy prevents races.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ebfde3fd-0c1e-4ec9-8a51-19340236e847
📒 Files selected for processing (1)
setting/config/config.go
|
补充说明:本次在原有 Responses bootstrap recovery 的全局开关基础上,又补了一层“渠道级 opt-in”控制。 新增内容:
本地已验证:
这样可以避免把恢复窗口误作用到所有渠道,权限和行为边界会更清晰。 |
|
@coderabbitai review |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model/channel_satisfy.go`:
- Around line 59-70: The loop over channelsIDM should skip disabled channels
before checking other settings; add an enabled-status check at the top of the
loop (e.g. if !channel.IsEnabled() { continue } or the equivalent field/getter
on the Channel type) before using
channel.GetOtherSettings().ResponsesStreamBootstrapRecoveryEnabled so that
channelMatchesAnyGroup and channelSupportsModel are only evaluated for enabled
channels.
- Around line 92-109: The DB fallback function
hasResponsesBootstrapRecoveryEnabledChannelDB currently loads all channels via
DB.Find(&channels) without filtering by status; update it to only consider
enabled channels by adding a WHERE clause to the DB query (e.g., filter on
channel.Status == <enabled value>) or, if you prefer not to change the query,
check channel.Status inside the for loop before evaluating GetOtherSettings and
other checks. Modify the DB.Find call for the channels slice or add a guard
referencing Channel.Status in hasResponsesBootstrapRecoveryEnabledChannelDB so
only enabled channels are considered when evaluating
channel.GetOtherSettings().ResponsesStreamBootstrapRecoveryEnabled and
channelSupportsModel.
In `@setting/config/config.go`:
- Around line 52-62: Replace direct encoding/json usage with the common JSON
wrappers: use common.Marshal instead of json.Marshal when creating raw, and use
common.Unmarshal (or a new common.UnmarshalWithNumber if numeric precision via
decoder.UseNumber is required) to decode into target; update the block where
raw, err := json.Marshal(config) and decoder.Decode(target) are used
(referencing variables raw, target, valueType, config and the decoder.UseNumber
behavior) to call common.Marshal and common.Unmarshal/UnmarshalWithNumber and
return config on error as before.
In `@web/src/components/table/channels/modals/EditChannelModal.jsx`:
- Around line 2968-2984: The switch for
responses_stream_bootstrap_recovery_enabled is currently rendered only when
inputs.type === 3 (Azure-only); move it out of that Azure-only branch and render
it for any channel that supports /v1/responses bootstrap recovery (e.g., include
constant.ChannelTypeOpenAI and any other types that support it) so OpenAI
channels can opt in too; update the render condition around the Form.Switch (and
keep using the same field name responses_stream_bootstrap_recovery_enabled and
handler handleChannelOtherSettingsChange) to check for a capability (e.g.,
inputs.type === constant.ChannelTypeOpenAI || inputs.type === <other supported
types> or a new helper like supportsResponsesBootstrap(inputs.type)) instead of
the Azure-only check.
In `@web/src/i18n/locales/ja.json`:
- Around line 3285-3298: The new Japanese translation entries (e.g., "启用
Responses 流启动恢复", "仅在 /v1/responses 流式请求首包前生效,用于短时故障恢复。", "启动恢复等待窗口(秒)", etc.)
were added at the JSON root instead of inside the existing "translation" object;
move all these keys and their values into the existing "translation" object so
i18next can resolve them, ensuring you merge without duplicating existing keys
and keep the file as valid JSON.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4c475562-37ef-4c64-8398-09c7176d67a2
📒 Files selected for processing (16)
dto/channel_settings.gomiddleware/distributor.gomiddleware/distributor_bootstrap_integration_test.gomodel/channel_satisfy.goservice/responses_bootstrap_recovery.goservice/responses_bootstrap_recovery_test.gosetting/config/config.goweb/src/components/table/channels/modals/EditChannelModal.jsxweb/src/i18n/locales/en.jsonweb/src/i18n/locales/fr.jsonweb/src/i18n/locales/ja.jsonweb/src/i18n/locales/ru.jsonweb/src/i18n/locales/vi.jsonweb/src/i18n/locales/zh-CN.jsonweb/src/i18n/locales/zh-TW.jsonweb/src/pages/Setting/Operation/SettingsGeneral.jsx
✅ Files skipped from review due to trivial changes (3)
- web/src/i18n/locales/zh-CN.json
- web/src/i18n/locales/fr.json
- web/src/i18n/locales/vi.json
🚧 Files skipped from review as they are similar to previous changes (4)
- middleware/distributor.go
- web/src/i18n/locales/zh-TW.json
- web/src/i18n/locales/en.json
- web/src/i18n/locales/ru.json
|
@coderabbitai review |
|
Addressed the latest CodeRabbit findings in 7da17cc. Changes included:
Validation:
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
web/src/components/table/channels/modals/EditChannelModal.jsx (1)
947-963:⚠️ Potential issue | 🟡 MinorMissing default for
responses_stream_bootstrap_recovery_enabledin legacy data handling.For consistency with other boolean settings, add the default value for
responses_stream_bootstrap_recovery_enabledin the else block that handles legacy channels withoutsettings. Without this, the value will beundefinedfor such channels instead offalse.🛠️ Proposed fix
} else { // 兼容历史数据:老渠道没有 settings 时,默认按 json 展示 data.vertex_key_type = 'json'; data.aws_key_type = 'ak_sk'; data.is_enterprise_account = false; data.allow_service_tier = false; data.disable_store = false; data.allow_safety_identifier = false; data.allow_include_obfuscation = false; data.allow_inference_geo = false; data.claude_beta_query = false; + data.responses_stream_bootstrap_recovery_enabled = false; data.upstream_model_update_check_enabled = false; data.upstream_model_update_auto_sync_enabled = false; data.upstream_model_update_last_check_time = 0; data.upstream_model_update_last_detected_models = []; data.upstream_model_update_ignored_models = ''; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/components/table/channels/modals/EditChannelModal.jsx` around lines 947 - 963, In the legacy-channel fallback else block where the code sets defaults on the data object (e.g., vertex_key_type, aws_key_type, etc.) ensure you also set data.responses_stream_bootstrap_recovery_enabled = false so legacy channels without settings get a boolean false instead of undefined; update the else branch in EditChannelModal.jsx where the data object defaults are applied to include this property.
♻️ Duplicate comments (1)
setting/config/config.go (1)
40-61:⚠️ Potential issue | 🟠 MajorAvoid fallback to live config pointer when snapshot cloning fails.
On Line 52 and Line 57, clone failures return the original
config; then Line 71 may executefnwith a live mutable pointer after unlock, reintroducing the concurrency hazard this change is meant to prevent.🔧 Suggested fix
-func cloneConfigValue(config interface{}) interface{} { +func cloneConfigValue(config interface{}) (interface{}, error) { if config == nil { - return nil + return nil, nil } valueType := reflect.TypeOf(config) if valueType.Kind() != reflect.Ptr || valueType.Elem().Kind() != reflect.Struct { - return config + return config, nil } raw, err := common.Marshal(config) if err != nil { - return config + return nil, err } target := reflect.New(valueType.Elem()).Interface() if err := common.UnmarshalWithNumber(raw, target); err != nil { - return config + return nil, err } - return target + return target, nil } // Read executes fn with a snapshot of the named config after releasing the manager read lock. func (cm *ConfigManager) Read(name string, fn func(config interface{})) { + if fn == nil { + return + } cm.mutex.RLock() - config := cloneConfigValue(cm.configs[name]) + config, err := cloneConfigValue(cm.configs[name]) cm.mutex.RUnlock() - if fn == nil { + if err != nil { + common.SysError("failed to clone config " + name + ": " + err.Error()) return } fn(config) }Also applies to: 64-71
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@setting/config/config.go` around lines 40 - 61, cloneConfigValue currently returns the original live config pointer when JSON marshal/unmarshal fails (calls to common.Marshal and common.UnmarshalWithNumber), which re-exposes the mutable pointer; change the function so that on any error path (nil input aside) it returns nil instead of returning config to avoid handing out the live pointer — ensure the code paths after common.Marshal and common.UnmarshalWithNumber return nil on error and that the successful path still returns the newly created target (reflect.New(...).Interface()).
🧹 Nitpick comments (1)
model/channel_satisfy_test.go (1)
10-38: Consider extracting shared fixture setup/teardown for globals.Both tests repeat the same global-state save/restore pattern. A small helper would reduce drift risk and keep future cases cleaner.
♻️ Optional refactor sketch
+func withResponsesBootstrapCacheFixture( + t *testing.T, + seed map[int]*Channel, + run func(), +) { + originalMemoryCacheEnabled := common.MemoryCacheEnabled + originalChannelsIDM := channelsIDM + + common.MemoryCacheEnabled = true + channelSyncLock.Lock() + channelsIDM = seed + channelSyncLock.Unlock() + + t.Cleanup(func() { + common.MemoryCacheEnabled = originalMemoryCacheEnabled + channelSyncLock.Lock() + channelsIDM = originalChannelsIDM + channelSyncLock.Unlock() + }) + + run() +} + func TestHasResponsesBootstrapRecoveryEnabledChannelSkipsDisabledChannels(t *testing.T) { - originalMemoryCacheEnabled := common.MemoryCacheEnabled - originalChannelsIDM := channelsIDM - ... - if HasResponsesBootstrapRecoveryEnabledChannel([]string{"default"}, "gpt-5") { - t.Fatal("disabled opted-in channel should not satisfy responses bootstrap recovery") - } + withResponsesBootstrapCacheFixture(t, map[int]*Channel{ + 1: {Id: 1, Status: common.ChannelStatusManuallyDisabled, Group: "default", Models: "gpt-5", OtherSettings: `{"responses_stream_bootstrap_recovery_enabled":true}`}, + 2: {Id: 2, Status: common.ChannelStatusEnabled, Group: "default", Models: "gpt-5", OtherSettings: `{"responses_stream_bootstrap_recovery_enabled":false}`}, + }, func() { + if HasResponsesBootstrapRecoveryEnabledChannel([]string{"default"}, "gpt-5") { + t.Fatal("disabled opted-in channel should not satisfy responses bootstrap recovery") + } + }) }Also applies to: 46-67
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model/channel_satisfy_test.go` around lines 10 - 38, Extract the repeated save/modify/restore logic into a small helper (e.g., withTempChannels or setupTestChannels) that accepts t *testing.T and the temporary channels map; inside the helper capture original values of common.MemoryCacheEnabled and channelsIDM, set common.MemoryCacheEnabled = true, lock channelSyncLock and assign channelsIDM to the provided map, unlock, then register a t.Cleanup closure that restores common.MemoryCacheEnabled and (under channelSyncLock) resets channelsIDM to the captured original; update the tests to call this helper instead of duplicating the setup/teardown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@web/src/components/table/channels/modals/EditChannelModal.jsx`:
- Around line 947-963: In the legacy-channel fallback else block where the code
sets defaults on the data object (e.g., vertex_key_type, aws_key_type, etc.)
ensure you also set data.responses_stream_bootstrap_recovery_enabled = false so
legacy channels without settings get a boolean false instead of undefined;
update the else branch in EditChannelModal.jsx where the data object defaults
are applied to include this property.
---
Duplicate comments:
In `@setting/config/config.go`:
- Around line 40-61: cloneConfigValue currently returns the original live config
pointer when JSON marshal/unmarshal fails (calls to common.Marshal and
common.UnmarshalWithNumber), which re-exposes the mutable pointer; change the
function so that on any error path (nil input aside) it returns nil instead of
returning config to avoid handing out the live pointer — ensure the code paths
after common.Marshal and common.UnmarshalWithNumber return nil on error and that
the successful path still returns the newly created target
(reflect.New(...).Interface()).
---
Nitpick comments:
In `@model/channel_satisfy_test.go`:
- Around line 10-38: Extract the repeated save/modify/restore logic into a small
helper (e.g., withTempChannels or setupTestChannels) that accepts t *testing.T
and the temporary channels map; inside the helper capture original values of
common.MemoryCacheEnabled and channelsIDM, set common.MemoryCacheEnabled = true,
lock channelSyncLock and assign channelsIDM to the provided map, unlock, then
register a t.Cleanup closure that restores common.MemoryCacheEnabled and (under
channelSyncLock) resets channelsIDM to the captured original; update the tests
to call this helper instead of duplicating the setup/teardown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 74b3a192-22e9-4009-8d70-a5590f8bc152
📒 Files selected for processing (8)
common/json.gomodel/channel_satisfy.gomodel/channel_satisfy_test.gosetting/config/config.gosetting/config/config_test.goweb/src/components/table/channels/modals/EditChannelModal.jsxweb/src/constants/channel.constants.jsweb/src/i18n/locales/ja.json
✅ Files skipped from review due to trivial changes (1)
- web/src/i18n/locales/ja.json
🚧 Files skipped from review as they are similar to previous changes (1)
- model/channel_satisfy.go
|
Follow-up fixes are in e8437fa. Handled:
Validation:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@web/src/components/table/channels/modals/EditChannelModal.jsx`:
- Around line 1767-1768: The code unconditionally persists
settings.responses_stream_bootstrap_recovery_enabled from localInputs, which can
leave a stale opt-in for channels that don't support it; update the assignment
around settings.responses_stream_bootstrap_recovery_enabled to only set it from
localInputs.responses_stream_bootstrap_recovery_enabled when
supportsResponsesBootstrapRecovery(localInputs.type) returns true, and otherwise
set it to false (or delete it) so unsupported channel types cannot retain the
flag; locate the change near the existing assignment in EditChannelModal.jsx and
reference the supportsResponsesBootstrapRecovery helper to gate persistence.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9ab8047b-56cb-491e-9a8c-ec47efdf3cdd
📒 Files selected for processing (4)
model/channel_satisfy_test.gosetting/config/config.gosetting/config/config_test.goweb/src/components/table/channels/modals/EditChannelModal.jsx
✅ Files skipped from review due to trivial changes (1)
- model/channel_satisfy_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- setting/config/config.go
- setting/config/config_test.go
CodeRabbit follow-up fixes已补充处理这轮剩余评论:
已验证:
commit: |
2a13c2d to
6c305a4
Compare
📝 变更描述 / Description
/v1/responses流式请求增加首包前的 bootstrap 恢复窗口。event: error,不再中途切回 JSON。httptest + gin + sqlite集成测试覆盖真实链路。🚀 变更类型 / Type of change
🔗 关联任务 / Related Issue
✅ 提交前检查项 / Checklist
📸 运行证明 / Proof of Work
go test ./...go vet ./service ./middleware ./controller ./relay/helpergo test ./middleware -run 'TestDistributeResponsesBootstrap' -count=1 -v