fix(channels): surface structured rate-limit metadata on chat_error (#2606)#2652
fix(channels): surface structured rate-limit metadata on chat_error (#2606)#2652CodeGhost21 wants to merge 1 commit into
Conversation
…inyhumansai#2606) Issue tinyhumansai#2606 follows up on PR tinyhumansai#2371: that PR landed retry-after wording inside the user-facing message text, but the structured data was discarded at the channel layer. The wire payload still only carried `message: String` and `error_type: String`, so the frontend could not render a real countdown, a retry button, a provider/source label, or a fallback CTA without regexing the message. This change moves the metadata onto the wire as additive optional fields on `WebChannelEvent`, populated by a new `ClassifiedError` struct returned from `classify_inference_error`: - `error_source` — "provider" | "openhuman_budget" | "agent_loop" | "openhuman_billing" | "transport" | "config" - `error_retryable` — false for non-retryable business 429s ("plan does not include", insufficient balance, Z.AI 1311/1113), auth, model, context, OpenHuman billing - `error_retry_after_ms` — verbatim from Retry-After / retry_after - `error_provider` — extracted from "<provider> API error (...)" - `error_fallback_available` — Some(false) when reliable.rs has already exhausted the model_fallbacks chain ("All providers/models failed") All fields use `skip_serializing_if = "Option::is_none"` so older clients that only read `message` / `error_type` keep working unchanged. Coverage: - 11 new unit tests on the classifier (each branch, each new field, non-retryable 429 business-quota cases, provider extraction, fallback-exhausted aggregate, JSON omit-when-None contract). - 2 new wire-shape tests that drive `start_chat` end-to-end through the WebChannelEvent bus and assert the structured fields land on both the struct and the serialized JSON. - Added a serial test lock so the three tests that share `TEST_FORCED_RUN_CHAT_TASK_ERROR` no longer race under `cargo test`'s default parallelism.
📝 WalkthroughWalkthroughThis PR adds structured error metadata to chat error responses. It extends the ChangesStructured Error Metadata for Rate Limits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/channels/providers/web_tests.rs (1)
29-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid fire-and-forget forced-error reset in
Drop(can clobber the next test)
TestForcedRunChatTaskErrorGuard::dropschedules async cleanup viatokio::spawn(set_test_forced_run_chat_task_error(None).await). Because locals drop in reverse order, the guard can be dropped (and the spawned reset started) beforeFORCED_ERROR_TEST_LOCKis released, letting the next test setSome(...)and then having the earlier spawned task overwrite it back toNonewhilerun_chat_taskhasn’t consumed it yet (TEST_FORCED_RUN_CHAT_TASK_ERRORis read viaslot.take()).
Remove the asyncDrop/spawn cleanup and instead perform an awaited reset inside each serialized test (applies to the three tests that currently createTestForcedRunChatTaskErrorGuard).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web_tests.rs` around lines 29 - 35, The Drop implementation for TestForcedRunChatTaskErrorGuard must not spawn an async reset because that fire-and-forget reset (tokio::spawn calling set_test_forced_run_chat_task_error(None)) can race with FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error; remove the async Drop::drop implementation entirely and instead update each test that constructs TestForcedRunChatTaskErrorGuard to explicitly await set_test_forced_run_chat_task_error(None) at the end of the test while still holding the FORCED_ERROR_TEST_LOCK (so run_chat_task’s TEST_FORCED_RUN_CHAT_TASK_ERROR slot.take() is not interfered with). Ensure references to set_test_forced_run_chat_task_error, TestForcedRunChatTaskErrorGuard, FORCED_ERROR_TEST_LOCK, run_chat_task, and TEST_FORCED_RUN_CHAT_TASK_ERROR are used to find and modify the guard and the three tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 496-508: The code branch that builds a ClassifiedError for
402/payment messages incorrectly sets source to the hardcoded string
"openhuman_billing", which misattributes upstream provider billing errors; in
the ClassifiedError constructor (the block creating ClassifiedError with
error_type "budget_exhausted" and calling with_provider_detail), replace the
hardcoded source with the actual provider identifier (use the existing provider
variable) or another provider-specific source instead of "openhuman_billing" so
upstream 402/payment errors preserve the provider as the error source.
- Around line 449-469: The current 429 branch sets retryable =
!is_non_retryable_rate_limit_text(&lower) but always builds a
transient/retry-focused summary; change it so you first compute a boolean (e.g.,
let non_retryable = is_non_retryable_rate_limit_text(&lower)) and then pick the
message accordingly: if non_retryable produce a non-retry message that directs
the user to billing/settings/plan (no "retry in this thread" hint), otherwise
keep the transient retry summary that uses retry_after_hint(retry_secs); pass
the chosen summary into with_provider_detail(...) and keep retryable set to
!non_retryable and retry_after_ms as-is. Ensure you reference
is_non_retryable_rate_limit_text, retry_after_hint, retry_secs, and
ClassifiedError::message/retryable when editing.
---
Outside diff comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 29-35: The Drop implementation for TestForcedRunChatTaskErrorGuard
must not spawn an async reset because that fire-and-forget reset (tokio::spawn
calling set_test_forced_run_chat_task_error(None)) can race with
FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error; remove the
async Drop::drop implementation entirely and instead update each test that
constructs TestForcedRunChatTaskErrorGuard to explicitly await
set_test_forced_run_chat_task_error(None) at the end of the test while still
holding the FORCED_ERROR_TEST_LOCK (so run_chat_task’s
TEST_FORCED_RUN_CHAT_TASK_ERROR slot.take() is not interfered with). Ensure
references to set_test_forced_run_chat_task_error,
TestForcedRunChatTaskErrorGuard, FORCED_ERROR_TEST_LOCK, run_chat_task, and
TEST_FORCED_RUN_CHAT_TASK_ERROR are used to find and modify the guard and the
three tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5e36b968-60fa-4bdc-a35e-c54ea354d53b
📒 Files selected for processing (5)
src/core/socketio.rssrc/openhuman/channels/proactive.rssrc/openhuman/channels/providers/presentation.rssrc/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
| } else if lower.contains("rate limit") || lower.contains("429") { | ||
| let retry = parse_retry_after_secs_from_str(err); | ||
| let retry_secs = parse_retry_after_secs_from_str(err); | ||
| let summary = format!( | ||
| "Your AI provider is rate-limiting requests. This is a transient upstream \ | ||
| limit, not a thread-level block — you can retry in this thread.{}", | ||
| retry_after_hint(retry) | ||
| retry_after_hint(retry_secs) | ||
| ); | ||
| ("rate_limited", with_provider_detail(summary.as_str(), err)) | ||
| // Non-retryable business 429s ("plan does not include", balance | ||
| // exhausted, known provider business codes like Z.AI 1311/1113) | ||
| // also surface here — mark them non-retryable so the FE can hide | ||
| // the "Retry" button and route the user to settings/billing. | ||
| let retryable = !is_non_retryable_rate_limit_text(&lower); | ||
| ClassifiedError { | ||
| error_type: "rate_limited", | ||
| message: with_provider_detail(summary.as_str(), err), | ||
| source: "provider", | ||
| retryable, | ||
| retry_after_ms: retry_secs.map(|s| s.saturating_mul(1000)), | ||
| provider, | ||
| fallback_available, | ||
| } |
There was a problem hiding this comment.
Split retryable and non-retryable 429 messaging.
This branch sets error_retryable = false for plan/balance/business 429s, but the message still says the limit is transient and that the user can retry in the same thread. That contradicts the structured metadata and will point users at the wrong recovery path.
Suggested fix
- let retry_secs = parse_retry_after_secs_from_str(err);
- let summary = format!(
- "Your AI provider is rate-limiting requests. This is a transient upstream \
- limit, not a thread-level block — you can retry in this thread.{}",
- retry_after_hint(retry_secs)
- );
- // Non-retryable business 429s ("plan does not include", balance
- // exhausted, known provider business codes like Z.AI 1311/1113)
- // also surface here — mark them non-retryable so the FE can hide
- // the "Retry" button and route the user to settings/billing.
- let retryable = !is_non_retryable_rate_limit_text(&lower);
+ let retry_secs = parse_retry_after_secs_from_str(err);
+ let retryable = !is_non_retryable_rate_limit_text(&lower);
+ let summary = if retryable {
+ format!(
+ "Your AI provider is rate-limiting requests. This is a transient upstream \
+ limit, not a thread-level block — you can retry in this thread.{}",
+ retry_after_hint(retry_secs)
+ )
+ } else {
+ "Your AI provider rejected this request because the current plan or balance \
+ does not cover it. Update the provider billing or plan and try again."
+ .to_string()
+ };📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if lower.contains("rate limit") || lower.contains("429") { | |
| let retry = parse_retry_after_secs_from_str(err); | |
| let retry_secs = parse_retry_after_secs_from_str(err); | |
| let summary = format!( | |
| "Your AI provider is rate-limiting requests. This is a transient upstream \ | |
| limit, not a thread-level block — you can retry in this thread.{}", | |
| retry_after_hint(retry) | |
| retry_after_hint(retry_secs) | |
| ); | |
| ("rate_limited", with_provider_detail(summary.as_str(), err)) | |
| // Non-retryable business 429s ("plan does not include", balance | |
| // exhausted, known provider business codes like Z.AI 1311/1113) | |
| // also surface here — mark them non-retryable so the FE can hide | |
| // the "Retry" button and route the user to settings/billing. | |
| let retryable = !is_non_retryable_rate_limit_text(&lower); | |
| ClassifiedError { | |
| error_type: "rate_limited", | |
| message: with_provider_detail(summary.as_str(), err), | |
| source: "provider", | |
| retryable, | |
| retry_after_ms: retry_secs.map(|s| s.saturating_mul(1000)), | |
| provider, | |
| fallback_available, | |
| } | |
| } else if lower.contains("rate limit") || lower.contains("429") { | |
| let retry_secs = parse_retry_after_secs_from_str(err); | |
| let retryable = !is_non_retryable_rate_limit_text(&lower); | |
| let summary = if retryable { | |
| format!( | |
| "Your AI provider is rate-limiting requests. This is a transient upstream \ | |
| limit, not a thread-level block — you can retry in this thread.{}", | |
| retry_after_hint(retry_secs) | |
| ) | |
| } else { | |
| "Your AI provider rejected this request because the current plan or balance \ | |
| does not cover it. Update the provider billing or plan and try again." | |
| .to_string() | |
| }; | |
| ClassifiedError { | |
| error_type: "rate_limited", | |
| message: with_provider_detail(summary.as_str(), err), | |
| source: "provider", | |
| retryable, | |
| retry_after_ms: retry_secs.map(|s| s.saturating_mul(1000)), | |
| provider, | |
| fallback_available, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/channels/providers/web.rs` around lines 449 - 469, The current
429 branch sets retryable = !is_non_retryable_rate_limit_text(&lower) but always
builds a transient/retry-focused summary; change it so you first compute a
boolean (e.g., let non_retryable = is_non_retryable_rate_limit_text(&lower)) and
then pick the message accordingly: if non_retryable produce a non-retry message
that directs the user to billing/settings/plan (no "retry in this thread" hint),
otherwise keep the transient retry summary that uses
retry_after_hint(retry_secs); pass the chosen summary into
with_provider_detail(...) and keep retryable set to !non_retryable and
retry_after_ms as-is. Ensure you reference is_non_retryable_rate_limit_text,
retry_after_hint, retry_secs, and ClassifiedError::message/retryable when
editing.
| } else if lower.contains("402") | ||
| || lower.contains("payment required") | ||
| || lower.contains("insufficient balance") | ||
| { | ||
| ( | ||
| "budget_exhausted", | ||
| with_provider_detail("Insufficient credits. Please top up to continue.", err), | ||
| ) | ||
| ClassifiedError { | ||
| error_type: "budget_exhausted", | ||
| message: with_provider_detail("Insufficient credits. Please top up to continue.", err), | ||
| source: "openhuman_billing", | ||
| retryable: false, | ||
| retry_after_ms: None, | ||
| provider, | ||
| fallback_available: None, | ||
| } |
There was a problem hiding this comment.
Don't map provider billing failures to openhuman_billing.
Any upstream error like openrouter API error (402 Payment Required) will land here and be surfaced as error_source = "openhuman_billing" with OpenHuman top-up copy. That sends users to the wrong billing system.
Suggested fix
- } else if lower.contains("402")
- || lower.contains("payment required")
- || lower.contains("insufficient balance")
- {
+ } else if (lower.contains("402")
+ || lower.contains("payment required")
+ || lower.contains("insufficient balance"))
+ && provider.is_none()
+ {
ClassifiedError {
error_type: "budget_exhausted",
message: with_provider_detail("Insufficient credits. Please top up to continue.", err),
source: "openhuman_billing",
retryable: false,
retry_after_ms: None,
provider,
fallback_available: None,
}
+ } else if lower.contains("402")
+ || lower.contains("payment required")
+ || lower.contains("insufficient balance")
+ {
+ ClassifiedError {
+ error_type: "provider_error",
+ message: with_provider_detail(
+ "Your AI provider account has no available credit for this request. \
+ Please top up that provider or change providers.",
+ err,
+ ),
+ source: "provider",
+ retryable: false,
+ retry_after_ms: None,
+ provider,
+ fallback_available,
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if lower.contains("402") | |
| || lower.contains("payment required") | |
| || lower.contains("insufficient balance") | |
| { | |
| ( | |
| "budget_exhausted", | |
| with_provider_detail("Insufficient credits. Please top up to continue.", err), | |
| ) | |
| ClassifiedError { | |
| error_type: "budget_exhausted", | |
| message: with_provider_detail("Insufficient credits. Please top up to continue.", err), | |
| source: "openhuman_billing", | |
| retryable: false, | |
| retry_after_ms: None, | |
| provider, | |
| fallback_available: None, | |
| } | |
| } else if (lower.contains("402") | |
| || lower.contains("payment required") | |
| || lower.contains("insufficient balance")) | |
| && provider.is_none() | |
| { | |
| ClassifiedError { | |
| error_type: "budget_exhausted", | |
| message: with_provider_detail("Insufficient credits. Please top up to continue.", err), | |
| source: "openhuman_billing", | |
| retryable: false, | |
| retry_after_ms: None, | |
| provider, | |
| fallback_available: None, | |
| } | |
| } else if lower.contains("402") | |
| || lower.contains("payment required") | |
| || lower.contains("insufficient balance") | |
| { | |
| ClassifiedError { | |
| error_type: "provider_error", | |
| message: with_provider_detail( | |
| "Your AI provider account has no available credit for this request. \ | |
| Please top up that provider or change providers.", | |
| err, | |
| ), | |
| source: "provider", | |
| retryable: false, | |
| retry_after_ms: None, | |
| provider, | |
| fallback_available, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/channels/providers/web.rs` around lines 496 - 508, The code
branch that builds a ClassifiedError for 402/payment messages incorrectly sets
source to the hardcoded string "openhuman_billing", which misattributes upstream
provider billing errors; in the ClassifiedError constructor (the block creating
ClassifiedError with error_type "budget_exhausted" and calling
with_provider_detail), replace the hardcoded source with the actual provider
identifier (use the existing provider variable) or another provider-specific
source instead of "openhuman_billing" so upstream 402/payment errors preserve
the provider as the error source.
Summary
chat_errorso the frontend can render a real countdown, retry button, and provider/source label without regexing the message.WebChannelEvent:error_source,error_retryable,error_retry_after_ms,error_provider,error_fallback_available. Older FE clients that read onlymessage/error_typekeep working unchanged (skip_serializing_if = "Option::is_none").fallback_available: falseoncereliable.rs::format_failure_aggregatehas exhausted themodel_fallbackschain so the FE doesn't promise a fallback that doesn't exist.Problem
Issue #2606 follows up on PR #2371. That PR landed retry-after wording inside the user-facing message text — but the structured data was thrown away at the channel-classifier boundary.
WebChannelEventstill only carriedmessage: Stringanderror_type: String, so the frontend could not:Users are still reporting confusing rate-limit copy because of this gap (see issue #2606's acceptance criteria).
Solution
New
ClassifiedErrorstruct returned fromclassify_inference_errorcarries the typed metadata, and the chat-error publish path populates the newWebChannelEventfields from it.error_sourceerror_retryableprovidertrueretry_after_msparsed from bodyRetry-After:/retry_after:providerfalseopenhuman_budgettrueagent_looptrueopenhuman_billingfalsetransporttrueconfigfalseProvider name extracted best-effort from the
\"<provider> API error (...)\"prefix thatinference::provider::ops::api_errorformats. Fallback-exhausted signal extracted from the "All providers/models failed" aggregate thatreliable.rs::format_failure_aggregateemits.Submission Checklist
None-omit contract test.## Related— N/A: no matrix row touched.Closes #NNN— see## Related.Impact
channels::providers::web::classify_inference_errorand thechat_errorpublish instart_chat) plus theWebChannelEventwire struct.chat_errorSSE/Socket.IO frame and can be picked up by the FE in a subsequent PR to render a real countdown/retry/fallback UI.messagealready carries. Provider name extraction is allow-listed to ASCII alphanumeric/_/-.error_typetokens are unchanged for existing consumers. New fields are allOption<…>withskip_serializing_if = \"Option::is_none\", so older FE clients see exactly the same JSON shape they do today.Live verification
Beyond the unit + wire-shape tests, ran a full end-to-end live test:
HTTP/1.1 429 Too Many Requests,Retry-After: 30, body{\"error\":{\"message\":\"...\",\"code\":\"rate_limit_exceeded\"},\"retry_after\":30}on/v1/chat/completions.openhuman-core run --port 7892against an isolated workspace with that endpoint configured as acustom_openaiprovider.openhuman.channel_web_chat, listen to/eventsviacurl -N.The chat_error SSE frame received:
This exercises the full real path:
reliable.rs::is_rate_limited→ops::api_errorformatting →classify_inference_error→WebChannelEventserialization → SSE.Known follow-ups (out of scope)
inference::provider::ops::api_errorcurrently preserves only the response BODY, not theRetry-AfterHTTP header. When the upstream sends only the header (no bodyretry_after),retry_after_mswill beNone. Follow-up should thread the header into the error chain.Related
Retry-Afterheader fromops::api_error; FE consumer for the new structured fields.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/2606-structured-rate-limit-metadata(branched fromorigin/mainafter fresh fetch)Validation Run
pnpm --filter openhuman-app format:check— N/A: no frontend changes.pnpm typecheck— fails on 4 pre-existing errors inapp/src/components/settings/panels/devices/PairPhoneModal.tsx,app/src/lib/tunnel/crypto.ts,app/src/pages/ios/PairScreen.tsx(missing iOS-only depsqrcode.react,@noble/ciphers,@tauri-apps/plugin-barcode-scanner). These exist onorigin/mainand are unrelated to this PR's Rust-only changes. Pushed with--no-verifyper CLAUDE.md guidance.cargo test --lib openhuman::channels::providers::web::tests→ 52 passed, 0 failed (39 pre-existing + 13 new).cargo test --lib openhuman::channels::providers→ 834 passed, 0 failed under default parallelism (added a serial test lock for the three tests sharingTEST_FORCED_RUN_CHAT_TASK_ERROR).cargo fmt --manifest-path Cargo.tomlapplied;cargo check --lib --bin openhuman-coreclean.cargo check --manifest-path app/src-tauri/Cargo.tomlclean.Validation Blocked
command:N/A — fullcargo test --libhad 42 pre-existing failures inopenhuman::memory::tree_global::*andopenhuman::memory_tree::*modules, confirmed reproducible onorigin/mainwith this branch stashed.error:pre-existing memory_tree test failures unrelated to channels work.impact:none on this PR — those modules are not touched.Behavior Changes
chat_errorWebChannelEvent now carries structurederror_source,error_retryable,error_retry_after_ms,error_provider,error_fallback_availablefields.Parity Contract
error_typetokens andmessagetext are unchanged for every existing branch. New fields are additiveOption<…>withskip_serializing_if. Three existing classifier tests still pass unchanged (only the destructuring shape was migrated from tuple to struct)._distinguishes_action_budget_from_provider_429and_max_iterations_gets_dedicated_branch).Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes