-
Notifications
You must be signed in to change notification settings - Fork 0
Chat: add retry-before-timeout strategy for transient network errors #723
Description
Problem
When the user's internet connection is patchy, LLM chat requests fail immediately with no retry attempt:
Error: Network error: Connection failed: error sending request for url
(https://api.anthropic.com/v1/messages). Please check your internet connection.
Please check your settings and try again.
The user must manually resend their message, losing context of any in-progress agentic loop.
Root Cause
The LLMProvider::chat() and chat_stream() implementations in all 5 providers (Anthropic, OpenAI, OpenRouter, LiteLLM, Gemini) make a single reqwest call with no retry logic. When the request fails due to a transient network error:
reqwest::Erroris caught byFrom<reqwest::Error> for LlmError(conductor-gui/src-tauri/src/llm/error.rs:72-82)err.is_connect()maps toLlmError::Network(format!("Connection failed: {}", err))error_to_string()inllm_commands.rs:489-493formats the user-facing message- The error propagates to the frontend
chat.js:871-882catch block, which displays it and setsSESSION_STATES.ERROR
There is retry infrastructure in streaming.rs (StreamConfig.max_retries, StreamState.retry_count, can_retry()/record_retry()) — but this only tracks stream-level retries within an already-established SSE connection. It does not retry the initial HTTP request that establishes the connection.
Errors That Should Be Retried
These LlmError variants represent transient failures worth retrying:
| Variant | Retryable | Rationale |
|---|---|---|
Network(_) |
Yes | Connection blip, DNS timeout, TCP reset |
Timeout(_) |
Yes | Server slow but may recover |
RateLimitExceeded(_) |
Yes (with backoff) | Provider will accept after cooldown |
StreamInterrupted(_) |
Yes | Mid-stream disconnect |
ProviderError (5xx) |
Yes | Server-side transient error |
AuthenticationFailed |
No | Bad credentials won't self-heal |
ApiKeyNotFound |
No | Missing config |
InvalidRequest |
No | Bad payload |
ContextLengthExceeded |
No | Deterministic rejection |
Recommended Fix
1. Add is_retryable() method to LlmError (error.rs)
impl LlmError {
pub fn is_retryable(&self) -> bool {
matches!(self,
LlmError::Network(_)
| LlmError::Timeout(_)
| LlmError::RateLimitExceeded(_)
| LlmError::StreamInterrupted(_)
)
}
}2. Add retry wrapper in llm_commands.rs
Insert a retry loop around provider.chat() calls with exponential backoff:
async fn chat_with_retry(
provider: &dyn LLMProvider,
request: &ChatRequest, // must be Clone
max_retries: u32, // default: 3
) -> LlmResult<ChatResponse> {
let delays = [1000, 2000, 4000]; // ms, exponential backoff
let mut last_err = None;
for attempt in 0..=max_retries {
match provider.chat(request.clone()).await {
Ok(response) => return Ok(response),
Err(err) if err.is_retryable() && attempt < max_retries => {
warn!("LLM request attempt {} failed (retryable): {}", attempt + 1, err);
tokio::time::sleep(Duration::from_millis(delays[attempt as usize])).await;
last_err = Some(err);
}
Err(err) => return Err(err),
}
}
Err(last_err.unwrap())
}3. Apply same pattern to chat_stream() initial connection
The SSE connection establishment (provider.chat_stream()) should also retry before returning an error. The existing StreamConfig.max_retries can be repurposed or a separate connection-level retry can wrap it.
4. Frontend: show retry status in chat UI (chat.js)
When retries are happening, the backend should emit a Tauri event (e.g., llm_retry) so the frontend can show "Retrying... (attempt 2/3)" instead of silence followed by a delayed response.
5. Make retry count configurable
Add max_retries: Option<u32> to ChatRequestPayload or a global LLM setting, defaulting to 3.
Scope
- Backend:
error.rs,llm_commands.rs(retry wrapper), optionally provider-level - Frontend:
chat.js(retry status display),ChatView.svelte(retry indicator) - Tests: Unit test for
is_retryable(), integration test for retry loop with mock failures
Acceptance Criteria
- Transient network errors retry up to 3 times with exponential backoff (1s, 2s, 4s)
- Non-retryable errors (auth, invalid request) fail immediately — no wasted retries
- Rate limit errors retry with appropriate backoff (respect
Retry-Afterheader if present) - User sees retry status in chat UI ("Retrying..." indicator)
- Retry count is configurable (default: 3)
- Agentic loop resumes correctly after a successful retry mid-loop
- All existing LLM tests pass