Skip to content

Chat: add retry-before-timeout strategy for transient network errors #723

@amiable-dev

Description

@amiable-dev

Problem

When the user's internet connection is patchy, LLM chat requests fail immediately with no retry attempt:

Error: Network error: Connection failed: error sending request for url
(https://api.anthropic.com/v1/messages). Please check your internet connection.

Please check your settings and try again.

The user must manually resend their message, losing context of any in-progress agentic loop.

Root Cause

The LLMProvider::chat() and chat_stream() implementations in all 5 providers (Anthropic, OpenAI, OpenRouter, LiteLLM, Gemini) make a single reqwest call with no retry logic. When the request fails due to a transient network error:

  1. reqwest::Error is caught by From<reqwest::Error> for LlmError (conductor-gui/src-tauri/src/llm/error.rs:72-82)
  2. err.is_connect() maps to LlmError::Network(format!("Connection failed: {}", err))
  3. error_to_string() in llm_commands.rs:489-493 formats the user-facing message
  4. The error propagates to the frontend chat.js:871-882 catch block, which displays it and sets SESSION_STATES.ERROR

There is retry infrastructure in streaming.rs (StreamConfig.max_retries, StreamState.retry_count, can_retry()/record_retry()) — but this only tracks stream-level retries within an already-established SSE connection. It does not retry the initial HTTP request that establishes the connection.

Errors That Should Be Retried

These LlmError variants represent transient failures worth retrying:

Variant Retryable Rationale
Network(_) Yes Connection blip, DNS timeout, TCP reset
Timeout(_) Yes Server slow but may recover
RateLimitExceeded(_) Yes (with backoff) Provider will accept after cooldown
StreamInterrupted(_) Yes Mid-stream disconnect
ProviderError (5xx) Yes Server-side transient error
AuthenticationFailed No Bad credentials won't self-heal
ApiKeyNotFound No Missing config
InvalidRequest No Bad payload
ContextLengthExceeded No Deterministic rejection

Recommended Fix

1. Add is_retryable() method to LlmError (error.rs)

impl LlmError {
    pub fn is_retryable(&self) -> bool {
        matches!(self,
            LlmError::Network(_)
            | LlmError::Timeout(_)
            | LlmError::RateLimitExceeded(_)
            | LlmError::StreamInterrupted(_)
        )
    }
}

2. Add retry wrapper in llm_commands.rs

Insert a retry loop around provider.chat() calls with exponential backoff:

async fn chat_with_retry(
    provider: &dyn LLMProvider,
    request: &ChatRequest,  // must be Clone
    max_retries: u32,       // default: 3
) -> LlmResult<ChatResponse> {
    let delays = [1000, 2000, 4000]; // ms, exponential backoff
    let mut last_err = None;

    for attempt in 0..=max_retries {
        match provider.chat(request.clone()).await {
            Ok(response) => return Ok(response),
            Err(err) if err.is_retryable() && attempt < max_retries => {
                warn!("LLM request attempt {} failed (retryable): {}", attempt + 1, err);
                tokio::time::sleep(Duration::from_millis(delays[attempt as usize])).await;
                last_err = Some(err);
            }
            Err(err) => return Err(err),
        }
    }

    Err(last_err.unwrap())
}

3. Apply same pattern to chat_stream() initial connection

The SSE connection establishment (provider.chat_stream()) should also retry before returning an error. The existing StreamConfig.max_retries can be repurposed or a separate connection-level retry can wrap it.

4. Frontend: show retry status in chat UI (chat.js)

When retries are happening, the backend should emit a Tauri event (e.g., llm_retry) so the frontend can show "Retrying... (attempt 2/3)" instead of silence followed by a delayed response.

5. Make retry count configurable

Add max_retries: Option<u32> to ChatRequestPayload or a global LLM setting, defaulting to 3.

Scope

  • Backend: error.rs, llm_commands.rs (retry wrapper), optionally provider-level
  • Frontend: chat.js (retry status display), ChatView.svelte (retry indicator)
  • Tests: Unit test for is_retryable(), integration test for retry loop with mock failures

Acceptance Criteria

  • Transient network errors retry up to 3 times with exponential backoff (1s, 2s, 4s)
  • Non-retryable errors (auth, invalid request) fail immediately — no wasted retries
  • Rate limit errors retry with appropriate backoff (respect Retry-After header if present)
  • User sees retry status in chat UI ("Retrying..." indicator)
  • Retry count is configurable (default: 3)
  • Agentic loop resumes correctly after a successful retry mid-loop
  • All existing LLM tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions