Skip to content

[Tracker] Async LLM support (#3284): correctness bugs + maintainability issues #3341

@VascoSch92

Description

@VascoSch92

Summary

Tracker for follow-up work on the async LLM path introduced in #3284 (feat: add async LLM completion support across the SDK, merged). The agent server now drives real LocalConversations through arun() by default (event_service.py:285-294), so these affect production server runs, not just opt-in async callers.

Findings below were cross-checked against merged main by two independent reviews; the two High-severity items were reproduced.

Checklist

Correctness

Maintainability


B1 — Async error path bypasses fallback / mapping / telemetry (High)

In acompletion (llm.py:996-1054) and aresponses (llm.py:1338-1460), the retry loop is outside the try/except that calls _ahandle_error:

async for attempt in self.async_retry(...):   # reraise=True
    with attempt:
        resp = await self._atransport_call(...)   # provider errors raised HERE
try:                                              # only wraps response parsing
    message = Message.from_llm_chat_message(...)
    return LLMResponse(...)
except Exception as e:
    return await self._ahandle_error(e, ...)

The sync completion puts _one_attempt() inside the try (llm.py:900). map_provider_exception is called only in _handle_error/_ahandle_error (llm.py:718, 751). So on the common failure mode (retry exhaustion / non-retryable provider error), async:

  • never invokes fallback_strategy (backup LLM silently skipped),
  • never maps litellm exceptions to SDK types,
  • never calls telemetry.on_error.

Most damaging consequence: astep recovers from context overflow via except LLMContextWindowExceedError → CondensationRequest. That SDK type is produced only by map_provider_exception (mapping.py:35). In async the raw litellm.ContextWindowExceededError escapes unmapped, the handler misses it, and the conversation errors out instead of auto-condensing. Same for LLMMalformedConversationHistoryError.

Repro: primary litellm_acompletionAPIConnectionError, fallback LLM configured; acompletion() re-raised APIConnectionError and the fallback call count stayed 0.

Fix: wrap the async retry loop in the same try/except scope as sync, for both acompletion and aresponses.

B2 — Async retry loses temperature→1.0 on LLMNoResponseError (High)

before_sleep bumps temperature on LLMNoResponseError by mutating retry_state.kwargs (retry_mixin.py:50-59). Sync reads it back via _one_attempt(**retry_kwargs)final_kwargs = {**call_kwargs, **retry_kwargs} (llm.py:870). The async async for attempt: with attempt: idiom has no retried function and uses a call_kwargs local that is never updated, so the bump is lost.

Repro (pinned tenacity 9.1.2): temperatures per attempt — sync [0, 1.0, 1.0], async [0, 0, 0]. With temperature=0 the model tends to repeat the empty response, so all retries fail identically (and then compound with B1).

Fix: thread the retry-adjusted kwargs into the per-attempt call in the async loop (cleanest if combined with B1 via a shared per-attempt setup helper). Affects acompletion and aresponses.

B3 — cancel_token cleared while interrupted tool threads run (Medium)

arun()'s finally sets self._cancel_token = None (local_conversation.py:1067). run_in_executor worker threads can't be force-cancelled and may outlive arun(). A long-running tool that polls the documented conversation.cancel_token pattern after arun() exits sees None and loses the cancellation signal. (Terminal tools also get executor.interrupt(), so they're usually fine; cooperative-polling tools are not.)

Fix: keep the cancelled token reachable until the interrupted batch drains, or pass the token directly into tool execution rather than reading it back from the conversation.

B4 — RemoteConversation.interrupt() degrades to pause() (Medium)

The PR adds POST /conversations/{id}/interrupt and LocalConversation.interrupt(), but RemoteConversation (remote_conversation.py) doesn't override interrupt(), so it inherits BaseConversation.interrupt()self.pause(). SDK callers using a remote conversation get pause semantics from a method named interrupt().

Fix: add a RemoteConversation.interrupt() that POSTs to the /interrupt endpoint; add a test asserting it hits /interrupt, not /pause.

B5 — Sync-only/custom agents routed through threaded astep (Medium)

EventService.run() selects arun() whenever the conversation overrides BaseConversation.arun (event_service.py:285-294) — always true for LocalConversation. But only Agent overrides astep; ACPAgent and other AgentBase subclasses inherit the default astep (base.py:613) that runs sync step() via run_in_executor. So step() runs in a worker thread while arun() holds the FIFOLock on the event-loop thread.

Verified this is not a deadlock (step() never acquires the state lock), but it changes the threading/ordering model that ACP and custom agents were written against (callbacks now fire from a worker thread, not the lock-owning thread).

Fix: gate the async path on the agent's capability, e.g. type(conversation.agent).astep is not AgentBase.astep; otherwise run sync run() in the executor.

Q1 — Dead code

  • Tool.acall (tool.py:392) — zero callers; also a false affordance, since aexecute_batch/_arun_safe always wrap the sync __call__ in a thread (parallel_executor.py:761) and never call acall.
  • agenerate_title_with_llm / agenerate_title_from_message (title_utils.py, ~78 lines) — only call each other; server still uses sync generate_title.
  • _return_metrics — never read in any of the 4 LLM methods; propagated into the new async ones.

Q2 — Metrics.get_snapshot() not reused

MetricsSnapshot(...) is hand-built at llm.py:907, 1032, 1256, 1450; Metrics.get_snapshot() (metrics.py:136) already does this and additionally deepcopys accumulated_token_usage. The hand-built copies alias the live object, so they aren't true snapshots — reusing the helper removes ~20 lines and fixes the aliasing.

Q3 — Parallel-tool throughput regression

Sync execute_batch uses a dedicated ThreadPoolExecutor(max_workers) per batch; async aexecute_batch uses the shared default run_in_executor(None, …) pool gated by Semaphore(max_workers) (parallel_executor.py:729-737). Under concurrency or when max_workers exceeds the default pool size, previously-parallel tool calls can serialize. Correctness intact (order + locking preserved).

Q4 — FIFOLock held across await

arun() holds the thread-reentrant FIFOLock across await astep (local_conversation.py:953-1007). Safe only because the server routes every state mutator through run_in_executor worker threads. Any future state-mutating coroutine on the event-loop thread would silently re-enter the lock and corrupt history. Worth documenting or releasing the lock across the await.

Q5 — Sync/async duplication

After normalizing await/async, the twins are ~90%+ identical (astep/step 96/126 lines; arun/run loop 75/80; plus acompletion/aresponses/_atransport_call/_ahandle_error/condenser). ~600 of the ~1,740 added production lines are reclaimable via shared color-agnostic helpers (_prepare_chat_request/_finalize_chat_response, _classify_step_error, _map_and_raise). This duplication is the root cause of B1/B2 diverging from the sync path.

Suggested test additions

  • async fallback success; all-fallbacks-fail → mapped SDK exception; responses-API async fallback; telemetry.on_error fires on async failure (covers B1)
  • LLMNoResponseError → temperature escalates on async retry (covers B2)
  • tool still polling cancel_token after arun() enters finally still sees is_cancelled (covers B3)
  • RemoteConversation.interrupt() posts to /interrupt (covers B4)
  • server conversation with a sync-only AgentBase subclass runs via run() in executor (covers B5)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions