[Tracker] Async LLM support (#3284): correctness bugs + maintainability issues

## Summary

Tracker for follow-up work on the async LLM path introduced in #3284 (`feat: add async LLM completion support across the SDK`, merged). The agent server now drives real `LocalConversation`s through `arun()` by default (`event_service.py:285-294`), so these affect production server runs, not just opt-in async callers.

Findings below were cross-checked against merged `main` by two independent reviews; the two High-severity items were reproduced.

## Checklist

**Correctness**
- [x] **B1** — Async error path bypasses fallback strategy, provider-exception mapping, and `telemetry.on_error` (High) #3355
- [x] **B2** — Async retry loses the `temperature→1.0` escalation on `LLMNoResponseError` (High) #3356
- [ ] **B3** — `cancel_token` cleared in `arun()`'s `finally` while interrupted tool threads may still be running (Medium) #3395 
- [x] **B4** — `RemoteConversation.interrupt()` not overridden → silently degrades to `pause()` (Medium) #3398 
- [x] **B5** — ACP/custom agents (no `astep` override) run sync `step()` in a worker thread while `arun()` holds the state lock on the event-loop thread (Medium) #3404 

**Maintainability**
- [ ] **Q1** — Dead code: `Tool.acall`, `agenerate_title_with_llm`/`agenerate_title_from_message`, dead `_return_metrics` param
- [x] **Q2** — `Metrics.get_snapshot()` not reused (4 hand-built copies + token-usage aliasing) #3397 
- [ ] **Q3** — Parallel-tool throughput regression (shared default executor vs dedicated pool)
- [ ] **Q4** — `FIFOLock` held across `await` — undocumented "mutators must run off the event-loop thread" invariant #3400
- [ ] **Q5** — ~90% sync/async twin duplication (~600 reclaimable lines); extract shared helpers

---

## B1 — Async error path bypasses fallback / mapping / telemetry (High)

In `acompletion` (`llm.py:996-1054`) and `aresponses` (`llm.py:1338-1460`), the retry loop is **outside** the `try/except` that calls `_ahandle_error`:

```python
async for attempt in self.async_retry(...):   # reraise=True
    with attempt:
        resp = await self._atransport_call(...)   # provider errors raised HERE
try:                                              # only wraps response parsing
    message = Message.from_llm_chat_message(...)
    return LLMResponse(...)
except Exception as e:
    return await self._ahandle_error(e, ...)
```

The sync `completion` puts `_one_attempt()` inside the `try` (`llm.py:900`). `map_provider_exception` is called **only** in `_handle_error`/`_ahandle_error` (`llm.py:718, 751`). So on the common failure mode (retry exhaustion / non-retryable provider error), async:
- never invokes `fallback_strategy` (backup LLM silently skipped),
- never maps litellm exceptions to SDK types,
- never calls `telemetry.on_error`.

**Most damaging consequence:** `astep` recovers from context overflow via `except LLMContextWindowExceedError → CondensationRequest`. That SDK type is produced *only* by `map_provider_exception` (`mapping.py:35`). In async the raw `litellm.ContextWindowExceededError` escapes unmapped, the handler misses it, and **the conversation errors out instead of auto-condensing.** Same for `LLMMalformedConversationHistoryError`.

**Repro:** primary `litellm_acompletion` → `APIConnectionError`, fallback LLM configured; `acompletion()` re-raised `APIConnectionError` and the fallback call count stayed `0`.

**Fix:** wrap the async retry loop in the same `try/except` scope as sync, for both `acompletion` and `aresponses`.

## B2 — Async retry loses `temperature→1.0` on `LLMNoResponseError` (High)

`before_sleep` bumps temperature on `LLMNoResponseError` by mutating `retry_state.kwargs` (`retry_mixin.py:50-59`). Sync reads it back via `_one_attempt(**retry_kwargs)` → `final_kwargs = {**call_kwargs, **retry_kwargs}` (`llm.py:870`). The async `async for attempt: with attempt:` idiom has no retried function and uses a `call_kwargs` local that is never updated, so the bump is lost.

**Repro (pinned tenacity 9.1.2):** temperatures per attempt — sync `[0, 1.0, 1.0]`, async `[0, 0, 0]`. With `temperature=0` the model tends to repeat the empty response, so all retries fail identically (and then compound with B1).

**Fix:** thread the retry-adjusted kwargs into the per-attempt call in the async loop (cleanest if combined with B1 via a shared per-attempt setup helper). Affects `acompletion` and `aresponses`.

## B3 — `cancel_token` cleared while interrupted tool threads run (Medium)

`arun()`'s `finally` sets `self._cancel_token = None` (`local_conversation.py:1067`). `run_in_executor` worker threads can't be force-cancelled and may outlive `arun()`. A long-running tool that polls the documented `conversation.cancel_token` pattern after `arun()` exits sees `None` and loses the cancellation signal. (Terminal tools also get `executor.interrupt()`, so they're usually fine; cooperative-polling tools are not.)

**Fix:** keep the cancelled token reachable until the interrupted batch drains, or pass the token directly into tool execution rather than reading it back from the conversation.

## B4 — `RemoteConversation.interrupt()` degrades to `pause()` (Medium)

The PR adds `POST /conversations/{id}/interrupt` and `LocalConversation.interrupt()`, but `RemoteConversation` (`remote_conversation.py`) doesn't override `interrupt()`, so it inherits `BaseConversation.interrupt()` → `self.pause()`. SDK callers using a remote conversation get pause semantics from a method named `interrupt()`.

**Fix:** add a `RemoteConversation.interrupt()` that POSTs to the `/interrupt` endpoint; add a test asserting it hits `/interrupt`, not `/pause`.

## B5 — Sync-only/custom agents routed through threaded `astep` (Medium)

`EventService.run()` selects `arun()` whenever the *conversation* overrides `BaseConversation.arun` (`event_service.py:285-294`) — always true for `LocalConversation`. But only `Agent` overrides `astep`; `ACPAgent` and other `AgentBase` subclasses inherit the default `astep` (`base.py:613`) that runs sync `step()` via `run_in_executor`. So `step()` runs in a worker thread while `arun()` holds the `FIFOLock` on the event-loop thread.

Verified this is **not a deadlock** (`step()` never acquires the state lock), but it changes the threading/ordering model that ACP and custom agents were written against (callbacks now fire from a worker thread, not the lock-owning thread).

**Fix:** gate the async path on the *agent's* capability, e.g. `type(conversation.agent).astep is not AgentBase.astep`; otherwise run sync `run()` in the executor.

## Q1 — Dead code
- `Tool.acall` (`tool.py:392`) — zero callers; also a false affordance, since `aexecute_batch`/`_arun_safe` always wrap the sync `__call__` in a thread (`parallel_executor.py:761`) and never call `acall`.
- `agenerate_title_with_llm` / `agenerate_title_from_message` (`title_utils.py`, ~78 lines) — only call each other; server still uses sync `generate_title`.
- `_return_metrics` — never read in any of the 4 LLM methods; propagated into the new async ones.

## Q2 — `Metrics.get_snapshot()` not reused
`MetricsSnapshot(...)` is hand-built at `llm.py:907, 1032, 1256, 1450`; `Metrics.get_snapshot()` (`metrics.py:136`) already does this and additionally `deepcopy`s `accumulated_token_usage`. The hand-built copies alias the live object, so they aren't true snapshots — reusing the helper removes ~20 lines and fixes the aliasing.

## Q3 — Parallel-tool throughput regression
Sync `execute_batch` uses a dedicated `ThreadPoolExecutor(max_workers)` per batch; async `aexecute_batch` uses the shared default `run_in_executor(None, …)` pool gated by `Semaphore(max_workers)` (`parallel_executor.py:729-737`). Under concurrency or when `max_workers` exceeds the default pool size, previously-parallel tool calls can serialize. Correctness intact (order + locking preserved).

## Q4 — `FIFOLock` held across `await`
`arun()` holds the thread-reentrant `FIFOLock` across `await astep` (`local_conversation.py:953-1007`). Safe *only* because the server routes every state mutator through `run_in_executor` worker threads. Any future state-mutating coroutine on the event-loop thread would silently re-enter the lock and corrupt history. Worth documenting or releasing the lock across the `await`.

## Q5 — Sync/async duplication
After normalizing `await`/`async`, the twins are ~90%+ identical (`astep`/`step` 96/126 lines; `arun`/`run` loop 75/80; plus `acompletion`/`aresponses`/`_atransport_call`/`_ahandle_error`/condenser). ~600 of the ~1,740 added production lines are reclaimable via shared color-agnostic helpers (`_prepare_chat_request`/`_finalize_chat_response`, `_classify_step_error`, `_map_and_raise`). This duplication is the root cause of B1/B2 diverging from the sync path.

## Suggested test additions
- async fallback success; all-fallbacks-fail → mapped SDK exception; responses-API async fallback; `telemetry.on_error` fires on async failure (covers B1)
- `LLMNoResponseError` → temperature escalates on async retry (covers B2)
- tool still polling `cancel_token` after `arun()` enters `finally` still sees `is_cancelled` (covers B3)
- `RemoteConversation.interrupt()` posts to `/interrupt` (covers B4)
- server conversation with a sync-only `AgentBase` subclass runs via `run()` in executor (covers B5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracker] Async LLM support (#3284): correctness bugs + maintainability issues #3341

Summary

Checklist

B1 — Async error path bypasses fallback / mapping / telemetry (High)

B2 — Async retry loses `temperature→1.0` on `LLMNoResponseError` (High)

B3 — `cancel_token` cleared while interrupted tool threads run (Medium)

B4 — `RemoteConversation.interrupt()` degrades to `pause()` (Medium)

B5 — Sync-only/custom agents routed through threaded `astep` (Medium)

Q1 — Dead code

Q2 — `Metrics.get_snapshot()` not reused

Q3 — Parallel-tool throughput regression

Q4 — `FIFOLock` held across `await`

Q5 — Sync/async duplication

Suggested test additions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Tracker] Async LLM support (#3284): correctness bugs + maintainability issues #3341

Description

Summary

Checklist

B1 — Async error path bypasses fallback / mapping / telemetry (High)

B2 — Async retry loses temperature→1.0 on LLMNoResponseError (High)

B3 — cancel_token cleared while interrupted tool threads run (Medium)

B4 — RemoteConversation.interrupt() degrades to pause() (Medium)

B5 — Sync-only/custom agents routed through threaded astep (Medium)

Q1 — Dead code

Q2 — Metrics.get_snapshot() not reused

Q3 — Parallel-tool throughput regression

Q4 — FIFOLock held across await

Q5 — Sync/async duplication

Suggested test additions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

B2 — Async retry loses `temperature→1.0` on `LLMNoResponseError` (High)

B3 — `cancel_token` cleared while interrupted tool threads run (Medium)

B4 — `RemoteConversation.interrupt()` degrades to `pause()` (Medium)

B5 — Sync-only/custom agents routed through threaded `astep` (Medium)

Q2 — `Metrics.get_snapshot()` not reused

Q4 — `FIFOLock` held across `await`