Skip to content

fix(adapter): surface error on agent EOF without final response#1198

Open
howie wants to merge 2 commits into
openabdev:mainfrom
howie:worktree-fix+acp-eof-error
Open

fix(adapter): surface error on agent EOF without final response#1198
howie wants to merge 2 commits into
openabdev:mainfrom
howie:worktree-fix+acp-eof-error

Conversation

@howie

@howie howie commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What problem does this solve?

When a bridged (non-ACP-native) agent crashes on a backend error, OpenAB shows the user a vague _(no response)_ (or, worse, a partially-streamed buffer presented as a complete answer) instead of the real failure reason.

agy does not natively speak ACP — we bridge it. So when an AI backend (e.g. Gemini) returns HTTP 500 or a quota-exhausted error, agy can crash/exit without first emitting an ACP final response carrying that error. The failure chain:

  1. AI backend returns HTTP 500 / quota exhausted.
  2. agy crashes/exits — never sends an ACP error notification.
  3. OpenAB's ACP stdout pipe reaches EOF.
  4. The recv loop's rx.recv() -> None branch breaks out, but response_error is still None.
  5. final_content is empty → falls through to the _(no response)_ sentinel.

The user is left with no idea their request hit a quota limit or server error.

This was surfaced in the OpenAB community discussion. Because agy itself doesn't support ACP, the bridging layer (OpenAB side) must supply a sensible error on pipe EOF rather than letting it fall through.

Closes #

Discord Discussion URL:

At a Glance

Before:
  backend 500/quota  ->  agy crashes (no ACP error)  ->  pipe EOF
        ->  recv() == None  ->  break (response_error stays None)
        ->  final_content empty  ->  "_(no response)_"   [user confused]

After:
  backend 500/quota  ->  agy crashes (no ACP error)  ->  pipe EOF
        ->  recv() == None  ->  response_error = "Agent process exited unexpectedly"
        ->  final_content: "⚠️ Agent process exited unexpectedly"   [empty buffer]
                        or "⚠️ Agent process exited unexpectedly\n\n<partial text>"  [partial stream]

Termination paths:
  liveness check sees dead process   -> "⚠️ Agent process died"               (already handled)
  pipe EOF, no final response        -> "⚠️ Agent process exited unexpectedly" (THIS FIX)
  agent sends proper ACP error       -> "⚠️ Server Error ... quota exceeded"   (already handled)

Prior Art & Industry Research

Not applicable — defensive bug fix on the ACP bridge. Adds error surfacing on an existing break
path; no new architecture, dependency, or protocol. It makes the EOF branch consistent with the
sibling !conn.alive() branch, which already sets response_error = "Agent process died".

Proposed Solution

In crates/openab-core/src/adapter.rs, the recv loop EOF branch records an explicit error when the
pipe closes without a final response and no error was already recorded:

None => {
    if response_error.is_none() {
        response_error = Some("Agent process exited unexpectedly".into());
    }
    break;
}

The downstream final_content builder already renders response_error as ⚠️ {err} (standalone
when there is no body, or prepended to any partial content), so no other change is needed.

Why this approach?

  • Smallest correct fix on the bridge side. agy can't be relied on to emit an ACP error before
    crashing, so the bridge detects the abnormal EOF itself, mirroring the existing
    !conn.alive() branch.
  • No buffer-content guard. An earlier revision gated this on text_buf.is_empty(). Multi-model
    mob review (Codex + a Claude multi-finder review) showed that guard is unsound: it is defeated by
    the session-reset pre-seed (text_buf already holds the expiry notice), by send-once mode (the
    delivered body is a post-answer_start slice, not the full buffer), and by native streaming
    (partial text shown as a complete answer). The guard is also unnecessary: a successful turn is
    always signalled by the id-bearing JSON-RPC response to session/prompt, which breaks the recv
    loop before EOF — so the EOF branch is reachable only on abnormal termination, regardless of
    buffer contents. When partial text was streamed, final_content prepends the warning to it,
    preserving the output while flagging the truncation.

Alternatives Considered

  • Gate on text_buf.is_empty() (earlier revision): rejected after mob review — masks crashes on
    reset turns, send-once tool turns, and native-streaming partials (see above).
  • Synthetic final-response timeout instead of acting on EOF: rejected — EOF already definitively
    signals the pipe closed; a timeout only adds latency for the same outcome.
  • Fix only agy (emit ACP error before exit): rejected as the sole fix — agy isn't guaranteed to
    emit an error before crashing, so the bridge must be robust to silent EOF. A complementary agy-side
    fix is still worthwhile but out of scope for this repo.

Known Issues / Follow-ups

  • Reaction emoji on crashed turns (pre-existing, deferred): the EOF-injected response_error
    drives the ⚠️ message banner but not delivery_failed, so dispatch still reacts with success
    (🆗) rather than error (❌). This gap is shared with the existing Agent process died /
    hard-timeout / coded-error branches — this PR only joins the pattern. Recommended as a separate PR
    that wires response_error → ❌ uniformly for all error branches.

Validation

Rust changes:

  • cargo build -p openab-core passes
  • cargo test -p openab-core --lib adapter — 43 passed, 0 failed
  • adapter.rs clippy-clean
  • Multi-model mob review (Codex + Claude multi-finder): converged to LGTM after the
    text_buf.is_empty() guard was dropped

CI note:

All PRs:

  • Manual testing — reproduce by forcing a bridged agent to exit on a backend 500/quota error and
    confirm the user sees ⚠️ Agent process exited unexpectedly (with any partial text preserved)
    instead of _(no response)_. No automated test seam exists for this deeply-nested recv loop;
    the sibling error paths are likewise covered only by manual repro against a live agent.

@howie howie requested a review from thepagent as a code owner June 25, 2026 06:16
@openab-app openab-app Bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 25, 2026
@chaodu-agent

This comment has been minimized.

@openab-app openab-app Bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 25, 2026
The ACP recv loop's EOF branch (rx.recv() -> None) broke out of the
loop without setting response_error. When a bridged agent crashes on a
backend error (e.g. Gemini HTTP 500 / quota exhausted) it exits without
ever emitting an ACP error notification, so the pipe closes, recv()
returns None, and final_content falls through to the "_(no response)_"
sentinel -- hiding the real failure from the user.

Mark unexpected termination explicitly on EOF when no error was recorded
and no text was streamed, so the user sees a concrete error instead of a
vague sentinel. The text_buf.is_empty() guard preserves any partial text
that did arrive before the connection closed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01X6Vv8eNtc4T9ESSSzZ2KgM
@howie howie force-pushed the worktree-fix+acp-eof-error branch from dc9dfee to 513c523 Compare June 25, 2026 06:52
Mob review (Codex + Claude multi-finder) found the original
text_buf.is_empty() guard predicates error surfacing on buffer contents,
which is not a reliable proxy for turn completion. Three crash classes
still slipped through silently:

- session-reset turns pre-seed text_buf with the expiry notice, so the
  guard is always false and a crash is masked;
- send-once mode slices off inter-tool narration before delivery, so a
  crash after a tool (non-empty buffer, empty delivered body) shows no
  error;
- native-streaming partial-text crashes were shown as complete answers.

Drop the text_buf.is_empty() conjunct: on EOF, set response_error
whenever none was recorded. This is safe because a successful turn is
always signalled by the id-bearing JSON-RPC response to session/prompt,
which breaks the loop before EOF — so the EOF arm is reachable only on
abnormal termination, regardless of buffer contents. When partial text
was streamed, final_content prepends the warning to it, preserving the
output while flagging the truncation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01X6Vv8eNtc4T9ESSSzZ2KgM
@howie

howie commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Final Aggregated Review — PR #1198

Mode

group-review (2/2 voices active: Claude 8-finder/14-verified workflow + Codex)

Outcome (after fix + re-review)

  • Consensus fix applied (commit 6697c21): dropped text_buf.is_empty(); EOF now always sets
    response_error when none was recorded. Resolves findings 1–3.
  • Codex re-review: LGTM (no findings). Claude voice (lead): LGTM — findings 1–3 resolved,
    false-positive concern was refuted at review time, fix: pull --rebase before push in bump-chart #4 deferred (user decision), fix: use PR for chart bump instead of direct push #5 kept distinct.
  • All 43 adapter unit tests pass; cargo build -p openab-core clean.
  • Merge blocked only by the pre-existing pre_seed.rs:289 clippy lint on main (see Out-of-PR
    note) — user decision: leave for maintainers.

Consensus Critical (must fix)

None.

Consensus Important (must fix)

Both voices converge on a single root cause: the text_buf.is_empty() guard predicates error
surfacing on buffer contents, but buffer contents are not a reliable proxy for "did the turn
complete".
Three confirmed crash classes still slip through silently:

  1. Session-reset turn masks the crash (adapter.rs:705-707 pre-seeds text_buf with
    "⚠️ _Session expired, starting fresh..._"). On a reset turn text_buf.is_empty() is always
    false, so a bridged-agent crash at EOF never sets response_error — the user sees only the
    reset notice. Confirmed by 6 Claude finders + verifiers.

  2. Guard measures the wrong buffer (send-once mode). The guard reads the full pre-split
    text_buf, but the delivered body is the post-answer_start slice. A turn that streams
    pre-tool narration, runs a tool (answer_start advances to buffer end at adapter.rs:955),
    then crashes with no post-tool text → text_buf non-empty but delivered body empty → no error.
    Confirmed by Codex + multiple Claude verifiers.

  3. Native-streaming partial-text crash shown as complete. Partial text flushed live, then
    backend 500/quota closes stdout → response_error stays None by the guard → truncated answer
    shown unprefixed, indistinguishable from a deliberately short reply. (Codex's primary finding +
    Claude finding.)

Unifying fix (resolves all three): drop the text_buf.is_empty() conjunct; on EOF set
response_error whenever none was recorded. This is provably safe — both voices' verifiers
independently REFUTED
the "successful turn closes stdout without a final response → false
positive" concern: a successful ACP turn is always signaled by the id-bearing JSON-RPC response to
session/prompt, which breaks the loop at the id-branch before EOF. The EOF branch is therefore
reachable only on abnormal termination, regardless of buffer contents. (Independently corroborated
by an Explore trace of acp/connection.rs:170-172,255-257 + agy-acp/src/main.rs:46-52,149-172.)

The original spec's text_buf.is_empty() rationale ("agent streamed all text then closed
normally") describes a path that does not exist in this ACP/JSON-RPC implementation.

Disputed / scope decision (user decides)

  1. EOF error is display-only; does not set delivery_failed (Claude single-voice, CONFIRMED at
    verify index 29). A crashed turn shows the ⚠️ banner but dispatch still reacts with success
    (🆗) because the closure return is governed by delivery_failed, never by response_error.
    Pre-existing & shared: the existing Agent process died (833), hard-timeout (838), and
    coded-error branches have the same gap — my branch only joins the existing pattern, it does not
    introduce it. Wiring response_error → ❌ reaction is a broader change to the shared
    error/delivery contract affecting all four branches.
    • Lead recommendation: OUT OF SCOPE for this focused EOF fix. Defer to a separate PR that
      fixes the reaction signal for all error branches uniformly; note as Known Issue here.

Actionable NIT

  1. recv-vs-timeout wording race (PLAUSIBLE). The same crash can surface as
    "Agent process exited unexpectedly" (EOF arm wins) or "Agent process died" (liveness arm
    wins) depending on a tokio::select! race.
    • Lead decision: keep distinct wording. Both messages are accurate and reflect two genuine
      detection mechanisms (stdout EOF vs liveness probe); the distinction aids triage rather than
      harming it. Documented here rather than forced into a single string. Not a defect.

Voices unavailable

  • agy (Gemini): not installed (cache GEMINI_OK=0). 2-voice mob.

Out-of-PR note

@openab-app openab-app Bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. label Jun 25, 2026
@openab-app

openab-app Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Caution

This PR is missing a Discord Discussion URL in the body.
This PR will be automatically closed in 24 hours if the link is not added.

All PRs must reference a prior Discord discussion to ensure community alignment before implementation.

Please edit the PR description to include a link like:

Discord Discussion URL: https://discord.com/channels/...

@chaodu-agent

Copy link
Copy Markdown
Collaborator

LGTM ✅ — Minimal, correct defensive fix for the silent-EOF-on-bridged-agent-crash path.

What This PR Does

When a bridged (non-ACP-native) agent crashes on a backend error (e.g. HTTP 500, quota exhausted), it exits without emitting an ACP error notification. Previously the recv loop's EOF branch simply broke out, leaving response_error as None and the user seeing a vague _(no response)_. This PR surfaces an explicit "Agent process exited unexpectedly" error on that path, consistent with the existing !conn.alive() ("Agent process died") branch.

How It Works

In the rx.recv() => None branch (pipe EOF without a prior final response), sets response_error = Some("Agent process exited unexpectedly".into()) when no error was already recorded. The downstream final_content builder already renders this as ⚠️ {err}. No other code changes needed.

Findings

# Severity Finding Location
1 🟢 Correct pattern — mirrors !conn.alive() branch exactly adapter.rs:815-835
2 🟢 Sound reasoning for dropping the text_buf.is_empty() guard — well-documented in comments adapter.rs:823-831
3 🟢 response_error.is_none() guard avoids overwriting errors already set by other paths adapter.rs:833
Finding Details

🟢 F1: Correct pattern reuse

The fix follows the exact same response_error = Some(...) + break pattern used by the sibling !conn.alive() and hard-timeout branches. Minimal and consistent.

🟢 F2: Thorough design rationale in comments

The inline comments explain why the text_buf.is_empty() guard was dropped (pre-seed buffer, send-once mode, native streaming partials) — this is excellent documentation for future maintainers and avoids re-litigating the same design question.

🟢 F3: Defensive is_none() guard

Protects against double-setting response_error if a future code path sets it before reaching EOF. Good defensive practice.

Baseline Check
  • PR opened: 2026-06-25
  • Main already has: response_error handling for !conn.alive() ("Agent process died") and hard timeout; the EOF branch was a bare break
  • Net-new value: Surfaces an explicit error on the EOF-without-final-response path, eliminating the silent _(no response)_ failure mode for bridged agents
CI Status
What's Good (🟢)
  • Single-line behavioral fix with extensive, well-reasoned inline documentation
  • Consistent with existing error-surfacing patterns in the same recv loop
  • PR description is exemplary: clear before/after diagrams, alternatives considered, known issues documented
  • Multi-model mob review already performed and converged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. pending-contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants