fix(agent): resumable checkpoint at tool-call cap + tolerant tool parsing/resolution#2683
Conversation
…sing/resolution Three agent-reliability bugs from tester logs (bug-report-2026-05-26): A1 (HIGH): hitting the tool-call iteration cap aborted the turn, leaving the transcript on an unterminated tool cycle — the next message then silently returned empty and wedged the whole thread. The loop now emits a model-written "done so far / next steps" checkpoint (native tools disabled, streamed) as the final assistant message and keeps full tool history, so the user's next message resumes naturally — no heuristic "continue" detection. Deterministic fallback if the summary is empty, plus a guard that surfaces a 0-text/0-tool completion as an error instead of a silent blank reply. A2 (MED): integrations_agent dropped valid Gmail/Slides tools when the model emitted a near-miss slug. LazyToolkitResolver now resolves exact -> case-insensitive -> unique normalized-slug; fabricated slugs still route to the "available tools" error so the model self-corrects. A3 (MED): a tool_call with a correct name but an aliased argument key (args/parameters/input) was dropped, wasting an iteration. parse_tool_call_value now accepts argument-key aliases; the tool name stays strict to avoid false-positive tool calls in the whole-response parse path. Tests: checkpoint emitted at cap + transcript well-formed; empty-response error; deterministic fallback; arg-alias recovery; near-miss resolve + normalize_slug. Note: the legacy run_tool_call_loop (tool_loop.rs, superseded for chat dispatch) still returns MaxIterationsExceeded; only the live session turn path is changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughNormalizes tool-call argument-key aliases while forbidding generic aliases for whole-response singletons, adds resumable assistant checkpoints at max-tool-iteration caps (with deterministic fallback and usage accounting), and implements fuzzy toolkit slug resolution. Tests added/updated across parsing, turn handling, and toolkit resolution. ChangesModel Output Resilience
Sequence Diagram(s)sequenceDiagram
participant Client
participant Agent
participant Provider
participant Transcript
Client->>Agent: request turn
Agent->>Provider: model completions (may include tool calls)
Provider-->>Agent: tool-call markup + text
Agent->>Provider: summarize_iteration_checkpoint (tools: None) -- when iteration cap hit
Provider-->>Agent: streaming prose checkpoint
Agent->>Transcript: persist assistant checkpoint and usage
Agent-->>Client: return checkpoint text
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/openhuman/agent/harness/subagent_runner/ops.rs (1)
213-258: 💤 Low valueConsider logging ambiguous normalized matches for diagnostics.
The logic correctly returns
Nonewhen multiple actions normalize to the same slug (preventing mis-dispatch), but the silent fall-through at line 247 might make debugging harder if a toolkit has actions likeread_fileandReadFile(both normalize to"readfile"). A debug-level log would aid diagnosis without changing behavior.📊 Optional diagnostic log
if let Some(action) = matches.next() { if matches.next().is_none() { tracing::info!( requested = %name, matched = %action.name, "[subagent_runner] resolved tool by normalized-slug match" ); return Some(action); + } else { + tracing::debug!( + requested = %name, + norm = %norm, + "[subagent_runner] ambiguous normalized-slug match; multiple actions resolve to same slug" + ); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/subagent_runner/ops.rs` around lines 213 - 258, In find_action, when normalizing the requested name into norm and iterating matches (the iterator bound to matches), add a debug-level log when multiple actions share the same normalized slug so ambiguous normalized matches are visible; specifically, detect the branch where matches.next() yields an action but a subsequent matches.next().is_some() is true (the ambiguity case) and emit a tracing::debug (or tracing::info at debug level) with requested = %name, norm = %norm, and known_slugs or the colliding action names to aid diagnosis without changing the return behavior.src/openhuman/agent/harness/parse.rs (1)
412-418: ⚡ Quick winPrefix this warning consistently.
New error-path logs in Rust are expected to use a stable grep-friendly prefix; this message will be harder to aggregate next to the existing bracketed log families. As per coding guidelines "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/parse.rs` around lines 412 - 418, The tracing::warn message in parse.rs that reports malformed <tool_call> JSON should be changed to include a stable, grep-friendly prefix (e.g. "[parse:malformed-tool-call]" or similar) at the start of the log string so it aligns with other structured logs; update the tracing::warn call in the function that emits this warning (the block that currently logs "Malformed <tool_call> JSON: expected tool-call object in tag body (body_chars={})" and uses inner.chars().count()) to prepend the chosen prefix while keeping the existing diagnostic text and the body_chars value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/harness/parse.rs`:
- Around line 79-80: The singleton top-level-object path in parse_tool_calls is
incorrectly using first_args_by_keys(value) which applies broad alias widening
(e.g., "input") and causes normal replies to be interpreted as tool calls;
instead, for the singleton whole-response fallback (the branch that constructs
Some(ParsedToolCall { ... } from value) only accept a canonical "arguments"
field or apply a strict key check (e.g., require the exact "arguments" key or a
narrow whitelist) rather than calling first_args_by_keys; update the code around
the ParsedToolCall construction to only extract arguments from
value.get("arguments") (or validate explicit canonical keys) and leave
first_args_by_keys usage limited to tagged/tool_calls entries.
In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 1011-1052: The checkpoint path currently calls
summarize_iteration_checkpoint() to get checkpoint text but ignores any
token/usage info, causing undercounting; change summarize_iteration_checkpoint
(and its other occurrence around the 1991-2073 range) to return both the
checkpoint string and a UsageSummary (or similar struct), then when checkpoint
is produced (or when falling back to build_deterministic_checkpoint) fold that
returned usage into cumulative_input_tokens, cumulative_output_tokens,
cumulative_cached_input_tokens, cumulative_charged_usd and merge into
last_turn_usage before calling persist_session_transcript; update the call sites
that push ChatMessage::assistant(checkpoint.clone()) and
persist_session_transcript to use the adjusted cumulative values and updated
last_turn_usage so the checkpoint provider’s usage is included in the saved
transcript.
- Around line 2057-2066: The current match in turn.rs returns resp.text when
parse_response yields empty prose, which can re-emit raw <tool_call> markup;
update the branch after calling self.tool_dispatcher.parse_response(&resp) so
that if text.trim().is_empty() you return an empty string (or only fall back to
resp.text after verifying it contains no tool-call markup, e.g., no "<tool_call"
substring). Modify the block around tool_dispatcher.parse_response, the Ok(resp)
match arm, and the conditional that selects between text and resp.text to ensure
parsed-empty responses do not persist raw tool-call XML.
---
Nitpick comments:
In `@src/openhuman/agent/harness/parse.rs`:
- Around line 412-418: The tracing::warn message in parse.rs that reports
malformed <tool_call> JSON should be changed to include a stable, grep-friendly
prefix (e.g. "[parse:malformed-tool-call]" or similar) at the start of the log
string so it aligns with other structured logs; update the tracing::warn call in
the function that emits this warning (the block that currently logs "Malformed
<tool_call> JSON: expected tool-call object in tag body (body_chars={})" and
uses inner.chars().count()) to prepend the chosen prefix while keeping the
existing diagnostic text and the body_chars value.
In `@src/openhuman/agent/harness/subagent_runner/ops.rs`:
- Around line 213-258: In find_action, when normalizing the requested name into
norm and iterating matches (the iterator bound to matches), add a debug-level
log when multiple actions share the same normalized slug so ambiguous normalized
matches are visible; specifically, detect the branch where matches.next() yields
an action but a subsequent matches.next().is_some() is true (the ambiguity case)
and emit a tracing::debug (or tracing::info at debug level) with requested =
%name, norm = %norm, and known_slugs or the colliding action names to aid
diagnosis without changing the return behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1f0acf74-04da-456f-bdaf-5e511c818847
📒 Files selected for processing (6)
src/openhuman/agent/harness/parse.rssrc/openhuman/agent/harness/parse_tests.rssrc/openhuman/agent/harness/session/turn.rssrc/openhuman/agent/harness/session/turn_tests.rssrc/openhuman/agent/harness/subagent_runner/ops.rssrc/openhuman/agent/harness/subagent_runner/ops_tests.rs
CodeRabbit review fixes:
- parse.rs: don't apply generic arg-key aliases (notably the very generic
'input') on the singleton whole-response path. A bare {"name":..,"input":..}
model reply must not be dressed up as a tool call with arbitrary args routed
to execution. Alias widening stays for tagged contexts (tool_calls array,
<tool_call>/<invoke> tags, function-wrapped) via a threaded allow_arg_aliases
flag; the whole-response parse passes false. New regression test.
- turn.rs: summarize_iteration_checkpoint now returns its provider usage so the
caller folds it into cumulative_input/output/cached tokens, charged_usd and
last_turn_usage (mirrors the normal final-response accounting). The extra
checkpoint call no longer undercounts the turn / mis-attributes prior usage.
- turn.rs: when parse_response strips tool-call markup and leaves no prose,
return empty (-> deterministic fallback) instead of re-emitting resp.text,
which could persist raw <tool_call> markup verbatim as the checkpoint.
CI test reconciliation (agent/tests.rs): the resumable-checkpoint + empty-as-
error behaviour from this PR changed three pre-existing contracts. Updated:
- turn_bails_out_at_max_iterations -> turn_emits_checkpoint_at_max_iterations
(asserts Ok(checkpoint) + well-formed transcript tail, not Err).
- turn_handles_empty/none_text_response -> turn_errors_on_empty/none_text_response
(assert the empty-response error, matching the A1 guard).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agnostics (tinyhumansai#2683) - parse.rs: give the malformed-<tool_call> warn a stable grep-friendly '[agent_parse]' prefix and structured body_chars field, matching the other harness log families (per CLAUDE.md logging conventions). - subagent_runner/ops.rs: log a debug line when 2+ actions normalize to the same slug in find_action — the ambiguous case deliberately refuses to guess, so surface the collision for diagnosis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed
The 3 actionable comments were addressed in |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/agent/harness/session/turn.rs (1)
2071-2109: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winAdd success-path diagnostics around the checkpoint provider call.
This adds a second
provider.chat(...)on the max-iteration path, but only failures are logged. When the checkpoint comes back empty or markup-only, there’s no grep-friendly trace of the external call itself, which makes capped-turn debugging much harder.📝 Example logging shape
+ let checkpoint_started = std::time::Instant::now(); + log::info!( + "[agent_loop] checkpoint summary request iteration={} model={} messages={}", + iteration_for_stream, + effective_model, + messages.len() + ); let result = self .provider .chat( ChatRequest { messages: &messages, @@ match result { Ok(resp) => { + log::info!( + "[agent_loop] checkpoint summary response iteration={} elapsed_ms={} text_chars={} native_tool_calls={} usage_present={}", + iteration_for_stream, + checkpoint_started.elapsed().as_millis(), + resp.text.as_ref().map_or(0, |t| t.chars().count()), + resp.tool_calls.len(), + resp.usage.is_some() + ); let usage = resp.usage.clone(); // Strip any stray tool-call XML a text-mode model may have // emitted; keep only the prose.As per coding guidelines
**/*.rs:Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/session/turn.rs` around lines 2071 - 2109, Wrap the successful provider.chat call (the call that produces resp used to compute checkpoint) with grep-friendly entry/exit logs and a branch log when checkpoint is empty or markup-only: before calling provider.chat log an entry like "CHECKPOINT_CALL:start" with correlation fields (turn/session/request id from self if available), model (effective_model) and temperature; after it returns log "CHECKPOINT_CALL:ok" with resp.usage, resp.text (or resp.text.is_some()), the parsed (tool_dispatcher.parse_response) text and calls, and the derived checkpoint value; if checkpoint is empty or was dropped due to markup-only emit "CHECKPOINT_CALL:empty_or_markup_only" including resp.raw/text/calls/usage to aid debugging. Use existing logging facility in this module and keep messages stable and grep-friendly.
🧹 Nitpick comments (1)
src/openhuman/agent/tests.rs (1)
443-490: ⚡ Quick winPin the cap with an execution-count assertion.
This proves a checkpoint is eventually returned, but it would still pass if the loop executed
max_tool_iterations + 1tools before checkpointing. UsingCountingToolhere and asserting the count is exactlymax_iterswould lock down the off-by-one behavior this regression is meant to guard.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/tests.rs` around lines 443 - 490, Update the test turn_emits_checkpoint_at_max_iterations to use the CountingTool instead of EchoTool and assert its execution count equals max_iters; specifically, create and pass a CountingTool instance in the build_agent_with_config tools vector (replace Box::new(EchoTool) with the CountingTool instance), run agent.turn as before, then after the reply add an assertion that the CountingTool's recorded count (e.g., counting_tool.count() or its public counter) is exactly max_iters to pin the off-by-one behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/harness/parse.rs`:
- Around line 100-106: The fallback branch in parse_tool_calls currently treats
any top-level JSON as a tool call when allow_arg_aliases is false; change the
logic in the block that calls parse_arguments_value(value.get("arguments")) so
it first checks that the canonical "arguments" key is actually present (e.g.,
value.get("arguments").is_some() or value.contains_key("arguments")) before
constructing a ParsedToolCall, returning None/plain-text otherwise; apply the
same presence check to the similar fallback at the other occurrence (lines
~415-420) so only responses with an explicit "arguments" key become tool calls.
---
Outside diff comments:
In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 2071-2109: Wrap the successful provider.chat call (the call that
produces resp used to compute checkpoint) with grep-friendly entry/exit logs and
a branch log when checkpoint is empty or markup-only: before calling
provider.chat log an entry like "CHECKPOINT_CALL:start" with correlation fields
(turn/session/request id from self if available), model (effective_model) and
temperature; after it returns log "CHECKPOINT_CALL:ok" with resp.usage,
resp.text (or resp.text.is_some()), the parsed (tool_dispatcher.parse_response)
text and calls, and the derived checkpoint value; if checkpoint is empty or was
dropped due to markup-only emit "CHECKPOINT_CALL:empty_or_markup_only" including
resp.raw/text/calls/usage to aid debugging. Use existing logging facility in
this module and keep messages stable and grep-friendly.
---
Nitpick comments:
In `@src/openhuman/agent/tests.rs`:
- Around line 443-490: Update the test turn_emits_checkpoint_at_max_iterations
to use the CountingTool instead of EchoTool and assert its execution count
equals max_iters; specifically, create and pass a CountingTool instance in the
build_agent_with_config tools vector (replace Box::new(EchoTool) with the
CountingTool instance), run agent.turn as before, then after the reply add an
assertion that the CountingTool's recorded count (e.g., counting_tool.count() or
its public counter) is exactly max_iters to pin the off-by-one behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 49ab3ba0-6f40-434f-aec3-ee3957d14bbb
📒 Files selected for processing (5)
src/openhuman/agent/harness/parse.rssrc/openhuman/agent/harness/parse_tests.rssrc/openhuman/agent/harness/session/turn.rssrc/openhuman/agent/harness/session/turn_tests.rssrc/openhuman/agent/tests.rs
… parse (tinyhumansai#2683) Address CodeRabbit (major): on the bare whole-response JSON path (`parse_tool_call_value_aliased(.., false)`), a plain reply like `{"name":"Alice","input":{…}}` still returned a `ParsedToolCall` with empty arguments — so a normal JSON answer whose `name` happens to match a tool could be dispatched. Require the canonical `arguments` key as an explicit tool-call marker on that path; without it, return None (plain text). Tagged contexts (`<tool_call>`/`<invoke>`, `tool_calls` array, `function` wrapper) reach the parser with aliases enabled and keep their permissive behaviour. Test updated: a bare `{"name","input"}` whole-response object is no longer a tool call; a bare object WITH canonical `arguments` still is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckpoint (tinyhumansai#2683) Main agent: the max-iteration checkpoint now summarizes ONLY this turn's work — a trim-robust digest of its tool calls + results compiled as the loop runs — instead of the full conversation (no prior turns, no system prompt). The digest is decoupled from persistence: the transcript is still saved with the full conversation, so the persisted record is unchanged. The digest is immune to history trimming (which drops/reorders from the front). Sub-agent (run_inner_loop): on the iteration cap, return a graceful partial-progress checkpoint (model-written over the run's tool-call digest, tools disabled, with a deterministic fallback) as the result instead of Err(MaxIterationsExceeded). The delegating agent now receives what the sub-agent accomplished and can continue, rather than a bare "delegate failed". Tests: sub-agent returns a checkpoint at the cap (model-written + deterministic fallback); existing main-agent checkpoint tests still pass under turn-scoping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/openhuman/agent/harness/subagent_runner/ops.rs (1)
1870-1929:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPersist the synthesized checkpoint before returning.
This branch folds the summary-call usage into
usage, but it returns without pushingcheckpointintohistoryor re-runningpersist_transcript(). The sub-agent transcript therefore still ends on the last tool-result turn, the checkpoint call's tokens/cost never reach disk, andworker_thread_idruns miss the final agent message for this path. Please mirror the normal terminal-response path here beforeOk(...).Suggested fix
let checkpoint = match provider .chat( ChatRequest { messages: &summary_input, tools: None, @@ } }; + history.push(ChatMessage::assistant(checkpoint.clone())); + append_worker_message( + checkpoint.clone(), + "agent".to_string(), + serde_json::json!({ + "scope": "worker_thread", + "agent_id": agent_id, + "task_id": task_id, + "iteration": max_iterations, + "final": true, + "checkpoint": true, + }), + ); + persist_transcript(history, &usage); Ok((checkpoint, max_iterations, usage)) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/subagent_runner/ops.rs` around lines 1870 - 1929, The checkpoint branch currently returns after synthesizing `checkpoint` (from the `provider.chat(...).await` block) and updating `usage`, but it never append the final checkpoint message into the sub-agent `history` nor calls `persist_transcript()` (so the tokens/cost and final message are not saved); update this path to mirror the normal terminal-response flow by creating the final ChatMessage/turn containing `checkpoint` (use the same message shape used elsewhere), push it into `history`, call `persist_transcript()` (and any code that sets `worker_thread_id`/final metadata) before returning `Ok((checkpoint, max_iterations, usage))`, ensuring the `usage` updates are preserved and any errors from persisting are handled the same way as the regular terminal-response path.
🧹 Nitpick comments (1)
src/openhuman/agent/harness/session/turn.rs (1)
2100-2142: ⚡ Quick winAdd start/success logs around the checkpoint summary call.
This helper adds a new provider round-trip on the turn hot path, but only the failure path is logged. Please emit request/response diagnostics here with the same stable prefix/correlation fields as the main loop so checkpoint stalls, empty-summary fallbacks, and provider latency are traceable.
As per coding guidelines, "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/agent/harness/session/turn.rs` around lines 2100 - 2142, Before calling self.provider.chat in the checkpoint summary block, add a stable, grep-friendly entry log (e.g., "[agent_loop::checkpoint] request" plus the same correlation fields used by the main loop) that records the outgoing request diagnostics (effective_model, temperature, tools present, and a safe summary of messages); on Ok(resp) add a matching success log (e.g., "[agent_loop::checkpoint] response") that records response diagnostics (resp.text presence/length, resp.usage, parsed calls count and the checkpoint decision) so both request/response and latency are traceable; use the existing identifiers (self.provider.chat, ChatRequest, delta_tx_opt, result, resp, self.tool_dispatcher.parse_response) to locate where to insert the logs and ensure the log lines mirror the main loop's stable prefix/correlation field format.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/openhuman/agent/harness/subagent_runner/ops.rs`:
- Around line 1870-1929: The checkpoint branch currently returns after
synthesizing `checkpoint` (from the `provider.chat(...).await` block) and
updating `usage`, but it never append the final checkpoint message into the
sub-agent `history` nor calls `persist_transcript()` (so the tokens/cost and
final message are not saved); update this path to mirror the normal
terminal-response flow by creating the final ChatMessage/turn containing
`checkpoint` (use the same message shape used elsewhere), push it into
`history`, call `persist_transcript()` (and any code that sets
`worker_thread_id`/final metadata) before returning `Ok((checkpoint,
max_iterations, usage))`, ensuring the `usage` updates are preserved and any
errors from persisting are handled the same way as the regular terminal-response
path.
---
Nitpick comments:
In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 2100-2142: Before calling self.provider.chat in the checkpoint
summary block, add a stable, grep-friendly entry log (e.g.,
"[agent_loop::checkpoint] request" plus the same correlation fields used by the
main loop) that records the outgoing request diagnostics (effective_model,
temperature, tools present, and a safe summary of messages); on Ok(resp) add a
matching success log (e.g., "[agent_loop::checkpoint] response") that records
response diagnostics (resp.text presence/length, resp.usage, parsed calls count
and the checkpoint decision) so both request/response and latency are traceable;
use the existing identifiers (self.provider.chat, ChatRequest, delta_tx_opt,
result, resp, self.tool_dispatcher.parse_response) to locate where to insert the
logs and ensure the log lines mirror the main loop's stable prefix/correlation
field format.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8ce5691a-5506-4086-a35b-c45f90a7a8ae
📒 Files selected for processing (3)
src/openhuman/agent/harness/session/turn.rssrc/openhuman/agent/harness/subagent_runner/ops.rssrc/openhuman/agent/harness/subagent_runner/ops_tests.rs
M3gA-Mind
left a comment
There was a problem hiding this comment.
Review
Summary: Well-engineered fix for the three production bugs from bug-report-2026-05-26. The two-layer safety net on A1 (model checkpoint → deterministic fallback) is the right design and the test coverage is thorough — happy path, empty-summary fallback, and usage-accounting regression guard are all covered. A2's uniqueness check on normalized slugs is correctly conservative. A3's allow_arg_aliases boolean is a clean lever.
A few inline notes below — main asks before merge:
- Promote the ambiguous-slug-collision log to
warn(it's a configuration anomaly, not routine noise) - Add a comment at the checkpoint
persist_session_transcriptcall noting the last iteration's tool results are intentionally absent from the base messages - Note the subagent path intentionally skips transcript persist (so a future reader doesn't 'fix' it)
Everything else is minor / informational.
| // the prior conversation or system prompt bleeding in — and it's | ||
| // immune to history trimming (which drops/reorders from the front). | ||
| // The persisted transcript is unaffected (bug-report-2026-05-26 A1). | ||
| let mut turn_tool_digest = String::new(); |
There was a problem hiding this comment.
turn_tool_digest is unbounded — each entry is capped at 800 chars, so with the default cap of 10 iterations the practical max is ~8 KB. Fine as-is, but worth a short comment so a future increase to max_tool_iterations comes with an obvious prompt to revisit this.
| // conversation. `base_messages` above is still used for the | ||
| // transcript persist below, so the saved transcript is unchanged | ||
| // (bug-report-2026-05-26 A1). | ||
| let turn_summary_input = vec![ChatMessage::user(format!( |
There was a problem hiding this comment.
user_message here is captured from the outer turn(&mut self, message: &str) parameter — correct, but implicit enough to confuse a future reader. A one-line comment like // user_message captured from turn() parameter would help.
| // well-formed assistant message (never a dangling tool cycle). | ||
| let mut checkpoint_messages = base_messages; | ||
| checkpoint_messages.push(ChatMessage::assistant(checkpoint.clone())); | ||
| self.persist_session_transcript( |
There was a problem hiding this comment.
Worth a comment here: base_messages (and therefore the persisted transcript) does NOT include the tool results from the final capped iteration — those were processed after the last last_provider_messages snapshot and before the cap fired. The transcript is structurally correct (ends on an assistant message) and the checkpoint prose covers the work done, but the omission is non-obvious to a future reader. Something like: // base_messages ends before the final iteration's tool results; checkpoint prose covers what the model did.
| tracing::debug!( | ||
| requested = %name, | ||
| norm = %norm, | ||
| "[subagent_runner] ambiguous normalized-slug match — multiple actions resolve to the same slug; not resolving" |
There was a problem hiding this comment.
This should be tracing::warn! rather than tracing::debug!. Two registered actions normalizing to the same slug is a configuration anomaly (silent ambiguity in the toolkit), not routine noise — it won't surface in production unless someone is actively tailing debug logs. At warn level it would show up in normal operator log streams and be diagnosable without special filtering.
| deterministic | ||
| } | ||
| }; | ||
| Ok((checkpoint, max_iterations, usage)) |
There was a problem hiding this comment.
Unlike the main-agent path, the subagent checkpoint is intentionally NOT persisted to a transcript here — the parent agent's transcript captures the delegated result, so there's no data loss. Worth a short comment to that effect so a future reader doesn't treat the missing persist_subagent_transcript call as a bug to fix.
- subagent_runner: promote the ambiguous normalized-slug collision log debug→warn — a toolkit config anomaly should surface in normal operator logs, not stay behind debug filtering. - turn.rs: document the turn_tool_digest size bound (~8 KB at the default 10-iteration cap; revisit if raised), that `user_message` is the turn() parameter, and that base_messages / the persisted transcript intentionally omits the final capped iteration's tool results (checkpoint prose covers them; transcript stays structurally correct). - subagent_runner: note the sub-agent checkpoint is intentionally NOT transcript-persisted (the parent's transcript captures the delegated result). All from M3gA-Mind's non-blocking review notes; no behavior change beyond the log level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @M3gA-Mind — all addressed in 6e545f2:
No behavior change beyond the log level. |
|
All good — reviewed and approved. CI failures are systemic (main is also red), not caused by this PR. Ready to merge once the infrastructure recovers. |
Summary
0-text / 0-tool-callcompletion is surfaced as an error instead of being accepted as a blank reply.LazyToolkitResolvertolerates near-miss tool slugs (case / separator drift) sointegrations_agentstops dropping valid Gmail/Slides calls.parse_tool_call_valueaccepts argument-key aliases (args/parameters/input); tool name kept strict.Problem
From tester logs (
bug-report-2026-05-26, M-chip tester):max_tool_iterations(10) returnedErr(MaxIterationsExceeded)after the transcript had been persisted mid-tool-cycle. The thread was left ending on an unterminated assistant tool-call; the next message replayed that history, the gateway returned an empty completion, the loop accepted it as the final answer (final_chars=0), and the thread silently wedged — every subsequent message returned blank. (In the log,thread-d6f29915: 4 healthy turns, then permanent silence.)integrations_agentrepeatedly loggedtool not in allowlistfor valid actions (7 occurrences) because the lazy resolver matched slugs exact-only; each drop burned an iteration, accelerating A1.<tool_call>dropped the call and wasted an iteration.Solution
turn.rscalls a newsummarize_iteration_checkpoint(provider call withtools: None, streamed) to produce the checkpoint, persists it as the final assistant message, and returnsOk(checkpoint). Full tool-call history is retained, so resumption needs no heuristic — the model continues from the checkpoint on the next user message.build_deterministic_checkpoint(from the turn'sToolCallRecords) is the fallback if the summary call returns empty, guaranteeing the thread can never be left unterminated. Separately, an empty completion in the normal loop now returns an error rather thanOk("").LazyToolkitResolver::find_actionresolves exact → case-insensitive → unique normalized-slug (alphanumeric, lowercased); fabricated slugs stay unresolved and route to the existing "available tools" error so the model self-corrects.first_args_by_keys;nameresolution unchanged to avoid misreading a plain JSON answer as a tool call in the whole-response parse path.Scope note: the legacy
run_tool_call_loop(tool_loop.rs, superseded for chat dispatch and now test-only) still returnsMaxIterationsExceeded; only the live session-turn path changed.Submission Checklist
turn_tests.rs); arg-alias recovery (parse_tests.rs); near-miss resolve +normalize_slug(ops_tests.rs). The priorturn_errors_when_max_tool_iterations_are_exceededtest was updated to assert the new checkpoint behaviour.diff-coverenforces the exact gate.N/A: behaviour-only bug fix; no feature row added/removed/renamed.N/A: no matrix feature touched.N/A: no release-cut surface changed.Closes #NNN—N/A: no tracking issue filed; sourced from bug-report-2026-05-26 (private tester logs).Impact
Related
<tool_call>bodies (now logged viabody_chars) before any further parser widening.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/agent-tool-loop-resilience5f458c73🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests