Skip to content

fix(agent): resumable checkpoint at tool-call cap + tolerant tool parsing/resolution#2683

Merged
M3gA-Mind merged 6 commits into
tinyhumansai:mainfrom
sanil-23:fix/agent-tool-loop-resilience
May 26, 2026
Merged

fix(agent): resumable checkpoint at tool-call cap + tolerant tool parsing/resolution#2683
M3gA-Mind merged 6 commits into
tinyhumansai:mainfrom
sanil-23:fix/agent-tool-loop-resilience

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented May 26, 2026

Summary

  • A1 (HIGH): at the tool-call iteration cap, the live chat turn now emits a resumable checkpoint (model-written "done so far / next steps", native tools disabled, streamed) as the final assistant message instead of aborting — the transcript stays well-formed and the user's next message resumes the task.
  • A1 guard: a 0-text / 0-tool-call completion is surfaced as an error instead of being accepted as a blank reply.
  • A2 (MED): LazyToolkitResolver tolerates near-miss tool slugs (case / separator drift) so integrations_agent stops dropping valid Gmail/Slides calls.
  • A3 (MED): parse_tool_call_value accepts argument-key aliases (args/parameters/input); tool name kept strict.

Problem

From tester logs (bug-report-2026-05-26, M-chip tester):

  • A1 — hitting max_tool_iterations (10) returned Err(MaxIterationsExceeded) after the transcript had been persisted mid-tool-cycle. The thread was left ending on an unterminated assistant tool-call; the next message replayed that history, the gateway returned an empty completion, the loop accepted it as the final answer (final_chars=0), and the thread silently wedged — every subsequent message returned blank. (In the log, thread-d6f29915: 4 healthy turns, then permanent silence.)
  • A2integrations_agent repeatedly logged tool not in allowlist for valid actions (7 occurrences) because the lazy resolver matched slugs exact-only; each drop burned an iteration, accelerating A1.
  • A3 — a malformed <tool_call> dropped the call and wasted an iteration.

Solution

  • A1 — on reaching the cap, turn.rs calls a new summarize_iteration_checkpoint (provider call with tools: None, streamed) to produce the checkpoint, persists it as the final assistant message, and returns Ok(checkpoint). Full tool-call history is retained, so resumption needs no heuristic — the model continues from the checkpoint on the next user message. build_deterministic_checkpoint (from the turn's ToolCallRecords) is the fallback if the summary call returns empty, guaranteeing the thread can never be left unterminated. Separately, an empty completion in the normal loop now returns an error rather than Ok("").
  • A2LazyToolkitResolver::find_action resolves exact → case-insensitive → unique normalized-slug (alphanumeric, lowercased); fabricated slugs stay unresolved and route to the existing "available tools" error so the model self-corrects.
  • A3 — argument-key aliases via first_args_by_keys; name resolution unchanged to avoid misreading a plain JSON answer as a tool call in the whole-response parse path.

Scope note: the legacy run_tool_call_loop (tool_loop.rs, superseded for chat dispatch and now test-only) still returns MaxIterationsExceeded; only the live session-turn path changed.

Submission Checklist

If a section does not apply, marked N/A with a one-line reason.

  • Tests added or updated (happy path + failure/edge) — checkpoint-at-cap + well-formed transcript, empty-response error, deterministic fallback (turn_tests.rs); arg-alias recovery (parse_tests.rs); near-miss resolve + normalize_slug (ops_tests.rs). The prior turn_errors_when_max_tool_iterations_are_exceeded test was updated to assert the new checkpoint behaviour.
  • Diff coverage ≥ 80% — added unit tests target the changed lines (checkpoint path, fallback, empty guard, resolver, arg-alias); CI diff-cover enforces the exact gate.
  • Coverage matrix updated — N/A: behaviour-only bug fix; no feature row added/removed/renamed.
  • All affected feature IDs listed under Related — N/A: no matrix feature touched.
  • No new external network dependencies — none added.
  • Manual smoke checklist updated — N/A: no release-cut surface changed.
  • Linked issue closed via Closes #NNNN/A: no tracking issue filed; sourced from bug-report-2026-05-26 (private tester logs).

Impact

  • Desktop chat: hitting the tool-call cap now yields a usable checkpoint and a resumable thread instead of a dead conversation. No API/schema changes, no migration. Rust-core only (no frontend changes).

Related

  • Closes: N/A (no linked issue; bug-report-2026-05-26 A1/A2/A3)
  • Follow-up PR(s)/TODOs: A3 — capture real malformed <tool_call> bodies (now logged via body_chars) before any further parser widening.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/agent-tool-loop-resilience
  • Commit SHA: 5f458c73

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Tool-call argument keys now accept common aliases while generic aliases are rejected for bare top-level responses; malformed tool-call warnings include tag body length.
    • Agent turns that hit iteration caps now produce resumable assistant checkpoints (with deterministic fallback) and persist a final assistant entry instead of failing.
    • Empty provider responses with no tool calls now surface an error rather than returning empty text.
    • Tool name resolution tolerates common slug variations when unambiguous.
  • Tests

    • Added tests for argument-key aliasing, bare-response restrictions, slug normalization, checkpoint fallback, and empty-response errors.

Review Change Stack

…sing/resolution

Three agent-reliability bugs from tester logs (bug-report-2026-05-26):

A1 (HIGH): hitting the tool-call iteration cap aborted the turn, leaving the
transcript on an unterminated tool cycle — the next message then silently
returned empty and wedged the whole thread. The loop now emits a model-written
"done so far / next steps" checkpoint (native tools disabled, streamed) as the
final assistant message and keeps full tool history, so the user's next message
resumes naturally — no heuristic "continue" detection. Deterministic fallback if
the summary is empty, plus a guard that surfaces a 0-text/0-tool completion as an
error instead of a silent blank reply.

A2 (MED): integrations_agent dropped valid Gmail/Slides tools when the model
emitted a near-miss slug. LazyToolkitResolver now resolves exact ->
case-insensitive -> unique normalized-slug; fabricated slugs still route to the
"available tools" error so the model self-corrects.

A3 (MED): a tool_call with a correct name but an aliased argument key
(args/parameters/input) was dropped, wasting an iteration. parse_tool_call_value
now accepts argument-key aliases; the tool name stays strict to avoid
false-positive tool calls in the whole-response parse path.

Tests: checkpoint emitted at cap + transcript well-formed; empty-response error;
deterministic fallback; arg-alias recovery; near-miss resolve + normalize_slug.

Note: the legacy run_tool_call_loop (tool_loop.rs, superseded for chat dispatch)
still returns MaxIterationsExceeded; only the live session turn path is changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 requested a review from a team May 26, 2026 10:34
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Normalizes tool-call argument-key aliases while forbidding generic aliases for whole-response singletons, adds resumable assistant checkpoints at max-tool-iteration caps (with deterministic fallback and usage accounting), and implements fuzzy toolkit slug resolution. Tests added/updated across parsing, turn handling, and toolkit resolution.

Changes

Model Output Resilience

Layer / File(s) Summary
Argument key normalization for tool-call parsing
src/openhuman/agent/harness/parse.rs, src/openhuman/agent/harness/parse_tests.rs
Introduces prioritized argument key lookup and a helper to extract arguments from the first present alias (handles stringified JSON and object JSON). Adds alias-aware parsing (parse_tool_call_value_aliased) with an allow_arg_aliases flag, threads alias allowance through whole-response parsing (forbids generic aliases for bare top-level objects/arrays), and enhances malformed <tool_call> JSON warnings with body character counts. Tests verify aliased keys are recognized in intended contexts and that whole-response singleton parsing ignores generic aliases.
Turn safety checks and max-iterations checkpoint
src/openhuman/agent/harness/session/turn.rs, src/openhuman/agent/harness/session/turn_tests.rs, src/openhuman/agent/tests.rs
Adds defense against empty provider responses (text empty + no tool calls now errors). Replaces max-iteration abort with resumable checkpoint emission: appends a synthetic instruction, requests a prose-only checkpoint from the provider with native tools disabled and streaming enabled, forwards text deltas to progress sink, folds checkpoint usage into transcript accounting, falls back to a deterministic summary built from ToolCallRecords when the model returns empty, persists the checkpoint as the final assistant message, and returns the checkpoint text. Tests updated/added to verify checkpoint path, deterministic fallback, empty-response error, and checkpoint usage accounting.
Tool-name slug normalization and fuzzy resolution
src/openhuman/agent/harness/subagent_runner/ops.rs, src/openhuman/agent/harness/subagent_runner/ops_tests.rs
LazyToolkitResolver::find_action now tries exact, ASCII case-insensitive, and normalized (lowercase + non-alphanumeric stripped) matches for tool/action slugs; normalized matches only apply when unique. Adds normalize_slug, run digest recording for sub-agent runs, and tests ensuring near-miss slugs resolve while fabricated/ambiguous slugs remain unresolved.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Agent
  participant Provider
  participant Transcript
  Client->>Agent: request turn
  Agent->>Provider: model completions (may include tool calls)
  Provider-->>Agent: tool-call markup + text
  Agent->>Provider: summarize_iteration_checkpoint (tools: None)  -- when iteration cap hit
  Provider-->>Agent: streaming prose checkpoint
  Agent->>Transcript: persist assistant checkpoint and usage
  Agent-->>Client: return checkpoint text
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

working

Suggested reviewers

  • senamakel
  • M3gA-Mind

Poem

🐰 I tidy keys and tame stray slugs,
I stitch up loops when iteration hugs.
When models pause and tools are done,
I offer “Next steps” — the checkpoint won.
A tiny rabbit hums: all’s well, hop on.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically summarizes the main changes: adding resumable checkpoints at tool-call iteration caps and improving tool parsing/resolution tolerance.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug labels May 26, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/openhuman/agent/harness/subagent_runner/ops.rs (1)

213-258: 💤 Low value

Consider logging ambiguous normalized matches for diagnostics.

The logic correctly returns None when multiple actions normalize to the same slug (preventing mis-dispatch), but the silent fall-through at line 247 might make debugging harder if a toolkit has actions like read_file and ReadFile (both normalize to "readfile"). A debug-level log would aid diagnosis without changing behavior.

📊 Optional diagnostic log
             if let Some(action) = matches.next() {
                 if matches.next().is_none() {
                     tracing::info!(
                         requested = %name,
                         matched = %action.name,
                         "[subagent_runner] resolved tool by normalized-slug match"
                     );
                     return Some(action);
+                } else {
+                    tracing::debug!(
+                        requested = %name,
+                        norm = %norm,
+                        "[subagent_runner] ambiguous normalized-slug match; multiple actions resolve to same slug"
+                    );
                 }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/subagent_runner/ops.rs` around lines 213 - 258,
In find_action, when normalizing the requested name into norm and iterating
matches (the iterator bound to matches), add a debug-level log when multiple
actions share the same normalized slug so ambiguous normalized matches are
visible; specifically, detect the branch where matches.next() yields an action
but a subsequent matches.next().is_some() is true (the ambiguity case) and emit
a tracing::debug (or tracing::info at debug level) with requested = %name, norm
= %norm, and known_slugs or the colliding action names to aid diagnosis without
changing the return behavior.
src/openhuman/agent/harness/parse.rs (1)

412-418: ⚡ Quick win

Prefix this warning consistently.

New error-path logs in Rust are expected to use a stable grep-friendly prefix; this message will be harder to aggregate next to the existing bracketed log families. As per coding guidelines "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/parse.rs` around lines 412 - 418, The
tracing::warn message in parse.rs that reports malformed <tool_call> JSON should
be changed to include a stable, grep-friendly prefix (e.g.
"[parse:malformed-tool-call]" or similar) at the start of the log string so it
aligns with other structured logs; update the tracing::warn call in the function
that emits this warning (the block that currently logs "Malformed <tool_call>
JSON: expected tool-call object in tag body (body_chars={})" and uses
inner.chars().count()) to prepend the chosen prefix while keeping the existing
diagnostic text and the body_chars value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/parse.rs`:
- Around line 79-80: The singleton top-level-object path in parse_tool_calls is
incorrectly using first_args_by_keys(value) which applies broad alias widening
(e.g., "input") and causes normal replies to be interpreted as tool calls;
instead, for the singleton whole-response fallback (the branch that constructs
Some(ParsedToolCall { ... } from value) only accept a canonical "arguments"
field or apply a strict key check (e.g., require the exact "arguments" key or a
narrow whitelist) rather than calling first_args_by_keys; update the code around
the ParsedToolCall construction to only extract arguments from
value.get("arguments") (or validate explicit canonical keys) and leave
first_args_by_keys usage limited to tagged/tool_calls entries.

In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 1011-1052: The checkpoint path currently calls
summarize_iteration_checkpoint() to get checkpoint text but ignores any
token/usage info, causing undercounting; change summarize_iteration_checkpoint
(and its other occurrence around the 1991-2073 range) to return both the
checkpoint string and a UsageSummary (or similar struct), then when checkpoint
is produced (or when falling back to build_deterministic_checkpoint) fold that
returned usage into cumulative_input_tokens, cumulative_output_tokens,
cumulative_cached_input_tokens, cumulative_charged_usd and merge into
last_turn_usage before calling persist_session_transcript; update the call sites
that push ChatMessage::assistant(checkpoint.clone()) and
persist_session_transcript to use the adjusted cumulative values and updated
last_turn_usage so the checkpoint provider’s usage is included in the saved
transcript.
- Around line 2057-2066: The current match in turn.rs returns resp.text when
parse_response yields empty prose, which can re-emit raw <tool_call> markup;
update the branch after calling self.tool_dispatcher.parse_response(&resp) so
that if text.trim().is_empty() you return an empty string (or only fall back to
resp.text after verifying it contains no tool-call markup, e.g., no "<tool_call"
substring). Modify the block around tool_dispatcher.parse_response, the Ok(resp)
match arm, and the conditional that selects between text and resp.text to ensure
parsed-empty responses do not persist raw tool-call XML.

---

Nitpick comments:
In `@src/openhuman/agent/harness/parse.rs`:
- Around line 412-418: The tracing::warn message in parse.rs that reports
malformed <tool_call> JSON should be changed to include a stable, grep-friendly
prefix (e.g. "[parse:malformed-tool-call]" or similar) at the start of the log
string so it aligns with other structured logs; update the tracing::warn call in
the function that emits this warning (the block that currently logs "Malformed
<tool_call> JSON: expected tool-call object in tag body (body_chars={})" and
uses inner.chars().count()) to prepend the chosen prefix while keeping the
existing diagnostic text and the body_chars value.

In `@src/openhuman/agent/harness/subagent_runner/ops.rs`:
- Around line 213-258: In find_action, when normalizing the requested name into
norm and iterating matches (the iterator bound to matches), add a debug-level
log when multiple actions share the same normalized slug so ambiguous normalized
matches are visible; specifically, detect the branch where matches.next() yields
an action but a subsequent matches.next().is_some() is true (the ambiguity case)
and emit a tracing::debug (or tracing::info at debug level) with requested =
%name, norm = %norm, and known_slugs or the colliding action names to aid
diagnosis without changing the return behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f0acf74-04da-456f-bdaf-5e511c818847

📥 Commits

Reviewing files that changed from the base of the PR and between 844a26c and 5f458c7.

📒 Files selected for processing (6)
  • src/openhuman/agent/harness/parse.rs
  • src/openhuman/agent/harness/parse_tests.rs
  • src/openhuman/agent/harness/session/turn.rs
  • src/openhuman/agent/harness/session/turn_tests.rs
  • src/openhuman/agent/harness/subagent_runner/ops.rs
  • src/openhuman/agent/harness/subagent_runner/ops_tests.rs

Comment thread src/openhuman/agent/harness/parse.rs Outdated
Comment thread src/openhuman/agent/harness/session/turn.rs Outdated
Comment thread src/openhuman/agent/harness/session/turn.rs Outdated
sanil-23 and others added 2 commits May 26, 2026 13:03
CodeRabbit review fixes:
- parse.rs: don't apply generic arg-key aliases (notably the very generic
  'input') on the singleton whole-response path. A bare {"name":..,"input":..}
  model reply must not be dressed up as a tool call with arbitrary args routed
  to execution. Alias widening stays for tagged contexts (tool_calls array,
  <tool_call>/<invoke> tags, function-wrapped) via a threaded allow_arg_aliases
  flag; the whole-response parse passes false. New regression test.
- turn.rs: summarize_iteration_checkpoint now returns its provider usage so the
  caller folds it into cumulative_input/output/cached tokens, charged_usd and
  last_turn_usage (mirrors the normal final-response accounting). The extra
  checkpoint call no longer undercounts the turn / mis-attributes prior usage.
- turn.rs: when parse_response strips tool-call markup and leaves no prose,
  return empty (-> deterministic fallback) instead of re-emitting resp.text,
  which could persist raw <tool_call> markup verbatim as the checkpoint.

CI test reconciliation (agent/tests.rs): the resumable-checkpoint + empty-as-
error behaviour from this PR changed three pre-existing contracts. Updated:
- turn_bails_out_at_max_iterations -> turn_emits_checkpoint_at_max_iterations
  (asserts Ok(checkpoint) + well-formed transcript tail, not Err).
- turn_handles_empty/none_text_response -> turn_errors_on_empty/none_text_response
  (assert the empty-response error, matching the A1 guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agnostics (tinyhumansai#2683)

- parse.rs: give the malformed-<tool_call> warn a stable grep-friendly
  '[agent_parse]' prefix and structured body_chars field, matching the
  other harness log families (per CLAUDE.md logging conventions).
- subagent_runner/ops.rs: log a debug line when 2+ actions normalize to
  the same slug in find_action — the ambiguous case deliberately refuses
  to guess, so surface the collision for diagnosis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23
Copy link
Copy Markdown
Contributor Author

Pushed e7c55c9c addressing the 2 nitpick comments from the last review:

  • subagent_runner/ops.rsfind_action now emits a tracing::debug line when 2+ actions normalize to the same slug (the ambiguous case that deliberately refuses to guess), so the collision is diagnosable.
  • parse.rs — the malformed-<tool_call> warning now uses the stable grep-friendly [agent_parse] prefix with a structured body_chars field, consistent with the other harness log families.

The 3 actionable comments were addressed in a28523a0 (replied inline): singleton-path alias restriction, checkpoint usage accounting, and not re-emitting raw tool-call markup as the checkpoint.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/openhuman/agent/harness/session/turn.rs (1)

2071-2109: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add success-path diagnostics around the checkpoint provider call.

This adds a second provider.chat(...) on the max-iteration path, but only failures are logged. When the checkpoint comes back empty or markup-only, there’s no grep-friendly trace of the external call itself, which makes capped-turn debugging much harder.

📝 Example logging shape
+        let checkpoint_started = std::time::Instant::now();
+        log::info!(
+            "[agent_loop] checkpoint summary request iteration={} model={} messages={}",
+            iteration_for_stream,
+            effective_model,
+            messages.len()
+        );
         let result = self
             .provider
             .chat(
                 ChatRequest {
                     messages: &messages,
@@
         match result {
             Ok(resp) => {
+                log::info!(
+                    "[agent_loop] checkpoint summary response iteration={} elapsed_ms={} text_chars={} native_tool_calls={} usage_present={}",
+                    iteration_for_stream,
+                    checkpoint_started.elapsed().as_millis(),
+                    resp.text.as_ref().map_or(0, |t| t.chars().count()),
+                    resp.tool_calls.len(),
+                    resp.usage.is_some()
+                );
                 let usage = resp.usage.clone();
                 // Strip any stray tool-call XML a text-mode model may have
                 // emitted; keep only the prose.

As per coding guidelines **/*.rs: Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/session/turn.rs` around lines 2071 - 2109, Wrap
the successful provider.chat call (the call that produces resp used to compute
checkpoint) with grep-friendly entry/exit logs and a branch log when checkpoint
is empty or markup-only: before calling provider.chat log an entry like
"CHECKPOINT_CALL:start" with correlation fields (turn/session/request id from
self if available), model (effective_model) and temperature; after it returns
log "CHECKPOINT_CALL:ok" with resp.usage, resp.text (or resp.text.is_some()),
the parsed (tool_dispatcher.parse_response) text and calls, and the derived
checkpoint value; if checkpoint is empty or was dropped due to markup-only emit
"CHECKPOINT_CALL:empty_or_markup_only" including resp.raw/text/calls/usage to
aid debugging. Use existing logging facility in this module and keep messages
stable and grep-friendly.
🧹 Nitpick comments (1)
src/openhuman/agent/tests.rs (1)

443-490: ⚡ Quick win

Pin the cap with an execution-count assertion.

This proves a checkpoint is eventually returned, but it would still pass if the loop executed max_tool_iterations + 1 tools before checkpointing. Using CountingTool here and asserting the count is exactly max_iters would lock down the off-by-one behavior this regression is meant to guard.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/tests.rs` around lines 443 - 490, Update the test
turn_emits_checkpoint_at_max_iterations to use the CountingTool instead of
EchoTool and assert its execution count equals max_iters; specifically, create
and pass a CountingTool instance in the build_agent_with_config tools vector
(replace Box::new(EchoTool) with the CountingTool instance), run agent.turn as
before, then after the reply add an assertion that the CountingTool's recorded
count (e.g., counting_tool.count() or its public counter) is exactly max_iters
to pin the off-by-one behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/parse.rs`:
- Around line 100-106: The fallback branch in parse_tool_calls currently treats
any top-level JSON as a tool call when allow_arg_aliases is false; change the
logic in the block that calls parse_arguments_value(value.get("arguments")) so
it first checks that the canonical "arguments" key is actually present (e.g.,
value.get("arguments").is_some() or value.contains_key("arguments")) before
constructing a ParsedToolCall, returning None/plain-text otherwise; apply the
same presence check to the similar fallback at the other occurrence (lines
~415-420) so only responses with an explicit "arguments" key become tool calls.

---

Outside diff comments:
In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 2071-2109: Wrap the successful provider.chat call (the call that
produces resp used to compute checkpoint) with grep-friendly entry/exit logs and
a branch log when checkpoint is empty or markup-only: before calling
provider.chat log an entry like "CHECKPOINT_CALL:start" with correlation fields
(turn/session/request id from self if available), model (effective_model) and
temperature; after it returns log "CHECKPOINT_CALL:ok" with resp.usage,
resp.text (or resp.text.is_some()), the parsed (tool_dispatcher.parse_response)
text and calls, and the derived checkpoint value; if checkpoint is empty or was
dropped due to markup-only emit "CHECKPOINT_CALL:empty_or_markup_only" including
resp.raw/text/calls/usage to aid debugging. Use existing logging facility in
this module and keep messages stable and grep-friendly.

---

Nitpick comments:
In `@src/openhuman/agent/tests.rs`:
- Around line 443-490: Update the test turn_emits_checkpoint_at_max_iterations
to use the CountingTool instead of EchoTool and assert its execution count
equals max_iters; specifically, create and pass a CountingTool instance in the
build_agent_with_config tools vector (replace Box::new(EchoTool) with the
CountingTool instance), run agent.turn as before, then after the reply add an
assertion that the CountingTool's recorded count (e.g., counting_tool.count() or
its public counter) is exactly max_iters to pin the off-by-one behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 49ab3ba0-6f40-434f-aec3-ee3957d14bbb

📥 Commits

Reviewing files that changed from the base of the PR and between 5f458c7 and a28523a.

📒 Files selected for processing (5)
  • src/openhuman/agent/harness/parse.rs
  • src/openhuman/agent/harness/parse_tests.rs
  • src/openhuman/agent/harness/session/turn.rs
  • src/openhuman/agent/harness/session/turn_tests.rs
  • src/openhuman/agent/tests.rs

Comment thread src/openhuman/agent/harness/parse.rs
… parse (tinyhumansai#2683)

Address CodeRabbit (major): on the bare whole-response JSON path
(`parse_tool_call_value_aliased(.., false)`), a plain reply like
`{"name":"Alice","input":{…}}` still returned a `ParsedToolCall` with empty
arguments — so a normal JSON answer whose `name` happens to match a tool
could be dispatched. Require the canonical `arguments` key as an explicit
tool-call marker on that path; without it, return None (plain text). Tagged
contexts (`<tool_call>`/`<invoke>`, `tool_calls` array, `function` wrapper)
reach the parser with aliases enabled and keep their permissive behaviour.

Test updated: a bare `{"name","input"}` whole-response object is no longer a
tool call; a bare object WITH canonical `arguments` still is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 26, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 26, 2026
…ckpoint (tinyhumansai#2683)

Main agent: the max-iteration checkpoint now summarizes ONLY this turn's work
— a trim-robust digest of its tool calls + results compiled as the loop runs —
instead of the full conversation (no prior turns, no system prompt). The
digest is decoupled from persistence: the transcript is still saved with the
full conversation, so the persisted record is unchanged. The digest is immune
to history trimming (which drops/reorders from the front).

Sub-agent (run_inner_loop): on the iteration cap, return a graceful
partial-progress checkpoint (model-written over the run's tool-call digest,
tools disabled, with a deterministic fallback) as the result instead of
Err(MaxIterationsExceeded). The delegating agent now receives what the
sub-agent accomplished and can continue, rather than a bare "delegate failed".

Tests: sub-agent returns a checkpoint at the cap (model-written + deterministic
fallback); existing main-agent checkpoint tests still pass under turn-scoping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/openhuman/agent/harness/subagent_runner/ops.rs (1)

1870-1929: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist the synthesized checkpoint before returning.

This branch folds the summary-call usage into usage, but it returns without pushing checkpoint into history or re-running persist_transcript(). The sub-agent transcript therefore still ends on the last tool-result turn, the checkpoint call's tokens/cost never reach disk, and worker_thread_id runs miss the final agent message for this path. Please mirror the normal terminal-response path here before Ok(...).

Suggested fix
     let checkpoint = match provider
         .chat(
             ChatRequest {
                 messages: &summary_input,
                 tools: None,
@@
         }
     };
+    history.push(ChatMessage::assistant(checkpoint.clone()));
+    append_worker_message(
+        checkpoint.clone(),
+        "agent".to_string(),
+        serde_json::json!({
+            "scope": "worker_thread",
+            "agent_id": agent_id,
+            "task_id": task_id,
+            "iteration": max_iterations,
+            "final": true,
+            "checkpoint": true,
+        }),
+    );
+    persist_transcript(history, &usage);
     Ok((checkpoint, max_iterations, usage))
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/subagent_runner/ops.rs` around lines 1870 - 1929,
The checkpoint branch currently returns after synthesizing `checkpoint` (from
the `provider.chat(...).await` block) and updating `usage`, but it never append
the final checkpoint message into the sub-agent `history` nor calls
`persist_transcript()` (so the tokens/cost and final message are not saved);
update this path to mirror the normal terminal-response flow by creating the
final ChatMessage/turn containing `checkpoint` (use the same message shape used
elsewhere), push it into `history`, call `persist_transcript()` (and any code
that sets `worker_thread_id`/final metadata) before returning `Ok((checkpoint,
max_iterations, usage))`, ensuring the `usage` updates are preserved and any
errors from persisting are handled the same way as the regular terminal-response
path.
🧹 Nitpick comments (1)
src/openhuman/agent/harness/session/turn.rs (1)

2100-2142: ⚡ Quick win

Add start/success logs around the checkpoint summary call.

This helper adds a new provider round-trip on the turn hot path, but only the failure path is logged. Please emit request/response diagnostics here with the same stable prefix/correlation fields as the main loop so checkpoint stalls, empty-summary fallbacks, and provider latency are traceable.

As per coding guidelines, "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/agent/harness/session/turn.rs` around lines 2100 - 2142, Before
calling self.provider.chat in the checkpoint summary block, add a stable,
grep-friendly entry log (e.g., "[agent_loop::checkpoint] request" plus the same
correlation fields used by the main loop) that records the outgoing request
diagnostics (effective_model, temperature, tools present, and a safe summary of
messages); on Ok(resp) add a matching success log (e.g.,
"[agent_loop::checkpoint] response") that records response diagnostics
(resp.text presence/length, resp.usage, parsed calls count and the checkpoint
decision) so both request/response and latency are traceable; use the existing
identifiers (self.provider.chat, ChatRequest, delta_tx_opt, result, resp,
self.tool_dispatcher.parse_response) to locate where to insert the logs and
ensure the log lines mirror the main loop's stable prefix/correlation field
format.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/openhuman/agent/harness/subagent_runner/ops.rs`:
- Around line 1870-1929: The checkpoint branch currently returns after
synthesizing `checkpoint` (from the `provider.chat(...).await` block) and
updating `usage`, but it never append the final checkpoint message into the
sub-agent `history` nor calls `persist_transcript()` (so the tokens/cost and
final message are not saved); update this path to mirror the normal
terminal-response flow by creating the final ChatMessage/turn containing
`checkpoint` (use the same message shape used elsewhere), push it into
`history`, call `persist_transcript()` (and any code that sets
`worker_thread_id`/final metadata) before returning `Ok((checkpoint,
max_iterations, usage))`, ensuring the `usage` updates are preserved and any
errors from persisting are handled the same way as the regular terminal-response
path.

---

Nitpick comments:
In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 2100-2142: Before calling self.provider.chat in the checkpoint
summary block, add a stable, grep-friendly entry log (e.g.,
"[agent_loop::checkpoint] request" plus the same correlation fields used by the
main loop) that records the outgoing request diagnostics (effective_model,
temperature, tools present, and a safe summary of messages); on Ok(resp) add a
matching success log (e.g., "[agent_loop::checkpoint] response") that records
response diagnostics (resp.text presence/length, resp.usage, parsed calls count
and the checkpoint decision) so both request/response and latency are traceable;
use the existing identifiers (self.provider.chat, ChatRequest, delta_tx_opt,
result, resp, self.tool_dispatcher.parse_response) to locate where to insert the
logs and ensure the log lines mirror the main loop's stable prefix/correlation
field format.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8ce5691a-5506-4086-a35b-c45f90a7a8ae

📥 Commits

Reviewing files that changed from the base of the PR and between 879a970 and 8fb981f.

📒 Files selected for processing (3)
  • src/openhuman/agent/harness/session/turn.rs
  • src/openhuman/agent/harness/subagent_runner/ops.rs
  • src/openhuman/agent/harness/subagent_runner/ops_tests.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 26, 2026
Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Summary: Well-engineered fix for the three production bugs from bug-report-2026-05-26. The two-layer safety net on A1 (model checkpoint → deterministic fallback) is the right design and the test coverage is thorough — happy path, empty-summary fallback, and usage-accounting regression guard are all covered. A2's uniqueness check on normalized slugs is correctly conservative. A3's allow_arg_aliases boolean is a clean lever.

A few inline notes below — main asks before merge:

  • Promote the ambiguous-slug-collision log to warn (it's a configuration anomaly, not routine noise)
  • Add a comment at the checkpoint persist_session_transcript call noting the last iteration's tool results are intentionally absent from the base messages
  • Note the subagent path intentionally skips transcript persist (so a future reader doesn't 'fix' it)

Everything else is minor / informational.

// the prior conversation or system prompt bleeding in — and it's
// immune to history trimming (which drops/reorders from the front).
// The persisted transcript is unaffected (bug-report-2026-05-26 A1).
let mut turn_tool_digest = String::new();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn_tool_digest is unbounded — each entry is capped at 800 chars, so with the default cap of 10 iterations the practical max is ~8 KB. Fine as-is, but worth a short comment so a future increase to max_tool_iterations comes with an obvious prompt to revisit this.

// conversation. `base_messages` above is still used for the
// transcript persist below, so the saved transcript is unchanged
// (bug-report-2026-05-26 A1).
let turn_summary_input = vec![ChatMessage::user(format!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_message here is captured from the outer turn(&mut self, message: &str) parameter — correct, but implicit enough to confuse a future reader. A one-line comment like // user_message captured from turn() parameter would help.

// well-formed assistant message (never a dangling tool cycle).
let mut checkpoint_messages = base_messages;
checkpoint_messages.push(ChatMessage::assistant(checkpoint.clone()));
self.persist_session_transcript(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment here: base_messages (and therefore the persisted transcript) does NOT include the tool results from the final capped iteration — those were processed after the last last_provider_messages snapshot and before the cap fired. The transcript is structurally correct (ends on an assistant message) and the checkpoint prose covers the work done, but the omission is non-obvious to a future reader. Something like: // base_messages ends before the final iteration's tool results; checkpoint prose covers what the model did.

tracing::debug!(
requested = %name,
norm = %norm,
"[subagent_runner] ambiguous normalized-slug match — multiple actions resolve to the same slug; not resolving"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tracing::warn! rather than tracing::debug!. Two registered actions normalizing to the same slug is a configuration anomaly (silent ambiguity in the toolkit), not routine noise — it won't surface in production unless someone is actively tailing debug logs. At warn level it would show up in normal operator log streams and be diagnosable without special filtering.

deterministic
}
};
Ok((checkpoint, max_iterations, usage))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the main-agent path, the subagent checkpoint is intentionally NOT persisted to a transcript here — the parent agent's transcript captures the delegated result, so there's no data loss. Worth a short comment to that effect so a future reader doesn't treat the missing persist_subagent_transcript call as a bug to fix.

- subagent_runner: promote the ambiguous normalized-slug collision log
  debug→warn — a toolkit config anomaly should surface in normal operator
  logs, not stay behind debug filtering.
- turn.rs: document the turn_tool_digest size bound (~8 KB at the default
  10-iteration cap; revisit if raised), that `user_message` is the turn()
  parameter, and that base_messages / the persisted transcript intentionally
  omits the final capped iteration's tool results (checkpoint prose covers
  them; transcript stays structurally correct).
- subagent_runner: note the sub-agent checkpoint is intentionally NOT
  transcript-persisted (the parent's transcript captures the delegated result).

All from M3gA-Mind's non-blocking review notes; no behavior change beyond the
log level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23
Copy link
Copy Markdown
Contributor Author

Thanks @M3gA-Mind — all addressed in 6e545f2:

  • ops.rs slug collision → promoted debugtracing::warn! (toolkit config anomaly; now surfaces in normal operator logs).
  • turn.rs turn_tool_digest → comment noting the ~8 KB bound at the default 10-iteration cap (revisit if max_tool_iterations is raised).
  • turn.rs user_message → comment noting it is the turn(&mut self, message) parameter.
  • turn.rs persist → comment noting base_messages (and thus the persisted transcript) intentionally omits the final capped iteration's tool results — the checkpoint prose covers them and the transcript stays structurally correct.
  • subagent checkpoint → comment noting it is intentionally not transcript-persisted (the parent agent's transcript captures the delegated result), so it isn't mistaken for a missing-persist bug.

No behavior change beyond the log level.

@M3gA-Mind
Copy link
Copy Markdown
Contributor

All good — reviewed and approved. CI failures are systemic (main is also red), not caused by this PR. Ready to merge once the infrastructure recovers.

@M3gA-Mind M3gA-Mind merged commit 6736467 into tinyhumansai:main May 26, 2026
10 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Built-in agents, prompts, orchestration, and agent runtime in src/openhuman/agent/. bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants