Skip to content

fix(cuga_lite): cap bind_tools count and shortlist over the limit#203

Open
haroldship wants to merge 8 commits into
mainfrom
fix/bind-tools-cap-shortlist-202
Open

fix(cuga_lite): cap bind_tools count and shortlist over the limit#203
haroldship wants to merge 8 commits into
mainfrom
fix/bind-tools-cap-shortlist-202

Conversation

@haroldship
Copy link
Copy Markdown
Collaborator

@haroldship haroldship commented May 11, 2026

Summary

  • Adds a provider-safe cap on the number of tools sent to LLM.bind_tools() from cuga_lite, fixing a hard 400 from Groq/OpenAI when the candidate set exceeds the provider's per-request limit (e.g. m3 hockey: 206 tools, mode=all).
  • When the cap is exceeded, the existing LLM shortlister (PromptUtils.find_tools machinery) ranks the candidates against the first user message and the top-K tools are bound (reserving 1 slot for find_tools when include_find_tools=True).
  • When shortlisting is impossible (no query, shortlister failure, empty ranking, or hallucinated names), raise an actionable RuntimeError. Silent truncation would corrupt benchmark results comparing native tool-calling vs text-mode — failing loudly is intentional.

Closes #202

New settings

advanced_features.cuga_lite_bind_tools_max_count

  • Default: 128 (matches Groq/OpenAI).
  • Set to 0 (or negative) to disable the cap for permissive backends (WatsonX, Anthropic via LiteLLM) — also avoids the shortlister LLM round-trip entirely.
  • Override via DYNACONF_ADVANCED_FEATURES__CUGA_LITE_BIND_TOOLS_MAX_COUNT.

advanced_features.cuga_lite_bind_tools_pad_to_cap

  • Default: false. When the shortlister returns fewer tools than the cap allows, the default behavior is to bind only those tools (often 1–4 on the existing system prompt).
  • Set to true to pad the shortlister output with the remaining candidates so the cap is fully utilized. Useful for research scenarios that explicitly want mode=all to bind as many tools as the provider will accept.
  • Off by default because cuga_lite is a code-execution agent: padding pushes the model toward native tool_calls mode, which the code-mode flow doesn't fully exercise (measured on m3 hockey: 0 tool calls with padding vs 5–7 without).
  • Override via DYNACONF_ADVANCED_FEATURES__CUGA_LITE_BIND_TOOLS_PAD_TO_CAP.

Operational notes

  • Latency / cost: When the candidate count exceeds max_count, applying the cap incurs one extra LLM round-trip (the shortlister) per call_model invocation. To avoid the round-trip on permissive backends, set cuga_lite_bind_tools_max_count=0.
  • Loud failures (intentional): When shortlisting cannot run safely — empty user query, shortlister exception, empty ranking, or hallucinated tool names that don't match any candidate — a RuntimeError is raised rather than silently truncating. Benchmark/research runs comparing native tool-calling vs text-mode therefore can't degrade silently; downstream callers should expect to see this error and either widen the cap, fix the shortlister LLM, or refine the user query.

Where the change lives

  • src/cuga/backend/cuga_graph/nodes/cuga_lite/bind_tools/ (new subpackage)
    • cap.pybind_tools_max_count_from_settings, bind_tools_pad_to_cap_from_settings, apply_bind_tools_cap_and_merge, plus private helpers (_resolve_find_tools_overlay, _build_ranking_pool, _run_shortlister, _materialize_shortlist, _maybe_pad_to_cap).
    • __init__.py — re-exports the three public names used by cuga_lite_graph.py.
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
    • resolve_model_with_bind_tools stays as orchestration — it imports from bind_tools and uses a local _cap_merge_bound(bound) closure to DRY the four mode={all,apps,tools,apps_and_tools} call sites.
    • resolve_model_with_bind_tools gains an optional query: Optional[str] kwarg.
    • The top-level except Exception no longer swallows the actionable RuntimeError from the cap helper.
    • call_model passes _first_user_message_text(state.chat_messages) through as query.
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
    • New PromptUtils.shortlist_tool_names(query, all_tools, all_apps, llm, top_k, instructions) — returns ranked tool names. Internally reuses the existing ShortListerOutputLite chain (./prompts/shortlister/system.jinja2); injects a top_k instruction.
    • New private PromptUtils._build_shortlister_payload(...) — shared payload builder between find_tools and shortlist_tool_names so the two callers can't drift.
  • src/cuga/config.py
    • New validators: cuga_lite_bind_tools_max_count (default 128) and cuga_lite_bind_tools_pad_to_cap (default False).
  • src/cuga/settings.toml
    • Documents both new settings inline with the existing cuga_lite_bind_tools_* block.

Backward compatibility

  • Users with ≤128 tools see no behavior change (cap path is a no-op, no shortlister call).
  • Users previously hitting the Groq 400 now get a working shortlisted bind.
  • Users explicitly wanting old behavior on permissive backends: set CUGA_LITE_BIND_TOOLS_MAX_COUNT=0.
  • Users explicitly wanting padding on top of shortlisting: set CUGA_LITE_BIND_TOOLS_PAD_TO_CAP=true.

Test plan

  • pytest tests/unit/test_cuga_lite_bind_tools.py -q — 18 passed (existing + new):
    • shortlist runs when over cap (mode=all)
    • actionable RuntimeError raised when over cap and no query
    • actionable RuntimeError raised when shortlister hallucinates names
    • no shortlist call when under cap
    • cap=0 disables the cap and binds everything
    • include_find_tools reserves a slot (top_k = cap − 1) and find_tools is appended after shortlisting
    • pad_to_cap=true pads with remaining candidates up to the cap
  • ruff check + ruff format --check on all touched files: clean
  • End-to-end m3 hockey eval (./scripts/eval.sh --benchmark m3 --capability m3_task_2 --domain hockey --max-samples 1 with CUGA_LITE_BIND_TOOLS_MODE=all) — to be exercised by reporter

Summary by CodeRabbit

  • New Features

    • Provider-safe configurable cap on bound tools (default 128, 0 disables).
    • LLM-driven shortlisting of top-K tools when candidates exceed the cap; can reserve and always include a discovery tool.
    • Shortlisting uses the initial user message for relevance and supports optional padding to fill to the cap.
  • Bug Fixes

    • Replaces silent truncation with clear errors when safe shortlisting cannot run.
  • Tests

    • Expanded tests for cap behavior, shortlisting, padding, discovery-tool inclusion, and shortlister failures.

Review Change Stack

bind_tools cap + pad_to_cap cross-provider verification

m3 hockey task (task_2, --max-samples 1) with CUGA_LITE_BIND_TOOLS_MODE=all, cap=128. pad_to_cap per m3.env at run time. Updated by scripts/m3_pad_to_cap_verify.sh.

provider ran no exceptions cap path exercised padded pass/total tool_calls duration cuga-agent HEAD when (UTC)
groq yes yes yes yes (127 padded → 128 bound) 0 / 1 (0.0%) 7/1 47.9s 6af34316 2026-05-11T16:24Z
litellm yes yes yes yes (127 padded → 128 bound) 0 / 1 (0.0%) 0/1 -- acdbf18b 2026-05-11T16:43Z
watsonx yes yes yes yes (127 padded → 128 bound) 0 / 1 (0.0%) 1/1 184.2s acdbf18b 2026-05-11T17:14Z

`resolve_model_with_bind_tools` blindly passed every candidate tool to
`LLM.bind_tools()`, which fails hard on Groq/OpenAI (max 128 tools) for
benchmarks like m3-hockey (206 tools).

Add `cuga_lite_bind_tools_max_count` (default 128). When the candidate list
exceeds the cap, run the existing LLM shortlister against the first user
message and bind the top-K most relevant tools (reserving 1 slot for
find_tools when `include_find_tools=True`). When shortlisting is impossible
(no query / shortlister failure / empty ranking), raise an actionable
RuntimeError naming the count, cap, and remedy — silent truncation would
corrupt research comparing native tool-calling against text-mode.

WatsonX / LiteLLM-to-Anthropic users on permissive backends can disable the
cap via `DYNACONF_ADVANCED_FEATURES__CUGA_LITE_BIND_TOOLS_MAX_COUNT=0`.

Closes #202

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a configurable provider-safe cap on native tool binding (default 128). When candidates exceed the cap, an LLM shortlister ranks tools using the first user message as query to pick a top-K subset; the code reserves/merges find_tools when configured and raises errors when safe shortlisting cannot run.

Changes

Tool Binding Cap & LLM Shortlisting

Layer / File(s) Summary
Configuration Schema
src/cuga/config.py
Dynaconf validator adds advanced_features.cuga_lite_bind_tools_max_count with default value of 128.
Tool Shortlisting Utility
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
New PromptUtils._build_shortlister_payload serializer and PromptUtils.shortlist_tool_names async method rank tools for a query using the shortlister LLM chain, returning deduplicated top-K tool names with support for custom instructions and optional LLM override. find_tools now reuses the shared payload builder.
Settings Reader Helpers
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
_bind_tools_max_count_from_settings() and _bind_tools_pad_to_cap_from_settings() read cap and padding opt-in from settings with defensive parsing/fallbacks.
Capped Tool Binding & Merge
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
_apply_bind_tools_cap_and_merge(...) enforces max_count, invokes PromptUtils.shortlist_tool_names when over-cap (reserving capacity for find_tools if enabled), optionally pads to the cap, merges find_tools, and raises RuntimeError on unsafe shortlisting.
resolve_model_with_bind_tools Signature & Doc
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
Function signature extended to accept optional query: Optional[str] parameter; max_count computed via settings helper; internal logic delegates capping/merging to _apply_bind_tools_cap_and_merge.
RuntimeError Re-raising
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
Exception handler in resolve_model_with_bind_tools explicitly re-raises RuntimeError from capping/shortlisting logic so cap failures propagate as actionable errors; other exceptions continue to be warned and fallbacked.
call_model Node Integration
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
call_model node now extracts first user message text via _first_user_message_text() and passes it as query argument to resolve_model_with_bind_tools, enabling shortlisting context when cap is exceeded.
Unit Tests & Imports
tests/unit/test_cuga_lite_bind_tools.py
New async unit tests validate over-cap shortlisting, over-cap without query raising, under-cap behaviour, disabled-cap binding-all, padding opt-in behavior, and reserved slot for find_tools. Import consolidation includes patch and other mocks.

Sequence Diagram(s)

sequenceDiagram
  participant CallModel
  participant ResolveModel as resolve_model_with_bind_tools
  participant ApplyCap as _apply_bind_tools_cap_and_merge
  participant Shortlist as PromptUtils.shortlist_tool_names
  participant LLM
  participant Model
  CallModel->>ResolveModel: call with query (first user message)
  ResolveModel->>ApplyCap: candidate_tools, max_count, query
  alt candidates exceed cap
    ApplyCap->>Shortlist: query, all_tools, all_apps, top_k
    Shortlist->>LLM: shortlister prompt + serialized payload
    LLM-->>Shortlist: ranked tool names
    Shortlist-->>ApplyCap: top-K tool names
    ApplyCap->>ApplyCap: reserve slot for find_tools, pad to cap if enabled
    ApplyCap-->>ResolveModel: capped + merged tool list
  else candidates within cap
    ApplyCap-->>ResolveModel: all tools
  end
  ResolveModel->>Model: bind_tools(selected_tools)
  Model-->>ResolveModel: bound model
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • sami-marreed
  • gjt-prog

Poem

🐰 I’m a rabbit with a ranking hat,
When tools overflow I pick just that,
A query guides which names to keep,
I merge find_tools and guard the heap,
Provider-safe and snug in my lap.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding a provider-safe cap on bind_tools count and shortlisting tools when the cap is exceeded.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bind-tools-cap-shortlist-202

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py (1)

434-477: ⚡ Quick win

Share the shortlister payload builder with find_tools.

This method already diverges from PromptUtils.find_tools: it rebuilds the same prompt/serialization flow but drops _response_schemas and _param_constraints. That means bind-time shortlisting can rank a different top-K than the existing find_tools path, and the two implementations will keep drifting.

♻️ Suggested direction
+    `@staticmethod`
+    def _build_shortlister_payload(
+        all_tools: List[StructuredTool],
+        all_apps: List[AppDefinition],
+    ) -> tuple[Dict[str, Any], Dict[str, Any]]:
+        tools_as_dict: Dict[str, Any] = {}
+        for tool in all_tools:
+            tool_dict = tool.model_dump()
+            ...
+            if hasattr(tool, "func"):
+                if hasattr(tool.func, "_response_schemas"):
+                    tool_dict["_response_schemas"] = tool.func._response_schemas
+                if hasattr(tool.func, "_param_constraints"):
+                    tool_dict["_param_constraints"] = tool.func._param_constraints
+            tools_as_dict[tool.name] = tool_dict
+        apps_as_dict = {app.name: app.model_dump() for app in all_apps}
+        return tools_as_dict, apps_as_dict
+
     async def find_tools(...):
-        tools_as_dict = {}
-        ...
-        apps_as_dict = {app.name: app.model_dump() for app in all_apps}
+        tools_as_dict, apps_as_dict = PromptUtils._build_shortlister_payload(all_tools, all_apps)

     async def shortlist_tool_names(...):
-        tools_as_dict: Dict[str, Any] = {}
-        ...
-        apps_as_dict = {app.name: app.model_dump() for app in all_apps}
+        tools_as_dict, apps_as_dict = PromptUtils._build_shortlister_payload(all_tools, all_apps)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py` around lines 434
- 477, This shortlister duplicates the payload-building logic from
PromptUtils.find_tools but omits _response_schemas and _param_constraints
causing divergence; refactor to reuse the same builder (or call the existing
PromptUtils.find_tools payload/serialization helper) instead of reserializing
inline: locate the tools serialization block that builds tools_as_dict (and the
prompt created via create_chat_prompt_from_templates / ShortListerOutputLite /
BaseAgent.get_chain) and replace it with the shared payload builder used by
find_tools so the produced payload includes _response_schemas and
_param_constraints and both paths remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 327-361: The code treats find_tools_tool inconsistently for
counting vs reserving and can produce max_count+1 tools; fix by treating
find_tools_tool as already counted if bound already contains it: when computing
cap checks (the branch with cap_disabled or len(bound) <= max_count) and when
computing reserve, first derive ft_name from find_tools_tool (ft_name =
getattr(find_tools_tool,"name","") ) and test membership against
{getattr(t,"name","") for t in bound}; only append via _append_find_tools when
ft_name is non-empty and not already present in bound, and set reserve = 1 only
if find_tools_tool is non-None and ft_name not in bound so target_k correctly
reflects available shortlist slots.

---

Nitpick comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`:
- Around line 434-477: This shortlister duplicates the payload-building logic
from PromptUtils.find_tools but omits _response_schemas and _param_constraints
causing divergence; refactor to reuse the same builder (or call the existing
PromptUtils.find_tools payload/serialization helper) instead of reserializing
inline: locate the tools serialization block that builds tools_as_dict (and the
prompt created via create_chat_prompt_from_templates / ShortListerOutputLite /
BaseAgent.get_chain) and replace it with the shared payload builder used by
find_tools so the produced payload includes _response_schemas and
_param_constraints and both paths remain consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90c0edbd-8f6e-4bd0-9b35-3aeba1c92a7d

📥 Commits

Reviewing files that changed from the base of the PR and between b5b72b2 and 83fd812.

📒 Files selected for processing (4)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
  • src/cuga/config.py
  • tests/unit/test_cuga_lite_bind_tools.py

Comment thread src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py Outdated
Two follow-ups discovered while exercising the cap path on the m3 hockey
benchmark (206 tools, Groq llama-3, mode=all):

1. Loguru log strings used %-style placeholders so the cap exceeded /
   shortlisted-to messages printed literal "%s" / "%d". Switch to {}.

2. The existing shortlister LLM system prompt is biased toward "the most
   relevant" tools and typically returns 1-4 entries even when asked for
   K=128. Padding the result with the remaining tools (provider order) does
   fill the cap, but binding ~128 native tools pushes the Groq llama model
   into native tool_calls mode, which cuga_lite's code-execution flow does
   not fully exercise — measured regression from 5-7 tool calls / pass to
   0 tool calls / fail on the same task.

   Add `cuga_lite_bind_tools_pad_to_cap` (default False) and keep the
   shortlister's natural ranked prefix as the bound set. Result on the m3
   hockey task: 0/1 to 1/1 pass, 17s, 4 tool calls in code-mode.

Users explicitly chasing "bind every slot the provider allows" can flip the
setting on for that experiment.
Per coderabbit review on #203: the under-cap fast path and the over-cap
reserve logic treated find_tools inconsistently.

- When len(bound) == max_count and include_find_tools=True with a
  non-overlapping find_tools, the fast path returned max_count+1 tools and
  the provider rejected the request.
- When the overlay path had already placed find_tools into bound and
  include_find_tools=True, reserve=1 dropped a real shortlist slot even
  though find_tools was already counted in bound.

Track whether find_tools is already in bound, compute an effective_count
for the under-cap check, and only set reserve=1 when find_tools genuinely
needs to be appended.

Add two regression tests for the boundary cases.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py (1)

356-390: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep find_tools guaranteed when include_find_tools=True.

When find_tools comes in via the overlay, need_to_append_find_tools becomes False. In the over-cap path that means reserve=0 and _append_find_tools() is a no-op, so the shortlister can rank find_tools out of the final list entirely. That breaks the include_find_tools=True contract and also makes max_count=1 fail even though binding only find_tools would still fit.

Proposed fix
-    find_tools_already_in_bound = bool(find_tools_name) and any(
+    find_tools_already_in_bound = bool(find_tools_name) and any(
         getattr(t, "name", "") == find_tools_name for t in bound
     )
-    need_to_append_find_tools = find_tools_tool is not None and not find_tools_already_in_bound
+    keep_find_tools = include_find_tools and find_tools_tool is not None
+    ranking_pool = (
+        [t for t in bound if getattr(t, "name", "") != find_tools_name]
+        if keep_find_tools and find_tools_already_in_bound
+        else bound
+    )

     def _append_find_tools(tools: List[StructuredTool]) -> List[StructuredTool]:
-        if not need_to_append_find_tools:
+        if not keep_find_tools or find_tools_tool is None:
             return tools
         if find_tools_name in {getattr(t, "name", "") for t in tools}:
             return tools
         return [*tools, find_tools_tool]

     cap_disabled = max_count <= 0
-    effective_count = len(bound) + (1 if need_to_append_find_tools else 0)
+    effective_count = len(ranking_pool) + (1 if keep_find_tools else 0)
     if cap_disabled or effective_count <= max_count:
-        return _append_find_tools(bound)
+        return _append_find_tools(ranking_pool)

-    reserve = 1 if need_to_append_find_tools else 0
+    reserve = 1 if keep_find_tools else 0
     target_k = max_count - reserve
     if target_k <= 0:
-        raise RuntimeError(
-            f"cuga_lite_bind_tools_max_count={max_count} is too small to fit even find_tools "
-            f"(reserve={reserve}). Raise the cap."
-        )
+        return _append_find_tools([])

     ...
-            all_tools=bound,
+            all_tools=ranking_pool,
             ...
-    by_name = {getattr(t, "name", ""): t for t in bound}
+    by_name = {getattr(t, "name", ""): t for t in ranking_pool}

Also applies to: 414-460

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py` around lines
356 - 390, The shortlister can drop find_tools when it's already present in
bound because reserve is only set based on need_to_append_find_tools; to
guarantee include_find_tools=True you must treat that tool as reserved even if
it was placed in-band. Change the reserve logic to reserve a slot whenever
include_find_tools is true (i.e. set reserve = 1 if include_find_tools else 0)
and ensure effective_count/shortlist decisions continue to reflect
need_to_append_find_tools for counting new appends; keep _append_find_tools and
find_tools_already_in_bound as-is so the append only happens when needed but the
shortlister cannot evict the in-band find_tools thanks to the new reserve
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 397-403: The logger.warning call inside the block where all_apps
is populated from tool_provider.get_apps() uses printf-style "%s" formatting
which doesn't work with Loguru; change the call in the try/except that catches
Exception e to use Loguru's {} style (e.g., logger.warning("bind_tools cap:
tool_provider.get_apps() failed: {}", e)) so the exception message renders
correctly; update the logging invocation near the
all_apps/tool_provider/get_apps handling to use {} formatting and keep the same
message content.

---

Duplicate comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 356-390: The shortlister can drop find_tools when it's already
present in bound because reserve is only set based on need_to_append_find_tools;
to guarantee include_find_tools=True you must treat that tool as reserved even
if it was placed in-band. Change the reserve logic to reserve a slot whenever
include_find_tools is true (i.e. set reserve = 1 if include_find_tools else 0)
and ensure effective_count/shortlist decisions continue to reflect
need_to_append_find_tools for counting new appends; keep _append_find_tools and
find_tools_already_in_bound as-is so the append only happens when needed but the
shortlister cannot evict the in-band find_tools thanks to the new reserve
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cde90c26-803b-4a75-9706-57b02fca574f

📥 Commits

Reviewing files that changed from the base of the PR and between b4b154c and 6af3431.

📒 Files selected for processing (2)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
  • tests/unit/test_cuga_lite_bind_tools.py

Comment thread src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py Outdated
Per coderabbit nitpick on #203: shortlist_tool_names duplicated the tool
serialization from find_tools but dropped _response_schemas and
_param_constraints, so bind-time ranking could diverge from runtime find_tools
ranking on the same tool set.

Extract PromptUtils._build_shortlister_payload as the single source of truth
for serializing (all_tools, all_apps) into the prompt-friendly dicts. Both
callers now produce identical payloads including the response-schema and
param-constraint fields the LLM uses to rank.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py (1)

225-234: ⚡ Quick win

Narrow the blind exception in schema serialization.

On Line 232, except Exception can mask unexpected bugs and silently strip tool schema from ranking input. Catch expected serialization errors explicitly so unexpected failures surface.

Proposed diff
             if hasattr(tool, 'args_schema') and tool.args_schema:
                 try:
                     if hasattr(tool.args_schema, 'schema'):
                         tool_dict['args_schema'] = tool.args_schema.schema()
                     elif hasattr(tool.args_schema, 'model_json_schema'):
                         tool_dict['args_schema'] = tool.args_schema.model_json_schema()
                     else:
                         tool_dict['args_schema'] = {}
-                except Exception as e:
+                except (AttributeError, TypeError, ValueError) as e:
                     logger.debug(f"Failed to serialize args_schema for tool {tool.name}: {e}")
                     tool_dict['args_schema'] = {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py` around lines 225
- 234, The try/except around args_schema serialization in the prompt_utils code
is too broad (it catches Exception) and can hide unexpected bugs; change the
except to only catch expected serialization-related errors (for example: except
(AttributeError, TypeError, ValueError) as e:) when accessing
tool.args_schema.schema() or model_json_schema(), log the error with
logger.debug (including the exception) and set tool_dict['args_schema'] = {},
and let any other exceptions propagate so unexpected failures surface; refer to
the block that checks hasattr(tool.args_schema, 'schema') / 'model_json_schema'
and the logger.debug call that mentions tool.name to locate the code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`:
- Around line 427-428: The guard that returns [] currently checks only top_k and
all_tools but misses whitespace-only queries; update the shortlist entry point
(the function containing the if top_k <= 0 or not all_tools) to treat a query of
None or a query that is all whitespace as empty by trimming (e.g., query =
(query or "").strip()) and then checking if not query before proceeding to call
the LLM; use the existing variables query, top_k, and all_tools so the function
returns [] when the trimmed query is empty, preserving the current behavior for
top_k and all_tools.

---

Nitpick comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`:
- Around line 225-234: The try/except around args_schema serialization in the
prompt_utils code is too broad (it catches Exception) and can hide unexpected
bugs; change the except to only catch expected serialization-related errors (for
example: except (AttributeError, TypeError, ValueError) as e:) when accessing
tool.args_schema.schema() or model_json_schema(), log the error with
logger.debug (including the exception) and set tool_dict['args_schema'] = {},
and let any other exceptions propagate so unexpected failures surface; refer to
the block that checks hasattr(tool.args_schema, 'schema') / 'model_json_schema'
and the logger.debug call that mentions tool.name to locate the code to modify.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 424758d9-3d74-4181-97e2-5287f364951e

📥 Commits

Reviewing files that changed from the base of the PR and between 6af3431 and acdbf18.

📒 Files selected for processing (1)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py

Comment thread src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
@haroldship haroldship requested a review from sami-marreed May 11, 2026 17:22
@haroldship
Copy link
Copy Markdown
Collaborator Author

This does NOT seem to improve M3, just fixes an issue with bind_tools where too many tools causes errors from LLM invoke.

Here are some results for hockey domain in capabilty 2, which are similar to previous:

============================== FINAL SUMMARY ==============================

================================================================================
EVALUATION COMPLETE — Vakra LLM judges (correctness × groundedness × exact-match)
================================================================================
Scoring mode:          live-mcp
Scored samples:        10
Pass / total:          7 / 10  (70.0%)
Mean dialogue score:   0.7000
Min  dialogue score:   0.0000
Max  dialogue score:   1.0000

Per-task results:
  ✓ 308738b8195d-5bd16a8893c5       score=1.00  tool_calls=5/1
  ✓ 308738b8195d-18bbc5dc131d       score=1.00  tool_calls=1/1
  ✓ 308738b8195d-05ec22ea67ac       score=1.00  tool_calls=3/1
  ✗ 308738b8195d-3e3d6db9ab3d       score=0.00  tool_calls=5/1
    turn 0 score=0.0 pred="The 9th oldest hockey player is **Cully Wilson**, and his position is **D**."
      ✗ exactmatch    score=0.0  Missing expected tool_responses. Missing: ['{\n  "positions": [\n    [\n      null\n    ],\n    [\n      null\n    ],\n    [\n      null\n    ],\n    [\n      null\n    ],\n    [\n      null\n    ],\n    [\n      null\n    ],\n    [\n      null\n    ],\n    [\n      "R"\n    ]...
      ✗ answer        score=0.0  The ground truth indicates the position is "r", but the prediction says the position is "d", which does not match. Therefore the prediction is incorrect.
  ✓ 308738b8195d-b11eeb38eded       score=1.00  tool_calls=1/1
  ✓ 308738b8195d-91d22b875555       score=1.00  tool_calls=1/1
  ✓ 308738b8195d-02960c8c0d16       score=1.00  tool_calls=1/1
  ✓ 308738b8195d-8b4b06bc3398       score=1.00  tool_calls=2/1
  ✗ 308738b8195d-9296bdde9ace       score=0.00  tool_calls=0/1
    turn 0 score=0.0 pred="I’ll run a series of broader `find_tools` queries to see if any hidden event‑level endpoint exists."
      ✗ exactmatch    score=0.0  Missing expected tool_responses. Missing: ['{\n  "count": 0\n}']
      ✗ answer        score=0.0  The prediction does not answer the question about the number of teams; it is unrelated text, so it receives a score of 0.
  ✗ 308738b8195d-56faa9f6bbd2       score=0.00  tool_calls=5/1
    turn 0 score=0.0 pred="One coach was recorded as having a temporary (interim) term in 2007. This count comes from the `task_2_hockey_get_count_coaches_by_year_and_notes` endpoint using the note “interim”; alternative not..."
      ✓ exactmatch    score=1.0  All expected tool_responses are present in actual.
      ✗ groundedness  score=0.0  The response mentions a specific API endpoint and maps zero counts to particular note types, neither of which are stated in the document, making those claims ungrounded.

Full judge details: /Users/harold/workspace/cuga-internal-evaluation/benchmarks/m3/results/_vakra/results.json
================================================================================

Four findings addressed in one commit:

1. (major) include_find_tools=True did not guarantee find_tools survived
   shortlisting when the overlay had already placed it into bound. The
   shortlister LLM is free to drop any tool from its ranking input. Fix:
   pull find_tools out of the ranking pool when present, always reserve a
   cap slot when include_find_tools=True, and append find_tools back after
   ranking. As a bonus, max_count=1 + include_find_tools=True now binds
   just find_tools instead of raising.

2. (minor) logger.warning("...get_apps() failed: %s", e) used printf-style
   formatting, which loguru does not interpolate — the actual error reason
   was lost. Switch to "{}".

3. (minor) shortlist_tool_names did not guard whitespace-only queries; the
   guard at the top only checked top_k and all_tools. A whitespace query
   would still hit the LLM and produce arbitrary rankings, bypassing the
   "no query" failure path in the caller. Add `if not query.strip(): return []`.

4. (nit) _build_shortlister_payload caught blind Exception when
   serializing args_schema, masking unexpected bugs. Narrow to
   (AttributeError, TypeError, ValueError).

New regression tests:
- find_tools survives shortlisting when in overlay bound (cap path)
- max_count=1 + include_find_tools=True binds only find_tools
@haroldship
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 440-465: If ranked_names is non-empty but none match ranking_pool,
treat this as a hard failure: after building shortlisted from ranked_names (the
loop that uses by_name and seen_short), check if ranked_names and shortlisted is
empty and raise an exception (or return a clear error) before running the
padding logic (_bind_tools_pad_to_cap_from_settings) or calling
_append_find_tools; update the block that constructs shortlisted (and references
ranking_pool, ranked_names, by_name, seen_short) to perform this validation and
throw a descriptive error so we never silently proceed with padding or find-tool
append when the shortlist filtering produced no valid tools.
- Around line 350-371: The overlay-injected find_tools must be detected and
removed from the bound list regardless of include_find_tools; update
_indexed_tools_for_native_bind() / the block that sets find_tools_tool so that
you always extract candidate =
tools_context_ref.get("_lc_bind_tools_find_tools") and compute find_tools_name
and find_tools_already_in_bound unconditionally, then modify bound to strip any
tool whose name == find_tools_name unless keep_find_tools is True; after that,
build ranking_pool from the (possibly stripped) bound and only reserve/append
the find_tools_tool when keep_find_tools is True so it does not consume a capped
slot during ranking (refer to symbols: include_find_tools, find_tools_tool,
find_tools_name, find_tools_already_in_bound, keep_find_tools, bound,
ranking_pool).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a9fd89c-6a30-42d6-a4e7-c80283d67033

📥 Commits

Reviewing files that changed from the base of the PR and between acdbf18 and ded870c.

📒 Files selected for processing (3)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
  • tests/unit/test_cuga_lite_bind_tools.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py

Comment thread src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py Outdated
Comment thread src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py Outdated
A. Overlay-injected find_tools must respect include_find_tools=False.
   `_indexed_tools_for_native_bind` can place find_tools into `bound`
   independently of config. Previously, when `include_find_tools=False`,
   the code never even *read* the find_tools candidate from
   tools_context_ref, so the overlay-injected copy slipped through —
   consuming a capped slot, leaking through the shortlister, and
   defeating the user's explicit opt-out. Fix: resolve the find_tools
   candidate unconditionally; if `include_find_tools=False` and it's
   already in `bound`, strip it before any cap math runs.

B. LLM-hallucinated shortlist names must fail loudly.
   The existing guard catches `ranked_names == []` but not the case
   where the LLM returns names that don't exist in `ranking_pool` (zero
   matches after the dictionary lookup). Without the new check, we'd
   silently pad with arbitrary `ranking_pool` prefix or return just
   find_tools — recreating exactly the silent degradation this cap path
   exists to prevent. Fix: raise RuntimeError when `shortlisted` is
   empty after the lookup loop, with diagnostic context.

Two regression tests added:
- overlay-injected find_tools stripped when include_find_tools=False
- shortlister returning hallucinated names raises before pad / append

16/16 tests pass.
@haroldship
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unit/test_cuga_lite_bind_tools.py (1)

472-579: ⚡ Quick win

Add regression test for overlong shortlister output.

Please add one case where PromptUtils.shortlist_tool_names returns more than top_k valid names, then assert the final bound list is still capped. This guards the provider-safe cap contract against non-compliant LLM output.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_cuga_lite_bind_tools.py` around lines 472 - 579, Add a
regression test that simulates PromptUtils.shortlist_tool_names returning more
valid tool names than the requested top_k and assert the binding still respects
the cap; specifically, create an async fake_shortlist that returns a list longer
than top_k, patch "PromptUtils.shortlist_tool_names" to use it, call
resolve_model_with_bind_tools with a configured
_bind_tools_max_count_from_settings (e.g., 4) and include_find_tools as needed,
then assert model.bind_tools was called and that the bound list (from
model.bind_tools.call_args) contains at most the capped number of tools (and
reserves a slot for "find_tools" if include_find_tools=True), referencing
PromptUtils.shortlist_tool_names and resolve_model_with_bind_tools to locate the
behavior to test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 449-457: Shortlister collects tools from ranked_names into
shortlisted (using by_name, seen_short and StructuredTool) but never enforces
target_k/max_count; update the loop or post-process shortlisted to limit its
length to target_k (or max_count) before returning so it never contains more
than the requested number of tools, e.g., stop adding once shortlisted reaches
target_k or truncate shortlisted to target_k after the loop to ensure
provider-side bind limits are respected.

---

Nitpick comments:
In `@tests/unit/test_cuga_lite_bind_tools.py`:
- Around line 472-579: Add a regression test that simulates
PromptUtils.shortlist_tool_names returning more valid tool names than the
requested top_k and assert the binding still respects the cap; specifically,
create an async fake_shortlist that returns a list longer than top_k, patch
"PromptUtils.shortlist_tool_names" to use it, call resolve_model_with_bind_tools
with a configured _bind_tools_max_count_from_settings (e.g., 4) and
include_find_tools as needed, then assert model.bind_tools was called and that
the bound list (from model.bind_tools.call_args) contains at most the capped
number of tools (and reserves a slot for "find_tools" if
include_find_tools=True), referencing PromptUtils.shortlist_tool_names and
resolve_model_with_bind_tools to locate the behavior to test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3f39a6d-7c3d-4910-bd5f-0450fff229fc

📥 Commits

Reviewing files that changed from the base of the PR and between ded870c and 11feace.

📒 Files selected for processing (2)
  • src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
  • tests/unit/test_cuga_lite_bind_tools.py

Comment thread src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py Outdated
)

Defense-in-depth: the build-shortlisted loop in _apply_bind_tools_cap_and_merge
trusted PromptUtils.shortlist_tool_names to truncate its own output at top_k.
That trust is fragile — a custom shortlister, a future refactor, or a mocked
path could return more names than requested, in which case shortlisted grew
past target_k, the pad block was skipped (len already ≥ target_k), and the
final bound list exceeded max_count, re-triggering the provider 400 the cap
exists to prevent.

Add an explicit `if len(shortlisted) >= target_k: break` so the call site
enforces its own cap contract regardless of the shortlister's compliance.

Two regression tests added:
- overlong shortlister output clamped to target_k (mode=all, no find_tools)
- same clamp with include_find_tools=True reserving 1 slot
@sami-marreed
Copy link
Copy Markdown
Contributor

why tests failing here

@haroldship
Copy link
Copy Markdown
Collaborator Author

why tests failing here

Reran and they pass. They are unreliable as the LLM doesn't return consistent responses. All I did was rerun them. And the failure was in a totally different part of cuga (policies)

@sami-marreed
Copy link
Copy Markdown
Contributor

Modularity (cuga_lite_graph)

The bind-tools cap fix looks right for strict provider limits, but cuga_lite_graph.py picks up a lot of logic (~180 lines in _apply_bind_tools_cap_and_merge plus settings helpers). Please pull this into a dedicated package so the graph file stays readable:

Placement: Under src/cuga/backend/cuga_graph/nodes/cuga_lite/ you already have subfolders (executors/, reflection/, prompts/, tests/). None of those fit “cap + shortlist + merge for bind_tools” well (executors/ is runtime/sandbox). Add a new subpackage, e.g.:

src/cuga/backend/cuga_graph/nodes/cuga_lite/bind_tools/

  • bind_tools/__init__.py — re-export only what cuga_lite_graph.py needs.
  • bind_tools/cap.py (split further if it grows) — _bind_tools_max_count_from_settings, _bind_tools_pad_to_cap_from_settings, _apply_bind_tools_cap_and_merge, and small private helpers (resolve/strip find_tools overlay, build ranking_pool, run shortlist + validate names, optional pad-to-cap).

Keep resolve_model_with_bind_tools in cuga_lite_graph.py as orchestration (call into bind_tools).

DRY: The four _apply_bind_tools_cap_and_merge(...) call sites repeat the same kwargs — factor a tiny closure or _cap_merge_bound(bound) that closes over query, tool_provider, llm, max_count, include_find_tools, tools_context_ref, mode.

prompt_utils.py already factored _build_shortlister_payload / shortlist_tool_names; matching that structure on the graph side (via bind_tools/) keeps responsibilities aligned.


Other notes (worth fixing, non-blocking)

  1. Docs vs code: advanced_features.cuga_lite_bind_tools_pad_to_cap is in code + validators but the PR blurb mostly highlights cuga_lite_bind_tools_max_count. Document pad_to_cap (what it does, default off, env override) and add settings.toml / sample config if that's how operators discover advanced_features.*.

  2. Operational: Over-cap paths add a shortlister LLM round-trip on bind — intentional; a one-line docstring or ops note on latency/cost helps.

  3. Loud failures: RuntimeError when shortlisting can't run is intentional; just ensure benchmark/user docs mention it.

…203)

Addresses Sami's review on #203 (modularity + DRY + docs).

- New subpackage src/cuga/backend/cuga_graph/nodes/cuga_lite/bind_tools/
  - cap.py: holds bind_tools_max_count_from_settings,
    bind_tools_pad_to_cap_from_settings, apply_bind_tools_cap_and_merge,
    plus small private helpers (_resolve_find_tools_overlay,
    _build_ranking_pool, _run_shortlister, _materialize_shortlist,
    _maybe_pad_to_cap).
  - __init__.py: re-exports only what cuga_lite_graph.py needs.
- cuga_lite_graph.py: resolve_model_with_bind_tools stays as orchestration.
  The four mode={all,apps,tools,apps_and_tools} branches now share a local
  _cap_merge_bound(bound) closure (DRY ask), and import the helpers from
  bind_tools/.
- Docstring on resolve_model_with_bind_tools now documents
  cuga_lite_bind_tools_pad_to_cap and adds an Operational cost paragraph
  (one extra LLM round-trip per call_model when cap exceeded; intentional
  RuntimeError when shortlisting cannot run safely).
- settings.toml: documents both cuga_lite_bind_tools_max_count and
  cuga_lite_bind_tools_pad_to_cap inline with the other bind_tools_* knobs
  (env overrides + latency/RuntimeError semantics).
- Tests: patch paths updated for the new module layout; pad_to_cap patches
  now target bind_tools.cap directly since the call site moved with the
  function. 18 tests pass.
@haroldship
Copy link
Copy Markdown
Collaborator Author

Thanks @sami-marreed — addressed all of this in c20968b (just pushed).

Modularity + DRY (your main asks)

New subpackage at the exact path you suggested:

src/cuga/backend/cuga_graph/nodes/cuga_lite/bind_tools/
├── __init__.py   # re-exports only what cuga_lite_graph.py needs
└── cap.py        # all the moved logic

cap.py holds the three module-level helpers (bind_tools_max_count_from_settings, bind_tools_pad_to_cap_from_settings, apply_bind_tools_cap_and_merge) and decomposes the previously-monolithic _apply_bind_tools_cap_and_merge into the small private helpers you outlined:

  • _resolve_find_tools_overlay — find/strip the overlay-injected find_tools candidate
  • _build_ranking_pool — pull find_tools out of the pool when we must guarantee it survives
  • _run_shortlister — invoke PromptUtils.shortlist_tool_names and raise on failure / empty ranking
  • _materialize_shortlist — map ranker output back to StructuredTool, defense-in-depth clamp to target_k, raise on hallucinated names
  • _maybe_pad_to_cap — opt-in padding

cuga_lite_graph.py:resolve_model_with_bind_tools now stays as orchestration. The four mode={all,apps,tools,apps_and_tools} call sites repeated the same kwargs, so I factored a local _cap_merge_bound(bound) closure (your DRY suggestion) — each branch is now one await _cap_merge_bound(bound).

Other notes — all done

  1. Docs vs code (pad_to_cap):

    • PR body now has a dedicated "New settings" subsection for cuga_lite_bind_tools_pad_to_cap (default, env override, why off by default with the m3-hockey measurement).
    • src/cuga/settings.toml now documents both cuga_lite_bind_tools_max_count and cuga_lite_bind_tools_pad_to_cap inline next to the other cuga_lite_bind_tools_* knobs, including the env vars.
  2. Operational (latency/cost):

    • Added an "Operational notes" section to the PR body: when the candidate count exceeds max_count, applying the cap incurs one extra LLM round-trip (the shortlister) per call_model. Permissive backends can skip the round-trip with cuga_lite_bind_tools_max_count=0.
    • Same note added to the resolve_model_with_bind_tools docstring and to cap.py's module docstring.
  3. Loud failures (RuntimeError):

    • PR body now explicitly documents the RuntimeError contract — when it fires (empty query / shortlister exception / empty ranking / hallucinated names) and how callers should react. Mentioned in both the bullet summary and "Operational notes".

Verification

  • pytest tests/unit/test_cuga_lite_bind_tools.py → 18 passed
  • ruff check + ruff format --check on all touched files → clean

(Test patch paths needed two adjustments: bind_tools_max_count_from_settings is still patchable at cuga_lite_graph.* since the call site lives there, but bind_tools_pad_to_cap_from_settings patches now target bind_tools.cap.* because that call site moved with the helper.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] cuga_lite_bind_tools_mode=all crashes on Groq when tool count > 128 (provider 400)

2 participants