fix(cuga_lite): cap bind_tools count and shortlist over the limit#203
fix(cuga_lite): cap bind_tools count and shortlist over the limit#203haroldship wants to merge 8 commits into
Conversation
`resolve_model_with_bind_tools` blindly passed every candidate tool to `LLM.bind_tools()`, which fails hard on Groq/OpenAI (max 128 tools) for benchmarks like m3-hockey (206 tools). Add `cuga_lite_bind_tools_max_count` (default 128). When the candidate list exceeds the cap, run the existing LLM shortlister against the first user message and bind the top-K most relevant tools (reserving 1 slot for find_tools when `include_find_tools=True`). When shortlisting is impossible (no query / shortlister failure / empty ranking), raise an actionable RuntimeError naming the count, cap, and remedy — silent truncation would corrupt research comparing native tool-calling against text-mode. WatsonX / LiteLLM-to-Anthropic users on permissive backends can disable the cap via `DYNACONF_ADVANCED_FEATURES__CUGA_LITE_BIND_TOOLS_MAX_COUNT=0`. Closes #202 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a configurable provider-safe cap on native tool binding (default 128). When candidates exceed the cap, an LLM shortlister ranks tools using the first user message as query to pick a top-K subset; the code reserves/merges ChangesTool Binding Cap & LLM Shortlisting
Sequence Diagram(s)sequenceDiagram
participant CallModel
participant ResolveModel as resolve_model_with_bind_tools
participant ApplyCap as _apply_bind_tools_cap_and_merge
participant Shortlist as PromptUtils.shortlist_tool_names
participant LLM
participant Model
CallModel->>ResolveModel: call with query (first user message)
ResolveModel->>ApplyCap: candidate_tools, max_count, query
alt candidates exceed cap
ApplyCap->>Shortlist: query, all_tools, all_apps, top_k
Shortlist->>LLM: shortlister prompt + serialized payload
LLM-->>Shortlist: ranked tool names
Shortlist-->>ApplyCap: top-K tool names
ApplyCap->>ApplyCap: reserve slot for find_tools, pad to cap if enabled
ApplyCap-->>ResolveModel: capped + merged tool list
else candidates within cap
ApplyCap-->>ResolveModel: all tools
end
ResolveModel->>Model: bind_tools(selected_tools)
Model-->>ResolveModel: bound model
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py (1)
434-477: ⚡ Quick winShare the shortlister payload builder with
find_tools.This method already diverges from
PromptUtils.find_tools: it rebuilds the same prompt/serialization flow but drops_response_schemasand_param_constraints. That means bind-time shortlisting can rank a different top-K than the existingfind_toolspath, and the two implementations will keep drifting.♻️ Suggested direction
+ `@staticmethod` + def _build_shortlister_payload( + all_tools: List[StructuredTool], + all_apps: List[AppDefinition], + ) -> tuple[Dict[str, Any], Dict[str, Any]]: + tools_as_dict: Dict[str, Any] = {} + for tool in all_tools: + tool_dict = tool.model_dump() + ... + if hasattr(tool, "func"): + if hasattr(tool.func, "_response_schemas"): + tool_dict["_response_schemas"] = tool.func._response_schemas + if hasattr(tool.func, "_param_constraints"): + tool_dict["_param_constraints"] = tool.func._param_constraints + tools_as_dict[tool.name] = tool_dict + apps_as_dict = {app.name: app.model_dump() for app in all_apps} + return tools_as_dict, apps_as_dict + async def find_tools(...): - tools_as_dict = {} - ... - apps_as_dict = {app.name: app.model_dump() for app in all_apps} + tools_as_dict, apps_as_dict = PromptUtils._build_shortlister_payload(all_tools, all_apps) async def shortlist_tool_names(...): - tools_as_dict: Dict[str, Any] = {} - ... - apps_as_dict = {app.name: app.model_dump() for app in all_apps} + tools_as_dict, apps_as_dict = PromptUtils._build_shortlister_payload(all_tools, all_apps)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py` around lines 434 - 477, This shortlister duplicates the payload-building logic from PromptUtils.find_tools but omits _response_schemas and _param_constraints causing divergence; refactor to reuse the same builder (or call the existing PromptUtils.find_tools payload/serialization helper) instead of reserializing inline: locate the tools serialization block that builds tools_as_dict (and the prompt created via create_chat_prompt_from_templates / ShortListerOutputLite / BaseAgent.get_chain) and replace it with the shared payload builder used by find_tools so the produced payload includes _response_schemas and _param_constraints and both paths remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 327-361: The code treats find_tools_tool inconsistently for
counting vs reserving and can produce max_count+1 tools; fix by treating
find_tools_tool as already counted if bound already contains it: when computing
cap checks (the branch with cap_disabled or len(bound) <= max_count) and when
computing reserve, first derive ft_name from find_tools_tool (ft_name =
getattr(find_tools_tool,"name","") ) and test membership against
{getattr(t,"name","") for t in bound}; only append via _append_find_tools when
ft_name is non-empty and not already present in bound, and set reserve = 1 only
if find_tools_tool is non-None and ft_name not in bound so target_k correctly
reflects available shortlist slots.
---
Nitpick comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`:
- Around line 434-477: This shortlister duplicates the payload-building logic
from PromptUtils.find_tools but omits _response_schemas and _param_constraints
causing divergence; refactor to reuse the same builder (or call the existing
PromptUtils.find_tools payload/serialization helper) instead of reserializing
inline: locate the tools serialization block that builds tools_as_dict (and the
prompt created via create_chat_prompt_from_templates / ShortListerOutputLite /
BaseAgent.get_chain) and replace it with the shared payload builder used by
find_tools so the produced payload includes _response_schemas and
_param_constraints and both paths remain consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90c0edbd-8f6e-4bd0-9b35-3aeba1c92a7d
📒 Files selected for processing (4)
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.pysrc/cuga/config.pytests/unit/test_cuga_lite_bind_tools.py
Two follow-ups discovered while exercising the cap path on the m3 hockey
benchmark (206 tools, Groq llama-3, mode=all):
1. Loguru log strings used %-style placeholders so the cap exceeded /
shortlisted-to messages printed literal "%s" / "%d". Switch to {}.
2. The existing shortlister LLM system prompt is biased toward "the most
relevant" tools and typically returns 1-4 entries even when asked for
K=128. Padding the result with the remaining tools (provider order) does
fill the cap, but binding ~128 native tools pushes the Groq llama model
into native tool_calls mode, which cuga_lite's code-execution flow does
not fully exercise — measured regression from 5-7 tool calls / pass to
0 tool calls / fail on the same task.
Add `cuga_lite_bind_tools_pad_to_cap` (default False) and keep the
shortlister's natural ranked prefix as the bound set. Result on the m3
hockey task: 0/1 to 1/1 pass, 17s, 4 tool calls in code-mode.
Users explicitly chasing "bind every slot the provider allows" can flip the
setting on for that experiment.
Per coderabbit review on #203: the under-cap fast path and the over-cap reserve logic treated find_tools inconsistently. - When len(bound) == max_count and include_find_tools=True with a non-overlapping find_tools, the fast path returned max_count+1 tools and the provider rejected the request. - When the overlay path had already placed find_tools into bound and include_find_tools=True, reserve=1 dropped a real shortlist slot even though find_tools was already counted in bound. Track whether find_tools is already in bound, compute an effective_count for the under-cap check, and only set reserve=1 when find_tools genuinely needs to be appended. Add two regression tests for the boundary cases.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py (1)
356-390:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
find_toolsguaranteed wheninclude_find_tools=True.When
find_toolscomes in via the overlay,need_to_append_find_toolsbecomesFalse. In the over-cap path that meansreserve=0and_append_find_tools()is a no-op, so the shortlister can rankfind_toolsout of the final list entirely. That breaks theinclude_find_tools=Truecontract and also makesmax_count=1fail even though binding onlyfind_toolswould still fit.Proposed fix
- find_tools_already_in_bound = bool(find_tools_name) and any( + find_tools_already_in_bound = bool(find_tools_name) and any( getattr(t, "name", "") == find_tools_name for t in bound ) - need_to_append_find_tools = find_tools_tool is not None and not find_tools_already_in_bound + keep_find_tools = include_find_tools and find_tools_tool is not None + ranking_pool = ( + [t for t in bound if getattr(t, "name", "") != find_tools_name] + if keep_find_tools and find_tools_already_in_bound + else bound + ) def _append_find_tools(tools: List[StructuredTool]) -> List[StructuredTool]: - if not need_to_append_find_tools: + if not keep_find_tools or find_tools_tool is None: return tools if find_tools_name in {getattr(t, "name", "") for t in tools}: return tools return [*tools, find_tools_tool] cap_disabled = max_count <= 0 - effective_count = len(bound) + (1 if need_to_append_find_tools else 0) + effective_count = len(ranking_pool) + (1 if keep_find_tools else 0) if cap_disabled or effective_count <= max_count: - return _append_find_tools(bound) + return _append_find_tools(ranking_pool) - reserve = 1 if need_to_append_find_tools else 0 + reserve = 1 if keep_find_tools else 0 target_k = max_count - reserve if target_k <= 0: - raise RuntimeError( - f"cuga_lite_bind_tools_max_count={max_count} is too small to fit even find_tools " - f"(reserve={reserve}). Raise the cap." - ) + return _append_find_tools([]) ... - all_tools=bound, + all_tools=ranking_pool, ... - by_name = {getattr(t, "name", ""): t for t in bound} + by_name = {getattr(t, "name", ""): t for t in ranking_pool}Also applies to: 414-460
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py` around lines 356 - 390, The shortlister can drop find_tools when it's already present in bound because reserve is only set based on need_to_append_find_tools; to guarantee include_find_tools=True you must treat that tool as reserved even if it was placed in-band. Change the reserve logic to reserve a slot whenever include_find_tools is true (i.e. set reserve = 1 if include_find_tools else 0) and ensure effective_count/shortlist decisions continue to reflect need_to_append_find_tools for counting new appends; keep _append_find_tools and find_tools_already_in_bound as-is so the append only happens when needed but the shortlister cannot evict the in-band find_tools thanks to the new reserve behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 397-403: The logger.warning call inside the block where all_apps
is populated from tool_provider.get_apps() uses printf-style "%s" formatting
which doesn't work with Loguru; change the call in the try/except that catches
Exception e to use Loguru's {} style (e.g., logger.warning("bind_tools cap:
tool_provider.get_apps() failed: {}", e)) so the exception message renders
correctly; update the logging invocation near the
all_apps/tool_provider/get_apps handling to use {} formatting and keep the same
message content.
---
Duplicate comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 356-390: The shortlister can drop find_tools when it's already
present in bound because reserve is only set based on need_to_append_find_tools;
to guarantee include_find_tools=True you must treat that tool as reserved even
if it was placed in-band. Change the reserve logic to reserve a slot whenever
include_find_tools is true (i.e. set reserve = 1 if include_find_tools else 0)
and ensure effective_count/shortlist decisions continue to reflect
need_to_append_find_tools for counting new appends; keep _append_find_tools and
find_tools_already_in_bound as-is so the append only happens when needed but the
shortlister cannot evict the in-band find_tools thanks to the new reserve
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cde90c26-803b-4a75-9706-57b02fca574f
📒 Files selected for processing (2)
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.pytests/unit/test_cuga_lite_bind_tools.py
Per coderabbit nitpick on #203: shortlist_tool_names duplicated the tool serialization from find_tools but dropped _response_schemas and _param_constraints, so bind-time ranking could diverge from runtime find_tools ranking on the same tool set. Extract PromptUtils._build_shortlister_payload as the single source of truth for serializing (all_tools, all_apps) into the prompt-friendly dicts. Both callers now produce identical payloads including the response-schema and param-constraint fields the LLM uses to rank.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py (1)
225-234: ⚡ Quick winNarrow the blind exception in schema serialization.
On Line 232,
except Exceptioncan mask unexpected bugs and silently strip tool schema from ranking input. Catch expected serialization errors explicitly so unexpected failures surface.Proposed diff
if hasattr(tool, 'args_schema') and tool.args_schema: try: if hasattr(tool.args_schema, 'schema'): tool_dict['args_schema'] = tool.args_schema.schema() elif hasattr(tool.args_schema, 'model_json_schema'): tool_dict['args_schema'] = tool.args_schema.model_json_schema() else: tool_dict['args_schema'] = {} - except Exception as e: + except (AttributeError, TypeError, ValueError) as e: logger.debug(f"Failed to serialize args_schema for tool {tool.name}: {e}") tool_dict['args_schema'] = {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py` around lines 225 - 234, The try/except around args_schema serialization in the prompt_utils code is too broad (it catches Exception) and can hide unexpected bugs; change the except to only catch expected serialization-related errors (for example: except (AttributeError, TypeError, ValueError) as e:) when accessing tool.args_schema.schema() or model_json_schema(), log the error with logger.debug (including the exception) and set tool_dict['args_schema'] = {}, and let any other exceptions propagate so unexpected failures surface; refer to the block that checks hasattr(tool.args_schema, 'schema') / 'model_json_schema' and the logger.debug call that mentions tool.name to locate the code to modify.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`:
- Around line 427-428: The guard that returns [] currently checks only top_k and
all_tools but misses whitespace-only queries; update the shortlist entry point
(the function containing the if top_k <= 0 or not all_tools) to treat a query of
None or a query that is all whitespace as empty by trimming (e.g., query =
(query or "").strip()) and then checking if not query before proceeding to call
the LLM; use the existing variables query, top_k, and all_tools so the function
returns [] when the trimmed query is empty, preserving the current behavior for
top_k and all_tools.
---
Nitpick comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py`:
- Around line 225-234: The try/except around args_schema serialization in the
prompt_utils code is too broad (it catches Exception) and can hide unexpected
bugs; change the except to only catch expected serialization-related errors (for
example: except (AttributeError, TypeError, ValueError) as e:) when accessing
tool.args_schema.schema() or model_json_schema(), log the error with
logger.debug (including the exception) and set tool_dict['args_schema'] = {},
and let any other exceptions propagate so unexpected failures surface; refer to
the block that checks hasattr(tool.args_schema, 'schema') / 'model_json_schema'
and the logger.debug call that mentions tool.name to locate the code to modify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 424758d9-3d74-4181-97e2-5287f364951e
📒 Files selected for processing (1)
src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
|
This does NOT seem to improve M3, just fixes an issue with Here are some results for hockey domain in capabilty 2, which are similar to previous: |
Four findings addressed in one commit:
1. (major) include_find_tools=True did not guarantee find_tools survived
shortlisting when the overlay had already placed it into bound. The
shortlister LLM is free to drop any tool from its ranking input. Fix:
pull find_tools out of the ranking pool when present, always reserve a
cap slot when include_find_tools=True, and append find_tools back after
ranking. As a bonus, max_count=1 + include_find_tools=True now binds
just find_tools instead of raising.
2. (minor) logger.warning("...get_apps() failed: %s", e) used printf-style
formatting, which loguru does not interpolate — the actual error reason
was lost. Switch to "{}".
3. (minor) shortlist_tool_names did not guard whitespace-only queries; the
guard at the top only checked top_k and all_tools. A whitespace query
would still hit the LLM and produce arbitrary rankings, bypassing the
"no query" failure path in the caller. Add `if not query.strip(): return []`.
4. (nit) _build_shortlister_payload caught blind Exception when
serializing args_schema, masking unexpected bugs. Narrow to
(AttributeError, TypeError, ValueError).
New regression tests:
- find_tools survives shortlisting when in overlay bound (cap path)
- max_count=1 + include_find_tools=True binds only find_tools
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 440-465: If ranked_names is non-empty but none match ranking_pool,
treat this as a hard failure: after building shortlisted from ranked_names (the
loop that uses by_name and seen_short), check if ranked_names and shortlisted is
empty and raise an exception (or return a clear error) before running the
padding logic (_bind_tools_pad_to_cap_from_settings) or calling
_append_find_tools; update the block that constructs shortlisted (and references
ranking_pool, ranked_names, by_name, seen_short) to perform this validation and
throw a descriptive error so we never silently proceed with padding or find-tool
append when the shortlist filtering produced no valid tools.
- Around line 350-371: The overlay-injected find_tools must be detected and
removed from the bound list regardless of include_find_tools; update
_indexed_tools_for_native_bind() / the block that sets find_tools_tool so that
you always extract candidate =
tools_context_ref.get("_lc_bind_tools_find_tools") and compute find_tools_name
and find_tools_already_in_bound unconditionally, then modify bound to strip any
tool whose name == find_tools_name unless keep_find_tools is True; after that,
build ranking_pool from the (possibly stripped) bound and only reserve/append
the find_tools_tool when keep_find_tools is True so it does not consume a capped
slot during ranking (refer to symbols: include_find_tools, find_tools_tool,
find_tools_name, find_tools_already_in_bound, keep_find_tools, bound,
ranking_pool).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1a9fd89c-6a30-42d6-a4e7-c80283d67033
📒 Files selected for processing (3)
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.pytests/unit/test_cuga_lite_bind_tools.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.py
A. Overlay-injected find_tools must respect include_find_tools=False. `_indexed_tools_for_native_bind` can place find_tools into `bound` independently of config. Previously, when `include_find_tools=False`, the code never even *read* the find_tools candidate from tools_context_ref, so the overlay-injected copy slipped through — consuming a capped slot, leaking through the shortlister, and defeating the user's explicit opt-out. Fix: resolve the find_tools candidate unconditionally; if `include_find_tools=False` and it's already in `bound`, strip it before any cap math runs. B. LLM-hallucinated shortlist names must fail loudly. The existing guard catches `ranked_names == []` but not the case where the LLM returns names that don't exist in `ranking_pool` (zero matches after the dictionary lookup). Without the new check, we'd silently pad with arbitrary `ranking_pool` prefix or return just find_tools — recreating exactly the silent degradation this cap path exists to prevent. Fix: raise RuntimeError when `shortlisted` is empty after the lookup loop, with diagnostic context. Two regression tests added: - overlay-injected find_tools stripped when include_find_tools=False - shortlister returning hallucinated names raises before pad / append 16/16 tests pass.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/test_cuga_lite_bind_tools.py (1)
472-579: ⚡ Quick winAdd regression test for overlong shortlister output.
Please add one case where
PromptUtils.shortlist_tool_namesreturns more thantop_kvalid names, then assert the final bound list is still capped. This guards the provider-safe cap contract against non-compliant LLM output.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_cuga_lite_bind_tools.py` around lines 472 - 579, Add a regression test that simulates PromptUtils.shortlist_tool_names returning more valid tool names than the requested top_k and assert the binding still respects the cap; specifically, create an async fake_shortlist that returns a list longer than top_k, patch "PromptUtils.shortlist_tool_names" to use it, call resolve_model_with_bind_tools with a configured _bind_tools_max_count_from_settings (e.g., 4) and include_find_tools as needed, then assert model.bind_tools was called and that the bound list (from model.bind_tools.call_args) contains at most the capped number of tools (and reserves a slot for "find_tools" if include_find_tools=True), referencing PromptUtils.shortlist_tool_names and resolve_model_with_bind_tools to locate the behavior to test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py`:
- Around line 449-457: Shortlister collects tools from ranked_names into
shortlisted (using by_name, seen_short and StructuredTool) but never enforces
target_k/max_count; update the loop or post-process shortlisted to limit its
length to target_k (or max_count) before returning so it never contains more
than the requested number of tools, e.g., stop adding once shortlisted reaches
target_k or truncate shortlisted to target_k after the loop to ensure
provider-side bind limits are respected.
---
Nitpick comments:
In `@tests/unit/test_cuga_lite_bind_tools.py`:
- Around line 472-579: Add a regression test that simulates
PromptUtils.shortlist_tool_names returning more valid tool names than the
requested top_k and assert the binding still respects the cap; specifically,
create an async fake_shortlist that returns a list longer than top_k, patch
"PromptUtils.shortlist_tool_names" to use it, call resolve_model_with_bind_tools
with a configured _bind_tools_max_count_from_settings (e.g., 4) and
include_find_tools as needed, then assert model.bind_tools was called and that
the bound list (from model.bind_tools.call_args) contains at most the capped
number of tools (and reserves a slot for "find_tools" if
include_find_tools=True), referencing PromptUtils.shortlist_tool_names and
resolve_model_with_bind_tools to locate the behavior to test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d3f39a6d-7c3d-4910-bd5f-0450fff229fc
📒 Files selected for processing (2)
src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.pytests/unit/test_cuga_lite_bind_tools.py
) Defense-in-depth: the build-shortlisted loop in _apply_bind_tools_cap_and_merge trusted PromptUtils.shortlist_tool_names to truncate its own output at top_k. That trust is fragile — a custom shortlister, a future refactor, or a mocked path could return more names than requested, in which case shortlisted grew past target_k, the pad block was skipped (len already ≥ target_k), and the final bound list exceeded max_count, re-triggering the provider 400 the cap exists to prevent. Add an explicit `if len(shortlisted) >= target_k: break` so the call site enforces its own cap contract regardless of the shortlister's compliance. Two regression tests added: - overlong shortlister output clamped to target_k (mode=all, no find_tools) - same clamp with include_find_tools=True reserving 1 slot
|
why tests failing here |
Reran and they pass. They are unreliable as the LLM doesn't return consistent responses. All I did was rerun them. And the failure was in a totally different part of cuga (policies) |
Modularity (
|
…203) Addresses Sami's review on #203 (modularity + DRY + docs). - New subpackage src/cuga/backend/cuga_graph/nodes/cuga_lite/bind_tools/ - cap.py: holds bind_tools_max_count_from_settings, bind_tools_pad_to_cap_from_settings, apply_bind_tools_cap_and_merge, plus small private helpers (_resolve_find_tools_overlay, _build_ranking_pool, _run_shortlister, _materialize_shortlist, _maybe_pad_to_cap). - __init__.py: re-exports only what cuga_lite_graph.py needs. - cuga_lite_graph.py: resolve_model_with_bind_tools stays as orchestration. The four mode={all,apps,tools,apps_and_tools} branches now share a local _cap_merge_bound(bound) closure (DRY ask), and import the helpers from bind_tools/. - Docstring on resolve_model_with_bind_tools now documents cuga_lite_bind_tools_pad_to_cap and adds an Operational cost paragraph (one extra LLM round-trip per call_model when cap exceeded; intentional RuntimeError when shortlisting cannot run safely). - settings.toml: documents both cuga_lite_bind_tools_max_count and cuga_lite_bind_tools_pad_to_cap inline with the other bind_tools_* knobs (env overrides + latency/RuntimeError semantics). - Tests: patch paths updated for the new module layout; pad_to_cap patches now target bind_tools.cap directly since the call site moved with the function. 18 tests pass.
|
Thanks @sami-marreed — addressed all of this in c20968b (just pushed). Modularity + DRY (your main asks)New subpackage at the exact path you suggested:
Other notes — all done
Verification
(Test patch paths needed two adjustments: |
Summary
LLM.bind_tools()from cuga_lite, fixing a hard 400 from Groq/OpenAI when the candidate set exceeds the provider's per-request limit (e.g. m3 hockey: 206 tools, mode=all).PromptUtils.find_toolsmachinery) ranks the candidates against the first user message and the top-K tools are bound (reserving 1 slot forfind_toolswheninclude_find_tools=True).RuntimeError. Silent truncation would corrupt benchmark results comparing native tool-calling vs text-mode — failing loudly is intentional.Closes #202
New settings
advanced_features.cuga_lite_bind_tools_max_count128(matches Groq/OpenAI).0(or negative) to disable the cap for permissive backends (WatsonX, Anthropic via LiteLLM) — also avoids the shortlister LLM round-trip entirely.DYNACONF_ADVANCED_FEATURES__CUGA_LITE_BIND_TOOLS_MAX_COUNT.advanced_features.cuga_lite_bind_tools_pad_to_capfalse. When the shortlister returns fewer tools than the cap allows, the default behavior is to bind only those tools (often 1–4 on the existing system prompt).trueto pad the shortlister output with the remaining candidates so the cap is fully utilized. Useful for research scenarios that explicitly wantmode=allto bind as many tools as the provider will accept.tool_callsmode, which the code-mode flow doesn't fully exercise (measured on m3 hockey: 0 tool calls with padding vs 5–7 without).DYNACONF_ADVANCED_FEATURES__CUGA_LITE_BIND_TOOLS_PAD_TO_CAP.Operational notes
max_count, applying the cap incurs one extra LLM round-trip (the shortlister) percall_modelinvocation. To avoid the round-trip on permissive backends, setcuga_lite_bind_tools_max_count=0.RuntimeErroris raised rather than silently truncating. Benchmark/research runs comparing native tool-calling vs text-mode therefore can't degrade silently; downstream callers should expect to see this error and either widen the cap, fix the shortlister LLM, or refine the user query.Where the change lives
src/cuga/backend/cuga_graph/nodes/cuga_lite/bind_tools/(new subpackage)cap.py—bind_tools_max_count_from_settings,bind_tools_pad_to_cap_from_settings,apply_bind_tools_cap_and_merge, plus private helpers (_resolve_find_tools_overlay,_build_ranking_pool,_run_shortlister,_materialize_shortlist,_maybe_pad_to_cap).__init__.py— re-exports the three public names used bycuga_lite_graph.py.src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.pyresolve_model_with_bind_toolsstays as orchestration — it imports frombind_toolsand uses a local_cap_merge_bound(bound)closure to DRY the fourmode={all,apps,tools,apps_and_tools}call sites.resolve_model_with_bind_toolsgains an optionalquery: Optional[str]kwarg.except Exceptionno longer swallows the actionableRuntimeErrorfrom the cap helper.call_modelpasses_first_user_message_text(state.chat_messages)through asquery.src/cuga/backend/cuga_graph/nodes/cuga_lite/prompt_utils.pyPromptUtils.shortlist_tool_names(query, all_tools, all_apps, llm, top_k, instructions)— returns ranked tool names. Internally reuses the existingShortListerOutputLitechain (./prompts/shortlister/system.jinja2); injects atop_kinstruction.PromptUtils._build_shortlister_payload(...)— shared payload builder betweenfind_toolsandshortlist_tool_namesso the two callers can't drift.src/cuga/config.pycuga_lite_bind_tools_max_count(default 128) andcuga_lite_bind_tools_pad_to_cap(defaultFalse).src/cuga/settings.tomlcuga_lite_bind_tools_*block.Backward compatibility
CUGA_LITE_BIND_TOOLS_MAX_COUNT=0.CUGA_LITE_BIND_TOOLS_PAD_TO_CAP=true.Test plan
pytest tests/unit/test_cuga_lite_bind_tools.py -q— 18 passed (existing + new):RuntimeErrorraised when over cap and no queryRuntimeErrorraised when shortlister hallucinates namesinclude_find_toolsreserves a slot (top_k = cap − 1) and find_tools is appended after shortlistingpad_to_cap=truepads with remaining candidates up to the capruff check+ruff format --checkon all touched files: clean./scripts/eval.sh --benchmark m3 --capability m3_task_2 --domain hockey --max-samples 1withCUGA_LITE_BIND_TOOLS_MODE=all) — to be exercised by reporterSummary by CodeRabbit
New Features
Bug Fixes
Tests
bind_toolscap +pad_to_capcross-provider verificationm3 hockey task (
task_2,--max-samples 1) withCUGA_LITE_BIND_TOOLS_MODE=all, cap=128.pad_to_capperm3.envat run time. Updated byscripts/m3_pad_to_cap_verify.sh.groq6af34316litellmacdbf18bwatsonxacdbf18b