fix(eval): include function-call events in invocation_events when skip_summarization is set#5417
Conversation
|
Hi @Koushik-Salammagari , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh |
7e902ee to
709d857
Compare
|
Hi @rohityan — rebased onto latest |
|
Hi @Jacksunwei , can you please review this. |
…in thread pool When RunConfig.tool_thread_pool_config is enabled, _call_tool_in_thread_pool used None as a sentinel to distinguish "FunctionTool ran in thread pool" from "non-FunctionTool sync tool, needs async fallback". Because None is also a valid return value from any FunctionTool whose underlying function has no explicit return statement (implicit None), the sentinel check failed and execution fell through to tool.run_async(), invoking the function a second time silently. Replace the None sentinel with a dedicated _SYNC_TOOL_RESULT_UNSET object so that a legitimate None result from a FunctionTool is correctly returned on the first execution, without triggering the async fallback path. Fixes google#5284
…ases
Per reviewer feedback: collapse the two near-identical None tests into a
single @pytest.mark.parametrize test, and add falsy-but-not-None cases
(0, '', {}, False) to prove the sentinel is identity-based and does not
mishandle any falsy return value from a FunctionTool.
…p_summarization is set EvaluationGenerator.convert_events_to_eval_invocations builds invocation_events by excluding the final_event from intermediate steps. However, is_final_response() returns True for any event with skip_summarization=True, even when that event contains function calls (e.g. tools using skip_summarization to bypass LLM summarization). Such events were incorrectly excluded from invocation_events, causing get_all_tool_calls() to return an empty list and tool_trajectory_avg_score to always be 0.0 despite matching tool calls. Fix: keep an event in invocation_events even if it is the final_event when it contains function calls. Fixes google#5410
4d7739c to
f860460
Compare
|
Hi @rohityan @Jacksunwei — rebased onto latest |
_call_tool_in_thread_pool uses early returns for each tool category (sync FunctionTool, async tool, and the sync non-FunctionTool fallback), so a sync FunctionTool that returns None exits immediately via its own return and never falls through to the fallback path. The _SYNC_TOOL_RESULT_UNSET sentinel added to guard that case is never referenced anywhere, and its comment describes a fallthrough the code structure already prevents. Remove the dead definition and comment.
|
I have conducted a thorough, read-only analysis of PR #5417 in accordance with the I have generated a premium PR Analysis Report which is now saved and available: Key Takeaways & Recommendations:
Next Steps for You:
|
Link to Issue or Description of Change
Fixes #5410
Description
EvaluationGenerator.convert_events_to_eval_invocationsbuildsinvocation_events(the intermediate tool-call record used byTrajectoryEvaluator) by collecting all qualifying events and then excludingthe
final_eventfrom the list.The final event is identified via
event.is_final_response(), butis_final_response()returnsTruefor any event withskip_summarization=True— even events that containfunction_callparts(e.g. tools that use
skip_summarizationto surface their result directlywithout an LLM summarization step). Those events were silently dropped from
invocation_events, causingget_all_tool_calls()to return[]for theactual invocation. The result:
tool_trajectory_avg_scorewas always 0.0even when the tool name and args matched the expected exactly.
Root cause:
is_final_response()conflates "final user-visible response"with "should be excluded from tool trajectory". When
skip_summarization=Truethe function-call event is both the final response and an intermediate step
that must appear in the trajectory.
Fix: in the list comprehension that builds
invocation_events, keep anevent even when it equals
final_eventif it contains function calls:Changes
src/google/adk/evaluation/evaluation_generator.py: one-line fixtests/unittests/evaluation/test_evaluation_generator.py: regression test that verifies tool calls are preserved whenskip_summarization=Truetests/unittests/evaluation/test_trajectory_evaluator.py: end-to-end tests forInvocationEventsintermediate_data format (exact match → 1.0, mismatch → 0.0)Testing Plan