Please make sure you read the contribution guide and file the issues in the right place.
Contribution guide.
🔴 Required Information
Please ensure all items in this section are completed to allow for efficient
triaging. Requests without complete information may be rejected / deprioritized.
If an item is not applicable to you - please mark it as N/A
Is your feature request related to a specific problem?
Yes. In v1.24.1, evaluation behavior differs between adk eval and AgentEvaluator.evaluate,
which makes it harder to run and compare evals consistently across CLI and API/test workflows.
Concrete gaps I hit:
adk eval has no --num_runs, so I need external loops to repeat runs.
AgentEvaluator.evaluate does not provide built-in result persistence, so historical comparison is less convenient than CLI-based runs.
- In
adk eval, --config_file_path is optional, but when omitted it falls back to in-code default criteria and does not auto-discover per-test test_config.json next to each eval/test file.
adk eval resolves only agent_module.agent.root_agent and does not look for get_agent_async.
Describe the Solution You'd Like
I would like feature parity/consistency between both entry points:
- Add
--num_runs to adk eval (default 1) and aggregate results per eval case across runs.
- Add optional result persistence support to
AgentEvaluator.evaluate via an EvalSetResultsManager (local by default).
- In
adk eval, when --config_file_path is omitted, auto-discover test_config.json adjacent to each input eval/test file, then fall back to default criteria if no file is found.
- Align agent resolution behavior between
adk eval and AgentEvaluator: support both root_agent and get_agent_async consistently
Impact on your work
This impacts reproducibility and operational efficiency in day-to-day evals.
- Repeated runs are needed to reduce nondeterminism, but currently require custom scripting for CLI.
- Comparing historical outcomes is easier on one path than the other.
- The same dataset/config setup behaves differently depending on whether I use CLI or programmatic evaluation.
- Agent modules that are valid in one evaluation path can fail in the other, forcing workflow-specific agent definitions.
Willingness to contribute
Yes
🟡 Recommended Information
Describe Alternatives You've Considered
- Keeping current differences and documenting them: this still leaves manual repetition/persistence work and split workflows.
- Wrapper scripts outside ADK: works short-term, but duplicates logic and reduces maintainability.
Proposed API / Implementation
Pseudo-proposal:
# CLI
adk eval <agent_dir> <eval_set...> --num_runs=3 [--config_file_path=...]
# Internals (conceptual)
for i in range(num_runs):
inference_results += run_inference(...)
aggregated = aggregate_eval_case_results(inference_results)
save_if_configured(aggregated)
# AgentEvaluator
await AgentEvaluator.evaluate(
agent_module=...,
eval_dataset_file_path_or_dir=...,
num_runs=3,
eval_set_results_manager=LocalEvalSetResultsManager(...), # optional
)
# Config resolution for adk eval (conceptual)
if config_file_path:
eval_config = load(config_file_path)
else:
eval_config = discover_test_config_near_input_or_default(...)
# Conceptual shared loader used by both entry points.
agent_module = _get_agent_module(...)
if hasattr(agent_module.agent, "root_agent"):
agent = agent_module.agent.root_agent
elif hasattr(agent_module.agent, "get_agent_async"):
agent = await agent_module.agent.get_agent_async()
else:
raise ValueError("Expected `root_agent` or `get_agent_async` in agent module.")
Additional Context
Goal: make eval behavior consistent regardless of entry point (adk eval vs AgentEvaluator.evaluate) so that the same eval assets produce comparable outcomes with less custom glue code.
Please make sure you read the contribution guide and file the issues in the right place.
Contribution guide.
🔴 Required Information
Please ensure all items in this section are completed to allow for efficient
triaging. Requests without complete information may be rejected / deprioritized.
If an item is not applicable to you - please mark it as N/A
Is your feature request related to a specific problem?
Yes. In
v1.24.1, evaluation behavior differs betweenadk evalandAgentEvaluator.evaluate,which makes it harder to run and compare evals consistently across CLI and API/test workflows.
Concrete gaps I hit:
adk evalhas no--num_runs, so I need external loops to repeat runs.AgentEvaluator.evaluatedoes not provide built-in result persistence, so historical comparison is less convenient than CLI-based runs.adk eval,--config_file_pathis optional, but when omitted it falls back to in-code default criteria and does not auto-discover per-testtest_config.jsonnext to each eval/test file.adk evalresolves onlyagent_module.agent.root_agentand does not look forget_agent_async.Describe the Solution You'd Like
I would like feature parity/consistency between both entry points:
--num_runstoadk eval(default1) and aggregate results per eval case across runs.AgentEvaluator.evaluatevia anEvalSetResultsManager(local by default).adk eval, when--config_file_pathis omitted, auto-discovertest_config.jsonadjacent to each input eval/test file, then fall back to default criteria if no file is found.adk evalandAgentEvaluator: support bothroot_agentandget_agent_asyncconsistentlyImpact on your work
This impacts reproducibility and operational efficiency in day-to-day evals.
Willingness to contribute
Yes
🟡 Recommended Information
Describe Alternatives You've Considered
Proposed API / Implementation
Pseudo-proposal:
Additional Context
Goal: make eval behavior consistent regardless of entry point (
adk evalvsAgentEvaluator.evaluate) so that the same eval assets produce comparable outcomes with less custom glue code.