-
Notifications
You must be signed in to change notification settings - Fork 10
Open
1 / 11 of 1 issue completedDescription
Description:
To support a variety of benchmark tests, our framework needs to support evaluations on more benchmarks, especially those that require tool-based evaluation (e.g., swebench, webarena).
Proposed Benchmarks to Support:
- GAIA
- SWE-bench
- WebArena
- HotPotQA
Considerations:
- The performance of different agents can vary across benchmarks. Developers should select appropriate benchmarks based on the characteristics of the agent system.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels