-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add agent evaluation CLI #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Implement evaluation runner to execute agent test suites - Create report generator for JSON and Markdown outputs - Add assertion evaluation utilities for various assertion types - Introduce suite loader for loading evaluation suites from JSON files - Update README and add checklist for post-scaffold tasks
- Implement fileExists, fileContains, and fileJsonPath assertions - Update assertion evaluation to handle async file checks - Create tool registry for managing file-related tools
…work - Implement delete file tool with path validation - Add fileNotExists assertion schema and evaluation logic - Update tools suite to include delete file tests
…ll paths - Modify JSON and Markdown report saving methods to log relative paths - Introduce toDisplayPath method for consistent path formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c20e80cc64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| export const evaluateFileExistsAssertion = async ( | ||
| assertion: FileExistsAssertion | ||
| ): Promise<AssertionResult> => { | ||
| const fullPath = path.join(TMP_ROOT, assertion.path); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforce tmp-only paths for file assertions
The new file assertions build fullPath via path.join(TMP_ROOT, assertion.path) with no traversal/absolute-path checks. If a suite provides path like ../secrets.env or an absolute path, path.join will resolve outside tmp/, allowing these assertions to read or probe arbitrary files. This breaks the repo’s tmp-only sandbox guarantee and can leak sensitive data during evals. Use the same validation helpers as the tools (e.g., resolveTmpPathForRead) or explicitly reject absolute/traversal paths before reading.
Useful? React with 👍 / 👎.
| description: | ||
| "Relative path within the repo tmp directory (optional, defaults to tmp root)", | ||
| 'Relative path within the repo tmp directory. Use empty string "" to list tmp root.', | ||
| }, | ||
| }, | ||
| required: [], | ||
| required: ["path"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t require listFiles.path when root listing is supported
The tool schema now marks path as required, which means any tool call that omits it will be rejected by schema validation. That contradicts the stated behavior (“If no path provided, lists root of tmp”) and breaks callers that rely on omitting path to list the tmp root. Either make path optional again or update all callers/docs to pass an explicit empty string.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive agent evaluation framework that enables automated testing of AI agents with PASS/FAIL results and detailed reporting. The framework includes assertion types for validating agent outputs and file system side effects, a suite-based evaluation runner, and report generation in JSON and Markdown formats.
Changes:
- Added a complete evaluation CLI with runner, suite loader, report generator, and assertion framework
- Introduced
deleteFiletool with path validation and safety checks - Made
AgentRunner.outputTypeoptional to support plain text responses - Enhanced
parse-argsutility to handle argument parsing with configurable input - Updated
listFilestool to require thepathparameter for consistency
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils/parse-args.ts | Added support for custom raw args and sanitization of standalone "--" separator |
| src/utils/parse-args.test.ts | Comprehensive test coverage for argument parsing with double-dash separator handling |
| src/tools/utils/fs.ts | Added resolveTmpPathForDelete with security validation for file deletion |
| src/tools/read-file/read-file-tool.ts | Updated logging to display relative paths for consistency |
| src/tools/list-files/list-files-tool.ts | Changed path parameter from optional to required with empty string support |
| src/tools/delete-file/delete-file-tool.ts | New tool for safe file deletion under tmp directory |
| src/tools/delete-file/delete-file-tool.test.ts | Complete test coverage for delete file operations and security validations |
| src/clients/agent-runner.ts | Made outputType optional to support agents without structured output |
| src/cli/agent-evals/utils/file-assertions.ts | Assertion evaluators for file existence, content, and JSON path checks |
| src/cli/agent-evals/utils/assertions.ts | Core assertion evaluation logic for output validation |
| src/cli/agent-evals/utils/assertions.test.ts | Test coverage for assertion types |
| src/cli/agent-evals/suites/tools.json | Evaluation suite for testing file operation tools |
| src/cli/agent-evals/suites/example.json | Example evaluation suite demonstrating assertion patterns |
| src/cli/agent-evals/schemas.ts | Zod schemas for CLI args, assertions, suites, and results |
| src/cli/agent-evals/main.ts | CLI entry point orchestrating suite loading, execution, and reporting |
| src/cli/agent-evals/constants.ts | Configuration constants for the evaluation framework |
| src/cli/agent-evals/clients/tool-registry.ts | Tool factory registry for creating tool instances from names |
| src/cli/agent-evals/clients/suite-loader.ts | Suite definition loader from JSON files |
| src/cli/agent-evals/clients/report-generator.ts | Report generation in JSON and Markdown formats |
| src/cli/agent-evals/clients/eval-runner.ts | Evaluation orchestrator executing cases and collecting results |
| src/cli/agent-evals/README.md | Documentation for the evaluation framework |
| package.json | Added run:agent-evals script |
| README.md | Updated with agent evals documentation and tools table |
| AGENTS.md | Updated tool documentation and conventions |
| .claude/settings.json | Added scaffold-cli script to allow list and reordered fields |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, | ||
| }, | ||
| required: [], | ||
| required: ["path"], |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path parameter was changed from optional to required. While the implementation handles empty strings gracefully (converting to undefined), this is a breaking API change. The test on line 57 that calls the tool with an empty object {} will likely fail during parameter validation since path is now marked as required. Consider either keeping path optional or updating the test to pass an empty string explicitly.
| const deepEquals = (a: unknown, b: unknown): boolean => { | ||
| return JSON.stringify(a) === JSON.stringify(b); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deepEquals function uses JSON.stringify for equality comparison, which may produce inconsistent results for objects with different key ordering. For example, {a: 1, b: 2} and {b: 2, a: 1} are semantically equal but would fail this comparison. Consider using a deep equality library or sorting object keys before stringification to ensure reliable comparisons.
| const deepEquals = (a: unknown, b: unknown): boolean => { | ||
| return JSON.stringify(a) === JSON.stringify(b); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deepEquals function uses JSON.stringify for equality comparison, which may produce inconsistent results for objects with different key ordering. For example, {a: 1, b: 2} and {b: 2, a: 1} are semantically equal but would fail this comparison. Consider using a deep equality library or sorting object keys before stringification to ensure reliable comparisons.
| const deepEquals = (a: unknown, b: unknown): boolean => { | ||
| return JSON.stringify(a) === JSON.stringify(b); | ||
| }; |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deepEquals and getJsonPath helper functions are duplicated between this file and assertions.ts. Consider extracting these shared utilities into a separate helper file to follow the DRY principle and make maintenance easier.
| - Use Zod schemas for CLI args and tool IO. | ||
| - Keep object field names in `camelCase` (e.g., `trainSamples`), not `snake_case`. | ||
| - Keep Zod schemas in a dedicated `schemas.ts` file for each CLI (avoid inline schemas in `main.ts`). | ||
| - Keep Zod schemas in a dedicated `types/schemas.ts` file for each CLI (avoid inline schemas in `main.ts`). |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states schemas should be in types/schemas.ts, but this PR and other CLIs (like etf-backtest) place schemas directly as schemas.ts in the CLI directory. The codebase is inconsistent - some CLIs use types/ subdirectories (name-explorer, scrape-publications) while others don't. Either update the documentation to reflect actual practice or establish a consistent pattern across all CLIs.
| describe("file assertions", () => { | ||
| const TEST_DIR = path.join(TMP_ROOT, "assertion-tests"); | ||
| const TEST_FILE = path.join(TEST_DIR, "test-file.txt"); | ||
| const TEST_JSON_FILE = path.join(TEST_DIR, "test-data.json"); | ||
|
|
||
| beforeAll(async () => { | ||
| await fs.mkdir(TEST_DIR, { recursive: true }); | ||
| await fs.writeFile(TEST_FILE, "Hello World\nThis is test content."); | ||
| await fs.writeFile( | ||
| TEST_JSON_FILE, | ||
| JSON.stringify({ name: "test", value: 42, nested: { key: "value" } }) | ||
| ); | ||
| }); | ||
|
|
||
| afterAll(async () => { | ||
| await fs.rm(TEST_DIR, { recursive: true, force: true }); | ||
| }); | ||
|
|
||
| describe("fileExists", () => { | ||
| it("passes when file exists", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileExists", | ||
| path: "assertion-tests/test-file.txt", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(true); | ||
| expect(result.message).toContain("exists"); | ||
| }); | ||
|
|
||
| it("fails when file does not exist", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileExists", | ||
| path: "assertion-tests/nonexistent.txt", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| expect(result.message).toContain("does not exist"); | ||
| }); | ||
| }); | ||
|
|
||
| describe("fileContains", () => { | ||
| it("passes when file contains the value", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileContains", | ||
| path: "assertion-tests/test-file.txt", | ||
| value: "Hello World", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(true); | ||
| expect(result.message).toContain("contains"); | ||
| }); | ||
|
|
||
| it("fails when file does not contain the value", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileContains", | ||
| path: "assertion-tests/test-file.txt", | ||
| value: "Goodbye", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| expect(result.message).toContain("does not contain"); | ||
| }); | ||
|
|
||
| it("is case sensitive by default", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileContains", | ||
| path: "assertion-tests/test-file.txt", | ||
| value: "HELLO WORLD", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| }); | ||
|
|
||
| it("respects caseSensitive: false", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileContains", | ||
| path: "assertion-tests/test-file.txt", | ||
| value: "HELLO WORLD", | ||
| caseSensitive: false, | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(true); | ||
| }); | ||
|
|
||
| it("fails gracefully when file does not exist", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileContains", | ||
| path: "assertion-tests/nonexistent.txt", | ||
| value: "test", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| expect(result.message).toContain("Failed to read file"); | ||
| }); | ||
| }); | ||
|
|
||
| describe("fileJsonPath", () => { | ||
| it("extracts and compares JSON values", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileJsonPath", | ||
| path: "assertion-tests/test-data.json", | ||
| jsonPath: "name", | ||
| expected: "test", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(true); | ||
| }); | ||
|
|
||
| it("supports $. prefix in jsonPath", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileJsonPath", | ||
| path: "assertion-tests/test-data.json", | ||
| jsonPath: "$.value", | ||
| expected: 42, | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(true); | ||
| }); | ||
|
|
||
| it("handles nested paths", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileJsonPath", | ||
| path: "assertion-tests/test-data.json", | ||
| jsonPath: "nested.key", | ||
| expected: "value", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(true); | ||
| }); | ||
|
|
||
| it("fails when value does not match", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileJsonPath", | ||
| path: "assertion-tests/test-data.json", | ||
| jsonPath: "value", | ||
| expected: 100, | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| }); | ||
|
|
||
| it("fails gracefully for missing path", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileJsonPath", | ||
| path: "assertion-tests/test-data.json", | ||
| jsonPath: "missing.path", | ||
| expected: "value", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| expect(result.message).toContain("Failed to evaluate"); | ||
| }); | ||
|
|
||
| it("fails gracefully when file does not exist", async () => { | ||
| const assertion: Assertion = { | ||
| type: "fileJsonPath", | ||
| path: "assertion-tests/nonexistent.json", | ||
| jsonPath: "key", | ||
| expected: "value", | ||
| }; | ||
| const result = await evaluateAssertion(assertion, null); | ||
| expect(result.passed).toBe(false); | ||
| expect(result.message).toContain("Failed to evaluate"); | ||
| }); | ||
| }); | ||
| }); | ||
| }); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fileNotExists assertion is implemented and included in the schemas but is not covered by any test cases in this test file. Consider adding test coverage for the fileNotExists assertion type to ensure it works correctly and maintain consistent test coverage across all assertion types.
- Implement path traversal rejection in file assertions - Add resolveTmpPathForAccess utility to validate paths
- Replace custom deep equality function with lodash's isEqual - Update assertions to handle objects with different key ordering - Add lodash-es type definitions to package.json and pnpm-lock.yaml
- Move schemas from the root to a dedicated types directory - Update import paths across multiple files to reflect new structure
What
Introduce a new CLI for agent evaluation that includes an evaluation runner, reporting framework, and various assertion types for file operations. This update enhances the evaluation framework by adding tools for reading, listing, and deleting files, along with corresponding assertions. Documentation has been updated to reflect these changes, including a checklist for post-scaffold tasks.
Added CLI command for agent evaluations.
Implemented evaluation runner and report generator for JSON and Markdown outputs.
Introduced file assertion types:
fileExists,fileContains, andfileJsonPath.Added a delete file tool with validation and logging.
Updated README and AGENTS.md to include new features and usage instructions.
How to test
Run
pnpm startto execute the new agent evaluation CLI.Use
pnpm testto run unit tests, ensuring all new features are covered.Validate the output of the evaluation reports in both JSON and Markdown formats.
Security review
Secrets / env vars: not changed.
Auth / session: not changed.
Network / API calls: not changed.
Data handling / PII: not changed.
Dependencies: not changed.
No security-impacting changes identified.
No new dependencies and no network calls.
No env var changes and no auth/session logic touched.