feat: add agent evaluation CLI #9

valuecodes · 2026-01-27T13:15:19Z

What

Introduce a new CLI for agent evaluation that includes an evaluation runner, reporting framework, and various assertion types for file operations. This update enhances the evaluation framework by adding tools for reading, listing, and deleting files, along with corresponding assertions. Documentation has been updated to reflect these changes, including a checklist for post-scaffold tasks.

Added CLI command for agent evaluations.
Implemented evaluation runner and report generator for JSON and Markdown outputs.
Introduced file assertion types: fileExists, fileContains, and fileJsonPath.
Added a delete file tool with validation and logging.
Updated README and AGENTS.md to include new features and usage instructions.

How to test

Run pnpm start to execute the new agent evaluation CLI.
Use pnpm test to run unit tests, ensuring all new features are covered.
Validate the output of the evaluation reports in both JSON and Markdown formats.

Security review

Secrets / env vars: not changed.
Auth / session: not changed.
Network / API calls: not changed.
Data handling / PII: not changed.
Dependencies: not changed.

No security-impacting changes identified.

No new dependencies and no network calls.
No env var changes and no auth/session logic touched.

- Implement evaluation runner to execute agent test suites - Create report generator for JSON and Markdown outputs - Add assertion evaluation utilities for various assertion types - Introduce suite loader for loading evaluation suites from JSON files - Update README and add checklist for post-scaffold tasks

- Implement fileExists, fileContains, and fileJsonPath assertions - Update assertion evaluation to handle async file checks - Create tool registry for managing file-related tools

…work - Implement delete file tool with path validation - Add fileNotExists assertion schema and evaluation logic - Update tools suite to include delete file tests

…ll paths - Modify JSON and Markdown report saving methods to log relative paths - Introduce toDisplayPath method for consistent path formatting

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c20e80cc64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-27T13:19:10Z

src/cli/agent-evals/utils/file-assertions.ts

+export const evaluateFileExistsAssertion = async (
+  assertion: FileExistsAssertion
+): Promise<AssertionResult> => {
+  const fullPath = path.join(TMP_ROOT, assertion.path);
+


Enforce tmp-only paths for file assertions

The new file assertions build fullPath via path.join(TMP_ROOT, assertion.path) with no traversal/absolute-path checks. If a suite provides path like ../secrets.env or an absolute path, path.join will resolve outside tmp/, allowing these assertions to read or probe arbitrary files. This breaks the repo’s tmp-only sandbox guarantee and can leak sensitive data during evals. Use the same validation helpers as the tools (e.g., resolveTmpPathForRead) or explicitly reject absolute/traversal paths before reading.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-27T13:19:10Z

src/tools/list-files/list-files-tool.ts

          description:
-            "Relative path within the repo tmp directory (optional, defaults to tmp root)",
+            'Relative path within the repo tmp directory. Use empty string "" to list tmp root.',
        },
      },
-      required: [],
+      required: ["path"],


Don’t require listFiles.path when root listing is supported

The tool schema now marks path as required, which means any tool call that omits it will be rejected by schema validation. That contradicts the stated behavior (“If no path provided, lists root of tmp”) and breaks callers that rely on omitting path to list the tmp root. Either make path optional again or update all callers/docs to pass an explicit empty string.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR introduces a comprehensive agent evaluation framework that enables automated testing of AI agents with PASS/FAIL results and detailed reporting. The framework includes assertion types for validating agent outputs and file system side effects, a suite-based evaluation runner, and report generation in JSON and Markdown formats.

Changes:

Added a complete evaluation CLI with runner, suite loader, report generator, and assertion framework
Introduced deleteFile tool with path validation and safety checks
Made AgentRunner.outputType optional to support plain text responses
Enhanced parse-args utility to handle argument parsing with configurable input
Updated listFiles tool to require the path parameter for consistency

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/utils/parse-args.ts	Added support for custom raw args and sanitization of standalone "--" separator
src/utils/parse-args.test.ts	Comprehensive test coverage for argument parsing with double-dash separator handling
src/tools/utils/fs.ts	Added resolveTmpPathForDelete with security validation for file deletion
src/tools/read-file/read-file-tool.ts	Updated logging to display relative paths for consistency
src/tools/list-files/list-files-tool.ts	Changed path parameter from optional to required with empty string support
src/tools/delete-file/delete-file-tool.ts	New tool for safe file deletion under tmp directory
src/tools/delete-file/delete-file-tool.test.ts	Complete test coverage for delete file operations and security validations
src/clients/agent-runner.ts	Made outputType optional to support agents without structured output
src/cli/agent-evals/utils/file-assertions.ts	Assertion evaluators for file existence, content, and JSON path checks
src/cli/agent-evals/utils/assertions.ts	Core assertion evaluation logic for output validation
src/cli/agent-evals/utils/assertions.test.ts	Test coverage for assertion types
src/cli/agent-evals/suites/tools.json	Evaluation suite for testing file operation tools
src/cli/agent-evals/suites/example.json	Example evaluation suite demonstrating assertion patterns
src/cli/agent-evals/schemas.ts	Zod schemas for CLI args, assertions, suites, and results
src/cli/agent-evals/main.ts	CLI entry point orchestrating suite loading, execution, and reporting
src/cli/agent-evals/constants.ts	Configuration constants for the evaluation framework
src/cli/agent-evals/clients/tool-registry.ts	Tool factory registry for creating tool instances from names
src/cli/agent-evals/clients/suite-loader.ts	Suite definition loader from JSON files
src/cli/agent-evals/clients/report-generator.ts	Report generation in JSON and Markdown formats
src/cli/agent-evals/clients/eval-runner.ts	Evaluation orchestrator executing cases and collecting results
src/cli/agent-evals/README.md	Documentation for the evaluation framework
package.json	Added run:agent-evals script
README.md	Updated with agent evals documentation and tools table
AGENTS.md	Updated tool documentation and conventions
.claude/settings.json	Added scaffold-cli script to allow list and reordered fields

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T13:30:20Z

src/tools/list-files/list-files-tool.ts

        },
      },
-      required: [],
+      required: ["path"],


The path parameter was changed from optional to required. While the implementation handles empty strings gracefully (converting to undefined), this is a breaking API change. The test on line 57 that calls the tool with an empty object {} will likely fail during parameter validation since path is now marked as required. Consider either keeping path optional or updating the test to pass an empty string explicitly.

Copilot · 2026-01-27T13:30:20Z

src/cli/agent-evals/utils/assertions.ts

+const deepEquals = (a: unknown, b: unknown): boolean => {
+  return JSON.stringify(a) === JSON.stringify(b);


The deepEquals function uses JSON.stringify for equality comparison, which may produce inconsistent results for objects with different key ordering. For example, {a: 1, b: 2} and {b: 2, a: 1} are semantically equal but would fail this comparison. Consider using a deep equality library or sorting object keys before stringification to ensure reliable comparisons.

Copilot · 2026-01-27T13:30:20Z

src/cli/agent-evals/utils/file-assertions.ts

+const deepEquals = (a: unknown, b: unknown): boolean => {
+  return JSON.stringify(a) === JSON.stringify(b);


The deepEquals function uses JSON.stringify for equality comparison, which may produce inconsistent results for objects with different key ordering. For example, {a: 1, b: 2} and {b: 2, a: 1} are semantically equal but would fail this comparison. Consider using a deep equality library or sorting object keys before stringification to ensure reliable comparisons.

Copilot · 2026-01-27T13:30:21Z

src/cli/agent-evals/utils/file-assertions.ts

+const deepEquals = (a: unknown, b: unknown): boolean => {
+  return JSON.stringify(a) === JSON.stringify(b);
+};


The deepEquals and getJsonPath helper functions are duplicated between this file and assertions.ts. Consider extracting these shared utilities into a separate helper file to follow the DRY principle and make maintenance easier.

Copilot · 2026-01-27T13:30:21Z

AGENTS.md

 - Use Zod schemas for CLI args and tool IO.
 - Keep object field names in `camelCase` (e.g., `trainSamples`), not `snake_case`.
- Keep Zod schemas in a dedicated `schemas.ts` file for each CLI (avoid inline schemas in `main.ts`).
+- Keep Zod schemas in a dedicated `types/schemas.ts` file for each CLI (avoid inline schemas in `main.ts`).


The documentation states schemas should be in types/schemas.ts, but this PR and other CLIs (like etf-backtest) place schemas directly as schemas.ts in the CLI directory. The codebase is inconsistent - some CLIs use types/ subdirectories (name-explorer, scrape-publications) while others don't. Either update the documentation to reflect actual practice or establish a consistent pattern across all CLIs.

Copilot · 2026-01-27T13:30:21Z

src/cli/agent-evals/utils/assertions.test.ts

+  describe("file assertions", () => {
+    const TEST_DIR = path.join(TMP_ROOT, "assertion-tests");
+    const TEST_FILE = path.join(TEST_DIR, "test-file.txt");
+    const TEST_JSON_FILE = path.join(TEST_DIR, "test-data.json");
+
+    beforeAll(async () => {
+      await fs.mkdir(TEST_DIR, { recursive: true });
+      await fs.writeFile(TEST_FILE, "Hello World\nThis is test content.");
+      await fs.writeFile(
+        TEST_JSON_FILE,
+        JSON.stringify({ name: "test", value: 42, nested: { key: "value" } })
+      );
+    });
+
+    afterAll(async () => {
+      await fs.rm(TEST_DIR, { recursive: true, force: true });
+    });
+
+    describe("fileExists", () => {
+      it("passes when file exists", async () => {
+        const assertion: Assertion = {
+          type: "fileExists",
+          path: "assertion-tests/test-file.txt",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(true);
+        expect(result.message).toContain("exists");
+      });
+
+      it("fails when file does not exist", async () => {
+        const assertion: Assertion = {
+          type: "fileExists",
+          path: "assertion-tests/nonexistent.txt",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+        expect(result.message).toContain("does not exist");
+      });
+    });
+
+    describe("fileContains", () => {
+      it("passes when file contains the value", async () => {
+        const assertion: Assertion = {
+          type: "fileContains",
+          path: "assertion-tests/test-file.txt",
+          value: "Hello World",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(true);
+        expect(result.message).toContain("contains");
+      });
+
+      it("fails when file does not contain the value", async () => {
+        const assertion: Assertion = {
+          type: "fileContains",
+          path: "assertion-tests/test-file.txt",
+          value: "Goodbye",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+        expect(result.message).toContain("does not contain");
+      });
+
+      it("is case sensitive by default", async () => {
+        const assertion: Assertion = {
+          type: "fileContains",
+          path: "assertion-tests/test-file.txt",
+          value: "HELLO WORLD",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+      });
+
+      it("respects caseSensitive: false", async () => {
+        const assertion: Assertion = {
+          type: "fileContains",
+          path: "assertion-tests/test-file.txt",
+          value: "HELLO WORLD",
+          caseSensitive: false,
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(true);
+      });
+
+      it("fails gracefully when file does not exist", async () => {
+        const assertion: Assertion = {
+          type: "fileContains",
+          path: "assertion-tests/nonexistent.txt",
+          value: "test",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+        expect(result.message).toContain("Failed to read file");
+      });
+    });
+
+    describe("fileJsonPath", () => {
+      it("extracts and compares JSON values", async () => {
+        const assertion: Assertion = {
+          type: "fileJsonPath",
+          path: "assertion-tests/test-data.json",
+          jsonPath: "name",
+          expected: "test",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(true);
+      });
+
+      it("supports $. prefix in jsonPath", async () => {
+        const assertion: Assertion = {
+          type: "fileJsonPath",
+          path: "assertion-tests/test-data.json",
+          jsonPath: "$.value",
+          expected: 42,
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(true);
+      });
+
+      it("handles nested paths", async () => {
+        const assertion: Assertion = {
+          type: "fileJsonPath",
+          path: "assertion-tests/test-data.json",
+          jsonPath: "nested.key",
+          expected: "value",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(true);
+      });
+
+      it("fails when value does not match", async () => {
+        const assertion: Assertion = {
+          type: "fileJsonPath",
+          path: "assertion-tests/test-data.json",
+          jsonPath: "value",
+          expected: 100,
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+      });
+
+      it("fails gracefully for missing path", async () => {
+        const assertion: Assertion = {
+          type: "fileJsonPath",
+          path: "assertion-tests/test-data.json",
+          jsonPath: "missing.path",
+          expected: "value",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+        expect(result.message).toContain("Failed to evaluate");
+      });
+
+      it("fails gracefully when file does not exist", async () => {
+        const assertion: Assertion = {
+          type: "fileJsonPath",
+          path: "assertion-tests/nonexistent.json",
+          jsonPath: "key",
+          expected: "value",
+        };
+        const result = await evaluateAssertion(assertion, null);
+        expect(result.passed).toBe(false);
+        expect(result.message).toContain("Failed to evaluate");
+      });
+    });
+  });
+});


The fileNotExists assertion is implemented and included in the schemas but is not covered by any test cases in this test file. Consider adding test coverage for the fileNotExists assertion type to ensure it works correctly and maintain consistent test coverage across all assertion types.

- Implement path traversal rejection in file assertions - Add resolveTmpPathForAccess utility to validate paths

- Replace custom deep equality function with lodash's isEqual - Update assertions to handle objects with different key ordering - Add lodash-es type definitions to package.json and pnpm-lock.yaml

- Move schemas from the root to a dedicated types directory - Update import paths across multiple files to reflect new structure

valuecodes added 10 commits January 27, 2026 11:03

docs: update AGENTS.md to specify schemas.ts for pipeline classes

6ef7d0c

test: add unit tests for parseArgs function with Zod validation

01afdc9

feat: add file assertion types and enhance evaluation framework

4e16806

- Implement fileExists, fileContains, and fileJsonPath assertions - Update assertion evaluation to handle async file checks - Create tool registry for managing file-related tools

docs: update README with new suite field notes and file assertion types

7a1e10d

feat: add delete file tool and related assertions to evaluation frame…

d3d07b0

…work - Implement delete file tool with path validation - Add fileNotExists assertion schema and evaluation logic - Update tools suite to include delete file tests

refactor: update report generation to log display paths instead of fu…

a66279f

…ll paths - Modify JSON and Markdown report saving methods to log relative paths - Introduce toDisplayPath method for consistent path formatting

feat: add deleteFile tool and update documentation for agent evals

ae95f75

chore: remove outdated CHECKLIST.md file

19e0283

docs: add comment to clarify purpose of sanitizeArgs function

c20e80c

Copilot AI review requested due to automatic review settings January 27, 2026 13:15

Copilot started reviewing on behalf of valuecodes January 27, 2026 13:15 View session

chatgpt-codex-connector bot reviewed Jan 27, 2026

View reviewed changes

Copilot AI reviewed Jan 27, 2026

View reviewed changes

valuecodes added 4 commits January 27, 2026 17:05

feat: add path traversal checks to file assertions and utilities

c9a6fd9

- Implement path traversal rejection in file assertions - Add resolveTmpPathForAccess utility to validate paths

refactor: update description and test case for empty path handling

880f072

feat: add lodash-es for deep equality checks in assertions

8659764

- Replace custom deep equality function with lodash's isEqual - Update assertions to handle objects with different key ordering - Add lodash-es type definitions to package.json and pnpm-lock.yaml

refactor: reorganize schemas into separate types directory

384b9e2

- Move schemas from the root to a dedicated types directory - Update import paths across multiple files to reflect new structure

valuecodes merged commit 05abbb3 into main Jan 27, 2026
4 checks passed

valuecodes deleted the agent-evals branch January 27, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add agent evaluation CLI #9

feat: add agent evaluation CLI #9

Uh oh!

valuecodes commented Jan 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 27, 2026

Uh oh!

chatgpt-codex-connector bot Jan 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const deepEquals = (a: unknown, b: unknown): boolean => {
		return JSON.stringify(a) === JSON.stringify(b);

feat: add agent evaluation CLI #9

feat: add agent evaluation CLI #9

Uh oh!

Conversation

valuecodes commented Jan 27, 2026

What

How to test

Security review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants