Skip to content

Conversation

@valuecodes
Copy link
Owner

What

Introduce a new CLI for agent evaluation that includes an evaluation runner, reporting framework, and various assertion types for file operations. This update enhances the evaluation framework by adding tools for reading, listing, and deleting files, along with corresponding assertions. Documentation has been updated to reflect these changes, including a checklist for post-scaffold tasks.

  • Added CLI command for agent evaluations.

  • Implemented evaluation runner and report generator for JSON and Markdown outputs.

  • Introduced file assertion types: fileExists, fileContains, and fileJsonPath.

  • Added a delete file tool with validation and logging.

  • Updated README and AGENTS.md to include new features and usage instructions.

How to test

  • Run pnpm start to execute the new agent evaluation CLI.

  • Use pnpm test to run unit tests, ensuring all new features are covered.

  • Validate the output of the evaluation reports in both JSON and Markdown formats.

Security review

  • Secrets / env vars: not changed.

  • Auth / session: not changed.

  • Network / API calls: not changed.

  • Data handling / PII: not changed.

  • Dependencies: not changed.

No security-impacting changes identified.

  • No new dependencies and no network calls.

  • No env var changes and no auth/session logic touched.

- Implement evaluation runner to execute agent test suites
- Create report generator for JSON and Markdown outputs
- Add assertion evaluation utilities for various assertion types
- Introduce suite loader for loading evaluation suites from JSON files
- Update README and add checklist for post-scaffold tasks
- Implement fileExists, fileContains, and fileJsonPath assertions
- Update assertion evaluation to handle async file checks
- Create tool registry for managing file-related tools
…work

- Implement delete file tool with path validation
- Add fileNotExists assertion schema and evaluation logic
- Update tools suite to include delete file tests
…ll paths

- Modify JSON and Markdown report saving methods to log relative paths
- Introduce toDisplayPath method for consistent path formatting
Copilot AI review requested due to automatic review settings January 27, 2026 13:15
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c20e80cc64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 16 to 20
export const evaluateFileExistsAssertion = async (
assertion: FileExistsAssertion
): Promise<AssertionResult> => {
const fullPath = path.join(TMP_ROOT, assertion.path);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce tmp-only paths for file assertions

The new file assertions build fullPath via path.join(TMP_ROOT, assertion.path) with no traversal/absolute-path checks. If a suite provides path like ../secrets.env or an absolute path, path.join will resolve outside tmp/, allowing these assertions to read or probe arbitrary files. This breaks the repo’s tmp-only sandbox guarantee and can leak sensitive data during evals. Use the same validation helpers as the tools (e.g., resolveTmpPathForRead) or explicitly reject absolute/traversal paths before reading.

Useful? React with 👍 / 👎.

Comment on lines 21 to 25
description:
"Relative path within the repo tmp directory (optional, defaults to tmp root)",
'Relative path within the repo tmp directory. Use empty string "" to list tmp root.',
},
},
required: [],
required: ["path"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don’t require listFiles.path when root listing is supported

The tool schema now marks path as required, which means any tool call that omits it will be rejected by schema validation. That contradicts the stated behavior (“If no path provided, lists root of tmp”) and breaks callers that rely on omitting path to list the tmp root. Either make path optional again or update all callers/docs to pass an explicit empty string.

Useful? React with 👍 / 👎.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive agent evaluation framework that enables automated testing of AI agents with PASS/FAIL results and detailed reporting. The framework includes assertion types for validating agent outputs and file system side effects, a suite-based evaluation runner, and report generation in JSON and Markdown formats.

Changes:

  • Added a complete evaluation CLI with runner, suite loader, report generator, and assertion framework
  • Introduced deleteFile tool with path validation and safety checks
  • Made AgentRunner.outputType optional to support plain text responses
  • Enhanced parse-args utility to handle argument parsing with configurable input
  • Updated listFiles tool to require the path parameter for consistency

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/utils/parse-args.ts Added support for custom raw args and sanitization of standalone "--" separator
src/utils/parse-args.test.ts Comprehensive test coverage for argument parsing with double-dash separator handling
src/tools/utils/fs.ts Added resolveTmpPathForDelete with security validation for file deletion
src/tools/read-file/read-file-tool.ts Updated logging to display relative paths for consistency
src/tools/list-files/list-files-tool.ts Changed path parameter from optional to required with empty string support
src/tools/delete-file/delete-file-tool.ts New tool for safe file deletion under tmp directory
src/tools/delete-file/delete-file-tool.test.ts Complete test coverage for delete file operations and security validations
src/clients/agent-runner.ts Made outputType optional to support agents without structured output
src/cli/agent-evals/utils/file-assertions.ts Assertion evaluators for file existence, content, and JSON path checks
src/cli/agent-evals/utils/assertions.ts Core assertion evaluation logic for output validation
src/cli/agent-evals/utils/assertions.test.ts Test coverage for assertion types
src/cli/agent-evals/suites/tools.json Evaluation suite for testing file operation tools
src/cli/agent-evals/suites/example.json Example evaluation suite demonstrating assertion patterns
src/cli/agent-evals/schemas.ts Zod schemas for CLI args, assertions, suites, and results
src/cli/agent-evals/main.ts CLI entry point orchestrating suite loading, execution, and reporting
src/cli/agent-evals/constants.ts Configuration constants for the evaluation framework
src/cli/agent-evals/clients/tool-registry.ts Tool factory registry for creating tool instances from names
src/cli/agent-evals/clients/suite-loader.ts Suite definition loader from JSON files
src/cli/agent-evals/clients/report-generator.ts Report generation in JSON and Markdown formats
src/cli/agent-evals/clients/eval-runner.ts Evaluation orchestrator executing cases and collecting results
src/cli/agent-evals/README.md Documentation for the evaluation framework
package.json Added run:agent-evals script
README.md Updated with agent evals documentation and tools table
AGENTS.md Updated tool documentation and conventions
.claude/settings.json Added scaffold-cli script to allow list and reordered fields

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

},
},
required: [],
required: ["path"],
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path parameter was changed from optional to required. While the implementation handles empty strings gracefully (converting to undefined), this is a breaking API change. The test on line 57 that calls the tool with an empty object {} will likely fail during parameter validation since path is now marked as required. Consider either keeping path optional or updating the test to pass an empty string explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines 155 to 156
const deepEquals = (a: unknown, b: unknown): boolean => {
return JSON.stringify(a) === JSON.stringify(b);
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deepEquals function uses JSON.stringify for equality comparison, which may produce inconsistent results for objects with different key ordering. For example, {a: 1, b: 2} and {b: 2, a: 1} are semantically equal but would fail this comparison. Consider using a deep equality library or sorting object keys before stringification to ensure reliable comparisons.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +144
const deepEquals = (a: unknown, b: unknown): boolean => {
return JSON.stringify(a) === JSON.stringify(b);
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deepEquals function uses JSON.stringify for equality comparison, which may produce inconsistent results for objects with different key ordering. For example, {a: 1, b: 2} and {b: 2, a: 1} are semantically equal but would fail this comparison. Consider using a deep equality library or sorting object keys before stringification to ensure reliable comparisons.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +145
const deepEquals = (a: unknown, b: unknown): boolean => {
return JSON.stringify(a) === JSON.stringify(b);
};
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deepEquals and getJsonPath helper functions are duplicated between this file and assertions.ts. Consider extracting these shared utilities into a separate helper file to follow the DRY principle and make maintenance easier.

Copilot uses AI. Check for mistakes.
- Use Zod schemas for CLI args and tool IO.
- Keep object field names in `camelCase` (e.g., `trainSamples`), not `snake_case`.
- Keep Zod schemas in a dedicated `schemas.ts` file for each CLI (avoid inline schemas in `main.ts`).
- Keep Zod schemas in a dedicated `types/schemas.ts` file for each CLI (avoid inline schemas in `main.ts`).
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states schemas should be in types/schemas.ts, but this PR and other CLIs (like etf-backtest) place schemas directly as schemas.ts in the CLI directory. The codebase is inconsistent - some CLIs use types/ subdirectories (name-explorer, scrape-publications) while others don't. Either update the documentation to reflect actual practice or establish a consistent pattern across all CLIs.

Copilot uses AI. Check for mistakes.
Comment on lines 225 to 391
describe("file assertions", () => {
const TEST_DIR = path.join(TMP_ROOT, "assertion-tests");
const TEST_FILE = path.join(TEST_DIR, "test-file.txt");
const TEST_JSON_FILE = path.join(TEST_DIR, "test-data.json");

beforeAll(async () => {
await fs.mkdir(TEST_DIR, { recursive: true });
await fs.writeFile(TEST_FILE, "Hello World\nThis is test content.");
await fs.writeFile(
TEST_JSON_FILE,
JSON.stringify({ name: "test", value: 42, nested: { key: "value" } })
);
});

afterAll(async () => {
await fs.rm(TEST_DIR, { recursive: true, force: true });
});

describe("fileExists", () => {
it("passes when file exists", async () => {
const assertion: Assertion = {
type: "fileExists",
path: "assertion-tests/test-file.txt",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(true);
expect(result.message).toContain("exists");
});

it("fails when file does not exist", async () => {
const assertion: Assertion = {
type: "fileExists",
path: "assertion-tests/nonexistent.txt",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
expect(result.message).toContain("does not exist");
});
});

describe("fileContains", () => {
it("passes when file contains the value", async () => {
const assertion: Assertion = {
type: "fileContains",
path: "assertion-tests/test-file.txt",
value: "Hello World",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(true);
expect(result.message).toContain("contains");
});

it("fails when file does not contain the value", async () => {
const assertion: Assertion = {
type: "fileContains",
path: "assertion-tests/test-file.txt",
value: "Goodbye",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
expect(result.message).toContain("does not contain");
});

it("is case sensitive by default", async () => {
const assertion: Assertion = {
type: "fileContains",
path: "assertion-tests/test-file.txt",
value: "HELLO WORLD",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
});

it("respects caseSensitive: false", async () => {
const assertion: Assertion = {
type: "fileContains",
path: "assertion-tests/test-file.txt",
value: "HELLO WORLD",
caseSensitive: false,
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(true);
});

it("fails gracefully when file does not exist", async () => {
const assertion: Assertion = {
type: "fileContains",
path: "assertion-tests/nonexistent.txt",
value: "test",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
expect(result.message).toContain("Failed to read file");
});
});

describe("fileJsonPath", () => {
it("extracts and compares JSON values", async () => {
const assertion: Assertion = {
type: "fileJsonPath",
path: "assertion-tests/test-data.json",
jsonPath: "name",
expected: "test",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(true);
});

it("supports $. prefix in jsonPath", async () => {
const assertion: Assertion = {
type: "fileJsonPath",
path: "assertion-tests/test-data.json",
jsonPath: "$.value",
expected: 42,
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(true);
});

it("handles nested paths", async () => {
const assertion: Assertion = {
type: "fileJsonPath",
path: "assertion-tests/test-data.json",
jsonPath: "nested.key",
expected: "value",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(true);
});

it("fails when value does not match", async () => {
const assertion: Assertion = {
type: "fileJsonPath",
path: "assertion-tests/test-data.json",
jsonPath: "value",
expected: 100,
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
});

it("fails gracefully for missing path", async () => {
const assertion: Assertion = {
type: "fileJsonPath",
path: "assertion-tests/test-data.json",
jsonPath: "missing.path",
expected: "value",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
expect(result.message).toContain("Failed to evaluate");
});

it("fails gracefully when file does not exist", async () => {
const assertion: Assertion = {
type: "fileJsonPath",
path: "assertion-tests/nonexistent.json",
jsonPath: "key",
expected: "value",
};
const result = await evaluateAssertion(assertion, null);
expect(result.passed).toBe(false);
expect(result.message).toContain("Failed to evaluate");
});
});
});
});
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fileNotExists assertion is implemented and included in the schemas but is not covered by any test cases in this test file. Consider adding test coverage for the fileNotExists assertion type to ensure it works correctly and maintain consistent test coverage across all assertion types.

Copilot uses AI. Check for mistakes.
- Implement path traversal rejection in file assertions
- Add resolveTmpPathForAccess utility to validate paths
- Replace custom deep equality function with lodash's isEqual
- Update assertions to handle objects with different key ordering
- Add lodash-es type definitions to package.json and pnpm-lock.yaml
- Move schemas from the root to a dedicated types directory
- Update import paths across multiple files to reflect new structure
@valuecodes valuecodes merged commit 05abbb3 into main Jan 27, 2026
4 checks passed
@valuecodes valuecodes deleted the agent-evals branch January 27, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants