Skip to content

Add LLM evaluation suite support with config validation and reporting#1772

Open
prathmeshpatel wants to merge 3 commits intomainfrom
claude/refine-local-plan-JtxuU
Open

Add LLM evaluation suite support with config validation and reporting#1772
prathmeshpatel wants to merge 3 commits intomainfrom
claude/refine-local-plan-JtxuU

Conversation

@prathmeshpatel
Copy link
Copy Markdown
Collaborator

Summary

This PR adds comprehensive LLM evaluation suite functionality to the CLI, enabling users to define and run test cases against MCP servers with configurable agents, multiple output formats, and detailed performance metrics.

Key Changes

  • Evals Configuration System (evals-config.ts):

    • New EvalsConfig interface supporting servers, agent configuration, test cases, and run options
    • Environment variable interpolation with ${VAR} syntax throughout config files
    • Comprehensive validation of config structure, required fields, and value constraints
    • Support for test case matching modes (exact vs. subset tool call matching)
    • loadEvalsConfig() function to load, interpolate, and validate JSON config files
  • Evals Runner (evals-runner.ts):

    • runEvals() function orchestrating test execution against MCP servers
    • Integration with @mcpjam/sdk for agent-based testing and tool matching
    • Support for configurable iterations, concurrency, timeouts, and retries
    • Proper resource cleanup with server disconnection
  • Output Formatting (evals-output.ts):

    • Human-readable format with tabular test results, pass rates, and latency metrics
    • JUnit XML format for CI/CD integration with proper XML escaping
    • Aggregate statistics and per-test performance summaries
  • CLI Commands (evals.ts):

    • evals run command to execute test suites with JSON/human/JUnit-XML output options
    • evals validate command to validate config files without running tests
    • Support for CLI overrides of iterations and concurrency settings
    • Proper exit code handling for test failures
  • Comprehensive Test Coverage (evals-config.test.ts, evals-output.test.ts):

    • 20+ unit tests covering config validation, environment variable interpolation, and output formatting
    • Tests for error cases, edge cases, and XML special character escaping

Notable Implementation Details

  • Config validation provides detailed error messages with field paths (e.g., tests[0].prompt)
  • Environment variable interpolation is recursive, supporting nested objects and arrays
  • Model string validation uses parseLLMString() from SDK to ensure valid format
  • Temperature range validation (0-2) and maxSteps positive number validation
  • JUnit XML output includes per-iteration test cases for detailed failure tracking
  • Human output includes p50/p95 latency percentiles and token usage statistics

https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P

Adds `mcpjam evals run` and `mcpjam evals validate` commands backed by
the SDK's EvalSuite, EvalTest, TestAgent, and MCPClientManager. Users
define servers, agent config, and test cases in a single JSON config
file with ${ENV_VAR} interpolation. Supports json, human, and junit-xml
output formats with CLI flag overrides for iterations and concurrency.

https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P
@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 12, 2026
@chelojimenez
Copy link
Copy Markdown
Contributor

chelojimenez commented Apr 12, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@dosubot dosubot Bot added the enhancement New feature or request label Apr 12, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b35f66a5-13d0-45f7-ad41-52ead96a5fb2

📥 Commits

Reviewing files that changed from the base of the PR and between 01a6b07 and 7768f1a.

📒 Files selected for processing (1)
  • cli/excalidraw-evals.json
✅ Files skipped from review due to trivial changes (1)
  • cli/excalidraw-evals.json

Walkthrough

Adds an evals CLI command group with run and validate subcommands. New modules provide config loading/interpolation (loadEvalsConfig, interpolateEnvVars, validateEvalsConfig), test execution (runEvals) using MCP clients and a TestAgent, and output formatting (json, human, junit-xml). run accepts iteration/concurrency overrides and writes results in the chosen format, setting exit code 1 on test failures; validate checks configs and reports validation errors. Unit tests cover config validation, env interpolation, and output formatting.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
cli/tests/evals-config.test.ts (1)

188-211: Add tests for options validation and non-finite numeric inputs.

The suite currently skips the options.iterations/concurrency/timeoutMs/retries validation paths and edge values like NaN/Infinity, which are worth pinning once validation is tightened.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/tests/evals-config.test.ts` around lines 188 - 211, Add unit tests
covering validation of the options object and non-finite numeric inputs for
validateEvalsConfig: create tests that assert validateEvalsConfig rejects
out-of-range or missing options.iterations, options.concurrency,
options.timeoutMs, and options.retries (and accepts valid edge values), and
separate tests that pass NaN/Infinity for numeric fields (e.g.,
agent.temperature, options.*) and assert a CliError is thrown mentioning the
field; use the existing test patterns (assert.throws with CliError checks) and
the function validateEvalsConfig plus the test names like "validateEvalsConfig
rejects temperature out of range" as references to add these new tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/src/commands/evals.ts`:
- Around line 93-110: The validate flow accepts "junit-xml" but never emits XML;
update the validate branch in evals.ts (the block using parseEvalsFormat,
loadEvalsConfig, summary, and writeResult) to handle format === "junit-xml":
either (preferred) implement a small serializer (e.g., writeJUnitXml(summary))
that converts the summary into a minimal JUnit XML testsuite/testcase structure
and writes it to stdout, then call that when format === "junit-xml", or change
parseEvalsFormat to reject "junit-xml" for the validate command so it errors
earlier; reference the parseEvalsFormat, loadEvalsConfig, writeResult, and the
local format/summary variables when making the change.

In `@cli/src/lib/evals-config.ts`:
- Around line 177-182: The current validation only checks that config.options is
an object; enhance validation inside the same block (where config.options is
referenced) to validate individual fields (e.g., config.options.iterations,
config.options.concurrency, config.options.timeout, etc.): ensure iterations is
a positive integer (>0), concurrency is a positive integer (>0),
timeouts/durations are finite non-negative numbers, and reject wrong types
(e.g., string "3") by throwing usageError with a clear message; implement these
checks directly after the existing typeof/Array.isArray guard in evals-config.ts
so invalid option values are caught early and rejected before runtime.
- Around line 126-135: The validation for agent.maxSteps and agent.temperature
currently allows NaN and Infinity; update the checks in the validation logic
that throws usageError to also require numeric finiteness (use Number.isFinite
or isFinite) for both agent.maxSteps and agent.temperature so that non-finite
values are rejected—specifically modify the conditions around agent.maxSteps and
agent.temperature (the same blocks that call usageError) to include a finiteness
check before accepting the value.

---

Nitpick comments:
In `@cli/tests/evals-config.test.ts`:
- Around line 188-211: Add unit tests covering validation of the options object
and non-finite numeric inputs for validateEvalsConfig: create tests that assert
validateEvalsConfig rejects out-of-range or missing options.iterations,
options.concurrency, options.timeoutMs, and options.retries (and accepts valid
edge values), and separate tests that pass NaN/Infinity for numeric fields
(e.g., agent.temperature, options.*) and assert a CliError is thrown mentioning
the field; use the existing test patterns (assert.throws with CliError checks)
and the function validateEvalsConfig plus the test names like
"validateEvalsConfig rejects temperature out of range" as references to add
these new tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e44006f-da7a-4082-ab10-eea7d52feb27

📥 Commits

Reviewing files that changed from the base of the PR and between f01908e and cee1943.

📒 Files selected for processing (7)
  • cli/src/commands/evals.ts
  • cli/src/index.ts
  • cli/src/lib/evals-config.ts
  • cli/src/lib/evals-output.ts
  • cli/src/lib/evals-runner.ts
  • cli/tests/evals-config.test.ts
  • cli/tests/evals-output.test.ts

Comment thread cli/src/commands/evals.ts
Comment on lines +93 to +110
const format = parseEvalsFormat(command);

try {
const config = loadEvalsConfig(options.config);
const summary = {
valid: true,
servers: Object.keys(config.servers).length,
tests: config.tests.length,
model: config.agent.model,
};

if (format === "human") {
process.stdout.write(
`Config valid: ${summary.servers} server(s), ${summary.tests} test(s), model ${summary.model}\n`,
);
} else {
writeResult(summary);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

evals validate --format junit-xml is accepted but not actually produced.

The validate path parses junit-xml as valid but emits JSON/human only. Either reject junit-xml for validate, or implement XML output for this command.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/commands/evals.ts` around lines 93 - 110, The validate flow accepts
"junit-xml" but never emits XML; update the validate branch in evals.ts (the
block using parseEvalsFormat, loadEvalsConfig, summary, and writeResult) to
handle format === "junit-xml": either (preferred) implement a small serializer
(e.g., writeJUnitXml(summary)) that converts the summary into a minimal JUnit
XML testsuite/testcase structure and writes it to stdout, then call that when
format === "junit-xml", or change parseEvalsFormat to reject "junit-xml" for the
validate command so it errors earlier; reference the parseEvalsFormat,
loadEvalsConfig, writeResult, and the local format/summary variables when making
the change.

Comment on lines +126 to +135
if (agent.maxSteps !== undefined && (typeof agent.maxSteps !== "number" || agent.maxSteps <= 0)) {
throw usageError('"agent.maxSteps" must be a positive number.');
}

if (
agent.temperature !== undefined &&
(typeof agent.temperature !== "number" || agent.temperature < 0 || agent.temperature > 2)
) {
throw usageError('"agent.temperature" must be a number between 0 and 2.');
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against non-finite numeric values in agent settings.

NaN/Infinity currently pass these checks and can propagate into agent runtime settings.

Suggested fix
-  if (agent.maxSteps !== undefined && (typeof agent.maxSteps !== "number" || agent.maxSteps <= 0)) {
+  if (
+    agent.maxSteps !== undefined &&
+    (!Number.isFinite(agent.maxSteps) || agent.maxSteps <= 0)
+  ) {
     throw usageError('"agent.maxSteps" must be a positive number.');
   }
@@
-    (typeof agent.temperature !== "number" || agent.temperature < 0 || agent.temperature > 2)
+    (!Number.isFinite(agent.temperature) || agent.temperature < 0 || agent.temperature > 2)
   ) {
     throw usageError('"agent.temperature" must be a number between 0 and 2.');
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/lib/evals-config.ts` around lines 126 - 135, The validation for
agent.maxSteps and agent.temperature currently allows NaN and Infinity; update
the checks in the validation logic that throws usageError to also require
numeric finiteness (use Number.isFinite or isFinite) for both agent.maxSteps and
agent.temperature so that non-finite values are rejected—specifically modify the
conditions around agent.maxSteps and agent.temperature (the same blocks that
call usageError) to include a finiteness check before accepting the value.

Comment on lines +177 to +182
// options (optional)
if (config.options !== undefined) {
if (typeof config.options !== "object" || Array.isArray(config.options)) {
throw usageError('"options" must be an object.');
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate options fields, not only options shape.

options is accepted as any object, so invalid values like iterations: 0, concurrency: "3", or negative/NaN timeouts can reach runtime execution.

Suggested fix
   // options (optional)
   if (config.options !== undefined) {
     if (typeof config.options !== "object" || Array.isArray(config.options)) {
       throw usageError('"options" must be an object.');
     }
+    const options = config.options as Record<string, unknown>;
+    if (
+      options.iterations !== undefined &&
+      (!Number.isInteger(options.iterations) || (options.iterations as number) <= 0)
+    ) {
+      throw usageError('"options.iterations" must be a positive integer.');
+    }
+    if (
+      options.concurrency !== undefined &&
+      (!Number.isInteger(options.concurrency) || (options.concurrency as number) <= 0)
+    ) {
+      throw usageError('"options.concurrency" must be a positive integer.');
+    }
+    if (
+      options.timeoutMs !== undefined &&
+      (!Number.isFinite(options.timeoutMs) || (options.timeoutMs as number) <= 0)
+    ) {
+      throw usageError('"options.timeoutMs" must be a positive number.');
+    }
+    if (
+      options.retries !== undefined &&
+      (!Number.isInteger(options.retries) || (options.retries as number) < 0)
+    ) {
+      throw usageError('"options.retries" must be a non-negative integer.');
+    }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// options (optional)
if (config.options !== undefined) {
if (typeof config.options !== "object" || Array.isArray(config.options)) {
throw usageError('"options" must be an object.');
}
}
// options (optional)
if (config.options !== undefined) {
if (typeof config.options !== "object" || Array.isArray(config.options)) {
throw usageError('"options" must be an object.');
}
const options = config.options as Record<string, unknown>;
if (
options.iterations !== undefined &&
(!Number.isInteger(options.iterations) || (options.iterations as number) <= 0)
) {
throw usageError('"options.iterations" must be a positive integer.');
}
if (
options.concurrency !== undefined &&
(!Number.isInteger(options.concurrency) || (options.concurrency as number) <= 0)
) {
throw usageError('"options.concurrency" must be a positive integer.');
}
if (
options.timeoutMs !== undefined &&
(!Number.isFinite(options.timeoutMs) || (options.timeoutMs as number) <= 0)
) {
throw usageError('"options.timeoutMs" must be a positive number.');
}
if (
options.retries !== undefined &&
(!Number.isInteger(options.retries) || (options.retries as number) < 0)
) {
throw usageError('"options.retries" must be a non-negative integer.');
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/lib/evals-config.ts` around lines 177 - 182, The current validation
only checks that config.options is an object; enhance validation inside the same
block (where config.options is referenced) to validate individual fields (e.g.,
config.options.iterations, config.options.concurrency, config.options.timeout,
etc.): ensure iterations is a positive integer (>0), concurrency is a positive
integer (>0), timeouts/durations are finite non-negative numbers, and reject
wrong types (e.g., string "3") by throwing usageError with a clear message;
implement these checks directly after the existing typeof/Array.isArray guard in
evals-config.ts so invalid option values are caught early and rejected before
runtime.

@prathmeshpatel
Copy link
Copy Markdown
Collaborator Author

Screenshot 2026-04-11 at 11 51 43 PM

will keep iterating tmr

claude added 2 commits April 12, 2026 06:53
When tests fail, the runner now throws descriptive errors showing
expected vs actual tool calls and any prompt-level errors (model
failures, connection issues). The human formatter shows a deduplicated
Failures section so users can immediately see what went wrong.

https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P
Example config for running evals against the Excalidraw MCP server.
Tests cover simple shapes, flowcharts, colored elements, and
architecture diagrams — all expecting read_me + create_view calls.

https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
cli/tests/evals-output.test.ts (1)

143-149: Strengthen XML-escaping assertion for quotes.

The input includes ", but the assertion only checks <, >, and &. Add a check for &quot; to catch regressions in attribute escaping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/tests/evals-output.test.ts` around lines 143 - 149, The test
"formatEvalsJUnit escapes XML special characters" currently asserts that
formatEvalsJUnit(result, 'Suite <"test"> & more') escapes <, > and &, but
doesn't check for escaped quotes; update the assertion in
cli/tests/evals-output.test.ts to also verify that the output includes &quot;
(i.e., ensure output.includes("&quot;") is checked) so that formatEvalsJUnit's
attribute-quoting behavior is covered; reference the test helper
makeSuiteResult() and the function under test formatEvalsJUnit when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/src/lib/evals-runner.ts`:
- Around line 91-93: The finally block currently awaits
manager.disconnectAllServers() which can throw and overwrite the original error
from the try; modify the finally in the eval runner (where
manager.disconnectAllServers() is called) to wrap the disconnect in its own
try/catch, store the original error thrown from the try block (if any), and if
the cleanup throws as well, aggregate both errors (e.g., attach cleanup error as
a cause or append its message) and rethrow the original error with the cleanup
failure attached so the original eval failure is preserved; ensure this logic
references manager.disconnectAllServers() and the surrounding eval invocation
scope so the original exception is rethrown (with cleanup details) rather than
being replaced by the cleanup exception.

In `@cli/tests/evals-output.test.ts`:
- Around line 51-52: The mocked test summaries for the "echo" entry have
successes/failures counters that don't match the per-iteration data in
results/iterationDetails; update the makeTestResult calls (the tests.set("echo",
makeTestResult(...)) entries) so the successes and failures values reflect the
actual counts derived from results/iterationDetails (e.g., if one iteration in
iterationDetails is failing, set failures: 1 and successes accordingly). Repeat
the same correction for the second occurrence around the other makeTestResult
call (the block referenced in the comment).

---

Nitpick comments:
In `@cli/tests/evals-output.test.ts`:
- Around line 143-149: The test "formatEvalsJUnit escapes XML special
characters" currently asserts that formatEvalsJUnit(result, 'Suite <"test"> &
more') escapes <, > and &, but doesn't check for escaped quotes; update the
assertion in cli/tests/evals-output.test.ts to also verify that the output
includes &quot; (i.e., ensure output.includes("&quot;") is checked) so that
formatEvalsJUnit's attribute-quoting behavior is covered; reference the test
helper makeSuiteResult() and the function under test formatEvalsJUnit when
making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 890c8bbe-cd41-4e31-868e-cb10f173de8a

📥 Commits

Reviewing files that changed from the base of the PR and between cee1943 and 01a6b07.

📒 Files selected for processing (3)
  • cli/src/lib/evals-output.ts
  • cli/src/lib/evals-runner.ts
  • cli/tests/evals-output.test.ts
✅ Files skipped from review due to trivial changes (1)
  • cli/src/lib/evals-output.ts

Comment on lines +91 to +93
} finally {
await manager.disconnectAllServers();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "evals-runner.ts" --exec cat -n {} \;

Repository: MCPJam/inspector

Length of output: 50373


🏁 Script executed:

rg -n "runEvals|disconnectAllServers" cli/src/lib/evals-runner.ts

Repository: MCPJam/inspector

Length of output: 139


🏁 Script executed:

sed -n '17,100p' cli/src/lib/evals-runner.ts

Repository: MCPJam/inspector

Length of output: 2362


Preserve the original eval failure when cleanup also fails.

If disconnectAllServers() throws, it will overwrite the original error from the try block due to JavaScript finally semantics, masking critical diagnostics from the eval run. Wrap the cleanup in its own try/catch, capture the original error, and aggregate both if cleanup also fails.

Proposed fix
 export async function runEvals(
   config: EvalsConfig,
   overrides: RunOverrides,
 ): Promise<EvalSuiteResult> {
   const manager = new MCPClientManager(config.servers);
+  let runError: unknown;

   try {
     const tools = await manager.getTools();

     const agent = new TestAgent({
       tools,
       model: config.agent.model,
       apiKey: config.agent.apiKey,
       systemPrompt: config.agent.systemPrompt,
       temperature: config.agent.temperature,
       maxSteps: config.agent.maxSteps,
       mcpClientManager: manager,
     });

     const suite = new EvalSuite({
       name: config.mcpjam?.suiteName ?? "CLI Eval Suite",
       mcpjam: config.mcpjam,
     });

     for (const testCase of config.tests) {
       const matchMode = testCase.matchMode ?? "subset";
       const expected = testCase.expectedToolCalls;

       const evalTest = new EvalTest({
         name: testCase.name,
         test: async (evalAgent) => {
           const result = await evalAgent.prompt(testCase.prompt);

           // Surface prompt-level errors (model failures, connection issues)
           if (result.hasError()) {
             throw new Error(`Prompt error: ${result.getError()}`);
           }

           const called = result.toolsCalled();
           const matched = matchMode === "exact"
             ? matchToolCalls(expected, called)
             : matchToolCallsSubset(expected, called);

           if (!matched) {
             const actualStr = called.length > 0 ? called.join(", ") : "(none)";
             const expectedStr = expected.join(", ");
             throw new Error(
               `Tool call mismatch (${matchMode}): expected [${expectedStr}], got [${actualStr}]`,
             );
           }

           return true;
         },
         expectedToolCalls: expected.map((toolName) => ({ toolName })),
       });

       suite.add(evalTest);
     }

     const iterations =
       overrides.iterations ?? config.options?.iterations ?? 5;
     const concurrency =
       overrides.concurrency ?? config.options?.concurrency ?? 3;
     const timeoutMs = config.options?.timeoutMs ?? 30_000;
     const retries = config.options?.retries ?? 0;

     const result = await suite.run(agent, {
       iterations,
       concurrency,
       timeoutMs,
       retries,
     });

     return result;
+  } catch (error) {
+    runError = error;
+    throw error;
   } finally {
-    await manager.disconnectAllServers();
+    try {
+      await manager.disconnectAllServers();
+    } catch (disconnectError) {
+      if (runError) {
+        throw new AggregateError(
+          [runError, disconnectError],
+          "Eval run failed and cleanup also failed",
+        );
+      }
+      throw disconnectError;
+    }
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/lib/evals-runner.ts` around lines 91 - 93, The finally block
currently awaits manager.disconnectAllServers() which can throw and overwrite
the original error from the try; modify the finally in the eval runner (where
manager.disconnectAllServers() is called) to wrap the disconnect in its own
try/catch, store the original error thrown from the try block (if any), and if
the cleanup throws as well, aggregate both errors (e.g., attach cleanup error as
a cause or append its message) and rethrow the original error with the cleanup
failure attached so the original eval failure is preserved; ensure this logic
references manager.disconnectAllServers() and the surrounding eval invocation
scope so the original exception is rethrown (with cleanup details) rather than
being replaced by the cleanup exception.

Comment on lines +51 to +52
tests.set("echo", makeTestResult({ successes: 5, failures: 0 }));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep mocked pass/fail counters consistent with iteration-level data.

At Line 51 and Line 114, the overrides set successes/failures to all-pass, but results and iterationDetails still contain a failing iteration. This can let tests pass for the wrong reason.

💡 Proposed fix
+function makeAllPassTestResult(): EvalRunResult {
+  const base = makeTestResult();
+  return {
+    ...base,
+    successes: base.iterations,
+    failures: 0,
+    results: Array(base.iterations).fill(true),
+    iterationDetails: base.iterationDetails.map(({ latencies, tokens }) => ({
+      passed: true,
+      latencies,
+      tokens,
+    })),
+  };
+}
@@
-  tests.set("echo", makeTestResult({ successes: 5, failures: 0 }));
+  tests.set("echo", makeAllPassTestResult());
@@
-  result.tests.set("addition", makeTestResult({ successes: 5, failures: 0 }));
+  result.tests.set("addition", makeAllPassTestResult());

Also applies to: 111-115

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/tests/evals-output.test.ts` around lines 51 - 52, The mocked test
summaries for the "echo" entry have successes/failures counters that don't match
the per-iteration data in results/iterationDetails; update the makeTestResult
calls (the tests.set("echo", makeTestResult(...)) entries) so the successes and
failures values reflect the actual counts derived from results/iterationDetails
(e.g., if one iteration in iterationDetails is failing, set failures: 1 and
successes accordingly). Repeat the same correction for the second occurrence
around the other makeTestResult call (the block referenced in the comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants