Add LLM evaluation suite support with config validation and reporting#1772
Add LLM evaluation suite support with config validation and reporting#1772prathmeshpatel wants to merge 3 commits intomainfrom
Conversation
Adds `mcpjam evals run` and `mcpjam evals validate` commands backed by
the SDK's EvalSuite, EvalTest, TestAgent, and MCPClientManager. Users
define servers, agent config, and test cases in a single JSON config
file with ${ENV_VAR} interpolation. Supports json, human, and junit-xml
output formats with CLI flag overrides for iterations and concurrency.
https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughAdds an Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
cli/tests/evals-config.test.ts (1)
188-211: Add tests foroptionsvalidation and non-finite numeric inputs.The suite currently skips the
options.iterations/concurrency/timeoutMs/retriesvalidation paths and edge values likeNaN/Infinity, which are worth pinning once validation is tightened.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/tests/evals-config.test.ts` around lines 188 - 211, Add unit tests covering validation of the options object and non-finite numeric inputs for validateEvalsConfig: create tests that assert validateEvalsConfig rejects out-of-range or missing options.iterations, options.concurrency, options.timeoutMs, and options.retries (and accepts valid edge values), and separate tests that pass NaN/Infinity for numeric fields (e.g., agent.temperature, options.*) and assert a CliError is thrown mentioning the field; use the existing test patterns (assert.throws with CliError checks) and the function validateEvalsConfig plus the test names like "validateEvalsConfig rejects temperature out of range" as references to add these new tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cli/src/commands/evals.ts`:
- Around line 93-110: The validate flow accepts "junit-xml" but never emits XML;
update the validate branch in evals.ts (the block using parseEvalsFormat,
loadEvalsConfig, summary, and writeResult) to handle format === "junit-xml":
either (preferred) implement a small serializer (e.g., writeJUnitXml(summary))
that converts the summary into a minimal JUnit XML testsuite/testcase structure
and writes it to stdout, then call that when format === "junit-xml", or change
parseEvalsFormat to reject "junit-xml" for the validate command so it errors
earlier; reference the parseEvalsFormat, loadEvalsConfig, writeResult, and the
local format/summary variables when making the change.
In `@cli/src/lib/evals-config.ts`:
- Around line 177-182: The current validation only checks that config.options is
an object; enhance validation inside the same block (where config.options is
referenced) to validate individual fields (e.g., config.options.iterations,
config.options.concurrency, config.options.timeout, etc.): ensure iterations is
a positive integer (>0), concurrency is a positive integer (>0),
timeouts/durations are finite non-negative numbers, and reject wrong types
(e.g., string "3") by throwing usageError with a clear message; implement these
checks directly after the existing typeof/Array.isArray guard in evals-config.ts
so invalid option values are caught early and rejected before runtime.
- Around line 126-135: The validation for agent.maxSteps and agent.temperature
currently allows NaN and Infinity; update the checks in the validation logic
that throws usageError to also require numeric finiteness (use Number.isFinite
or isFinite) for both agent.maxSteps and agent.temperature so that non-finite
values are rejected—specifically modify the conditions around agent.maxSteps and
agent.temperature (the same blocks that call usageError) to include a finiteness
check before accepting the value.
---
Nitpick comments:
In `@cli/tests/evals-config.test.ts`:
- Around line 188-211: Add unit tests covering validation of the options object
and non-finite numeric inputs for validateEvalsConfig: create tests that assert
validateEvalsConfig rejects out-of-range or missing options.iterations,
options.concurrency, options.timeoutMs, and options.retries (and accepts valid
edge values), and separate tests that pass NaN/Infinity for numeric fields
(e.g., agent.temperature, options.*) and assert a CliError is thrown mentioning
the field; use the existing test patterns (assert.throws with CliError checks)
and the function validateEvalsConfig plus the test names like
"validateEvalsConfig rejects temperature out of range" as references to add
these new tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0e44006f-da7a-4082-ab10-eea7d52feb27
📒 Files selected for processing (7)
cli/src/commands/evals.tscli/src/index.tscli/src/lib/evals-config.tscli/src/lib/evals-output.tscli/src/lib/evals-runner.tscli/tests/evals-config.test.tscli/tests/evals-output.test.ts
| const format = parseEvalsFormat(command); | ||
|
|
||
| try { | ||
| const config = loadEvalsConfig(options.config); | ||
| const summary = { | ||
| valid: true, | ||
| servers: Object.keys(config.servers).length, | ||
| tests: config.tests.length, | ||
| model: config.agent.model, | ||
| }; | ||
|
|
||
| if (format === "human") { | ||
| process.stdout.write( | ||
| `Config valid: ${summary.servers} server(s), ${summary.tests} test(s), model ${summary.model}\n`, | ||
| ); | ||
| } else { | ||
| writeResult(summary); | ||
| } |
There was a problem hiding this comment.
evals validate --format junit-xml is accepted but not actually produced.
The validate path parses junit-xml as valid but emits JSON/human only. Either reject junit-xml for validate, or implement XML output for this command.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/commands/evals.ts` around lines 93 - 110, The validate flow accepts
"junit-xml" but never emits XML; update the validate branch in evals.ts (the
block using parseEvalsFormat, loadEvalsConfig, summary, and writeResult) to
handle format === "junit-xml": either (preferred) implement a small serializer
(e.g., writeJUnitXml(summary)) that converts the summary into a minimal JUnit
XML testsuite/testcase structure and writes it to stdout, then call that when
format === "junit-xml", or change parseEvalsFormat to reject "junit-xml" for the
validate command so it errors earlier; reference the parseEvalsFormat,
loadEvalsConfig, writeResult, and the local format/summary variables when making
the change.
| if (agent.maxSteps !== undefined && (typeof agent.maxSteps !== "number" || agent.maxSteps <= 0)) { | ||
| throw usageError('"agent.maxSteps" must be a positive number.'); | ||
| } | ||
|
|
||
| if ( | ||
| agent.temperature !== undefined && | ||
| (typeof agent.temperature !== "number" || agent.temperature < 0 || agent.temperature > 2) | ||
| ) { | ||
| throw usageError('"agent.temperature" must be a number between 0 and 2.'); | ||
| } |
There was a problem hiding this comment.
Guard against non-finite numeric values in agent settings.
NaN/Infinity currently pass these checks and can propagate into agent runtime settings.
Suggested fix
- if (agent.maxSteps !== undefined && (typeof agent.maxSteps !== "number" || agent.maxSteps <= 0)) {
+ if (
+ agent.maxSteps !== undefined &&
+ (!Number.isFinite(agent.maxSteps) || agent.maxSteps <= 0)
+ ) {
throw usageError('"agent.maxSteps" must be a positive number.');
}
@@
- (typeof agent.temperature !== "number" || agent.temperature < 0 || agent.temperature > 2)
+ (!Number.isFinite(agent.temperature) || agent.temperature < 0 || agent.temperature > 2)
) {
throw usageError('"agent.temperature" must be a number between 0 and 2.');
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/lib/evals-config.ts` around lines 126 - 135, The validation for
agent.maxSteps and agent.temperature currently allows NaN and Infinity; update
the checks in the validation logic that throws usageError to also require
numeric finiteness (use Number.isFinite or isFinite) for both agent.maxSteps and
agent.temperature so that non-finite values are rejected—specifically modify the
conditions around agent.maxSteps and agent.temperature (the same blocks that
call usageError) to include a finiteness check before accepting the value.
| // options (optional) | ||
| if (config.options !== undefined) { | ||
| if (typeof config.options !== "object" || Array.isArray(config.options)) { | ||
| throw usageError('"options" must be an object.'); | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate options fields, not only options shape.
options is accepted as any object, so invalid values like iterations: 0, concurrency: "3", or negative/NaN timeouts can reach runtime execution.
Suggested fix
// options (optional)
if (config.options !== undefined) {
if (typeof config.options !== "object" || Array.isArray(config.options)) {
throw usageError('"options" must be an object.');
}
+ const options = config.options as Record<string, unknown>;
+ if (
+ options.iterations !== undefined &&
+ (!Number.isInteger(options.iterations) || (options.iterations as number) <= 0)
+ ) {
+ throw usageError('"options.iterations" must be a positive integer.');
+ }
+ if (
+ options.concurrency !== undefined &&
+ (!Number.isInteger(options.concurrency) || (options.concurrency as number) <= 0)
+ ) {
+ throw usageError('"options.concurrency" must be a positive integer.');
+ }
+ if (
+ options.timeoutMs !== undefined &&
+ (!Number.isFinite(options.timeoutMs) || (options.timeoutMs as number) <= 0)
+ ) {
+ throw usageError('"options.timeoutMs" must be a positive number.');
+ }
+ if (
+ options.retries !== undefined &&
+ (!Number.isInteger(options.retries) || (options.retries as number) < 0)
+ ) {
+ throw usageError('"options.retries" must be a non-negative integer.');
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // options (optional) | |
| if (config.options !== undefined) { | |
| if (typeof config.options !== "object" || Array.isArray(config.options)) { | |
| throw usageError('"options" must be an object.'); | |
| } | |
| } | |
| // options (optional) | |
| if (config.options !== undefined) { | |
| if (typeof config.options !== "object" || Array.isArray(config.options)) { | |
| throw usageError('"options" must be an object.'); | |
| } | |
| const options = config.options as Record<string, unknown>; | |
| if ( | |
| options.iterations !== undefined && | |
| (!Number.isInteger(options.iterations) || (options.iterations as number) <= 0) | |
| ) { | |
| throw usageError('"options.iterations" must be a positive integer.'); | |
| } | |
| if ( | |
| options.concurrency !== undefined && | |
| (!Number.isInteger(options.concurrency) || (options.concurrency as number) <= 0) | |
| ) { | |
| throw usageError('"options.concurrency" must be a positive integer.'); | |
| } | |
| if ( | |
| options.timeoutMs !== undefined && | |
| (!Number.isFinite(options.timeoutMs) || (options.timeoutMs as number) <= 0) | |
| ) { | |
| throw usageError('"options.timeoutMs" must be a positive number.'); | |
| } | |
| if ( | |
| options.retries !== undefined && | |
| (!Number.isInteger(options.retries) || (options.retries as number) < 0) | |
| ) { | |
| throw usageError('"options.retries" must be a non-negative integer.'); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/lib/evals-config.ts` around lines 177 - 182, The current validation
only checks that config.options is an object; enhance validation inside the same
block (where config.options is referenced) to validate individual fields (e.g.,
config.options.iterations, config.options.concurrency, config.options.timeout,
etc.): ensure iterations is a positive integer (>0), concurrency is a positive
integer (>0), timeouts/durations are finite non-negative numbers, and reject
wrong types (e.g., string "3") by throwing usageError with a clear message;
implement these checks directly after the existing typeof/Array.isArray guard in
evals-config.ts so invalid option values are caught early and rejected before
runtime.
When tests fail, the runner now throws descriptive errors showing expected vs actual tool calls and any prompt-level errors (model failures, connection issues). The human formatter shows a deduplicated Failures section so users can immediately see what went wrong. https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P
Example config for running evals against the Excalidraw MCP server. Tests cover simple shapes, flowcharts, colored elements, and architecture diagrams — all expecting read_me + create_view calls. https://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cli/tests/evals-output.test.ts (1)
143-149: Strengthen XML-escaping assertion for quotes.The input includes
", but the assertion only checks<,>, and&. Add a check for"to catch regressions in attribute escaping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/tests/evals-output.test.ts` around lines 143 - 149, The test "formatEvalsJUnit escapes XML special characters" currently asserts that formatEvalsJUnit(result, 'Suite <"test"> & more') escapes <, > and &, but doesn't check for escaped quotes; update the assertion in cli/tests/evals-output.test.ts to also verify that the output includes " (i.e., ensure output.includes(""") is checked) so that formatEvalsJUnit's attribute-quoting behavior is covered; reference the test helper makeSuiteResult() and the function under test formatEvalsJUnit when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cli/src/lib/evals-runner.ts`:
- Around line 91-93: The finally block currently awaits
manager.disconnectAllServers() which can throw and overwrite the original error
from the try; modify the finally in the eval runner (where
manager.disconnectAllServers() is called) to wrap the disconnect in its own
try/catch, store the original error thrown from the try block (if any), and if
the cleanup throws as well, aggregate both errors (e.g., attach cleanup error as
a cause or append its message) and rethrow the original error with the cleanup
failure attached so the original eval failure is preserved; ensure this logic
references manager.disconnectAllServers() and the surrounding eval invocation
scope so the original exception is rethrown (with cleanup details) rather than
being replaced by the cleanup exception.
In `@cli/tests/evals-output.test.ts`:
- Around line 51-52: The mocked test summaries for the "echo" entry have
successes/failures counters that don't match the per-iteration data in
results/iterationDetails; update the makeTestResult calls (the tests.set("echo",
makeTestResult(...)) entries) so the successes and failures values reflect the
actual counts derived from results/iterationDetails (e.g., if one iteration in
iterationDetails is failing, set failures: 1 and successes accordingly). Repeat
the same correction for the second occurrence around the other makeTestResult
call (the block referenced in the comment).
---
Nitpick comments:
In `@cli/tests/evals-output.test.ts`:
- Around line 143-149: The test "formatEvalsJUnit escapes XML special
characters" currently asserts that formatEvalsJUnit(result, 'Suite <"test"> &
more') escapes <, > and &, but doesn't check for escaped quotes; update the
assertion in cli/tests/evals-output.test.ts to also verify that the output
includes " (i.e., ensure output.includes(""") is checked) so that
formatEvalsJUnit's attribute-quoting behavior is covered; reference the test
helper makeSuiteResult() and the function under test formatEvalsJUnit when
making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 890c8bbe-cd41-4e31-868e-cb10f173de8a
📒 Files selected for processing (3)
cli/src/lib/evals-output.tscli/src/lib/evals-runner.tscli/tests/evals-output.test.ts
✅ Files skipped from review due to trivial changes (1)
- cli/src/lib/evals-output.ts
| } finally { | ||
| await manager.disconnectAllServers(); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "evals-runner.ts" --exec cat -n {} \;Repository: MCPJam/inspector
Length of output: 50373
🏁 Script executed:
rg -n "runEvals|disconnectAllServers" cli/src/lib/evals-runner.tsRepository: MCPJam/inspector
Length of output: 139
🏁 Script executed:
sed -n '17,100p' cli/src/lib/evals-runner.tsRepository: MCPJam/inspector
Length of output: 2362
Preserve the original eval failure when cleanup also fails.
If disconnectAllServers() throws, it will overwrite the original error from the try block due to JavaScript finally semantics, masking critical diagnostics from the eval run. Wrap the cleanup in its own try/catch, capture the original error, and aggregate both if cleanup also fails.
Proposed fix
export async function runEvals(
config: EvalsConfig,
overrides: RunOverrides,
): Promise<EvalSuiteResult> {
const manager = new MCPClientManager(config.servers);
+ let runError: unknown;
try {
const tools = await manager.getTools();
const agent = new TestAgent({
tools,
model: config.agent.model,
apiKey: config.agent.apiKey,
systemPrompt: config.agent.systemPrompt,
temperature: config.agent.temperature,
maxSteps: config.agent.maxSteps,
mcpClientManager: manager,
});
const suite = new EvalSuite({
name: config.mcpjam?.suiteName ?? "CLI Eval Suite",
mcpjam: config.mcpjam,
});
for (const testCase of config.tests) {
const matchMode = testCase.matchMode ?? "subset";
const expected = testCase.expectedToolCalls;
const evalTest = new EvalTest({
name: testCase.name,
test: async (evalAgent) => {
const result = await evalAgent.prompt(testCase.prompt);
// Surface prompt-level errors (model failures, connection issues)
if (result.hasError()) {
throw new Error(`Prompt error: ${result.getError()}`);
}
const called = result.toolsCalled();
const matched = matchMode === "exact"
? matchToolCalls(expected, called)
: matchToolCallsSubset(expected, called);
if (!matched) {
const actualStr = called.length > 0 ? called.join(", ") : "(none)";
const expectedStr = expected.join(", ");
throw new Error(
`Tool call mismatch (${matchMode}): expected [${expectedStr}], got [${actualStr}]`,
);
}
return true;
},
expectedToolCalls: expected.map((toolName) => ({ toolName })),
});
suite.add(evalTest);
}
const iterations =
overrides.iterations ?? config.options?.iterations ?? 5;
const concurrency =
overrides.concurrency ?? config.options?.concurrency ?? 3;
const timeoutMs = config.options?.timeoutMs ?? 30_000;
const retries = config.options?.retries ?? 0;
const result = await suite.run(agent, {
iterations,
concurrency,
timeoutMs,
retries,
});
return result;
+ } catch (error) {
+ runError = error;
+ throw error;
} finally {
- await manager.disconnectAllServers();
+ try {
+ await manager.disconnectAllServers();
+ } catch (disconnectError) {
+ if (runError) {
+ throw new AggregateError(
+ [runError, disconnectError],
+ "Eval run failed and cleanup also failed",
+ );
+ }
+ throw disconnectError;
+ }
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/lib/evals-runner.ts` around lines 91 - 93, The finally block
currently awaits manager.disconnectAllServers() which can throw and overwrite
the original error from the try; modify the finally in the eval runner (where
manager.disconnectAllServers() is called) to wrap the disconnect in its own
try/catch, store the original error thrown from the try block (if any), and if
the cleanup throws as well, aggregate both errors (e.g., attach cleanup error as
a cause or append its message) and rethrow the original error with the cleanup
failure attached so the original eval failure is preserved; ensure this logic
references manager.disconnectAllServers() and the surrounding eval invocation
scope so the original exception is rethrown (with cleanup details) rather than
being replaced by the cleanup exception.
| tests.set("echo", makeTestResult({ successes: 5, failures: 0 })); | ||
|
|
There was a problem hiding this comment.
Keep mocked pass/fail counters consistent with iteration-level data.
At Line 51 and Line 114, the overrides set successes/failures to all-pass, but results and iterationDetails still contain a failing iteration. This can let tests pass for the wrong reason.
💡 Proposed fix
+function makeAllPassTestResult(): EvalRunResult {
+ const base = makeTestResult();
+ return {
+ ...base,
+ successes: base.iterations,
+ failures: 0,
+ results: Array(base.iterations).fill(true),
+ iterationDetails: base.iterationDetails.map(({ latencies, tokens }) => ({
+ passed: true,
+ latencies,
+ tokens,
+ })),
+ };
+}
@@
- tests.set("echo", makeTestResult({ successes: 5, failures: 0 }));
+ tests.set("echo", makeAllPassTestResult());
@@
- result.tests.set("addition", makeTestResult({ successes: 5, failures: 0 }));
+ result.tests.set("addition", makeAllPassTestResult());Also applies to: 111-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/tests/evals-output.test.ts` around lines 51 - 52, The mocked test
summaries for the "echo" entry have successes/failures counters that don't match
the per-iteration data in results/iterationDetails; update the makeTestResult
calls (the tests.set("echo", makeTestResult(...)) entries) so the successes and
failures values reflect the actual counts derived from results/iterationDetails
(e.g., if one iteration in iterationDetails is failing, set failures: 1 and
successes accordingly). Repeat the same correction for the second occurrence
around the other makeTestResult call (the block referenced in the comment).

Summary
This PR adds comprehensive LLM evaluation suite functionality to the CLI, enabling users to define and run test cases against MCP servers with configurable agents, multiple output formats, and detailed performance metrics.
Key Changes
Evals Configuration System (
evals-config.ts):EvalsConfiginterface supporting servers, agent configuration, test cases, and run options${VAR}syntax throughout config filesloadEvalsConfig()function to load, interpolate, and validate JSON config filesEvals Runner (
evals-runner.ts):runEvals()function orchestrating test execution against MCP servers@mcpjam/sdkfor agent-based testing and tool matchingOutput Formatting (
evals-output.ts):CLI Commands (
evals.ts):evals runcommand to execute test suites with JSON/human/JUnit-XML output optionsevals validatecommand to validate config files without running testsComprehensive Test Coverage (
evals-config.test.ts,evals-output.test.ts):Notable Implementation Details
tests[0].prompt)parseLLMString()from SDK to ensure valid formathttps://claude.ai/code/session_01AKW9UqD6M3RBdPdMXkEf4P