Improve LLM tool parameter guidance and add E2E testing framework #26

janisz · 2026-01-15T16:51:40Z

Description

Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models.

Validation

./scripts/run-tests.sh
══════════════════════════════════════════════════════════
  StackRox MCP E2E Testing with Gevals
══════════════════════════════════════════════════════════

Loading environment variables from .env...
Configuration:
  Agent Model: gpt-4o
  Judge Model: gpt-4o
  MCP Server: stackrox-mcp (via go run)

Running gevals tests...


=== Starting Evaluation ===

Task: list-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-workloads
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-nonexistent
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-scooby
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-maria
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-clusters-general
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-list
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

=== Evaluation Complete ===

📄 Results saved to: gevals-stackrox-mcp-e2e-out.json

=== Results Summary ===

Task: list-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/list-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-workloads
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-workloads.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-nonexistent
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-nonexistent.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-scooby
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-scooby.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-maria
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-maria.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-clusters-general
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-clusters-general.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-list
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-list.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

=== Overall Statistics ===
Total Tasks: 8
Tasks Passed: 8/8
Assertions Passed: 24/24

=== Statistics by Difficulty ===

easy:
  Tasks: 8/8
  Assertions: 24/24

══════════════════════════════════════════════════════════
  Tests Completed Successfully!
══════════════════════════════════════════════════════════

codecov-commenter · 2026-01-15T16:56:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.94%. Comparing base (df15807) to head (b830585).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #26      +/-   ##
==========================================
+ Coverage   77.58%   77.94%   +0.35%     
==========================================
  Files          26       26              
  Lines        1120     1138      +18     
==========================================
+ Hits          869      887      +18     
  Misses        216      216              
  Partials       35       35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mtodor · 2026-01-19T15:35:52Z

e2e-tests/gevals/tasks/cve-affecting-clusters.yaml

+  difficulty: easy
+steps:
+  prompt:
+    inline: "is this CVE-2016-1000031 affecting me?"


nitpick: Let's capitalize first letter.

Suggested change

inline: "is this CVE-2016-1000031 affecting me?"

inline: "Is this CVE-2016-1000031 detected in my clusters?"

Please check other prompts for start of a sentence capitalization.

We would like to shift statement from affected to detected. Please adjust other prompts to not us affect.

mtodor · 2026-01-19T15:41:01Z

internal/toolsets/config/tools.go

 	schema.Properties["limit"].Minimum = jsonschema.Ptr(0.0)
 	schema.Properties["limit"].Default = toolsets.MustJSONMarshal(defaultLimit)
-	schema.Properties["limit"].Description = "Maximum number of clusters to return (default: 0 - unlimited)"
+	schema.Properties["limit"].Description = "Maximum number of clusters to return. Use 0 for unlimited (default). When using pagination, always provide both limit and offset together. Default: 0."


nitpick:

Suggested change

schema.Properties["limit"].Description = "Maximum number of clusters to return. Use 0 for unlimited (default). When using pagination, always provide both limit and offset together. Default: 0."

schema.Properties["limit"].Description = "Maximum number of clusters to return. When using pagination, always provide both limit and offset together. Use 0 for unlimited (default)."

mtodor · 2026-01-19T15:43:40Z

internal/toolsets/vulnerability/clusters.go

 	schema.Properties["cveName"].Description = "CVE name to filter clusters (e.g., CVE-2021-44228)"
-	schema.Properties["filterClusterId"].Description = "Optional cluster ID to verify if a specific cluster is affected"
+	schema.Properties["filterClusterId"].Description =
+		"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +


This is not correct. Only cluster ID is supported. We are not accepting cluster name. Cluster name can be resolved from list clusters tool.

mtodor · 2026-01-19T15:46:16Z

internal/toolsets/vulnerability/clusters.go

 	schema.Properties["cveName"].Description = "CVE name to filter clusters (e.g., CVE-2021-44228)"
-	schema.Properties["filterClusterId"].Description = "Optional cluster ID to verify if a specific cluster is affected"
+	schema.Properties["filterClusterId"].Description =
+		"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +


We don't want to use word affected - because we can not say with confidence: "it is not affected". We can only say, if CVE is detected.

Suggested change

"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +

"Optional cluster ID to verify if CVE is detected in a specific cluster. " +

mtodor · 2026-01-19T15:49:57Z

e2e-tests/mcp-testing-framework.yaml

This file looks like leftover from the initial implementation.

mtodor · 2026-01-19T16:58:22Z

e2e-tests/gevals/tasks/cve-nonexistent.yaml

+  prompt:
+    inline: "Is CVE-2099-00001 affecting my clusters?"
+  verify:
+    contains: "A response indicating whether CVE-2099-00001 is affecting clusters, or stating it is not affecting any clusters"


I know, it's about calling correct MCP tool, but results generated by LLM are also relevant. If we have everywhere "find something", or "nothing" - then anything returned by LLM will be acceptable.

At least in this case we know it should be "nothing". And it would be good if we could change other tests to have expectations that correct data is returned by LLM.

mtodor · 2026-01-19T17:00:01Z

e2e-tests/gevals/tasks/list-clusters.yaml

+    inline: "list my clusters"
+  verify:
+    contains: "cluster names"


nitpick:

Suggested change

inline: "list my clusters"

verify:

contains: "cluster names"

inline: "List my clusters"

verify:

contains: "A response contains list of cluster names"

mtodor · 2026-01-19T17:02:41Z

e2e-tests/gevals/eval.yaml

+
+    # Test 2: CVE affecting workloads
+    - path: tasks/cve-affecting-workloads.yaml
+      assertions:


Is it possible to define assertions in tasks file? i.e. in for this case tasks/cve-affecting-workloads.yaml

mtodor · 2026-01-19T17:15:12Z

e2e-tests/gevals/eval.yaml

+            argumentsMatch:
+              cveName: "CVE-2016-1000031"
+        minToolCalls: 1
+        maxToolCalls: 3


What maxToolCalls means? Tools defined in toolsUsed can be called up to 3 times?

mtodor · 2026-01-19T17:16:06Z

e2e-tests/gevals/eval.yaml

+          - server: stackrox-mcp
+            toolPattern: "list_clusters"
+        minToolCalls: 1
+        maxToolCalls: 2


LLM should not fetch list of clusters twice:

Suggested change

maxToolCalls: 2

maxToolCalls: 1

Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

Fix E2E test assertion failures by improving tool descriptions with smart usage pattern guidance. Tool descriptions now clearly indicate: - When to call all three CVE tools for comprehensive coverage ("Is CVE-X detected in my clusters?" without specific cluster name) - When to call only specific tools for targeted queries ("Is CVE-X detected in cluster staging-central-cluster?") Changes: - Update vulnerability tool descriptions (clusters, deployments, nodes) to use directive language and clear usage patterns - Adjust cve-nonexistent test maxToolCalls from 2 to 3 to match comprehensive check pattern - Update cve-cluster-does-not-exist verification to accept both "CVE not detected" and "cluster doesn't exist" responses Results: All 24/24 E2E test assertions now pass (improved from 21/24). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Split long description strings in tool definitions to comply with the 120-character line limit by breaking at natural sentence boundaries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…criptions Changes: - Switch E2E agent from GPT-4o to Claude Sonnet 4.5 via Vertex AI - Add enableAllTools: true to MCP config for auto-approval - Configure gpt-5-nano as LLM judge for cost efficiency - Improve CVE tool descriptions with clear WHEN TO USE/WHEN NOT TO USE sections - Update test assertions to account for Claude's comprehensive CVE checking behavior - Update run-tests.sh to export Vertex AI environment variables The tool descriptions now explicitly guide when to use each CVE detection tool: - General "clusters" queries → comprehensive check (all 3 tools) - Specific component queries → single relevant tool only - Single cluster queries → orchestrator tool with cluster filter All 8 E2E tests passing with 24/24 assertions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

janisz force-pushed the e2e-tests branch from 5b19c98 to 2868f53 Compare January 15, 2026 16:58

janisz mentioned this pull request Jan 16, 2026

chore(tweak): Tweak tool name and description #25

Merged

1 task

mtodor reviewed Jan 19, 2026

View reviewed changes

janisz marked this pull request as draft January 19, 2026 17:39

janisz and others added 3 commits January 19, 2026 18:41

use gevals

ca7c21e

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

janisz force-pushed the e2e-tests branch from bcaaa07 to 6e0ec3d Compare January 19, 2026 18:40

janisz and others added 2 commits January 20, 2026 11:30

Fix golangci-lint line length errors in tool descriptions

4f27b92

Split long description strings in tool definitions to comply with the 120-character line limit by breaking at natural sentence boundaries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve LLM tool parameter guidance and add E2E testing framework #26

Improve LLM tool parameter guidance and add E2E testing framework #26

janisz commented Jan 15, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 15, 2026 •

edited

Loading

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

mtodor Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	inline: "is this CVE-2016-1000031 affecting me?"
	inline: "Is this CVE-2016-1000031 detected in my clusters?"

	schema.Properties["limit"].Description = "Maximum number of clusters to return. Use 0 for unlimited (default). When using pagination, always provide both limit and offset together. Default: 0."
	schema.Properties["limit"].Description = "Maximum number of clusters to return. When using pagination, always provide both limit and offset together. Use 0 for unlimited (default)."

	"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +
	"Optional cluster ID to verify if CVE is detected in a specific cluster. " +

Improve LLM tool parameter guidance and add E2E testing framework #26

Are you sure you want to change the base?

Improve LLM tool parameter guidance and add E2E testing framework #26

Conversation

janisz commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation

Uh oh!

codecov-commenter commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janisz commented Jan 15, 2026 •

edited

Loading

codecov-commenter commented Jan 15, 2026 •

edited

Loading