Skip to content

Conversation

@janisz
Copy link
Contributor

@janisz janisz commented Jan 15, 2026

Description

Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models.

Validation

./scripts/run-tests.sh
══════════════════════════════════════════════════════════
  StackRox MCP E2E Testing with Gevals
══════════════════════════════════════════════════════════

Loading environment variables from .env...
Configuration:
  Agent Model: gpt-4o
  Judge Model: gpt-4o
  MCP Server: stackrox-mcp (via go run)

Running gevals tests...


=== Starting Evaluation ===

Task: list-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-workloads
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-nonexistent
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-scooby
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-maria
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-clusters-general
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-list
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

=== Evaluation Complete ===

📄 Results saved to: gevals-stackrox-mcp-e2e-out.json

=== Results Summary ===

Task: list-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/list-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-workloads
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-workloads.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-nonexistent
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-nonexistent.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-scooby
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-scooby.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-maria
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-maria.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-clusters-general
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-clusters-general.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-list
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-list.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

=== Overall Statistics ===
Total Tasks: 8
Tasks Passed: 8/8
Assertions Passed: 24/24

=== Statistics by Difficulty ===

easy:
  Tasks: 8/8
  Assertions: 24/24

══════════════════════════════════════════════════════════
  Tests Completed Successfully!
══════════════════════════════════════════════════════════

@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.94%. Comparing base (df15807) to head (b830585).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #26      +/-   ##
==========================================
+ Coverage   77.58%   77.94%   +0.35%     
==========================================
  Files          26       26              
  Lines        1120     1138      +18     
==========================================
+ Hits          869      887      +18     
  Misses        216      216              
  Partials       35       35              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

difficulty: easy
steps:
prompt:
inline: "is this CVE-2016-1000031 affecting me?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Let's capitalize first letter.

Suggested change
inline: "is this CVE-2016-1000031 affecting me?"
inline: "Is this CVE-2016-1000031 detected in my clusters?"

Please check other prompts for start of a sentence capitalization.

We would like to shift statement from affected to detected. Please adjust other prompts to not us affect.

schema.Properties["limit"].Minimum = jsonschema.Ptr(0.0)
schema.Properties["limit"].Default = toolsets.MustJSONMarshal(defaultLimit)
schema.Properties["limit"].Description = "Maximum number of clusters to return (default: 0 - unlimited)"
schema.Properties["limit"].Description = "Maximum number of clusters to return. Use 0 for unlimited (default). When using pagination, always provide both limit and offset together. Default: 0."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
schema.Properties["limit"].Description = "Maximum number of clusters to return. Use 0 for unlimited (default). When using pagination, always provide both limit and offset together. Default: 0."
schema.Properties["limit"].Description = "Maximum number of clusters to return. When using pagination, always provide both limit and offset together. Use 0 for unlimited (default)."

schema.Properties["cveName"].Description = "CVE name to filter clusters (e.g., CVE-2021-44228)"
schema.Properties["filterClusterId"].Description = "Optional cluster ID to verify if a specific cluster is affected"
schema.Properties["filterClusterId"].Description =
"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. Only cluster ID is supported. We are not accepting cluster name. Cluster name can be resolved from list clusters tool.

schema.Properties["cveName"].Description = "CVE name to filter clusters (e.g., CVE-2021-44228)"
schema.Properties["filterClusterId"].Description = "Optional cluster ID to verify if a specific cluster is affected"
schema.Properties["filterClusterId"].Description =
"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to use word affected - because we can not say with confidence: "it is not affected". We can only say, if CVE is detected.

Suggested change
"Optional cluster ID or cluster name to verify if a specific cluster is affected. " +
"Optional cluster ID to verify if CVE is detected in a specific cluster. " +

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks like leftover from the initial implementation.

prompt:
inline: "Is CVE-2099-00001 affecting my clusters?"
verify:
contains: "A response indicating whether CVE-2099-00001 is affecting clusters, or stating it is not affecting any clusters"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, it's about calling correct MCP tool, but results generated by LLM are also relevant. If we have everywhere "find something", or "nothing" - then anything returned by LLM will be acceptable.

At least in this case we know it should be "nothing". And it would be good if we could change other tests to have expectations that correct data is returned by LLM.

Comment on lines 7 to 9
inline: "list my clusters"
verify:
contains: "cluster names"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
inline: "list my clusters"
verify:
contains: "cluster names"
inline: "List my clusters"
verify:
contains: "A response contains list of cluster names"


# Test 2: CVE affecting workloads
- path: tasks/cve-affecting-workloads.yaml
assertions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to define assertions in tasks file? i.e. in for this case tasks/cve-affecting-workloads.yaml

argumentsMatch:
cveName: "CVE-2016-1000031"
minToolCalls: 1
maxToolCalls: 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What maxToolCalls means? Tools defined in toolsUsed can be called up to 3 times?

- server: stackrox-mcp
toolPattern: "list_clusters"
minToolCalls: 1
maxToolCalls: 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM should not fetch list of clusters twice:

Suggested change
maxToolCalls: 2
maxToolCalls: 1

@janisz janisz marked this pull request as draft January 19, 2026 17:39
janisz and others added 3 commits January 19, 2026 18:41
Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
Fix E2E test assertion failures by improving tool descriptions with
smart usage pattern guidance. Tool descriptions now clearly indicate:

- When to call all three CVE tools for comprehensive coverage
  ("Is CVE-X detected in my clusters?" without specific cluster name)
- When to call only specific tools for targeted queries
  ("Is CVE-X detected in cluster staging-central-cluster?")

Changes:
- Update vulnerability tool descriptions (clusters, deployments, nodes)
  to use directive language and clear usage patterns
- Adjust cve-nonexistent test maxToolCalls from 2 to 3 to match
  comprehensive check pattern
- Update cve-cluster-does-not-exist verification to accept both
  "CVE not detected" and "cluster doesn't exist" responses

Results: All 24/24 E2E test assertions now pass (improved from 21/24).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
janisz and others added 2 commits January 20, 2026 11:30
Split long description strings in tool definitions to comply with the
120-character line limit by breaking at natural sentence boundaries.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…criptions

Changes:
- Switch E2E agent from GPT-4o to Claude Sonnet 4.5 via Vertex AI
- Add enableAllTools: true to MCP config for auto-approval
- Configure gpt-5-nano as LLM judge for cost efficiency
- Improve CVE tool descriptions with clear WHEN TO USE/WHEN NOT TO USE sections
- Update test assertions to account for Claude's comprehensive CVE checking behavior
- Update run-tests.sh to export Vertex AI environment variables

The tool descriptions now explicitly guide when to use each CVE detection tool:
- General "clusters" queries → comprehensive check (all 3 tools)
- Specific component queries → single relevant tool only
- Single cluster queries → orchestrator tool with cluster filter

All 8 E2E tests passing with 24/24 assertions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants