Skip to content

test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake#1826

Open
cosmyo wants to merge 5 commits into
mainfrom
fix/mcp-allkinds-behavioral-grading
Open

test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake#1826
cosmyo wants to merge 5 commits into
mainfrom
fix/mcp-allkinds-behavioral-grading

Conversation

@cosmyo

@cosmyo cosmyo commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

The skill-mcp-servers-e2e-uipath-allkinds task scored 0.933 / FAILURE on an otherwise-successful run. All 6 tools landed on the live tenant and the Jira curated activity was authored correctly — but one criterion failed:

command_executed / tool_name: Read / integration-service/(resources|reference-resolution).md → 0/1

The agent did load integration-service/resources.md (for the Jira cascade) — via cat in a Bash command rather than the Read tool:

# Read the resources.md reference for cascade / Parent-Field-Driven Custom Fields
cat .../uipath-platform/references/integration-service/resources.md

The criterion graded the tool mechanism (which tool opened the file), not whether the reference was consulted. Both tool_name: Read criteria in this task had this flaw.

Root cause (confirmed in coder_eval source)

coder_eval/criteria/command_executed.py:

if criterion.tool_name is not None and cmd.tool_name != criterion.tool_name:
    continue                                   # tool filter only applies when set
...
if cmd.tool_name == "Bash" and cmd.parameters.get("command"):
    cmd_text = cmd.parameters["command"]       # Bash → the command string
else:
    cmd_text = json.dumps(cmd.parameters)      # Read → {"file_path": "..."}

So omitting tool_name matches the pattern against every tool — the Read tool's file_path and a Bash cat/grep/head command string. The filename can stay in the check; it just shouldn't be pinned to the Read tool.

Fix (criteria 13 → 15)

  • Keep the reference-load process signals, drop tool_name: Read — now they match whether the agent used Read or cat/grep/head. Verified against the real transcript: is-activity-workflow.md matched via the Read tool, integration-service/resources.md matched via the Bash cat.
  • Add two live-tenant OUTCOME checks the 6-tool count can't give:
    • Gmail get-labels authored for the right connector/operation (Metadata.connector.key == uipath-google-gmail, object method == GET)
    • Jira curated tool resolved its -f cascade — InputSchema.properties has fields.summary and fields.description

Both jq programs verified green (exit 0) against the captured live mcp-tools list output from the failing run.

Result: the check is now both robust (no dependency on which tool opens a file) and more accurate (grades actual tool correctness, not just that a file was opened).

Validation

  • YAML parses (15 criteria)
  • Both file-load patterns match the real transcript (Read + cat); both outcome jq programs return true
  • lint-task: OK; CLI-verb check: 0 High/Medium (all Info)
  • No skill change — the skill already steered the agent to the reference correctly

Follow-up (not in this PR)

The tool_name: Read reference-load pattern is used in ~10 other task files (uipath-review, uipath-api-workflow) and carries the same latent flake. Candidate for a repo-wide sweep to drop the tool_name pin on file-load criteria.

🤖 Generated with Claude Code


Commit 2 — create-task artifact save location (outlook/gmail/slack/jira)

skill-mcp-servers-outlook-create scored 0.000 (0/4) with "File 'metadata.json' does not exist" — yet the agent built the Outlook tool correctly (right connector/resource, baked UTC timezone, dry-run OK) and produced every artifact. It just wrote them to a self-created output/ subdirectory (mkdir -p .../output) while the criteria check root paths.

Root cause: the prompt said Save: metadata.json, ... with no location, so the agent invented output/. Prompt-ambiguity/test issue, not a skill issue.

All four create smoke tasks shared the ambiguity (none forbade a subdir; slack's "to cwd" wouldn't have helped — output/ was under cwd). Fixed all four prompts to pin the save location to the working-directory root and explicitly forbid a subdirectory. Prompts only — no skill or success_criteria changes.

@cosmyo cosmyo requested a review from a team as a code owner July 2, 2026 14:31
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cosmyo's task in 2m 8s —— View job


PR Review: test(uipath-mcp-servers): grade e2e-allkinds is-activity by outcome, not Read-tool

  • Read review criteria and project conventions
  • Read full diff
  • Read full changed file for context
  • Check PR conversation for existing feedback
  • Verify YAML structure and success criteria
  • Post review findings

Summary

Replaces two command_executed / tool_name: Read criteria in the e2e-uipath-allkinds task with run_command checks that verify actual tool state on the live tenant. This fixes a flake where the agent opened references via cat/grep instead of the Read tool, causing a false failure on an otherwise-correct run.

Change-by-Change Review

1. tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml

Severity: OK — well-executed change.

What changed:

  • Two command_executed criteria (lines 99–112 old) that checked for tool_name: Read with command_pattern matching is-activity-workflow.md and integration-service/(resources|reference-resolution).md are replaced by two run_command criteria that verify behavioral outcomes on the live tenant.
  • The Gmail check (line 105–111) verifies the tool has Metadata.connector.key == "uipath-google-gmail" and object.method == "GET".
  • The Jira check (line 119–125) verifies the tool's InputSchema.properties has fields.summary and fields.description (proving the -f cascade resolved).
  • Both use jq -e (exit 1 on false/null) with expected_exit_code: 0, timeout: 60, weight 1.5 — consistent with the existing run_command pattern at line 149.
  • Criterion count stays at 13; weights unchanged at 1.5 each.
  • Comment block updated to explain the grading philosophy.

Verification:

  • YAML is valid (confirmed via grep-based count: 13 - type: entries).
  • No tool_name: Read criteria remain (only in explanatory comments at lines 103, 116).
  • The run_command structure matches the sibling pattern in e2e-platform-server/task.yaml (same fields: command, timeout, expected_exit_code).
  • jq programs look correct: fromjson on the string-typed Metadata/InputSchema fields, then field access and boolean assertions via -e.
  • The $(jq -r .slug report.json) subshell is consistent with the existing check at line 151.

One minor note: The comments at lines 99–104 and 113–118 are longer than strictly needed (6 lines each explaining the old proxies). The repo's token-optimization rules say to strip words that don't carry information — the "replaces a former…" sentences are changelog context that belongs in the PR description (which already has it) rather than the YAML. This is cosmetic; not flagging as an issue since comments inside test YAMLs don't burn agent tokens at runtime.

What's Missing

Nothing identified.

Area Ratings

Area Status Notes
Frontmatter N/A No skill files changed
E2E Tests OK Criteria correctly grade behavior over mechanism
Skill Body N/A No skill files changed
References & Assets N/A No references changed
Repo Hygiene OK Scoped change, no secrets, no unrelated modifications

Issues for Manual Review

  • The jq programs were verified against captured output from the failing run (per the PR description). Worth confirming on a fresh nightly run that the Metadata and InputSchema JSON shapes haven't changed since that capture.

Conclusion

Clean change that aligns the test with the repo's "Grade Behavior, Not Self-Reports" principle. The new run_command criteria are structurally consistent with existing patterns, the jq assertions are sound, and criterion count/weights are preserved. Approve.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cosmyo's task in 3m 6s —— View job


Coder-eval task lint (advisory)

1 task YAML changed; verdict counts: 0 Critical, 0 High, 0 Medium, 0 Low, 1 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

⚠️ Borderline — PR body does not explicitly claim the modified task was re-run and passed end-to-end, but does state: "Both jq programs were verified against the captured live output from the failing run (each returns true, exit 0)." The author verified the new criteria against real tenant output from the run that exposed the flake (0.933 / FAILURE), demonstrating the modified task would have passed. This is semantically close to a passing-run claim, but consider adding a line like: Ran skill-mcp-servers-e2e-uipath-allkinds and it passed. after the next nightly to make it explicit.

Per-task lint

tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml — verdict: OK

The change replaces two command_executed / tool_name: Read proxies (lines 99–125 old) with two run_command criteria that validate live tenant outcomes — Gmail tool connector/operation metadata and Jira cascade field resolution. This is a textbook improvement per the repo rule "Grade Behavior, Not Self-Reports":

  • Self-report anti-pattern: Clean. report.json is an operational fixture for cleanup/slug lookup, not a self-assessment.
  • Prompt over-specification: Clean. Prompt supplies ground-truth anchors (entity names, connection IDs, folder keys) required for the live tenant — no procedure leakage.
  • Meaningful coverage: Strong. 13 criteria span command execution (7), skill trigger (1), live tenant validation (3), and file existence (1). The new run_command criteria with jq -e assertions on metadata fields are more meaningful than the replaced read-proxies.
  • Could pass for the wrong reason: Clean. run_command criteria query the live tenant — can't be faked without actually creating the tools.
  • Near-duplicate: Clean. Only e2e task exercising all tool kinds on a single server; neighbors test individual server types or smoke-tier mocks.
  • CLI verb reachability: Skipped (sandbox did not permit python3 scripts/check-cli-verbs.py). PR body reports lint-task CLI-verb check as all Info.
  • Redundant uip CLI in sandbox: Clean — no sandbox block; inherits from experiment.
  • Run-limit fields under agent: Clean — run_limits: at top level (lines 12–14), no agent: block.

Conclusion

✅ All changed tasks pass the rubric. Evidence of passing run is borderline-acceptable (verified against captured live output, not a full re-run). Consider confirming after the next nightly.


@cosmyo cosmyo force-pushed the fix/mcp-allkinds-behavioral-grading branch from d8dcaec to 356e8e3 Compare July 2, 2026 14:38
…) + add outcome checks for e2e-allkinds

Reading the mandated references is a HARD requirement for this task; the
test must fail if the agent doesn't read them. The two reference-read
criteria enforce that (min_count 1, pass_threshold 1.0, weight 1.5) — any
sub-threshold criterion marks the task FAILURE (the original run failed at
0.933 for exactly this reason).

The bug was only in HOW the read was detected: `tool_name: Read` matched
the Read tool only, so a legitimate `cat .../resources.md` in Bash scored
0 and failed the task even though the reference WAS read (confirmed in the
real transcript).

Root cause in coder_eval/criteria/command_executed.py: `tool_name` filters
to that tool only when set; the pattern is matched against the Read tool's
params JSON (`{"file_path": ...}`) OR a Bash command string. Dropping the
`tool_name` pin therefore requires the read while accepting Read tool OR
cat/grep/head.

Changes (criteria 13 -> 15):
- Keep BOTH reference reads as required hard gates, drop `tool_name: Read`
  so they don't flake on cat/grep vs Read. Verified against the real
  transcript: is-activity-workflow.md matched via Read, resources.md via cat.
- Add two live-tenant OUTCOME checks (the 6-tool count can't verify these):
  * Gmail get-labels authored for the right connector/operation
    (Metadata.connector.key = uipath-google-gmail, object method GET)
  * Jira curated tool resolved its `-f` cascade to expose
    fields.summary + fields.description as runtime inputs
  Both jq programs verified green against the captured live output.

The reference reads (process) and the tool outcomes are complementary —
the reads prove the workflow was consulted, the outcomes prove it was
applied correctly. Neither replaces the other.

Note: the `tool_name: Read` reference-read pattern is used in ~10 other
task files (uipath-review, uipath-api-workflow) with the same latent flake
— follow-up candidate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
@cosmyo cosmyo force-pushed the fix/mcp-allkinds-behavioral-grading branch from 356e8e3 to a92f076 Compare July 2, 2026 14:41
…wd root (no output/ subdir)

skill-mcp-servers-outlook-create scored 0.000 — all 4 criteria failed with
"File 'metadata.json' does not exist". The agent actually did the task
correctly (right connector/resource, baked UTC timezone, dry-run
succeeded) and produced every artifact — but wrote them to a self-created
`output/` subdirectory (`mkdir -p .../output`), while the criteria check
root-relative paths (`path: metadata.json`).

Root cause: the prompt said "Save: metadata.json, ..." with no location,
so the agent invented `output/`. This is a prompt-ambiguity/test issue,
not a skill issue — artifact-saving is a test-grading construct, and the
skill's tool-building guidance was followed correctly.

All four create smoke tasks shared the same latent ambiguity (none forbade
a subdirectory; "to cwd" wouldn't have helped since `output/` was under
cwd). Make the save location explicit and forbid subdirectories in all
four so the same flake can't recur:
- outlook-create, gmail-create, slack-create: "Save these files in the
  current working directory (the sandbox root) — do NOT create a
  subdirectory such as `output/`: ..."
- jira-create: added the same no-subdir clause to its "Save to the sandbox
  cwd:" header.

No skill or success_criteria changes; prompts only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
@cosmyo cosmyo changed the title test(uipath-mcp-servers): grade e2e-allkinds is-activity by outcome, not Read-tool test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake Jul 2, 2026
cosmyo and others added 3 commits July 2, 2026 18:06
…ence for curated activities

A live e2e-allkinds run (2026-06-19) built the Jira curated_create_issue
tool correctly (summary+description resolved) but NEVER read
integration-service/resources.md — it worked the -f cascade from prior
knowledge of Jira's fields. The e2e test requires reading that reference
(the read is imperative for correctness on connectors the agent does NOT
already know), so the run failed the required-read gate at 0.933.

Root cause in the skill: is-activity-workflow.md pointed at resources.md
via a soft "first read" table row, but Critical Rule 3 inlined the Jira
cascade specifics (omit --action, project.key + issuetype.id), giving a
knowledgeable agent enough to skip the reference. The skill undercut its
own read mandate.

Strengthen the read into a hard precondition for curated/cascade
activities (the exact shortcut the agent took):
- Rule 3: "MUST read resources.md §Parent-Field-Driven Custom Fields
  before the first -f cascade re-run — knowing a connector's cascade
  fields from memory (e.g. Jira project.key + issuetype.id) does NOT
  exempt you; -f/--operation/--action rules and shapes vary per connector."
- "read by action" table: mark the cascade row mandatory incl. curated
  activities you "know".
- Workflow step 3 comment: [READ FIRST, mandatory even if you know the
  fields], matching the existing [READ: ...] marker style.

Pairs with the e2e-allkinds test's required-read gate (kept as a hard gate,
now mechanism-agnostic) — the test enforces the read, the skill makes the
agent actually do it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
…ck-fixture exploration

skill-mcp-servers-gmail-create hit MAX_TURNS_EXHAUSTED (score 0.000). The
agent actually COMPLETED the task — it wrote all six artifacts and marked
its todo done at tool call 43 — but needed 45 turns (62 assistant turns)
against max_turns 40, so the run was cut off (and, pre-fix, wrote to
`output/` too, which is addressed separately).

Turn budget went to exploration, not the real work (~16 of 44 calls):
- The Skill tool failed to load (`Execute skill: uipath:uipath-mcp-servers`
  errored), so the agent ran `find /` twice to locate the reference.
- It read all 13 mock fixture files under `mocks/responses/` individually
  to understand the sandbox before acting (~15 turns). A real run has no
  such files; the offline harness leaves them visible and the agent
  explores them.
- The reference reads the skill mandates for is-activity authoring
  legitimately add turns.

40 was also inconsistent with slack-create (60) for the same task shape.

Fixes (test setup):
- Raise max_turns 40 -> 60 on gmail/jira/outlook-create (matches slack).
- Add to all four create prompts: the mocked `uip` returns realistic
  responses — run `uip` directly; do NOT read/inspect the `mocks/`
  fixtures — to cut the ~15-turn exploration that caused the overrun.

Not a skill-content problem: skill length doesn't change turn count (one
Read per reference); the overrun was environment exploration + a too-low
limit. The Skill-tool load failure is a harness/sandbox limitation noted
for follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
…t cause of skill-load failure)

ROOT CAUSE of the gmail-create max-turns exhaustion (and the mock-run
disorientation generally): all 7 uipath-mcp-servers smoke/mock tasks
override `agent.allowed_tools: [Bash, Read, Write]`, which OMITS the
`Skill` tool. So the agent's `Skill(uipath:uipath-mcp-servers)` call is
permission-denied — it cannot load the skill it is meant to test.

Evidence:
- Run config shows allowed_tools = [Bash, Read, Write]; the Skill call
  returns result_status=error ("Execute skill: ...") in all 3 tempdir/mock
  runs, vs "Launching skill: ..." success in both e2e/docker runs (which do
  NOT override allowed_tools and inherit Skill from the experiment).
- After the denial the agent falls back to `find /` for the reference file
  and reads the mock fixtures to orient — burning turns. gmail did two
  whole-filesystem finds + 12 mock reads and overran max_turns 40.
- PR #1518 removed the "load the skill" directive from prompts to test
  auto-triggering — which ASSUMES the Skill tool is available. With Skill
  stripped, the agent can't auto-trigger even when it correctly identifies
  the skill.
- The repo's own correct pattern: uipath-platform traces/licensing tasks
  override with `[Skill, Bash, Read, Write]`. The mcp-servers tasks simply
  forgot Skill.

Fix: add `Skill` to allowed_tools on all 7 tasks (gmail, jira,
jira-update-fix-lookups, outlook, remote, resource, slack). With the Skill
tool available the agent loads the skill in one call and no longer hunts
the filesystem — which is the actual fix; the earlier max_turns bump and
mock-inspection guidance were mitigations for this root cause.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant