test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake by cosmyo · Pull Request #1826 · UiPath/skills

cosmyo · 2026-07-02T14:31:40Z

Problem

The skill-mcp-servers-e2e-uipath-allkinds task scored 0.933 / FAILURE on an otherwise-successful run. All 6 tools landed on the live tenant and the Jira curated activity was authored correctly — but one criterion failed:

command_executed / tool_name: Read / integration-service/(resources|reference-resolution).md → 0/1

The agent did load integration-service/resources.md (for the Jira cascade) — via cat in a Bash command rather than the Read tool:

# Read the resources.md reference for cascade / Parent-Field-Driven Custom Fields
cat .../uipath-platform/references/integration-service/resources.md

The criterion graded the tool mechanism (which tool opened the file), not whether the reference was consulted. Both tool_name: Read criteria in this task had this flaw.

Root cause (confirmed in coder_eval source)

coder_eval/criteria/command_executed.py:

if criterion.tool_name is not None and cmd.tool_name != criterion.tool_name:
    continue                                   # tool filter only applies when set
...
if cmd.tool_name == "Bash" and cmd.parameters.get("command"):
    cmd_text = cmd.parameters["command"]       # Bash → the command string
else:
    cmd_text = json.dumps(cmd.parameters)      # Read → {"file_path": "..."}

So omitting tool_name matches the pattern against every tool — the Read tool's file_path and a Bash cat/grep/head command string. The filename can stay in the check; it just shouldn't be pinned to the Read tool.

Fix (criteria 13 → 15)

Keep the reference-load process signals, drop tool_name: Read — now they match whether the agent used Read or cat/grep/head. Verified against the real transcript: is-activity-workflow.md matched via the Read tool, integration-service/resources.md matched via the Bash cat.
Add two live-tenant OUTCOME checks the 6-tool count can't give:
- Gmail get-labels authored for the right connector/operation (Metadata.connector.key == uipath-google-gmail, object method == GET)
- Jira curated tool resolved its -f cascade — InputSchema.properties has fields.summary and fields.description

Both jq programs verified green (exit 0) against the captured live mcp-tools list output from the failing run.

Result: the check is now both robust (no dependency on which tool opens a file) and more accurate (grades actual tool correctness, not just that a file was opened).

Validation

YAML parses (15 criteria)
Both file-load patterns match the real transcript (Read + cat); both outcome jq programs return true
lint-task: OK; CLI-verb check: 0 High/Medium (all Info)
No skill change — the skill already steered the agent to the reference correctly

Follow-up (not in this PR)

The tool_name: Read reference-load pattern is used in ~10 other task files (uipath-review, uipath-api-workflow) and carries the same latent flake. Candidate for a repo-wide sweep to drop the tool_name pin on file-load criteria.

🤖 Generated with Claude Code

Commit 2 — create-task artifact save location (outlook/gmail/slack/jira)

skill-mcp-servers-outlook-create scored 0.000 (0/4) with "File 'metadata.json' does not exist" — yet the agent built the Outlook tool correctly (right connector/resource, baked UTC timezone, dry-run OK) and produced every artifact. It just wrote them to a self-created output/ subdirectory (mkdir -p .../output) while the criteria check root paths.

Root cause: the prompt said Save: metadata.json, ... with no location, so the agent invented output/. Prompt-ambiguity/test issue, not a skill issue.

All four create smoke tasks shared the ambiguity (none forbade a subdir; slack's "to cwd" wouldn't have helped — output/ was under cwd). Fixed all four prompts to pin the save location to the working-directory root and explicitly forbid a subdirectory. Prompts only — no skill or success_criteria changes.

github-actions · 2026-07-02T14:31:56Z

Claude finished @cosmyo's task in 2m 8s —— View job

PR Review: test(uipath-mcp-servers): grade e2e-allkinds is-activity by outcome, not Read-tool

Read review criteria and project conventions
Read full diff
Read full changed file for context
Check PR conversation for existing feedback
Verify YAML structure and success criteria
Post review findings

Summary

Replaces two command_executed / tool_name: Read criteria in the e2e-uipath-allkinds task with run_command checks that verify actual tool state on the live tenant. This fixes a flake where the agent opened references via cat/grep instead of the Read tool, causing a false failure on an otherwise-correct run.

Change-by-Change Review

1. `tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml`

Severity: OK — well-executed change.

What changed:

Two command_executed criteria (lines 99–112 old) that checked for tool_name: Read with command_pattern matching is-activity-workflow.md and integration-service/(resources|reference-resolution).md are replaced by two run_command criteria that verify behavioral outcomes on the live tenant.
The Gmail check (line 105–111) verifies the tool has Metadata.connector.key == "uipath-google-gmail" and object.method == "GET".
The Jira check (line 119–125) verifies the tool's InputSchema.properties has fields.summary and fields.description (proving the -f cascade resolved).
Both use jq -e (exit 1 on false/null) with expected_exit_code: 0, timeout: 60, weight 1.5 — consistent with the existing run_command pattern at line 149.
Criterion count stays at 13; weights unchanged at 1.5 each.
Comment block updated to explain the grading philosophy.

Verification:

YAML is valid (confirmed via grep-based count: 13 - type: entries).
No tool_name: Read criteria remain (only in explanatory comments at lines 103, 116).
The run_command structure matches the sibling pattern in e2e-platform-server/task.yaml (same fields: command, timeout, expected_exit_code).
jq programs look correct: fromjson on the string-typed Metadata/InputSchema fields, then field access and boolean assertions via -e.
The $(jq -r .slug report.json) subshell is consistent with the existing check at line 151.

One minor note: The comments at lines 99–104 and 113–118 are longer than strictly needed (6 lines each explaining the old proxies). The repo's token-optimization rules say to strip words that don't carry information — the "replaces a former…" sentences are changelog context that belongs in the PR description (which already has it) rather than the YAML. This is cosmetic; not flagging as an issue since comments inside test YAMLs don't burn agent tokens at runtime.

What's Missing

Nothing identified.

Area Ratings

Area	Status	Notes
Frontmatter	N/A	No skill files changed
E2E Tests	OK	Criteria correctly grade behavior over mechanism
Skill Body	N/A	No skill files changed
References & Assets	N/A	No references changed
Repo Hygiene	OK	Scoped change, no secrets, no unrelated modifications

Issues for Manual Review

The jq programs were verified against captured output from the failing run (per the PR description). Worth confirming on a fresh nightly run that the Metadata and InputSchema JSON shapes haven't changed since that capture.

Conclusion

Clean change that aligns the test with the repo's "Grade Behavior, Not Self-Reports" principle. The new run_command criteria are structurally consistent with existing patterns, the jq assertions are sound, and criterion count/weights are preserved. Approve.

github-actions · 2026-07-02T14:32:01Z

Claude finished @cosmyo's task in 3m 6s —— View job

Coder-eval task lint (advisory)

1 task YAML changed; verdict counts: 0 Critical, 0 High, 0 Medium, 0 Low, 1 OK.

Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge.

Evidence of passing run

⚠️ Borderline — PR body does not explicitly claim the modified task was re-run and passed end-to-end, but does state: "Both jq programs were verified against the captured live output from the failing run (each returns true, exit 0)." The author verified the new criteria against real tenant output from the run that exposed the flake (0.933 / FAILURE), demonstrating the modified task would have passed. This is semantically close to a passing-run claim, but consider adding a line like: Ran skill-mcp-servers-e2e-uipath-allkinds and it passed. after the next nightly to make it explicit.

Per-task lint

`tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml` — verdict: OK

The change replaces two command_executed / tool_name: Read proxies (lines 99–125 old) with two run_command criteria that validate live tenant outcomes — Gmail tool connector/operation metadata and Jira cascade field resolution. This is a textbook improvement per the repo rule "Grade Behavior, Not Self-Reports":

Self-report anti-pattern: Clean. report.json is an operational fixture for cleanup/slug lookup, not a self-assessment.
Prompt over-specification: Clean. Prompt supplies ground-truth anchors (entity names, connection IDs, folder keys) required for the live tenant — no procedure leakage.
Meaningful coverage: Strong. 13 criteria span command execution (7), skill trigger (1), live tenant validation (3), and file existence (1). The new run_command criteria with jq -e assertions on metadata fields are more meaningful than the replaced read-proxies.
Could pass for the wrong reason: Clean. run_command criteria query the live tenant — can't be faked without actually creating the tools.
Near-duplicate: Clean. Only e2e task exercising all tool kinds on a single server; neighbors test individual server types or smoke-tier mocks.
CLI verb reachability: Skipped (sandbox did not permit python3 scripts/check-cli-verbs.py). PR body reports lint-task CLI-verb check as all Info.
Redundant uip CLI in sandbox: Clean — no sandbox block; inherits from experiment.
Run-limit fields under agent: Clean — run_limits: at top level (lines 12–14), no agent: block.

Conclusion

✅ All changed tasks pass the rubric. Evidence of passing run is borderline-acceptable (verified against captured live output, not a full re-run). Consider confirming after the next nightly.

…) + add outcome checks for e2e-allkinds Reading the mandated references is a HARD requirement for this task; the test must fail if the agent doesn't read them. The two reference-read criteria enforce that (min_count 1, pass_threshold 1.0, weight 1.5) — any sub-threshold criterion marks the task FAILURE (the original run failed at 0.933 for exactly this reason). The bug was only in HOW the read was detected: `tool_name: Read` matched the Read tool only, so a legitimate `cat .../resources.md` in Bash scored 0 and failed the task even though the reference WAS read (confirmed in the real transcript). Root cause in coder_eval/criteria/command_executed.py: `tool_name` filters to that tool only when set; the pattern is matched against the Read tool's params JSON (`{"file_path": ...}`) OR a Bash command string. Dropping the `tool_name` pin therefore requires the read while accepting Read tool OR cat/grep/head. Changes (criteria 13 -> 15): - Keep BOTH reference reads as required hard gates, drop `tool_name: Read` so they don't flake on cat/grep vs Read. Verified against the real transcript: is-activity-workflow.md matched via Read, resources.md via cat. - Add two live-tenant OUTCOME checks (the 6-tool count can't verify these): * Gmail get-labels authored for the right connector/operation (Metadata.connector.key = uipath-google-gmail, object method GET) * Jira curated tool resolved its `-f` cascade to expose fields.summary + fields.description as runtime inputs Both jq programs verified green against the captured live output. The reference reads (process) and the tool outcomes are complementary — the reads prove the workflow was consulted, the outcomes prove it was applied correctly. Neither replaces the other. Note: the `tool_name: Read` reference-read pattern is used in ~10 other task files (uipath-review, uipath-api-workflow) with the same latent flake — follow-up candidate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j

…wd root (no output/ subdir) skill-mcp-servers-outlook-create scored 0.000 — all 4 criteria failed with "File 'metadata.json' does not exist". The agent actually did the task correctly (right connector/resource, baked UTC timezone, dry-run succeeded) and produced every artifact — but wrote them to a self-created `output/` subdirectory (`mkdir -p .../output`), while the criteria check root-relative paths (`path: metadata.json`). Root cause: the prompt said "Save: metadata.json, ..." with no location, so the agent invented `output/`. This is a prompt-ambiguity/test issue, not a skill issue — artifact-saving is a test-grading construct, and the skill's tool-building guidance was followed correctly. All four create smoke tasks shared the same latent ambiguity (none forbade a subdirectory; "to cwd" wouldn't have helped since `output/` was under cwd). Make the save location explicit and forbid subdirectories in all four so the same flake can't recur: - outlook-create, gmail-create, slack-create: "Save these files in the current working directory (the sandbox root) — do NOT create a subdirectory such as `output/`: ..." - jira-create: added the same no-subdir clause to its "Save to the sandbox cwd:" header. No skill or success_criteria changes; prompts only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j

…ence for curated activities A live e2e-allkinds run (2026-06-19) built the Jira curated_create_issue tool correctly (summary+description resolved) but NEVER read integration-service/resources.md — it worked the -f cascade from prior knowledge of Jira's fields. The e2e test requires reading that reference (the read is imperative for correctness on connectors the agent does NOT already know), so the run failed the required-read gate at 0.933. Root cause in the skill: is-activity-workflow.md pointed at resources.md via a soft "first read" table row, but Critical Rule 3 inlined the Jira cascade specifics (omit --action, project.key + issuetype.id), giving a knowledgeable agent enough to skip the reference. The skill undercut its own read mandate. Strengthen the read into a hard precondition for curated/cascade activities (the exact shortcut the agent took): - Rule 3: "MUST read resources.md §Parent-Field-Driven Custom Fields before the first -f cascade re-run — knowing a connector's cascade fields from memory (e.g. Jira project.key + issuetype.id) does NOT exempt you; -f/--operation/--action rules and shapes vary per connector." - "read by action" table: mark the cascade row mandatory incl. curated activities you "know". - Workflow step 3 comment: [READ FIRST, mandatory even if you know the fields], matching the existing [READ: ...] marker style. Pairs with the e2e-allkinds test's required-read gate (kept as a hard gate, now mechanism-agnostic) — the test enforces the read, the skill makes the agent actually do it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j

…ck-fixture exploration skill-mcp-servers-gmail-create hit MAX_TURNS_EXHAUSTED (score 0.000). The agent actually COMPLETED the task — it wrote all six artifacts and marked its todo done at tool call 43 — but needed 45 turns (62 assistant turns) against max_turns 40, so the run was cut off (and, pre-fix, wrote to `output/` too, which is addressed separately). Turn budget went to exploration, not the real work (~16 of 44 calls): - The Skill tool failed to load (`Execute skill: uipath:uipath-mcp-servers` errored), so the agent ran `find /` twice to locate the reference. - It read all 13 mock fixture files under `mocks/responses/` individually to understand the sandbox before acting (~15 turns). A real run has no such files; the offline harness leaves them visible and the agent explores them. - The reference reads the skill mandates for is-activity authoring legitimately add turns. 40 was also inconsistent with slack-create (60) for the same task shape. Fixes (test setup): - Raise max_turns 40 -> 60 on gmail/jira/outlook-create (matches slack). - Add to all four create prompts: the mocked `uip` returns realistic responses — run `uip` directly; do NOT read/inspect the `mocks/` fixtures — to cut the ~15-turn exploration that caused the overrun. Not a skill-content problem: skill length doesn't change turn count (one Read per reference); the overrun was environment exploration + a too-low limit. The Skill-tool load failure is a harness/sandbox limitation noted for follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j

…t cause of skill-load failure) ROOT CAUSE of the gmail-create max-turns exhaustion (and the mock-run disorientation generally): all 7 uipath-mcp-servers smoke/mock tasks override `agent.allowed_tools: [Bash, Read, Write]`, which OMITS the `Skill` tool. So the agent's `Skill(uipath:uipath-mcp-servers)` call is permission-denied — it cannot load the skill it is meant to test. Evidence: - Run config shows allowed_tools = [Bash, Read, Write]; the Skill call returns result_status=error ("Execute skill: ...") in all 3 tempdir/mock runs, vs "Launching skill: ..." success in both e2e/docker runs (which do NOT override allowed_tools and inherit Skill from the experiment). - After the denial the agent falls back to `find /` for the reference file and reads the mock fixtures to orient — burning turns. gmail did two whole-filesystem finds + 12 mock reads and overran max_turns 40. - PR #1518 removed the "load the skill" directive from prompts to test auto-triggering — which ASSUMES the Skill tool is available. With Skill stripped, the agent can't auto-trigger even when it correctly identifies the skill. - The repo's own correct pattern: uipath-platform traces/licensing tasks override with `[Skill, Bash, Read, Write]`. The mcp-servers tasks simply forgot Skill. Fix: add `Skill` to allowed_tools on all 7 tasks (gmail, jira, jira-update-fix-lookups, outlook, remote, resource, slack). With the Skill tool available the agent loads the skill in one call and no longer hunts the filesystem — which is the actual fix; the earlier max_turns bump and mock-inspection guidance were mitigations for this root cause. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j

cosmyo requested a review from a team as a code owner July 2, 2026 14:31

cosmyo force-pushed the fix/mcp-allkinds-behavioral-grading branch from d8dcaec to 356e8e3 Compare July 2, 2026 14:38

cosmyo force-pushed the fix/mcp-allkinds-behavioral-grading branch from 356e8e3 to a92f076 Compare July 2, 2026 14:41

cosmyo changed the title ~~test(uipath-mcp-servers): grade e2e-allkinds is-activity by outcome, not Read-tool~~ test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake Jul 2, 2026

cosmyo and others added 3 commits July 2, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake#1826

test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake#1826
cosmyo wants to merge 5 commits into
mainfrom
fix/mcp-allkinds-behavioral-grading

cosmyo commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cosmyo commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause (confirmed in coder_eval source)

Fix (criteria 13 → 15)

Validation

Follow-up (not in this PR)

Commit 2 — create-task artifact save location (outlook/gmail/slack/jira)

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: test(uipath-mcp-servers): grade e2e-allkinds is-activity by outcome, not Read-tool

Summary

Change-by-Change Review

1. tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coder-eval task lint (advisory)

Evidence of passing run

Per-task lint

tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml — verdict: OK

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cosmyo commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

1. `tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml`

github-actions Bot commented Jul 2, 2026 •

edited

Loading

`tests/tasks/uipath-mcp-servers/e2e-uipath-allkinds/task.yaml` — verdict: OK