test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake#1826
test(uipath-mcp-servers): fix e2e-allkinds grading flake + create-task save-location flake#1826cosmyo wants to merge 5 commits into
Conversation
|
Claude finished @cosmyo's task in 2m 8s —— View job PR Review: test(uipath-mcp-servers): grade e2e-allkinds is-activity by outcome, not Read-tool
SummaryReplaces two Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | No skill files changed |
| E2E Tests | OK | Criteria correctly grade behavior over mechanism |
| Skill Body | N/A | No skill files changed |
| References & Assets | N/A | No references changed |
| Repo Hygiene | OK | Scoped change, no secrets, no unrelated modifications |
Issues for Manual Review
- The jq programs were verified against captured output from the failing run (per the PR description). Worth confirming on a fresh nightly run that the
MetadataandInputSchemaJSON shapes haven't changed since that capture.
Conclusion
Clean change that aligns the test with the repo's "Grade Behavior, Not Self-Reports" principle. The new run_command criteria are structurally consistent with existing patterns, the jq assertions are sound, and criterion count/weights are preserved. Approve.
|
Claude finished @cosmyo's task in 3m 6s —— View job Coder-eval task lint (advisory)1 task YAML changed; verdict counts: 0 Critical, 0 High, 0 Medium, 0 Low, 1 OK. Rubric: .claude/commands/lint-task.md. This check is advisory and never blocks merge. Evidence of passing run
Per-task lint
|
d8dcaec to
356e8e3
Compare
…) + add outcome checks for e2e-allkinds
Reading the mandated references is a HARD requirement for this task; the
test must fail if the agent doesn't read them. The two reference-read
criteria enforce that (min_count 1, pass_threshold 1.0, weight 1.5) — any
sub-threshold criterion marks the task FAILURE (the original run failed at
0.933 for exactly this reason).
The bug was only in HOW the read was detected: `tool_name: Read` matched
the Read tool only, so a legitimate `cat .../resources.md` in Bash scored
0 and failed the task even though the reference WAS read (confirmed in the
real transcript).
Root cause in coder_eval/criteria/command_executed.py: `tool_name` filters
to that tool only when set; the pattern is matched against the Read tool's
params JSON (`{"file_path": ...}`) OR a Bash command string. Dropping the
`tool_name` pin therefore requires the read while accepting Read tool OR
cat/grep/head.
Changes (criteria 13 -> 15):
- Keep BOTH reference reads as required hard gates, drop `tool_name: Read`
so they don't flake on cat/grep vs Read. Verified against the real
transcript: is-activity-workflow.md matched via Read, resources.md via cat.
- Add two live-tenant OUTCOME checks (the 6-tool count can't verify these):
* Gmail get-labels authored for the right connector/operation
(Metadata.connector.key = uipath-google-gmail, object method GET)
* Jira curated tool resolved its `-f` cascade to expose
fields.summary + fields.description as runtime inputs
Both jq programs verified green against the captured live output.
The reference reads (process) and the tool outcomes are complementary —
the reads prove the workflow was consulted, the outcomes prove it was
applied correctly. Neither replaces the other.
Note: the `tool_name: Read` reference-read pattern is used in ~10 other
task files (uipath-review, uipath-api-workflow) with the same latent flake
— follow-up candidate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
356e8e3 to
a92f076
Compare
…wd root (no output/ subdir) skill-mcp-servers-outlook-create scored 0.000 — all 4 criteria failed with "File 'metadata.json' does not exist". The agent actually did the task correctly (right connector/resource, baked UTC timezone, dry-run succeeded) and produced every artifact — but wrote them to a self-created `output/` subdirectory (`mkdir -p .../output`), while the criteria check root-relative paths (`path: metadata.json`). Root cause: the prompt said "Save: metadata.json, ..." with no location, so the agent invented `output/`. This is a prompt-ambiguity/test issue, not a skill issue — artifact-saving is a test-grading construct, and the skill's tool-building guidance was followed correctly. All four create smoke tasks shared the same latent ambiguity (none forbade a subdirectory; "to cwd" wouldn't have helped since `output/` was under cwd). Make the save location explicit and forbid subdirectories in all four so the same flake can't recur: - outlook-create, gmail-create, slack-create: "Save these files in the current working directory (the sandbox root) — do NOT create a subdirectory such as `output/`: ..." - jira-create: added the same no-subdir clause to its "Save to the sandbox cwd:" header. No skill or success_criteria changes; prompts only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
…ence for curated activities A live e2e-allkinds run (2026-06-19) built the Jira curated_create_issue tool correctly (summary+description resolved) but NEVER read integration-service/resources.md — it worked the -f cascade from prior knowledge of Jira's fields. The e2e test requires reading that reference (the read is imperative for correctness on connectors the agent does NOT already know), so the run failed the required-read gate at 0.933. Root cause in the skill: is-activity-workflow.md pointed at resources.md via a soft "first read" table row, but Critical Rule 3 inlined the Jira cascade specifics (omit --action, project.key + issuetype.id), giving a knowledgeable agent enough to skip the reference. The skill undercut its own read mandate. Strengthen the read into a hard precondition for curated/cascade activities (the exact shortcut the agent took): - Rule 3: "MUST read resources.md §Parent-Field-Driven Custom Fields before the first -f cascade re-run — knowing a connector's cascade fields from memory (e.g. Jira project.key + issuetype.id) does NOT exempt you; -f/--operation/--action rules and shapes vary per connector." - "read by action" table: mark the cascade row mandatory incl. curated activities you "know". - Workflow step 3 comment: [READ FIRST, mandatory even if you know the fields], matching the existing [READ: ...] marker style. Pairs with the e2e-allkinds test's required-read gate (kept as a hard gate, now mechanism-agnostic) — the test enforces the read, the skill makes the agent actually do it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
…ck-fixture exploration skill-mcp-servers-gmail-create hit MAX_TURNS_EXHAUSTED (score 0.000). The agent actually COMPLETED the task — it wrote all six artifacts and marked its todo done at tool call 43 — but needed 45 turns (62 assistant turns) against max_turns 40, so the run was cut off (and, pre-fix, wrote to `output/` too, which is addressed separately). Turn budget went to exploration, not the real work (~16 of 44 calls): - The Skill tool failed to load (`Execute skill: uipath:uipath-mcp-servers` errored), so the agent ran `find /` twice to locate the reference. - It read all 13 mock fixture files under `mocks/responses/` individually to understand the sandbox before acting (~15 turns). A real run has no such files; the offline harness leaves them visible and the agent explores them. - The reference reads the skill mandates for is-activity authoring legitimately add turns. 40 was also inconsistent with slack-create (60) for the same task shape. Fixes (test setup): - Raise max_turns 40 -> 60 on gmail/jira/outlook-create (matches slack). - Add to all four create prompts: the mocked `uip` returns realistic responses — run `uip` directly; do NOT read/inspect the `mocks/` fixtures — to cut the ~15-turn exploration that caused the overrun. Not a skill-content problem: skill length doesn't change turn count (one Read per reference); the overrun was environment exploration + a too-low limit. The Skill-tool load failure is a harness/sandbox limitation noted for follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
…t cause of skill-load failure)
ROOT CAUSE of the gmail-create max-turns exhaustion (and the mock-run
disorientation generally): all 7 uipath-mcp-servers smoke/mock tasks
override `agent.allowed_tools: [Bash, Read, Write]`, which OMITS the
`Skill` tool. So the agent's `Skill(uipath:uipath-mcp-servers)` call is
permission-denied — it cannot load the skill it is meant to test.
Evidence:
- Run config shows allowed_tools = [Bash, Read, Write]; the Skill call
returns result_status=error ("Execute skill: ...") in all 3 tempdir/mock
runs, vs "Launching skill: ..." success in both e2e/docker runs (which do
NOT override allowed_tools and inherit Skill from the experiment).
- After the denial the agent falls back to `find /` for the reference file
and reads the mock fixtures to orient — burning turns. gmail did two
whole-filesystem finds + 12 mock reads and overran max_turns 40.
- PR #1518 removed the "load the skill" directive from prompts to test
auto-triggering — which ASSUMES the Skill tool is available. With Skill
stripped, the agent can't auto-trigger even when it correctly identifies
the skill.
- The repo's own correct pattern: uipath-platform traces/licensing tasks
override with `[Skill, Bash, Read, Write]`. The mcp-servers tasks simply
forgot Skill.
Fix: add `Skill` to allowed_tools on all 7 tasks (gmail, jira,
jira-update-fix-lookups, outlook, remote, resource, slack). With the Skill
tool available the agent loads the skill in one call and no longer hunts
the filesystem — which is the actual fix; the earlier max_turns bump and
mock-inspection guidance were mitigations for this root cause.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Uu9ToFK6t5JCFP3g4kyt9j
Problem
The
skill-mcp-servers-e2e-uipath-allkindstask scored 0.933 / FAILURE on an otherwise-successful run. All 6 tools landed on the live tenant and the Jira curated activity was authored correctly — but one criterion failed:The agent did load
integration-service/resources.md(for the Jira cascade) — viacatin a Bash command rather than the Read tool:The criterion graded the tool mechanism (which tool opened the file), not whether the reference was consulted. Both
tool_name: Readcriteria in this task had this flaw.Root cause (confirmed in coder_eval source)
coder_eval/criteria/command_executed.py:So omitting
tool_namematches the pattern against every tool — the Read tool'sfile_pathand a Bashcat/grep/headcommand string. The filename can stay in the check; it just shouldn't be pinned to the Read tool.Fix (criteria 13 → 15)
tool_name: Read— now they match whether the agent used Read orcat/grep/head. Verified against the real transcript:is-activity-workflow.mdmatched via the Read tool,integration-service/resources.mdmatched via the Bashcat.Metadata.connector.key == uipath-google-gmail, objectmethod == GET)-fcascade —InputSchema.propertieshasfields.summaryandfields.descriptionBoth jq programs verified green (exit 0) against the captured live
mcp-tools listoutput from the failing run.Result: the check is now both robust (no dependency on which tool opens a file) and more accurate (grades actual tool correctness, not just that a file was opened).
Validation
truelint-task: OK; CLI-verb check: 0 High/Medium (all Info)Follow-up (not in this PR)
The
tool_name: Readreference-load pattern is used in ~10 other task files (uipath-review,uipath-api-workflow) and carries the same latent flake. Candidate for a repo-wide sweep to drop thetool_namepin on file-load criteria.🤖 Generated with Claude Code
Commit 2 — create-task artifact save location (outlook/gmail/slack/jira)
skill-mcp-servers-outlook-createscored 0.000 (0/4) with "File 'metadata.json' does not exist" — yet the agent built the Outlook tool correctly (right connector/resource, baked UTC timezone, dry-run OK) and produced every artifact. It just wrote them to a self-createdoutput/subdirectory (mkdir -p .../output) while the criteria check root paths.Root cause: the prompt said
Save: metadata.json, ...with no location, so the agent inventedoutput/. Prompt-ambiguity/test issue, not a skill issue.All four create smoke tasks shared the ambiguity (none forbade a subdir; slack's "to cwd" wouldn't have helped —
output/was under cwd). Fixed all four prompts to pin the save location to the working-directory root and explicitly forbid a subdirectory. Prompts only — no skill orsuccess_criteriachanges.