test(uipath-test): coverage for testcases steps + execution rerun/wait/report [TMHUB-32229]#1761
test(uipath-test): coverage for testcases steps + execution rerun/wait/report [TMHUB-32229]#1761ganeshborle wants to merge 1 commit into
Conversation
|
Claude finished @ganeshborle's task in 3m 7s —— View job Coder-eval task lint (advisory)4 task YAMLs changed; verdict counts: 0 Critical, 0 High, 0 Medium, 4 Low, 0 OK. Rubric: Evidence of passing run✅ Author confirms passing run: "All four run green locally via coder-eval against the alpha tenant: testcase_steps_authoring_smoke — 4/4 (1.000) … release_signoff_wait_report_e2e — 3/3 (1.000)" Per-task lint
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ee20733c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Thanks — re: Theme 1 (output-content assertions). Acknowledged; keeping the
Net: the advisory is well-taken, but a deterministic output assertion would be either fragile or self-report-based here. Happy to revisit if there's a preferred pattern for live, self-cleaning TM tasks. |
|
@claude please review this PR — prior automated review run was cancelled by the workflow's concurrency group (not by findings). All other checks (smoke/T1, lint, schema) are green. |
|
Claude finished @ganeshborle's task in 5m 10s —— View job PR Review: test(uipath-test) coverage for testcases steps + execution rerun/wait/report
SummaryAdds 4 coder-eval task YAMLs covering the Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | Test tasks, not skill files |
| E2E Tests | OK | Smoke + e2e + 2 integration — full tier coverage |
| Task Structure | Issue | sandbox.driver: tempdir overrides experiment default in all 4 tasks |
| Repo Hygiene | OK | CODEOWNERS present, no secrets, no cross-skill deps |
Issues for Manual Review
sandbox.driver: tempdirCI behavior — The author reports CI passed ("smoke/T1 green"), but the override contradicts the experiment'sdriver: docker. Verify whether coder-eval's task-level sandbox merge actually replaces the experiment driver, or if the experiment driver takes precedence. If the task override does win, clarify why tempdir is needed (e.g.,post_runshell commands that require host-level access) and whetheruipis available on the CI host outside Docker.executions runin description —execution_rerun_failed_integration.yamlline 6 listsexecutions runas a covered verb but no criterion asserts it. Cosmetic, but could mislead coverage audits.
Conclusion
The tasks are well-crafted — precise regex patterns, clean verb partitioning, goal-oriented prompts, and proper self-cleaning post_run hooks. The one substantive concern is the sandbox.driver: tempdir override present in all 4 tasks, which breaks from the established convention of inheriting the driver from the experiment config. This was already flagged by the Codex review (comment). Recommend either removing the sandbox: block from all 4 tasks (to match the other 14 uipath-test tasks) or adding a comment explaining why tempdir is required.
|
Claude finished @ganeshborle's task in 3m 51s —— View job PR Review: test(uipath-test) coverage for testcases steps + execution rerun/wait/report
SummaryAdds 4 coder-eval task YAMLs covering Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | Test tasks, not skill files |
| E2E Tests | OK | Full tier coverage (smoke + 2 integration + e2e), clean verb partitioning, all 4 validated green |
| Task Structure | OK | Tags, task_id, weights, prompts, post_run all follow conventions |
| Repo Hygiene | OK | CODEOWNERS present, no secrets, changes scoped to test tasks only |
Issues for Manual Review
sandbox.driver: tempdir— convention break, not a functional issue. All 4 new tasks explicitly setdriver: tempdir, while all 14 existinguipath-testtasks inherit from the experiment (which defaults todriver: docker). The prior Codex review (comment) claimed this would causeuip: command not found, but this is incorrect — the GH smoke runner installs@uipath/cliglobally on the host before any task runs (tests/README.md:25), anddriver: tempdiris widely used across the repo (uipath-troubleshoot, uipath-ixp, uipath-review, uipath-agents, uipath-mcp-servers, uipath-maestro-bpmn — including smoke-tagged tasks likeuipath-maestro-bpmn/smoke/registry_discovery.yaml). The author confirms CI passed. Still, it's a divergence from theuipath-test-specific convention — a one-line comment explaining why tempdir is preferred here (e.g.,post_runhost access, no benefit from Docker isolation for live-tenant tasks) would prevent the same question on future PRs.executions runin description (Low). See Change 3 above — cosmetic, but may mislead coverage audits that scrapedescriptionfields.
Conclusion
This PR is well-crafted. Precise regex patterns, clean verb partitioning with explicit ownership comments, goal-oriented prompts, appropriate run_limits for live-tenant latency, and proper self-cleaning post_run hooks. The only actionable finding is the minor executions run mention in the description parenthetical (Low). The sandbox.driver: tempdir override works correctly and is widely used across the repo, though a rationale comment would be helpful for consistency within the uipath-test task set.
Recommendation: approve. The Low finding is non-blocking.
|
Addressed the review findings in
Both were Low/cosmetic; logic and criteria are unchanged. Smoke (T1) + lint re-running on the new commit. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb5101f3d8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
eb5101f to
ac03f8d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac03f8d944
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ac03f8d to
990a7d0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 990a7d01de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
990a7d0 to
1117c43
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1117c4386f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1117c43 to
6228db9
Compare
|
Pushed |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6228db9d3d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6228db9 to
02e2691
Compare
|
Scope correction ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02e2691b20
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
02e2691 to
2a66f9e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a66f9efc2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t/report [TMHUB-32229] Add 3 coder-eval tasks covering uip tm commands with no prior coverage: - testcase_steps_lifecycle_integration (integration): full testcases steps lifecycle in one session — flag-mode add (one at a time) + JSON-mode add (one shot) + list + get + update + move + delete + case cleanup - execution_rerun_failed_integration (integration): executions get-stats / retry - release_signoff_wait_report_e2e (e2e): testsets run -> tm wait -> report get Every uip-command criterion enforces --output json (Critical Rule #3) and project-scoping per SKILL.md signatures; graded cleanup deletes seeded cases so no tenant junk is left. Prompts read like a real user, no skill-load hint. Validated green via coder-eval (steps 10/10, rerun 3/3, sign-off 3/3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2a66f9e to
85f5d0a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 85f5d0a942
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - type: command_executed | ||
| description: "Agent added a step individually via flag mode (`steps add ... --description`)" | ||
| tool_name: "Bash" | ||
| command_pattern: '(?=[\s\S]*--project-key\s+\S)(?=[\s\S]*--test-case-id\s+\S)(?=[\s\S]*--description\s+\S)(?=[\s\S]*--output\s+json)uip\s+tm\s+testcases?\s+steps\s+add\b' |
There was a problem hiding this comment.
Keep flag checks on the target command
When the agent batches multiple uip calls in one Bash block, the [\s\S]* lookaheads in this pattern can be satisfied by flags on later commands before the regex matches a bare steps add; for example, an add missing --project-key/--output json followed by a scoped steps list still passes. The same structure is used throughout the new command criteria, so these evals can accept the unscoped or non-JSON calls they are intended to catch; constrain the flag checks to the same shell command/line or use order-insensitive alternatives that stop at command separators.
Useful? React with 👍 / 👎.
| - Move "Enter valid card details" so it runs right after "Open the payments | ||
| screen". |
There was a problem hiding this comment.
Make the move step change the order
At this point the initial adds already leave the steps ordered as Open the payments screen followed by Enter valid card details, so asking to move that step right after Open is a no-op. A passing run can issue a move that does not change the list and still get full credit, which means this lifecycle task no longer validates real reordering or catches target-position mistakes; ask for a step to move to a different position or assert the final order.
Useful? React with 👍 / 👎.

What
Adds 4 coder-eval task YAMLs under
tests/tasks/uipath-test/covering recently shippeduip tmcommands that previously had zero test coverage.Jira: TMHUB-32229 (epic CA-5 — Coding Agents for Test · Sprint S198)
testcase_steps_authoring_smokeuip tm testcases steps add / list / gettestcase_steps_refine_integrationsteps update / move / deleteexecution_rerun_failed_integrationexecutions get-stats / retryrelease_signoff_wait_report_e2euip tm wait / report getThe
uip tm testcases stepssubgroup shipped incli/mainvia PR #2625 (so the@alphaCLI the smoke runner builds has it).No overlap
Verified (script) against all existing
uipath-testtasks: each new/primary command is asserted in exactly one task. Shared supporting commands (e.g.executions list,testsets run,testcases create) are left to their existing owners and intentionally not re-asserted here. The only command shared across tasks isuip login status— the universal Critical-Rule-#1 precondition present in every uipath-test task.Validation
All four run green locally via coder-eval against the alpha tenant:
testcase_steps_authoring_smoke— 4/4 (1.000)testcase_steps_refine_integration— 4/4 (1.000)execution_rerun_failed_integration— 3/3 (1.000)release_signoff_wait_report_e2e— 3/3 (1.000)Live-TM latency headroom: the slow multi-step tasks carry explicit
run_limits(turn_timeout/task_timeout) so they don't false-fail on tenant 504/latency under the experiment defaults.CI
smoke-skills.ymlruns--tags smoke, sotestcase_steps_authoring_smokeexecutes on this PR's smoke gate. The integration/e2e tasks run via the nightly/dashboard suites (not the PR smoke gate) — validated locally as above.🤖 Generated with Claude Code