Skip to content

tools: require persistent Agent-verifiable DoD evidence in implement-task skill#32

Merged
dognose24 merged 5 commits into
trunkfrom
tools/implement-task-dod-evidence
May 15, 2026
Merged

tools: require persistent Agent-verifiable DoD evidence in implement-task skill#32
dognose24 merged 5 commits into
trunkfrom
tools/implement-task-dod-evidence

Conversation

@dognose24
Copy link
Copy Markdown
Owner

Fixes RSM-3269

Proposed changes

Step 5 of /premium-analytics-implement-task (DoD-driven extra verification, e.g. local-only regression injection) shipped in RSM-3217 without a durable-evidence requirement. Result: the post-revert filesystem state is identical whether Step 5 ran clean or was silently skipped, so a reviewer looking at the resulting PR cannot tell whether the verification mechanism actually fired.

Observed in PR #30 (dashboard-pie-chart sandbox run): Step 5 actually ran correctly — injection fired, the expected spec failed on toContainText, revert restored the implementation, suite went green again — but the only evidence was in the agent's session log. PR #30 itself shows nothing about it.

This change closes the gap.

  • Step 5 sub-step 8 — append a structured outcome block to /tmp/dod-report.md for each DoD item executed: which edit was applied, which spec was expected to fail, the actual failure observation, status of other specs, and the revert + re-run summary. /tmp/ matches the RSM-3254 state-file pattern (outside .claude/, no commit pollution, no sensitive-file gate).

  • Step 8 — after git push + PR open/update, if /tmp/dod-report.md exists, post its contents to the PR as a comment under a ## DoD verification heading, then rm the buffer. Reviewers see verification results in the PR conversation thread, surviving the session.

  • HARD rules — added: "If Step 5 ran (task md DoD has Agent-verifiable items beyond build + UI verification), the PR must contain a ## DoD verification comment posted in Step 8. Missing comment when Step 5 was in scope = treat the task as failed."

Why this matters beyond one skill

The pattern generalizes: any local-only verification step the agent performs should leave a persistent trace. Without that, the verification is unverifiable from the outside — defeating its purpose. The whole point of regression injection (per RSM-3182) is to prove the suite catches regressions, not just that it doesn't false-alarm on green code; the proof has to outlive the session.

Does this pull request change what data or activity we track or use?

No.

Testing instructions

  1. After merge, run /premium-analytics-implement-task against a task md whose DoD includes a regression-injection item (e.g. projects/packages/premium-analytics/tasks/dashboard-pie-chart.md).
  2. Confirm /tmp/dod-report.md is populated during Step 5 and cleaned up after Step 8.
  3. Confirm the resulting PR has a ## DoD verification comment listing the injection outcome.
  4. (Negative test) For a task md with no DoD items beyond build + UI verification, confirm Step 5 + the Step 8 comment are skipped without the HARD rule firing.

Related

…task skill

Step 5's local-only regression injection leaves no durable trace by design — the post-revert filesystem state is identical whether Step 5 ran clean or was silently skipped. RSM-3269 / the dashboard-pie-chart sandbox run surfaced this: Step 5 actually ran correctly but PR #30 had no way to prove it; the only evidence lived in the agent's session log.

This change closes the evidence gap:

* Step 5 sub-step 8 — append a structured outcome block to `/tmp/dod-report.md` for each DoD item executed: edit applied, expected failing spec, actual failure observation, other specs status, revert + re-run summary. `/tmp/` location matches the RSM-3254 state-file pattern (outside `.claude/`, no commit pollution).

* Step 8 — after push + PR open/update, if `/tmp/dod-report.md` exists post its contents as a PR comment under the `## DoD verification` heading, then delete the buffer. Reviewers see the verification result in the PR conversation thread.

* HARD rule — Step 5 in scope without a corresponding `## DoD verification` comment = task failed. Forces agent to leave evidence; no longer optional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the Docs label May 14, 2026
@github-actions
Copy link
Copy Markdown

Thank you for your PR!

When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:

  • ✅ Include a description of your PR changes.
  • 🔴 Add a "[Status]" label (In Progress, Needs Review, ...).
  • ✅ Add testing instructions.
  • ✅ Specify whether this PR includes any changes to data or privacy.
  • ✅ Add changelog entries to affected projects

This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖


Follow this PR Review Process:

  1. Ensure all required checks appearing at the bottom of this PR are passing.
  2. Make sure to test your changes on all platforms that it applies to. You're responsible for the quality of the code you ship.
  3. You can use GitHub's Reviewers functionality to request a review.
  4. When it's reviewed and merged, you will be pinged in Slack to deploy the changes to WordPress.com simple once the build is done.

If you have questions about anything, reach out in #jetpack-developers for guidance!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the premium analytics implementation skill to require persistent evidence for Agent-verifiable Definition of Done checks, especially local regression-injection verification that otherwise leaves no trace after revert.

Changes:

  • Adds /tmp/dod-report.md as a temporary DoD outcome buffer during Step 5.
  • Posts the collected DoD verification report as a PR comment in Step 8.
  • Adds a HARD rule requiring a ## DoD verification PR comment when Step 5 is in scope.
Comments suppressed due to low confidence (3)

.agents/skills/premium-analytics-implement-task.md:203

  • The cleanup runs unconditionally after the gh pr comment pipeline. If gh pr view/gh pr comment fails (auth, network, missing PR, invalid selector), this still deletes /tmp/dod-report.md, losing the only durable evidence that the new hard rule depends on; only remove the buffer after a successful comment post.
  } | gh pr comment "$PR_NUM" --body-file -
  rm /tmp/dod-report.md

.agents/skills/premium-analytics-implement-task.md:203

  • This new command is not permitted by the skill's allowed-tools frontmatter: line 6 allows cat, gh pr comment, etc., but there is no Bash(rm:*) entry. As written, the automated Step 8 cleanup may be blocked by the tool allowlist unless the allowlist is updated or cleanup uses an already-allowed mechanism.
  rm /tmp/dod-report.md

.agents/skills/premium-analytics-implement-task.md:193

  • Using file existence as the only gate means Step 8 silently skips the PR comment when Step 5 was required but the report was never written, which is the failure mode this change is trying to make visible. The instructions should explicitly fail when Agent-verifiable DoD items are in scope and /tmp/dod-report.md is missing, instead of treating absence as “Step 5 did not run.”
If Step 5 ran (i.e. `/tmp/dod-report.md` exists), post the accumulated outcomes as a
PR comment so reviewers can verify what was actually executed — the post-revert state
is identical whether Step 5 ran clean or was skipped, so durable evidence is the only
way to tell from the PR alone:

Comment on lines +119 to +122
8. **Record outcome durably** — append a structured block to `/tmp/dod-report.md`. This file is consumed in Step 8 and posted to the PR, so the verification leaves a trace that survives the session:

```bash
cat >> /tmp/dod-report.md << 'EOF'
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 3f149b6. Step 5 now opens with rm -f /tmp/dod-report.md, wiping any leftover state from a prior interrupted run before this cycle's first append. Each run claims exclusive ownership of the buffer for the duration of Step 5–8, so the posted ## DoD verification comment reflects only that run's outcomes.

Concurrent runs in the same sandbox would still race on the shared path, but that's an unusual situation (sandbox is single-tenant by design); the simple wipe handles the realistic case (one run, possibly interrupted earlier).

Addresses Copilot review on PR #32: /tmp/dod-report.md was appended to without first wiping leftover content. A previous interrupted run, or another concurrent run in the same sandbox, would mix stale verification outcomes into this cycle's PR comment.

Step 5 now starts with `rm -f /tmp/dod-report.md`, claiming exclusive ownership of the buffer for the current run. Each cycle's appends represent only that cycle's DoD outcomes; Step 8's posted `## DoD verification` comment is accurate to the run that produced it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dognose24 dognose24 requested a review from Copilot May 14, 2026 16:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

.agents/skills/premium-analytics-implement-task.md:211

  • This newly added cleanup uses a bare rm, but rm is not listed in this skill's allowed-tools frontmatter. If the comment posts successfully, the subsequent cleanup can still be blocked, leaving /tmp/dod-report.md behind for the next run despite the stale-state safeguards.
  rm /tmp/dod-report.md

.agents/skills/premium-analytics-implement-task.md:211

  • This command sequence deletes the report even if gh pr comment fails: because rm /tmp/dod-report.md is the last command in the if body, the snippet can exit successfully after losing the only buffered evidence. That can leave the PR without the required ## DoD verification comment and no local report to retry; gate the cleanup on a successful comment post.
  } | gh pr comment "$PR_NUM" --body-file -
  rm /tmp/dod-report.md

reflects only this run:

```bash
rm -f /tmp/dod-report.md
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in dd8ee35. Both rm -f calls replaced with : > /tmp/dod-report.md — the shell no-op : redirected to the file truncates it (or creates it empty) without needing rm permission. No allow-list change required. Applies to both the Step 5 setup and the Step 8 post-comment cleanup.

Comment on lines +198 to +204
If Step 5 ran (i.e. `/tmp/dod-report.md` exists), post the accumulated outcomes as a
PR comment so reviewers can verify what was actually executed — the post-revert state
is identical whether Step 5 ran clean or was skipped, so durable evidence is the only
way to tell from the PR alone:

```bash
if [ -f /tmp/dod-report.md ]; then
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — fixed in dd8ee35 by moving the cleanup out of the Step-5-body and into a setup block at the very top of Step 5 that runs unconditionally, before the skip decision. Now even when a task md has no Agent-verifiable items beyond build/UI verification, the buffer is truncated to empty on entry, so any leftover from a prior interrupted run is wiped before Step 8 ever looks at it.

Step 8's check also changed from -f (file exists — true after the truncate too) to -s (file non-empty — only true if Step 5 actually appended outcomes). So the signal is now "any DoD items ran this cycle", not just "the file exists".

the index, dropping the unstaged injection while keeping the staged
implementation intact.
7. Rebuild and re-run `/premium-analytics-verify-ui`; the suite must be green again.
8. **Record outcome durably** — append a structured block to `/tmp/dod-report.md`. This file is consumed in Step 8 and posted to the PR, so the verification leaves a trace that survives the session:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in dd8ee35. The "record outcome" instruction is no longer nested as sub-step 8 of the regression-injection numbered list. It's now its own section after the pattern, titled "Record outcome — required for every Agent-verifiable item," with template fields for both the regression-injection case (Edit applied / Expected failing spec / Actual failure / Other specs / Revert + re-run) and generic items ("what was checked, what the expected outcome was, what was observed"). Any non-regression DoD item now still produces an entry in /tmp/dod-report.md, so Step 8's HARD-rule comment posts correctly.

…y DoD item, no rm needed

Addresses three Copilot review comments on PR #32:

1. `rm -f /tmp/dod-report.md` requires `Bash(rm:*)` which is intentionally not in the skill's allow-list (only `git rm` is). Replace with `: > /tmp/dod-report.md` — shell no-op + redirect truncates the file or creates it empty, no `rm` permission required. Same change applies to Step 8's post-comment cleanup.

2. The cleanup was inside Step 5 after the skip-check, so when a task md has no Agent-verifiable items beyond build/UI verification, stale leftover content from a prior interrupted run would survive and Step 8 would post it as DoD evidence on an unrelated PR. Move the cleanup to a setup block at the very top of Step 5 — runs unconditionally, before any skip decision.

3. The "record outcome to /tmp/dod-report.md" instruction was nested as sub-step 8 of the "local-only regression injection" common pattern, so an Agent-verifiable DoD item using a different pattern (deterministic CLI output check, specific feature assertion, etc.) would execute without appending anything. Step 8's posted comment would be empty/missing, conflicting with the HARD rule. Lift the recording step out of the regression-injection numbered list into its own "Record outcome — required for every item" section after the pattern, with template fields for both the regression-injection case and generic items.

Also change Step 8's existence check from `-f` (file exists) to `-s` (file non-empty): since the buffer is now always truncated empty on Step 5 entry, file existence no longer distinguishes "Step 5 ran items" from "Step 5 was a no-op"; non-empty content is the real signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

.agents/skills/premium-analytics-implement-task.md:233

  • This cleanup uses the : shell builtin, but the skill frontmatter allow-list does not include Bash(:*). If the command is denied, the just-posted buffer remains non-empty and a later retry can post duplicate/stale DoD evidence; use an allowed command or add the matching allow-list entry.
  : > /tmp/dod-report.md   # truncate (no `rm` needed)

For other Agent-verifiable items (deterministic CLI output, specific feature
behavior assertion, etc.), record the equivalent: what was checked, what the
expected outcome was, what was observed.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 71717e7. Step 7 now calls git add <implementation-files> unconditionally before staging the changelog. git add is idempotent (already-staged files are no-ops), so the regression-injection path stays correct, and the non-regression DoD path (or a skipped Step 5) now also has the implementation staged before commit. Removed the conditional language about "Step 5 already staged it" since the unconditional add covers both cases.

swapping a mock-data label in `stage.tsx`), so a naive `git checkout -- <file>`
would discard the implementation along with the injection.
```bash
: > /tmp/dod-report.md
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — : is also not on the allow-list. Fixed in 71717e7 by switching both truncate calls (Step 5 setup + Step 8 cleanup) to cp /dev/null /tmp/dod-report.md. cp is already in the allow-list (Bash(cp:*)); copying /dev/null writes a zero-byte file, same end-result as the redirect, portable across GNU/BSD coreutils.

…ementation unconditionally in Step 7

Addresses two Copilot review comments on PR #32:

1. `:` (shell no-op) isn't in the skill's allow-list either (`Bash(:*)` not listed). Replace both truncate calls with `cp /dev/null /tmp/dod-report.md` — `cp` is already in the allow-list, the semantics are identical (writes a 0-byte file), and it's portable across GNU/BSD coreutils.

2. The unconditional record-outcome rule for Step 5 (added last commit) means non-regression DoD items can reach Step 7 without `git add` having been called on the implementation, because only the regression-injection pattern stages the implementation as part of its baseline-snapshot trick. Step 7 now calls `git add <implementation-files>` unconditionally before staging the changelog — `git add` is idempotent so already-staged files are no-ops, and previously-unstaged files (from a non-regression DoD path or a skipped Step 5) get picked up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +230 to +235
{
echo "## DoD verification"
echo ""
cat /tmp/dod-report.md
} | gh pr comment "$PR_NUM" --body-file -
cp /dev/null /tmp/dod-report.md # truncate using `cp` (allow-list friendly, no `rm` needed)
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fd30279. Wrapped the post in if { … } | gh pr comment …; then truncate; else echo error + exit 1; fi. The pipe's exit status is gh pr comment's, so cp /dev/null only runs after a successful post. On failure the buffer survives at /tmp/dod-report.md (so the post can be retried) and the script exits 1, signaling failure to the agent runner so the skill doesn't proceed to Step 9 with the HARD rule violated.

Addresses Copilot review on PR #32: previously the `cp /dev/null /tmp/dod-report.md` truncate ran unconditionally after `gh pr comment`. If posting failed (auth, network, PR doesn't exist), the shell continued, wiped the buffer, and left the run without the required `## DoD verification` comment — violating the HARD rule without leaving evidence to retry from.

Wrap the post in `if … then truncate else exit 1`. The pipe's exit status is `gh pr comment`'s, so `cp /dev/null` only runs on a successful post. On failure the buffer survives (so the agent or user can retry the post) and the script exits 1, signaling failure to the agent runner so it doesn't proceed to Step 9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@dognose24 dognose24 merged commit 88ff3f9 into trunk May 15, 2026
62 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants