Skip to content

Audit Agent Experience Skill#79

Open
jay-sahnan wants to merge 4 commits intomainfrom
audit-agent-experience
Open

Audit Agent Experience Skill#79
jay-sahnan wants to merge 4 commits intomainfrom
audit-agent-experience

Conversation

@jay-sahnan
Copy link
Copy Markdown
Contributor

@jay-sahnan jay-sahnan commented Apr 25, 2026

Spawns parallel Claude subagents against a target docs/SDK/SKILL.md from a one-sentence prompt, captures structured traces, and renders a graded HTML report scoring Setup Friction, Speed, Efficiency, Error Recovery, and Doc Quality. Includes narrative cross-agent review to surface convergent hallucinations and silent workarounds the JSON self-report misses.


Note

Medium Risk
Adds a new skill that orchestrates parallel subagents, optional shell execution, and credential auto-discovery guidance; mis-specification could lead to unsafe tool usage or accidental secret exposure if implemented incorrectly.

Overview
Adds a new audit-agent-experience skill definition (SKILL.md) that specifies an end-to-end workflow for benchmarking agent onboarding against a target docs/SDK/SKILL.md using multiple parallel subagents, minimal task prompts, and structured trace capture/scoring across DX dimensions.

Includes supporting reference docs (references/*.md) for prompt variants, subagent brief + JSON trace schema, and an evaluation rubric with score caps and cross-agent narrative review guidance, plus an assets/report-template.html for rendering the final graded HTML report and an MIT LICENSE.

Reviewed by Cursor Bugbot for commit 485b946. Bugbot is set up for automated code reviews on this repo. Configure here.

jay-sahnan and others added 2 commits April 25, 2026 09:07
Spawns parallel Claude subagents against a target docs/SDK/SKILL.md from a
one-sentence prompt, captures structured traces, and renders a graded HTML
report scoring Setup Friction, Speed, Efficiency, Error Recovery, and Doc
Quality. Includes narrative cross-agent review to surface convergent
hallucinations and silent workarounds the JSON self-report misses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread skills/audit-agent-experience/assets/report-template.html
Comment thread skills/audit-agent-experience/references/prompt-variants.md
Comment thread skills/audit-agent-experience/SKILL.md Outdated
Comment thread skills/audit-agent-experience/SKILL.md Outdated
Comment thread skills/audit-agent-experience/SKILL.md Outdated
Comment thread skills/audit-agent-experience/assets/report-template.html
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 485b946. Configure here.

<div class="stat"><div class="label">Completed</div><div class="value ok">{{COMPLETED_COUNT}}</div></div>
<div class="stat"><div class="label">Stuck</div><div class="value warn">{{STUCK_COUNT}}</div></div>
<div class="stat"><div class="label">Errored</div><div class="value bad">{{ERRORED_COUNT}}</div></div>
</div>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stat grid omits partial and blocked-on-credentials statuses

Medium Severity

The stat grid only defines counters for Completed, Stuck, and Errored, but the onboarding_status schema supports five values: completed, partial, stuck, blocked-on-credentials, plus errored for parse failures. Agents ending in partial or blocked-on-credentials status won't be reflected in any status-specific counter, so the three sub-counts won't sum to {{AGENT_COUNT}}, producing a confusing report.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 485b946. Configure here.

.agent-card .body .confusion .section-tag { font-family: 'Geist Mono', 'SF Mono', monospace; font-size: 0.75rem; color: var(--grade-f); font-weight: 500; }
.agent-card .body .confusion .issue { margin-top: 0.25rem; color: var(--text); }
.agent-card .body .positive { padding: 0.5rem 0.75rem; background: rgba(34,197,94,0.05); border: 1px solid rgba(34,197,94,0.2); border-radius: 3px; color: var(--text); }
.agent-card .body .suggestion { padding: 0.5rem 0.75rem; background: rgba(77,169,228,0.06); border: 1px solid rgba(77,169,228,0.2); border-radius: 3px; color: #2a7ab5; }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused agent-card CSS with mismatched status vocabulary

Low Severity

The .agent-card CSS block (~30 rules) is never referenced by any template placeholder or in SKILL.md — only .trace-card and .agent-results-table are used for per-agent rendering. Worse, it defines a wrong-result status class that doesn't exist in the onboarding_status schema, while missing partial and blocked-on-credentials classes that the trace-card and status-pill CSS correctly include. This dead CSS with a stale status vocabulary could mislead the LLM into using incorrect class names.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 485b946. Configure here.

Behavioural hint: reads end-to-end before coding. Surfaces ambiguity. Catches docs that don't survive a close read.

### Skeptical
> Follow — note anything in the docs that seems wrong or unclear as you go while following
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skeptical prefix doesn't compose with prompt template

Medium Severity

The Skeptical persona prefix ends with "while following", so applying the template {persona_prefix} {product}'s getting-started guide… produces "Follow — note anything… while following Acme's getting-started guide…" — an awkward double-verb sentence. The worked example on line 72 restructures the clause order entirely, placing "note anything…" after the product name instead of before it. The template and example are incompatible, creating ambiguity about which prompt format to generate.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 485b946. Configure here.

- **Speed (20%)** — total wall time, time-to-first-working-code.
- **Efficiency (20%)** — tool calls per passed goal item, wasted calls.
- **Error Recovery (15%)** — did errors block goal items, or did agents route around?
- **Doc Quality (20%)** — did docs supply what was needed to pass the checklist?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scoring summaries reference nonexistent "checklist" and "goal items"

Medium Severity

The Step 7 dimension summaries reference "goal items" and "checklist" — concepts the entire skill explicitly forbids. Line 278 says "tool calls per passed goal item" and line 280 says "pass the checklist," but the actual evaluation-rubric.md uses completed_subtasks / total tool_calls ratio for Efficiency and "Did the docs provide what agents needed?" for Doc Quality — no checklist anywhere. This stale language could cause the executing LLM to construct a scoring checklist, directly contradicting the core principle on lines 37 and 179.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 485b946. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant