Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/skill-creator/LICENSE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2026 Anthropic, PBC.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
37 changes: 11 additions & 26 deletions .agents/skills/skill-creator/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ So please pay attention to context cues to understand how to phrase your communi

It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.

______________________________________________________________________
---

## Creating a skill

Expand Down Expand Up @@ -86,21 +86,18 @@ skill-name/
#### Progressive Disclosure

Skills use a three-level loading system:

1. **Metadata** (name + description) - Always in context (~100 words)
2. **SKILL.md body** - In context whenever skill triggers (\<500 lines ideal)
2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
3. **Bundled resources** - As needed (unlimited, scripts can execute without loading)

These word counts are approximate and you can feel free to go longer if needed.

**Key patterns:**

- Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
- Reference files clearly from SKILL.md with guidance on when to read them
- For large reference files (>300 lines), include a table of contents

**Domain organization**: When a skill supports multiple domains/frameworks, organize by variant:

```
cloud-deploy/
├── SKILL.md (workflow + selection)
Expand All @@ -109,7 +106,6 @@ cloud-deploy/
├── gcp.md
└── azure.md
```

Claude reads only the relevant reference file.

#### Principle of Lack of Surprise
Expand All @@ -121,7 +117,6 @@ This goes without saying, but skills must not contain malware, exploit code, or
Prefer using the imperative form in instructions.

**Defining output formats** - You can do it like this:

```markdown
## Report structure
ALWAYS use this exact template:
Expand All @@ -132,7 +127,6 @@ ALWAYS use this exact template:
```

**Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):

```markdown
## Commit message format
**Example 1:**
Expand Down Expand Up @@ -188,7 +182,6 @@ Execute this task:
```

**Baseline run** (same prompt, but the baseline depends on context):

- **Creating a new skill**: no skill at all. Same prompt, no skill path, save to `without_skill/outputs/`.
- **Improving an existing skill**: the old version. Before editing, snapshot the skill (`cp -r <skill-path> <workspace>/skill-snapshot/`), then point the baseline subagent at the snapshot. Save to `old_skill/outputs/`.

Expand Down Expand Up @@ -232,18 +225,15 @@ Once all runs are done:
1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and evaluates each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.

2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory:

```bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
```

This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects.
Put each with_skill version before its baseline counterpart.
Put each with_skill version before its baseline counterpart.

3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.

4. **Launch the viewer** with both qualitative outputs and quantitative data:

```bash
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
<workspace>/iteration-N \
Expand All @@ -252,7 +242,6 @@ Once all runs are done:
> /dev/null 2>&1 &
VIEWER_PID=$!
```

For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.

**Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up.
Expand All @@ -264,7 +253,6 @@ Note: please use generate_review.py to create the viewer; there's no need to wri
### What the user sees in the viewer

The "Outputs" tab shows one test case at a time:

- **Prompt**: the task that was given
- **Output**: the files the skill produced, rendered inline where possible
- **Previous Output** (iteration 2+): collapsed section showing last iteration's output
Expand Down Expand Up @@ -299,7 +287,7 @@ Kill the viewer server when you're done with it:
kill $VIEWER_PID 2>/dev/null
```

______________________________________________________________________
---

## Improving the skill

Expand Down Expand Up @@ -328,20 +316,19 @@ After improving the skill:
5. Read the new feedback, improve again, repeat

Keep going until:

- The user says they're happy
- The feedback is all empty (everything looks good)
- You're not making meaningful progress

______________________________________________________________________
---

## Advanced: Blind comparison

For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read `agents/comparator.md` and `agents/analyzer.md` for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.

This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.

______________________________________________________________________
---

## Description Optimization

Expand Down Expand Up @@ -416,7 +403,7 @@ This means your eval queries should be substantive enough that Claude would actu

Take `best_description` from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

______________________________________________________________________
---

### Package and Present (only if `present_files` tool is available)

Expand All @@ -428,7 +415,7 @@ python -m scripts.package_skill <path/to/skill-folder>

After packaging, direct the user to the resulting `.skill` file path so they can install it.

______________________________________________________________________
---

## Claude.ai-specific instructions

Expand All @@ -449,12 +436,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
**Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.

**Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:

- **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
- **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
- **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.

______________________________________________________________________
---

## Cowork-Specific Instructions

Expand All @@ -468,7 +454,7 @@ If you're in Cowork, the main things to know are:
- Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
- **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.

______________________________________________________________________
---

## Reference files

Expand All @@ -479,10 +465,9 @@ The agents/ directory contains instructions for specialized subagents. Read them
- `agents/analyzer.md` — How to analyze why one version beat another

The references/ directory has additional documentation:

- `references/schemas.md` — JSON structures for evals.json, grading.json, etc.

______________________________________________________________________
---

Repeating one more time the core loop here for emphasis:

Expand Down
29 changes: 9 additions & 20 deletions .agents/skills/skill-creator/agents/analyzer.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ You receive these parameters in your prompt:
### Step 4: Analyze Instruction Following

For each transcript, evaluate:

- Did the agent follow the skill's explicit instructions?
- Did the agent use the skill's provided tools/scripts?
- Were there missed opportunities to leverage skill content?
Expand All @@ -60,7 +59,6 @@ Score instruction following 1-10 and note specific issues.
### Step 5: Identify Winner Strengths

Determine what made the winner better:

- Clearer instructions that led to better behavior?
- Better scripts/tools that produced better output?
- More comprehensive examples that guided edge cases?
Expand All @@ -71,7 +69,6 @@ Be specific. Quote from skills/transcripts where relevant.
### Step 6: Identify Loser Weaknesses

Determine what held the loser back:

- Ambiguous instructions that led to suboptimal choices?
- Missing tools/scripts that forced workarounds?
- Gaps in edge case coverage?
Expand All @@ -80,7 +77,6 @@ Determine what held the loser back:
### Step 7: Generate Improvement Suggestions

Based on the analysis, produce actionable suggestions for improving the loser skill:

- Specific instruction changes to make
- Tools/scripts to add or modify
- Examples to include
Expand Down Expand Up @@ -171,22 +167,22 @@ Write a JSON file with this structure:

Use these categories to organize improvement suggestions:

| Category | Description |
| ---------------- | ---------------------------------------------- |
| `instructions` | Changes to the skill's prose instructions |
| `tools` | Scripts, templates, or utilities to add/modify |
| `examples` | Example inputs/outputs to include |
| `error_handling` | Guidance for handling failures |
| `structure` | Reorganization of skill content |
| `references` | External docs or resources to add |
| Category | Description |
|----------|-------------|
| `instructions` | Changes to the skill's prose instructions |
| `tools` | Scripts, templates, or utilities to add/modify |
| `examples` | Example inputs/outputs to include |
| `error_handling` | Guidance for handling failures |
| `structure` | Reorganization of skill content |
| `references` | External docs or resources to add |

## Priority Levels

- **high**: Would likely change the outcome of this comparison
- **medium**: Would improve quality but may not change win/loss
- **low**: Nice to have, marginal improvement

______________________________________________________________________
---

# Analyzing Benchmark Results

Expand Down Expand Up @@ -215,7 +211,6 @@ You receive these parameters in your prompt:
### Step 2: Analyze Per-Assertion Patterns

For each expectation across all runs:

- Does it **always pass** in both configurations? (may not differentiate skill value)
- Does it **always fail** in both configurations? (may be broken or beyond capability)
- Does it **always pass with skill but fail without**? (skill clearly adds value here)
Expand All @@ -225,29 +220,25 @@ For each expectation across all runs:
### Step 3: Analyze Cross-Eval Patterns

Look for patterns across evals:

- Are certain eval types consistently harder/easier?
- Do some evals show high variance while others are stable?
- Are there surprising results that contradict expectations?

### Step 4: Analyze Metrics Patterns

Look at time_seconds, tokens, tool_calls:

- Does the skill significantly increase execution time?
- Is there high variance in resource usage?
- Are there outlier runs that skew the aggregates?

### Step 5: Generate Notes

Write freeform observations as a list of strings. Each note should:

- State a specific observation
- Be grounded in the data (not speculation)
- Help the user understand something the aggregate metrics don't show

Examples:

- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
Expand All @@ -271,14 +262,12 @@ Save notes to `{output_path}` as a JSON array of strings:
## Guidelines

**DO:**

- Report what you observe in the data
- Be specific about which evals, expectations, or runs you're referring to
- Note patterns that aggregate metrics would hide
- Provide context that helps interpret the numbers

**DO NOT:**

- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
- Make subjective quality judgments ("the output was good/bad")
- Speculate about causes without evidence
Expand Down
23 changes: 10 additions & 13 deletions .agents/skills/skill-creator/agents/comparator.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,23 +39,20 @@ You receive these parameters in your prompt:
Based on the task, generate a rubric with two dimensions:

**Content Rubric** (what the output contains):

| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| ------------ | ------------------------ | ------------------ | -------------------- |
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |
| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|-----------|----------|----------------|---------------|
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |
| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |

**Structure Rubric** (how the output is organized):

| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
| ------------ | ------------------- | -------------------- | ------------------------ |
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
| Usability | Difficult to use | Usable with effort | Easy to use |
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|-----------|----------|----------------|---------------|
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
| Usability | Difficult to use | Usable with effort | Easy to use |

Adapt criteria to the specific task. For example:

- PDF form → "Field alignment", "Text readability", "Data placement"
- Document → "Section structure", "Heading hierarchy", "Paragraph flow"
- Data output → "Schema correctness", "Data types", "Completeness"
Expand Down
6 changes: 0 additions & 6 deletions .agents/skills/skill-creator/agents/grader.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,11 @@ For each expectation:
Beyond the predefined expectations, extract implicit claims from the outputs and verify them:

1. **Extract claims** from the transcript and outputs:

- Factual statements ("The form has 12 fields")
- Process claims ("Used pypdf to fill the form")
- Quality claims ("All fields were filled correctly")

2. **Verify each claim**:

- **Factual claims**: Can be checked against the outputs or external sources
- **Process claims**: Can be verified from the transcript
- **Quality claims**: Evaluate whether the claim is justified
Expand All @@ -63,7 +61,6 @@ This catches issues that predefined expectations might miss.
### Step 5: Read User Notes

If `{outputs_dir}/user_notes.md` exists:

1. Read it and note any uncertainties or issues flagged by the executor
2. Include relevant concerns in the grading output
3. These may reveal problems even when expectations pass
Expand All @@ -75,7 +72,6 @@ After grading, consider whether the evals themselves could be improved. Only sur
Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion *discriminating*: it passes when the skill genuinely succeeds and fails when it doesn't.

Suggestions worth raising:

- An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content)
- An important outcome you observed — good or bad — that no assertion covers at all
- An assertion that can't actually be verified from the available outputs
Expand All @@ -89,13 +85,11 @@ Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir).
## Grading Criteria

**PASS when**:

- The transcript or outputs clearly demonstrate the expectation is true
- Specific evidence can be cited
- The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename)

**FAIL when**:

- No evidence found for the expectation
- Evidence contradicts the expectation
- The expectation cannot be verified from available information
Expand Down
Loading