LayeredCraft · j-d-ha · May 7, 2026 · May 9, 2026 · May 9, 2026
diff --git a/.agents/skills/skill-creator/LICENSE.txt b/.agents/skills/skill-creator/LICENSE.txt
@@ -187,7 +187,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2026 Anthropic, PBC.
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/.agents/skills/skill-creator/SKILL.md b/.agents/skills/skill-creator/SKILL.md
@@ -40,7 +40,7 @@ So please pay attention to context cues to understand how to phrase your communi
 
 It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
 
-______________________________________________________________________
+---
 
 ## Creating a skill
 
@@ -86,21 +86,18 @@ skill-name/
 #### Progressive Disclosure
 
 Skills use a three-level loading system:
-
 1. **Metadata** (name + description) - Always in context (~100 words)
-2. **SKILL.md body** - In context whenever skill triggers (\<500 lines ideal)
+2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal)
 3. **Bundled resources** - As needed (unlimited, scripts can execute without loading)
 
 These word counts are approximate and you can feel free to go longer if needed.
 
 **Key patterns:**
-
 - Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
 - Reference files clearly from SKILL.md with guidance on when to read them
 - For large reference files (>300 lines), include a table of contents
 
 **Domain organization**: When a skill supports multiple domains/frameworks, organize by variant:
-
 ```
 cloud-deploy/
 ├── SKILL.md (workflow + selection)
@@ -109,7 +106,6 @@ cloud-deploy/
     ├── gcp.md
     └── azure.md
 ```
-
 Claude reads only the relevant reference file.
 
 #### Principle of Lack of Surprise
@@ -121,7 +117,6 @@ This goes without saying, but skills must not contain malware, exploit code, or
 Prefer using the imperative form in instructions.
 
 **Defining output formats** - You can do it like this:
-
 ```markdown
 ## Report structure
 ALWAYS use this exact template:
@@ -132,7 +127,6 @@ ALWAYS use this exact template:
 ```
 
 **Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
-
 ```markdown
 ## Commit message format
 **Example 1:**
@@ -188,7 +182,6 @@ Execute this task:
 ```
 
 **Baseline run** (same prompt, but the baseline depends on context):
-
 - **Creating a new skill**: no skill at all. Same prompt, no skill path, save to `without_skill/outputs/`.
 - **Improving an existing skill**: the old version. Before editing, snapshot the skill (`cp -r <skill-path> <workspace>/skill-snapshot/`), then point the baseline subagent at the snapshot. Save to `old_skill/outputs/`.
 
@@ -232,18 +225,15 @@ Once all runs are done:
 1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and evaluates each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations.
 
 2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory:
-
    ```bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    ```
-
    This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects.
-   Put each with_skill version before its baseline counterpart.
+Put each with_skill version before its baseline counterpart.
 
 3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
 
 4. **Launch the viewer** with both qualitative outputs and quantitative data:
-
    ```bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
@@ -252,7 +242,6 @@ Once all runs are done:
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    ```
-
    For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
 
    **Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up.
@@ -264,7 +253,6 @@ Note: please use generate_review.py to create the viewer; there's no need to wri
 ### What the user sees in the viewer
 
 The "Outputs" tab shows one test case at a time:
-
 - **Prompt**: the task that was given
 - **Output**: the files the skill produced, rendered inline where possible
 - **Previous Output** (iteration 2+): collapsed section showing last iteration's output
@@ -299,7 +287,7 @@ Kill the viewer server when you're done with it:
 kill $VIEWER_PID 2>/dev/null
 ```
 
-______________________________________________________________________
+---
 
 ## Improving the skill
 
@@ -328,20 +316,19 @@ After improving the skill:
 5. Read the new feedback, improve again, repeat
 
 Keep going until:
-
 - The user says they're happy
 - The feedback is all empty (everything looks good)
 - You're not making meaningful progress
 
-______________________________________________________________________
+---
 
 ## Advanced: Blind comparison
 
 For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read `agents/comparator.md` and `agents/analyzer.md` for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.
 
 This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.
 
-______________________________________________________________________
+---
 
 ## Description Optimization
 
@@ -416,7 +403,7 @@ This means your eval queries should be substantive enough that Claude would actu
 
 Take `best_description` from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.
 
-______________________________________________________________________
+---
 
 ### Package and Present (only if `present_files` tool is available)
 
@@ -428,7 +415,7 @@ python -m scripts.package_skill <path/to/skill-folder>
 
 After packaging, direct the user to the resulting `.skill` file path so they can install it.
 
-______________________________________________________________________
+---
 
 ## Claude.ai-specific instructions
 
@@ -449,12 +436,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
 **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
 
 **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
-
 - **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
 - **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
 - **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions.
 
-______________________________________________________________________
+---
 
 ## Cowork-Specific Instructions
 
@@ -468,7 +454,7 @@ If you're in Cowork, the main things to know are:
 - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
 - **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
 
-______________________________________________________________________
+---
 
 ## Reference files
 
@@ -479,10 +465,9 @@ The agents/ directory contains instructions for specialized subagents. Read them
 - `agents/analyzer.md` — How to analyze why one version beat another
 
 The references/ directory has additional documentation:
-
 - `references/schemas.md` — JSON structures for evals.json, grading.json, etc.
 
-______________________________________________________________________
+---
 
 Repeating one more time the core loop here for emphasis:
 

diff --git a/.agents/skills/skill-creator/agents/analyzer.md b/.agents/skills/skill-creator/agents/analyzer.md
@@ -49,7 +49,6 @@ You receive these parameters in your prompt:
 ### Step 4: Analyze Instruction Following
 
 For each transcript, evaluate:
-
 - Did the agent follow the skill's explicit instructions?
 - Did the agent use the skill's provided tools/scripts?
 - Were there missed opportunities to leverage skill content?
@@ -60,7 +59,6 @@ Score instruction following 1-10 and note specific issues.
 ### Step 5: Identify Winner Strengths
 
 Determine what made the winner better:
-
 - Clearer instructions that led to better behavior?
 - Better scripts/tools that produced better output?
 - More comprehensive examples that guided edge cases?
@@ -71,7 +69,6 @@ Be specific. Quote from skills/transcripts where relevant.
 ### Step 6: Identify Loser Weaknesses
 
 Determine what held the loser back:
-
 - Ambiguous instructions that led to suboptimal choices?
 - Missing tools/scripts that forced workarounds?
 - Gaps in edge case coverage?
@@ -80,7 +77,6 @@ Determine what held the loser back:
 ### Step 7: Generate Improvement Suggestions
 
 Based on the analysis, produce actionable suggestions for improving the loser skill:
-
 - Specific instruction changes to make
 - Tools/scripts to add or modify
 - Examples to include
@@ -171,22 +167,22 @@ Write a JSON file with this structure:
 
 Use these categories to organize improvement suggestions:
 
-| Category         | Description                                    |
-| ---------------- | ---------------------------------------------- |
-| `instructions`   | Changes to the skill's prose instructions      |
-| `tools`          | Scripts, templates, or utilities to add/modify |
-| `examples`       | Example inputs/outputs to include              |
-| `error_handling` | Guidance for handling failures                 |
-| `structure`      | Reorganization of skill content                |
-| `references`     | External docs or resources to add              |
+| Category | Description |
+|----------|-------------|
+| `instructions` | Changes to the skill's prose instructions |
+| `tools` | Scripts, templates, or utilities to add/modify |
+| `examples` | Example inputs/outputs to include |
+| `error_handling` | Guidance for handling failures |
+| `structure` | Reorganization of skill content |
+| `references` | External docs or resources to add |
 
 ## Priority Levels
 
 - **high**: Would likely change the outcome of this comparison
 - **medium**: Would improve quality but may not change win/loss
 - **low**: Nice to have, marginal improvement
 
-______________________________________________________________________
+---
 
 # Analyzing Benchmark Results
 
@@ -215,7 +211,6 @@ You receive these parameters in your prompt:
 ### Step 2: Analyze Per-Assertion Patterns
 
 For each expectation across all runs:
-
 - Does it **always pass** in both configurations? (may not differentiate skill value)
 - Does it **always fail** in both configurations? (may be broken or beyond capability)
 - Does it **always pass with skill but fail without**? (skill clearly adds value here)
@@ -225,29 +220,25 @@ For each expectation across all runs:
 ### Step 3: Analyze Cross-Eval Patterns
 
 Look for patterns across evals:
-
 - Are certain eval types consistently harder/easier?
 - Do some evals show high variance while others are stable?
 - Are there surprising results that contradict expectations?
 
 ### Step 4: Analyze Metrics Patterns
 
 Look at time_seconds, tokens, tool_calls:
-
 - Does the skill significantly increase execution time?
 - Is there high variance in resource usage?
 - Are there outlier runs that skew the aggregates?
 
 ### Step 5: Generate Notes
 
 Write freeform observations as a list of strings. Each note should:
-
 - State a specific observation
 - Be grounded in the data (not speculation)
 - Help the user understand something the aggregate metrics don't show
 
 Examples:
-
 - "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
 - "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
 - "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
@@ -271,14 +262,12 @@ Save notes to `{output_path}` as a JSON array of strings:
 ## Guidelines
 
 **DO:**
-
 - Report what you observe in the data
 - Be specific about which evals, expectations, or runs you're referring to
 - Note patterns that aggregate metrics would hide
 - Provide context that helps interpret the numbers
 
 **DO NOT:**
-
 - Suggest improvements to the skill (that's for the improvement step, not benchmarking)
 - Make subjective quality judgments ("the output was good/bad")
 - Speculate about causes without evidence

diff --git a/.agents/skills/skill-creator/agents/comparator.md b/.agents/skills/skill-creator/agents/comparator.md
@@ -39,23 +39,20 @@ You receive these parameters in your prompt:
 Based on the task, generate a rubric with two dimensions:
 
 **Content Rubric** (what the output contains):
-
-| Criterion    | 1 (Poor)                 | 3 (Acceptable)     | 5 (Excellent)        |
-| ------------ | ------------------------ | ------------------ | -------------------- |
-| Correctness  | Major errors             | Minor errors       | Fully correct        |
-| Completeness | Missing key elements     | Mostly complete    | All elements present |
-| Accuracy     | Significant inaccuracies | Minor inaccuracies | Accurate throughout  |
+| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
+|-----------|----------|----------------|---------------|
+| Correctness | Major errors | Minor errors | Fully correct |
+| Completeness | Missing key elements | Mostly complete | All elements present |
+| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
 
 **Structure Rubric** (how the output is organized):
-
-| Criterion    | 1 (Poor)            | 3 (Acceptable)       | 5 (Excellent)            |
-| ------------ | ------------------- | -------------------- | ------------------------ |
-| Organization | Disorganized        | Reasonably organized | Clear, logical structure |
-| Formatting   | Inconsistent/broken | Mostly consistent    | Professional, polished   |
-| Usability    | Difficult to use    | Usable with effort   | Easy to use              |
+| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
+|-----------|----------|----------------|---------------|
+| Organization | Disorganized | Reasonably organized | Clear, logical structure |
+| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
+| Usability | Difficult to use | Usable with effort | Easy to use |
 
 Adapt criteria to the specific task. For example:
-
 - PDF form → "Field alignment", "Text readability", "Data placement"
 - Document → "Section structure", "Heading hierarchy", "Paragraph flow"
 - Data output → "Schema correctness", "Data types", "Completeness"

diff --git a/.agents/skills/skill-creator/agents/grader.md b/.agents/skills/skill-creator/agents/grader.md
@@ -45,13 +45,11 @@ For each expectation:
 Beyond the predefined expectations, extract implicit claims from the outputs and verify them:
 
 1. **Extract claims** from the transcript and outputs:
-
    - Factual statements ("The form has 12 fields")
    - Process claims ("Used pypdf to fill the form")
    - Quality claims ("All fields were filled correctly")
 
 2. **Verify each claim**:
-
    - **Factual claims**: Can be checked against the outputs or external sources
    - **Process claims**: Can be verified from the transcript
    - **Quality claims**: Evaluate whether the claim is justified
@@ -63,7 +61,6 @@ This catches issues that predefined expectations might miss.
 ### Step 5: Read User Notes
 
 If `{outputs_dir}/user_notes.md` exists:
-
 1. Read it and note any uncertainties or issues flagged by the executor
 2. Include relevant concerns in the grading output
 3. These may reveal problems even when expectations pass
@@ -75,7 +72,6 @@ After grading, consider whether the evals themselves could be improved. Only sur
 Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion *discriminating*: it passes when the skill genuinely succeeds and fails when it doesn't.
 
 Suggestions worth raising:
-
 - An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content)
 - An important outcome you observed — good or bad — that no assertion covers at all
 - An assertion that can't actually be verified from the available outputs
@@ -89,13 +85,11 @@ Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir).
 ## Grading Criteria
 
 **PASS when**:
-
 - The transcript or outputs clearly demonstrate the expectation is true
 - Specific evidence can be cited
 - The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename)
 
 **FAIL when**:
-
 - No evidence found for the expectation
 - Evidence contradicts the expectation
 - The expectation cannot be verified from available information