Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions cmd/claude-setup/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
//
// go run github.com/zerosixty/samurai/cmd/claude-setup@latest
//
// It creates .claude/skills/samurai/ with SKILL.md, api.md and pitfalls.md
// It creates .claude/skills/samurai/ with SKILL.md, api.md, naming.md and pitfalls.md
// so Claude Code understands the samurai API when writing or modifying tests.
package main

Expand All @@ -23,6 +23,9 @@ var skillContent string
//go:embed skill/api.md
var apiContent string

//go:embed skill/naming.md
var namingContent string

//go:embed skill/pitfalls.md
var pitfallsContent string

Expand Down Expand Up @@ -74,6 +77,7 @@ func main() {
files := map[string]string{
"SKILL.md": fmt.Sprintf("%s\n%s\n", skillContent, skillMarker(version)),
"api.md": apiContent,
"naming.md": namingContent,
"pitfalls.md": pitfallsContent,
}

Expand All @@ -88,7 +92,7 @@ func main() {
oldSamurai := filepath.Join(skillDir, "samurai.md")
os.Remove(oldSamurai) // ignore error — may not exist

fmt.Println("Created .claude/skills/samurai/ (SKILL.md, api.md, pitfalls.md)")
fmt.Println("Created .claude/skills/samurai/ (SKILL.md, api.md, naming.md, pitfalls.md)")
fmt.Println("Claude Code will use samurai context when working with tests.")
fmt.Println("Invoke /samurai to load the reference manually.")
}
Expand Down
14 changes: 14 additions & 0 deletions cmd/claude-setup/skill/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,19 @@ description: "Samurai scoped testing framework for Go (github.com/zerosixty/samu
- Parallel by default; `samurai.Sequential()` forces order; `go test -parallel N` controls concurrency
- Assertion-agnostic: use `w.Testing()` with any library (testify, is, stdlib)

## Naming

Every `s.Test("name", ...)` — at every tree level — names a business action or outcome from the actor's POV. Not the function under test, the assertion, or the input shape.

Forbidden patterns:

- **Identifier leak** — CamelCase tokens matching Go identifiers in the Test body. `"UpdateQuote calls..."` leaks `UpdateQuote`.
- **HTTP code leak** — bare 3-digit codes (`\b[1-5]\d{2}\b`) or `4xx/5xx` shorthand. `"returns 409"` → `"rejects duplicate id"`.
- **Assertion phrasing** — `is / equals / returns / not nil / has length / contains`. `"err is not nil"` → `"rejects invalid signature"`.
- **Structure phrasing** — `empty / nil / with N / map / slice / struct`. `"with empty slice"` → `"borrower has no past loans"`.

Risky-looking words used in a domain-natural sense (`borrower returns the equipment`, `has insufficient collateral`, `no past loans`) are clean — only flag when the word describes the assertion, the data shape, or the identifier. Flag matches, read the body, propose 1–2 replacements. See [naming.md](naming.md) for the protocol.

## File layout

For samurai test files, top-to-bottom:
Expand Down Expand Up @@ -60,4 +73,5 @@ Cost: samurai re-executes the full path per leaf — same cost as manually dupli
## Additional resources (lazy-load — read only when triggered)

- **Read [api.md](api.md) only when** you need the worked-out `samurai.Run` / `RunWith` API example (e.g. wiring a new test scope, writing a custom context type for `RunWith`). The "When to Use" / "Detection protocol" sections above are sufficient for review and detection work.
- **Read [naming.md](naming.md) only when** authoring a new `s.Test(...)` name and the right phrasing isn't obvious, or reviewing a samurai test where at least one name matches a forbidden pattern above. The inline "Naming" section is enough for clearly-good or clearly-bad names; load this when you need the actor/verb/outcome replacement protocol.
- **Read [pitfalls.md](pitfalls.md) only when** debugging a panic, a validation error, or a wrong pattern (e.g. asserting on a parent's result inside a child Test, declaring vars where they should be assigned). Skip otherwise — the rules above already cover the common authoring decisions.
4 changes: 2 additions & 2 deletions cmd/claude-setup/skill/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ func TestFeature(t *testing.T) {
assert.NotEmpty(w.Testing(), user.Role) // 6. one action + assertions = one Test
})

s.Test("list empty", func(ctx context.Context, w samurai.W) {
s.Test("lists no users initially", func(ctx context.Context, w samurai.W) {
users, err := db.ListUsers(ctx) // 7. own db — isolated from "create user" paths
assert.NoError(w.Testing(), err)
assert.Empty(w.Testing(), users)
Expand Down Expand Up @@ -59,7 +59,7 @@ func TestWithAssertions(t *testing.T) {
samurai.RunWith(t, func(w samurai.W) *MyCtx {
return &MyCtx{BaseContext: w, Assertions: assert.New(w.Testing())}
}, func(s *S) {
s.Test("check", func(_ context.Context, c *MyCtx) {
s.Test("the answer matches the spec", func(_ context.Context, c *MyCtx) {
c.Equal(42, value) // assertion methods directly on context
})
})
Expand Down
70 changes: 70 additions & 0 deletions cmd/claude-setup/skill/naming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Samurai Test Naming

The "Naming" section in SKILL.md states the bar and the four forbidden patterns. This file is the replacement protocol and worked examples — load it when you have a bad name in hand and need to propose a good one.

## Replacement protocol

Do not look at the diff hunk header, the parent test function name, or the function under test. **Open the `Test()` callback body.** Identify three things from the code in that body:

1. **Actor** — who acts. Usually a domain noun: `borrower`, `admin`, `provider`, `API caller`, `counterparty`. If there is no clear actor, the system itself (`the service`, `the API`) is acceptable but rarely the best choice.
2. **Verb** — the action call that mutates state or makes a request: `db.CreateUser`, `client.UpdateQuote`, `svc.Settle`. Express it as a domain verb (`creates`, `updates`, `settles`), not the Go method name.
3. **Outcome** — what the assertions in the same callback verify: a state change, a rejection, a returned value's domain meaning.

Write a 3–7 word, lowercase, present-tense phrase combining (1)+(2)+(3) — or (1)+(2) when the outcome is implied by success. Propose 1–2 alternatives so the author can pick.

If you cannot identify the actor or the outcome from the body, the `Test()` is doing too much or too little — flag this before renaming. Either the action is missing (a setup-only test) or many actions are bundled (split it).

## Detection heuristics

| Pattern | Why it's bad | What to look at in the body instead |
|---|---|---|
| CamelCase token matching a Go identifier in the Test body | Couples the test name to the SUT symbol; rename of the function breaks the name | The effect of that call in domain terms |
| Bare HTTP status code (`200 → accepts`, `404 → reports missing`, `409 → rejects duplicate id`) | Leaks the HTTP layer; the same behaviour at gRPC would be unnamable | The outcome verb the code stands for, not the code itself |
| `is / equals / returns / not nil / has length / contains` describing a value | Describes the assertion, not the step | The action that produced the value; the assertion is implementation |
| `empty / nil / with N keys / map / slice / struct` | Leaks the input data structure | The domain meaning of the input: `no past loans`, `missing preferences`, `unset filter` |

## Bad → good (grounded in real samurai trees)

- `"UpdateQuote calls produce audit records with matching client quote ids"` — identifier leak (`UpdateQuote`)
→ `"updating a quote writes an audit record"`
→ `"the audit record links to the originating client quote"`

- `"err is not nil"` — assertion phrasing
→ `"rejects request with invalid signature"`
→ `"refuses unauthenticated callers"`

- `"returns 409"` — HTTP leak
→ `"rejects duplicate id"`
→ `"refuses to create a provider that already exists"`

- `"with empty slice"` — structure leak
→ `"returns no matches when the filter is unset"`
→ `"borrower has no past loans"`

- `"CreateProvider returns nil"` — identifier leak + assertion phrasing
→ `"creates a new provider successfully"`
→ `"a fresh provider id is issued"`

## Uniformly across tree levels

The bar is the same at top, intermediate, and leaf. Example:

```
"borrower opens a credit line" // top — business setup, not "CreateCreditLine returns ok"
├─ "borrower draws on the credit line" // intermediate — action, not "Draw=true"
│ ├─ "balance reflects the drawdown" // leaf — outcome, not "balance equals 100"
│ └─ "draw is rejected when limit is exceeded"
└─ "counterparty accepts KYB access"
```

No level mentions the Go type (`CreditLine`), the method (`Draw`), the HTTP code, the assertion verb (`equals`), or the data shape (`with balance > 0`).

## During review

When reviewing a samurai test, scan every `s.Test(...)` string literal against the heuristics above. For each hit:

1. Quote the offending name and identify which forbidden pattern it matches.
2. Read the `Test()` body and derive actor + verb + outcome.
3. Propose 1–2 concrete replacements.

Do not silently rewrite — surface the issue and the proposal so the author confirms domain accuracy.
91 changes: 91 additions & 0 deletions eval/naming-length/RESULTS-HARD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Naming-section length A/B — hardened eval (no lazy load + mixed fixtures + N=3)

## Why a second eval

The first eval (`RESULTS.md`) was non-discriminating — every variant scored 20/20 because all variants pointed to the same lazy-loaded `naming.md` (681 words) and subagents loaded it. The inline section's own carrying capacity was never tested.

This eval fixes three things:

1. **No lazy load.** Variants live in `variants/no-load/v{1..5}.md`, stripped of the `[naming.md]` pointer. The directory has no `naming.md` sibling. The inline text is the only guidance.
2. **Mixed fixtures.** New `F{1..5}_hard.go.txt` mix textbook violations + borderline cases (passive voice, "has X" used in domain phrasing, "rejected" as domain outcome, status-range shorthand `4xx`) + clean controls with risky-looking words.
3. **N=3 per cell.** 5 variants × 5 fixtures × 3 runs = 75 subagent invocations, clean context each.

Ground truth: 21 verdicts per run, 7 FLAG + 14 PASS — see `fixtures/expectations-hard.json`. Each name labelled `textbook`, `borderline`, or `clean` for diagnostic slicing.

## Variant sizes

| Variant | Words | What's in it |
|---|---|---|
| v1 | 206 | All 4 forbidden patterns + worked examples + safe-examples carve-out paragraph. |
| v2 | 129 | All 4 patterns with bulleted examples + safe-examples one-liner. |
| v3 | 49 | All 4 patterns with parenthesized examples + brief safe-words clause. |
| v4 | 31 | All 4 patterns in one sentence, no safe-examples clause. |
| v5 | 16 | "business actions, never function names, status codes, or assertions" — no structure clause. |

## Scores

| Variant | Total correct | Score | Per-run (run 1 / 2 / 3) | Failure mode |
|---|---|---|---|---|
| v1 (206w) | 58/63 | 92.1% | 20 / 18 / 20 | Over-flags passive voice (`"first loan is added to the history"` — all 3 runs FLAGged for "is") |
| **v2 (129w)** | **62/63** | **98.4%** | 21 / 20 / 21 | One stray false-FLAG; everything else stable. |
| v3 (49w) | 60/63 | 95.2% | 20 / 20 / 20 | Consistently false-FLAGs `"the request is rejected for missing auth"` (no carve-out for passive domain outcomes). |
| v4 (31w) | 59/63 | 93.7% | 20 / 19 / 20 | Consistently false-FLAGs `"borrower has insufficient collateral"` — v4 lists `has X` as forbidden but drops the safe-examples clause. |
| v5 (16w) | 54/63 | 85.7% | 18 / 16 / 18 | **Structure leak detection collapses.** All 3 runs PASS `"borrower with empty loan history"` (textbook structure leak — should FLAG). Also misses `"the new provider has the requested name"`. |

## Per-fixture breakdown

| | F1 (id) | F2 (assert) | F3 (struct) | F4 (HTTP) | F5 (control) |
|---|---|---|---|---|---|
| v1 | 12/12 | 11/12 | 9/12 | 11/12 | 15/15 |
| **v2** | **12/12** | **12/12** | **12/12** | **11/12** | **15/15** |
| v3 | 12/12 | 12/12 | 12/12 | 9/12 | 15/15 |
| v4 | 12/12 | 12/12 | 12/12 | 11/12 | 12/15 |
| v5 | 12/12 | 9/12 | 8/12 | 10/12 | 15/15 |

## Where each variant breaks

- **v1's over-detail backfires.** Adding `is` to the assertion-verb list trips passive constructions like `"is added"`. The longer prose gives the subagent more rules to misapply.
- **v2 is the sweet spot.** Lists all 4 patterns with concrete examples *and* has the safe-examples disclaimer. Enough constraint, not too much.
- **v3 holds on 3/4 categories** but its briefer safe-words clause isn't enough to protect `"rejected"` as a domain outcome.
- **v4 loses the safe-examples clause.** "has X" is listed as forbidden but `"borrower has insufficient collateral"` is exactly the canonical domain phrasing. Without the carve-out, the variant punishes correct names.
- **v5 collapses on structure-leak detection.** The 16-word one-liner has no `structure / data-shape / empty / nil` cue. Three textbook leaks pass through (`"with empty loan history"`, `"has the requested name"`), and `"empty"` slips entirely (all 3 runs).

The degradation isn't monotonic in length — it's monotonic in **rule coverage**. v1 has too much detail, v5 has too little, v2–v4 trade precision against verbosity.

## Why v2 beats v1

v1 (Variant A's current naming section) is the section originally drafted as "good enough". The eval shows it's *worse* than a 129-word version. v1's extra detail — explicit `is` in the assertion-verb list, regex hints, repeated examples — makes the rule too aggressive on perfectly fine passive constructions. Trimming detail to v2 actually raises score.

## Recommendation

**Ship v2** as the inline `## Naming` section in `cmd/claude-setup/skill/SKILL.md`, plus the full `naming.md` lazy-loaded.

Rationale:
- v2 scored highest in the hardened eval (98.4% vs 92.1% for v1).
- v2 holds all 5 categories with one stray miss; no consistent failure mode.
- The `naming.md` lazy load remains net positive on borderline cases (it adds the actor/verb/outcome protocol), but is no longer load-bearing for basic rule application.
- Cost: 129 words inline, loaded with every samurai-skill activation. Acceptable.

Concrete v2 to ship (back-edit the naming.md pointer in for production):

```markdown
## Naming

Every `s.Test("name", ...)` — at every tree level — names a business action or outcome from the actor's POV. Not the function under test, the assertion, or the input shape.

Forbidden patterns:

- **Identifier leak** — CamelCase tokens matching Go identifiers in the Test body. `"UpdateQuote calls..."` leaks `UpdateQuote`.
- **HTTP code leak** — bare 3-digit codes (`\b[1-5]\d{2}\b`) or `4xx/5xx` shorthand. `"returns 409"` → `"rejects duplicate id"`.
- **Assertion phrasing** — `is / equals / returns / not nil / has length / contains / has`. `"err is not nil"` → `"rejects invalid signature"`.
- **Structure phrasing** — `empty / nil / with N / map / slice / struct`. `"with empty slice"` → `"borrower has no past loans"`.

Risky-looking words used in a domain-natural sense (`borrower returns the equipment`, `has insufficient collateral`, `no past loans`) are clean — only flag when the word describes the assertion, the data shape, or the identifier. Flag matches, read the body, propose 1–2 replacements. See [naming.md](naming.md) for the protocol.
```

## Eval quality notes

- N=3 was enough to expose consistent failure modes (v3, v4, v5 each had a 3/3 false-FLAG or false-PASS). For ranking purposes more runs aren't needed.
- The 7-FLAG / 14-PASS asymmetry biases the eval toward variants that PASS too eagerly (v5 still scored 86% by passing everything in F5). Real samurai trees have closer to 50/50 — re-run on production samples to confirm v2 holds.
- Ground-truth for `"first loan is added to the history"` is a judgement call. The naming.md examples use passive forms (`"first loan is added"` is the spirit of `"a fresh provider id is issued"`). If you treat this as FLAG instead, v1 jumps from 92.1% to ~96.8% — still behind v2.
- Cost: 75 subagents × ~21k tokens ≈ 1.6M tokens. The original suspect eval was 25 × 23k ≈ 575k. Worth the 2.8× spend to actually answer the question.
Loading