zerosixty · stepan-romankov · May 25, 2026 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/cmd/claude-setup/main.go b/cmd/claude-setup/main.go
@@ -4,7 +4,7 @@
 //
 //	go run github.com/zerosixty/samurai/cmd/claude-setup@latest
 //
-// It creates .claude/skills/samurai/ with SKILL.md, api.md and pitfalls.md
+// It creates .claude/skills/samurai/ with SKILL.md, api.md, naming.md and pitfalls.md
 // so Claude Code understands the samurai API when writing or modifying tests.
 package main
 
@@ -23,6 +23,9 @@ var skillContent string
 //go:embed skill/api.md
 var apiContent string
 
+//go:embed skill/naming.md
+var namingContent string
+
 //go:embed skill/pitfalls.md
 var pitfallsContent string
 
@@ -74,6 +77,7 @@ func main() {
 	files := map[string]string{
 		"SKILL.md":    fmt.Sprintf("%s\n%s\n", skillContent, skillMarker(version)),
 		"api.md":      apiContent,
+		"naming.md":   namingContent,
 		"pitfalls.md": pitfallsContent,
 	}
 
@@ -88,7 +92,7 @@ func main() {
 	oldSamurai := filepath.Join(skillDir, "samurai.md")
 	os.Remove(oldSamurai) // ignore error — may not exist
 
-	fmt.Println("Created .claude/skills/samurai/ (SKILL.md, api.md, pitfalls.md)")
+	fmt.Println("Created .claude/skills/samurai/ (SKILL.md, api.md, naming.md, pitfalls.md)")
 	fmt.Println("Claude Code will use samurai context when working with tests.")
 	fmt.Println("Invoke /samurai to load the reference manually.")
 }

diff --git a/cmd/claude-setup/skill/SKILL.md b/cmd/claude-setup/skill/SKILL.md
@@ -20,6 +20,19 @@ description: "Samurai scoped testing framework for Go (github.com/zerosixty/samu
 - Parallel by default; `samurai.Sequential()` forces order; `go test -parallel N` controls concurrency
 - Assertion-agnostic: use `w.Testing()` with any library (testify, is, stdlib)
 
+## Naming
+
+Every `s.Test("name", ...)` — at every tree level — names a business action or outcome from the actor's POV. Not the function under test, the assertion, or the input shape.
+
+Forbidden patterns:
+
+- **Identifier leak** — CamelCase tokens matching Go identifiers in the Test body. `"UpdateQuote calls..."` leaks `UpdateQuote`.
+- **HTTP code leak** — bare 3-digit codes (`\b[1-5]\d{2}\b`) or `4xx/5xx` shorthand. `"returns 409"` → `"rejects duplicate id"`.
+- **Assertion phrasing** — `is / equals / returns / not nil / has length / contains`. `"err is not nil"` → `"rejects invalid signature"`.
+- **Structure phrasing** — `empty / nil / with N / map / slice / struct`. `"with empty slice"` → `"borrower has no past loans"`.
+
+Risky-looking words used in a domain-natural sense (`borrower returns the equipment`, `has insufficient collateral`, `no past loans`) are clean — only flag when the word describes the assertion, the data shape, or the identifier. Flag matches, read the body, propose 1–2 replacements. See [naming.md](naming.md) for the protocol.
+
 ## File layout
 
 For samurai test files, top-to-bottom:
@@ -60,4 +73,5 @@ Cost: samurai re-executes the full path per leaf — same cost as manually dupli
 ## Additional resources (lazy-load — read only when triggered)
 
 - **Read [api.md](api.md) only when** you need the worked-out `samurai.Run` / `RunWith` API example (e.g. wiring a new test scope, writing a custom context type for `RunWith`). The "When to Use" / "Detection protocol" sections above are sufficient for review and detection work.
+- **Read [naming.md](naming.md) only when** authoring a new `s.Test(...)` name and the right phrasing isn't obvious, or reviewing a samurai test where at least one name matches a forbidden pattern above. The inline "Naming" section is enough for clearly-good or clearly-bad names; load this when you need the actor/verb/outcome replacement protocol.
 - **Read [pitfalls.md](pitfalls.md) only when** debugging a panic, a validation error, or a wrong pattern (e.g. asserting on a parent's result inside a child Test, declaring vars where they should be assigned). Skip otherwise — the rules above already cover the common authoring decisions.
diff --git a/cmd/claude-setup/skill/api.md b/cmd/claude-setup/skill/api.md
@@ -25,7 +25,7 @@ func TestFeature(t *testing.T) {
                 assert.NotEmpty(w.Testing(), user.Role)  // 6. one action + assertions = one Test
             })
 
-            s.Test("list empty", func(ctx context.Context, w samurai.W) {
+            s.Test("lists no users initially", func(ctx context.Context, w samurai.W) {
                 users, err := db.ListUsers(ctx)  // 7. own db — isolated from "create user" paths
                 assert.NoError(w.Testing(), err)
                 assert.Empty(w.Testing(), users)
@@ -59,7 +59,7 @@ func TestWithAssertions(t *testing.T) {
     samurai.RunWith(t, func(w samurai.W) *MyCtx {
         return &MyCtx{BaseContext: w, Assertions: assert.New(w.Testing())}
     }, func(s *S) {
-        s.Test("check", func(_ context.Context, c *MyCtx) {
+        s.Test("the answer matches the spec", func(_ context.Context, c *MyCtx) {
             c.Equal(42, value)  // assertion methods directly on context
         })
     })

diff --git a/cmd/claude-setup/skill/naming.md b/cmd/claude-setup/skill/naming.md
@@ -0,0 +1,70 @@
+# Samurai Test Naming
+
+The "Naming" section in SKILL.md states the bar and the four forbidden patterns. This file is the replacement protocol and worked examples — load it when you have a bad name in hand and need to propose a good one.
+
+## Replacement protocol
+
+Do not look at the diff hunk header, the parent test function name, or the function under test. **Open the `Test()` callback body.** Identify three things from the code in that body:
+
+1. **Actor** — who acts. Usually a domain noun: `borrower`, `admin`, `provider`, `API caller`, `counterparty`. If there is no clear actor, the system itself (`the service`, `the API`) is acceptable but rarely the best choice.
+2. **Verb** — the action call that mutates state or makes a request: `db.CreateUser`, `client.UpdateQuote`, `svc.Settle`. Express it as a domain verb (`creates`, `updates`, `settles`), not the Go method name.
+3. **Outcome** — what the assertions in the same callback verify: a state change, a rejection, a returned value's domain meaning.
+
+Write a 3–7 word, lowercase, present-tense phrase combining (1)+(2)+(3) — or (1)+(2) when the outcome is implied by success. Propose 1–2 alternatives so the author can pick.
+
+If you cannot identify the actor or the outcome from the body, the `Test()` is doing too much or too little — flag this before renaming. Either the action is missing (a setup-only test) or many actions are bundled (split it).
+
+## Detection heuristics
+
+| Pattern | Why it's bad | What to look at in the body instead |
+|---|---|---|
+| CamelCase token matching a Go identifier in the Test body | Couples the test name to the SUT symbol; rename of the function breaks the name | The effect of that call in domain terms |
+| Bare HTTP status code (`200 → accepts`, `404 → reports missing`, `409 → rejects duplicate id`) | Leaks the HTTP layer; the same behaviour at gRPC would be unnamable | The outcome verb the code stands for, not the code itself |
+| `is / equals / returns / not nil / has length / contains` describing a value | Describes the assertion, not the step | The action that produced the value; the assertion is implementation |
+| `empty / nil / with N keys / map / slice / struct` | Leaks the input data structure | The domain meaning of the input: `no past loans`, `missing preferences`, `unset filter` |
+
+## Bad → good (grounded in real samurai trees)
+
+- `"UpdateQuote calls produce audit records with matching client quote ids"` — identifier leak (`UpdateQuote`)
+  → `"updating a quote writes an audit record"`
+  → `"the audit record links to the originating client quote"`
+
+- `"err is not nil"` — assertion phrasing
+  → `"rejects request with invalid signature"`
+  → `"refuses unauthenticated callers"`
+
+- `"returns 409"` — HTTP leak
+  → `"rejects duplicate id"`
+  → `"refuses to create a provider that already exists"`
+
+- `"with empty slice"` — structure leak
+  → `"returns no matches when the filter is unset"`
+  → `"borrower has no past loans"`
+
+- `"CreateProvider returns nil"` — identifier leak + assertion phrasing
+  → `"creates a new provider successfully"`
+  → `"a fresh provider id is issued"`
+
+## Uniformly across tree levels
+
+The bar is the same at top, intermediate, and leaf. Example:
+
+```
+"borrower opens a credit line"           // top — business setup, not "CreateCreditLine returns ok"
+  ├─ "borrower draws on the credit line" // intermediate — action, not "Draw=true"
+  │    ├─ "balance reflects the drawdown"  // leaf — outcome, not "balance equals 100"
+  │    └─ "draw is rejected when limit is exceeded"
+  └─ "counterparty accepts KYB access"
+```
+
+No level mentions the Go type (`CreditLine`), the method (`Draw`), the HTTP code, the assertion verb (`equals`), or the data shape (`with balance > 0`).
+
+## During review
+
+When reviewing a samurai test, scan every `s.Test(...)` string literal against the heuristics above. For each hit:
+
+1. Quote the offending name and identify which forbidden pattern it matches.
+2. Read the `Test()` body and derive actor + verb + outcome.
+3. Propose 1–2 concrete replacements.
+
+Do not silently rewrite — surface the issue and the proposal so the author confirms domain accuracy.
diff --git a/eval/naming-length/RESULTS-HARD.md b/eval/naming-length/RESULTS-HARD.md
@@ -0,0 +1,91 @@
+# Naming-section length A/B — hardened eval (no lazy load + mixed fixtures + N=3)
+
+## Why a second eval
+
+The first eval (`RESULTS.md`) was non-discriminating — every variant scored 20/20 because all variants pointed to the same lazy-loaded `naming.md` (681 words) and subagents loaded it. The inline section's own carrying capacity was never tested.
+
+This eval fixes three things:
+
+1. **No lazy load.** Variants live in `variants/no-load/v{1..5}.md`, stripped of the `[naming.md]` pointer. The directory has no `naming.md` sibling. The inline text is the only guidance.
+2. **Mixed fixtures.** New `F{1..5}_hard.go.txt` mix textbook violations + borderline cases (passive voice, "has X" used in domain phrasing, "rejected" as domain outcome, status-range shorthand `4xx`) + clean controls with risky-looking words.
+3. **N=3 per cell.** 5 variants × 5 fixtures × 3 runs = 75 subagent invocations, clean context each.
+
+Ground truth: 21 verdicts per run, 7 FLAG + 14 PASS — see `fixtures/expectations-hard.json`. Each name labelled `textbook`, `borderline`, or `clean` for diagnostic slicing.
+
+## Variant sizes
+
+| Variant | Words | What's in it |
+|---|---|---|
+| v1 | 206 | All 4 forbidden patterns + worked examples + safe-examples carve-out paragraph. |
+| v2 | 129 | All 4 patterns with bulleted examples + safe-examples one-liner. |
+| v3 | 49 | All 4 patterns with parenthesized examples + brief safe-words clause. |
+| v4 | 31 | All 4 patterns in one sentence, no safe-examples clause. |
+| v5 | 16 | "business actions, never function names, status codes, or assertions" — no structure clause. |
+
+## Scores
+
+| Variant | Total correct | Score | Per-run (run 1 / 2 / 3) | Failure mode |
+|---|---|---|---|---|
+| v1 (206w) | 58/63 | 92.1% | 20 / 18 / 20 | Over-flags passive voice (`"first loan is added to the history"` — all 3 runs FLAGged for "is") |
+| **v2 (129w)** | **62/63** | **98.4%** | 21 / 20 / 21 | One stray false-FLAG; everything else stable. |
+| v3 (49w) | 60/63 | 95.2% | 20 / 20 / 20 | Consistently false-FLAGs `"the request is rejected for missing auth"` (no carve-out for passive domain outcomes). |
+| v4 (31w) | 59/63 | 93.7% | 20 / 19 / 20 | Consistently false-FLAGs `"borrower has insufficient collateral"` — v4 lists `has X` as forbidden but drops the safe-examples clause. |
+| v5 (16w) | 54/63 | 85.7% | 18 / 16 / 18 | **Structure leak detection collapses.** All 3 runs PASS `"borrower with empty loan history"` (textbook structure leak — should FLAG). Also misses `"the new provider has the requested name"`. |
+
+## Per-fixture breakdown
+
+| | F1 (id) | F2 (assert) | F3 (struct) | F4 (HTTP) | F5 (control) |
+|---|---|---|---|---|---|
+| v1 | 12/12 | 11/12 | 9/12 | 11/12 | 15/15 |
+| **v2** | **12/12** | **12/12** | **12/12** | **11/12** | **15/15** |
+| v3 | 12/12 | 12/12 | 12/12 | 9/12 | 15/15 |
+| v4 | 12/12 | 12/12 | 12/12 | 11/12 | 12/15 |
+| v5 | 12/12 | 9/12 | 8/12 | 10/12 | 15/15 |
+
+## Where each variant breaks
+
+- **v1's over-detail backfires.** Adding `is` to the assertion-verb list trips passive constructions like `"is added"`. The longer prose gives the subagent more rules to misapply.
+- **v2 is the sweet spot.** Lists all 4 patterns with concrete examples *and* has the safe-examples disclaimer. Enough constraint, not too much.
+- **v3 holds on 3/4 categories** but its briefer safe-words clause isn't enough to protect `"rejected"` as a domain outcome.
+- **v4 loses the safe-examples clause.** "has X" is listed as forbidden but `"borrower has insufficient collateral"` is exactly the canonical domain phrasing. Without the carve-out, the variant punishes correct names.
+- **v5 collapses on structure-leak detection.** The 16-word one-liner has no `structure / data-shape / empty / nil` cue. Three textbook leaks pass through (`"with empty loan history"`, `"has the requested name"`), and `"empty"` slips entirely (all 3 runs).
+
+The degradation isn't monotonic in length — it's monotonic in **rule coverage**. v1 has too much detail, v5 has too little, v2–v4 trade precision against verbosity.
+
+## Why v2 beats v1
+
+v1 (Variant A's current naming section) is the section originally drafted as "good enough". The eval shows it's *worse* than a 129-word version. v1's extra detail — explicit `is` in the assertion-verb list, regex hints, repeated examples — makes the rule too aggressive on perfectly fine passive constructions. Trimming detail to v2 actually raises score.
+
+## Recommendation
+
+**Ship v2** as the inline `## Naming` section in `cmd/claude-setup/skill/SKILL.md`, plus the full `naming.md` lazy-loaded.
+
+Rationale:
+- v2 scored highest in the hardened eval (98.4% vs 92.1% for v1).
+- v2 holds all 5 categories with one stray miss; no consistent failure mode.
+- The `naming.md` lazy load remains net positive on borderline cases (it adds the actor/verb/outcome protocol), but is no longer load-bearing for basic rule application.
+- Cost: 129 words inline, loaded with every samurai-skill activation. Acceptable.
+
+Concrete v2 to ship (back-edit the naming.md pointer in for production):
+
+```markdown
+## Naming
+
+Every `s.Test("name", ...)` — at every tree level — names a business action or outcome from the actor's POV. Not the function under test, the assertion, or the input shape.
+
+Forbidden patterns:
+
+- **Identifier leak** — CamelCase tokens matching Go identifiers in the Test body. `"UpdateQuote calls..."` leaks `UpdateQuote`.
+- **HTTP code leak** — bare 3-digit codes (`\b[1-5]\d{2}\b`) or `4xx/5xx` shorthand. `"returns 409"` → `"rejects duplicate id"`.
+- **Assertion phrasing** — `is / equals / returns / not nil / has length / contains / has`. `"err is not nil"` → `"rejects invalid signature"`.
+- **Structure phrasing** — `empty / nil / with N / map / slice / struct`. `"with empty slice"` → `"borrower has no past loans"`.
+
+Risky-looking words used in a domain-natural sense (`borrower returns the equipment`, `has insufficient collateral`, `no past loans`) are clean — only flag when the word describes the assertion, the data shape, or the identifier. Flag matches, read the body, propose 1–2 replacements. See [naming.md](naming.md) for the protocol.
+```
+
+## Eval quality notes
+
+- N=3 was enough to expose consistent failure modes (v3, v4, v5 each had a 3/3 false-FLAG or false-PASS). For ranking purposes more runs aren't needed.
+- The 7-FLAG / 14-PASS asymmetry biases the eval toward variants that PASS too eagerly (v5 still scored 86% by passing everything in F5). Real samurai trees have closer to 50/50 — re-run on production samples to confirm v2 holds.
+- Ground-truth for `"first loan is added to the history"` is a judgement call. The naming.md examples use passive forms (`"first loan is added"` is the spirit of `"a fresh provider id is issued"`). If you treat this as FLAG instead, v1 jumps from 92.1% to ~96.8% — still behind v2.
+- Cost: 75 subagents × ~21k tokens ≈ 1.6M tokens. The original suspect eval was 25 × 23k ≈ 575k. Worth the 2.8× spend to actually answer the question.