Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions benchmark/CLAUDE.skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
This repository has .graph.* files next to source files containing code relationship data from Supermodel.

The naming convention: for src/Foo.py the graph file is src/Foo.graph.py (insert .graph before the extension). Each graph file has up to three sections:
- [deps] — what this file imports and what imports it
- [calls] — function call relationships with file paths and line numbers
- [impact] — blast radius: risk level, affected domains, direct/transitive dependents

**Read the .graph file before the source file.** It shows the full dependency and call picture in far fewer tokens. Construct the path directly — don't ls the directory to discover it.

Before grepping to understand how code connects, check the relevant .graph files. They already answer most structural navigation questions: what calls what, what imports what, and what breaks if you change something. When you grep for a function name, .graph files appear in results showing every caller and callee — use this to navigate instead of searching for each one individually.
Binary file modified benchmark/results/benchmark_results.zip
Binary file not shown.
28 changes: 15 additions & 13 deletions benchmark/results/blog-post-draft.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 40% cheaper. 4× faster. Same correct answer.
# 60% cheaper. 4× faster. Same correct answer.

We ran a test: give Claude Code the same task twice — once by itself, once with Supermodel. Both had to make 8 failing tests pass in a 270k-line codebase. Both used the same model. Same starting point.
We ran a test: give Claude Code the same task four ways — naked, with a hand-crafted prompt, with our auto-generated prompt, and with a different shard format. All had to make 8 failing tests pass in a 270k-line codebase. Same model. Same starting point.

Here's what happened.

Expand Down Expand Up @@ -29,18 +29,18 @@ No plugins. No special AI tools. Just better context up front.

## Results

| | Naked Claude | + Supermodel |
|---------------------|-------------|--------------|
| **Cost** | $0.2212 | $0.1329 |
| **Turns** | 13 | 7 |
| **Duration** | 95.9s | 24.1s |
| **Cache reads** | 235,456 tok | 90,479 tok |
| **Tests passed** | ✓ YES | ✓ YES |
| Tool calls | Bash ×8, Read ×2, Write ×2 | Bash ×2, Read ×2, Glob ×1, Write ×1 |
| | Naked Claude | + Supermodel (crafted) | + Supermodel (auto) | Three-file shards |
|---------------------|-------------|------------------------|---------------------|-------------------|
| **Cost** | $0.30 | $0.12 | $0.15 | $0.25 |
| **Turns** | 20 | 9 | 11 | 16 |
| **Duration** | 122s | 29s | 42s | 73s |
| **Tests passed** | ✓ YES | ✓ YES | ✓ YES | ✓ YES |

**40% cheaper. 6 fewer turns. 72 seconds faster.**
**60% cheaper. 4× faster. 55% fewer turns.**

Both got the right answer. The only difference was how much digging each one had to do first.
All four got the right answer. The only difference was how much digging each one had to do first.

"Crafted" is a hand-written CLAUDE.md with Django-specific hints. "Auto" is what `supermodel skill` generates — a generic prompt that works on any repo. The auto prompt captured 83% of the crafted prompt's savings with zero manual effort.

---

Expand Down Expand Up @@ -118,7 +118,9 @@ That's real exploratory work. The summary files answered all of it before Claude

The savings didn't come from a cheaper model or a smaller prompt. They came from not making the AI rediscover things the codebase already knows about itself.

On a 270k-line repo with a hard task, one analysis pass meant 6 fewer turns and 72 fewer seconds — every single time. For tasks you run over and over — reviews, debugging, new features — that adds up fast.
On a 270k-line repo with a hard task, one analysis pass meant 11 fewer turns and 93 fewer seconds. And `supermodel skill` generates the CLAUDE.md for you — no hand-tuning required, still 50% cheaper than naked.

For tasks you run over and over — reviews, debugging, new features — that adds up fast.

Run the analysis once. Save on every task after.

Expand Down
94 changes: 94 additions & 0 deletions benchmark/results/skill-v2.txt

Large diffs are not rendered by default.

18 changes: 8 additions & 10 deletions benchmark/results/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,15 @@

## Results

| | naked | supermodel |
|--------------------|--------------|--------------|
| Cost | $0.2212 | $0.1329 |
| Turns | 13 | 7 |
| Duration | 95.9s | 24.1s |
| Cache tokens read | 235,456 | 90,479 |
| Cache tokens built | 18,681 | 23,281 |
| All tests passed | YES | YES |
| Tool calls | {'Bash': 8, 'Read': 2, 'Write': 2} | {'Bash': 2, 'Read': 2, 'Glob': 1, 'Write': 1} |
| | naked | supermodel (crafted) | skill (generic) | three-file |
|--------------------|--------------|----------------------|-----------------|--------------|
| Cost | $0.30 | $0.12 | $0.15 | $0.25 |
| Turns | 20 | 9 | 11 | 16 |
| Duration | 122s | 29s | 42s | 73s |
Comment on lines +12 to +14
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

skill (generic) benchmark values look stale against this PR’s stated results.

Line 12-Line 14 and Line 18 still show the older run ($0.15, 42s, 11 turns). The PR objective states the updated result is $0.11, 31s, 7 turns, so the derived comparison line is also off.

📌 Suggested update
-| Cost               | $0.30        | $0.12                | $0.15           | $0.25        |
-| Turns              | 20           | 9                    | 11              | 16           |
-| Duration           | 122s         | 29s                  | 42s             | 73s          |
+| Cost               | $0.30        | $0.12                | $0.11           | $0.25        |
+| Turns              | 20           | 9                    | 7               | 16           |
+| Duration           | 122s         | 29s                  | 31s             | 73s          |

-**skill (generic prompt): 50% cheaper, 66% faster, 45% fewer turns vs naked**
+**skill (generic prompt): 63% cheaper, 75% faster, 65% fewer turns vs naked**

Also applies to: 18-18

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/results/summary.md` around lines 12 - 14, Update the stale "skill
(generic)" benchmark entries in the results table: replace the old values
($0.15, 42s, 11 turns) with the PR-stated results ($0.11, 31s, 7 turns) in the
row for "skill (generic)" and then recompute and update the derived comparison
line (the comparison column that references those values) so the cost, duration,
and turns reflect the new numbers consistently throughout the table.

| All tests passed | YES | YES | YES | YES |

**supermodel: $0.0883 (39.9%) cheaper, 6 fewer turns, 72s faster**
**supermodel (crafted prompt): 60% cheaper, 76% faster, 55% fewer turns vs naked**
**skill (generic prompt): 50% cheaper, 66% faster, 45% fewer turns vs naked**

## How supermodel helped
The graph files gave Claude the architecture upfront. The supermodel run went straight
Expand Down
37 changes: 37 additions & 0 deletions cmd/skill.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
package cmd

import (
"fmt"

"github.com/spf13/cobra"
)

const skillPrompt = `This repository has .graph.* files next to source files containing code relationship data from Supermodel.

The naming convention: for src/Foo.py the graph file is src/Foo.graph.py (insert .graph before the extension). Each graph file has up to three sections:
- [deps] — what this file imports and what imports it
- [calls] — function call relationships with file paths and line numbers
- [impact] — blast radius: risk level, affected domains, direct/transitive dependents

**Read the .graph file before the source file.** It shows the full dependency and call picture in far fewer tokens. Construct the path directly — don't ls the directory to discover it.

Before grepping to understand how code connects, check the relevant .graph files. They already answer most structural navigation questions: what calls what, what imports what, and what breaks if you change something. When you grep for a function name, .graph files appear in results showing every caller and callee — use this to navigate instead of searching for each one individually.`

func init() {
c := &cobra.Command{
Use: "skill",
Short: "Print agent awareness prompt for graph files",
Long: `Prints a prompt that teaches AI coding agents how to use Supermodel's
graph files. Pipe into your agent's instructions:

supermodel skill >> CLAUDE.md
supermodel skill >> AGENTS.md
supermodel skill >> .cursorrules`,
Args: cobra.NoArgs,
Run: func(cmd *cobra.Command, args []string) {
fmt.Println(skillPrompt)
},
}

rootCmd.AddCommand(c)
}
32 changes: 32 additions & 0 deletions cmd/skill_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
package cmd

import (
"strings"
"testing"
)

func TestSkillPrompt_ContainsKeyElements(t *testing.T) {
required := []struct {
substr string
reason string
}{
{".graph.", "must reference graph file extension"},
{"[deps]", "must document deps section"},
{"[calls]", "must document calls section"},
{"[impact]", "must document impact section"},
{".graph.py", "must show naming convention with concrete example"},
{"before the source file", "must instruct read-order (graph first)"},
}

for _, r := range required {
if !strings.Contains(skillPrompt, r.substr) {
t.Errorf("skill prompt missing %q — %s", r.substr, r.reason)
}
}
}

func TestSkillPrompt_NotEmpty(t *testing.T) {
if len(strings.TrimSpace(skillPrompt)) < 100 {
t.Error("skill prompt is suspiciously short")
}
}
Loading