Skip to content

thousandflowers/skillreaper

Repository files navigation

reap in action

Half of what your AI agent loads, it never uses.

skillreaper proves which — from your own transcripts — and prunes the dead weight, so your agent stops wading through tools it never picks.

CI Issues Downloads MIT


brew install thousandflowers/tap/skillreaper
reap

One command. Zero config. Read-only. It reads your real session transcripts, finds every skill / MCP / agent your AI loads but never fires, and shows you exactly what it costs you.


Why I built this

I was running out of context budget on every session. I had accumulated skills, MCP servers, and agents over months most of them experiments I'd forgotten about or be too busy to change it. Every new session loaded all of them, burning tokens before I'd typed a single message.

I needed to know which ones were actually firing and which were just dead weight. Nothing existing told me that from transcript evidence. So I built it.

The problem turned out to be common. The first Reddit post hit 29K views in a week. The tool now supports six platforms and ships on Homebrew, npm, and as a static binary for every major OS.


Two problems, one cause

Wrong-tool picks. Buried in a wall of irrelevant options, your agent wastes turns reaching for the wrong tool. More turns = slower, costlier, sloppier runs. This isn't about pennies — it's about work quality.

Wasted tokens. Dead instructions eat context every session and hurt prompt-cache hit rate. A typical setup:

  • 187 items loaded
  • 142 never used (76 %)
  • 8 000 tok/session dead
  • ~2 160 000 tok/month burned on irrelevant instructions

Numbers above are from a real session — run reap to see yours.

skillreaper measures both, from evidence — no guessing.

reap points at the waste. You decide what goes.


Privacy

100 % local. Zero telemetry, zero network, zero uploads. Reads config files and session transcripts on disk — your data never leaves your machine.


Before → After

Before skillreaper After skillreaper
187 items loaded every session 45 items, all actively used
Wrong tool 1 in 5 turns Right tool on first try
8 000 tok/session dead Full context budget for real work
~30 pages of irrelevant instructions read monthly Zero
Lower cache hit rate = higher latency Smaller prompt fits in cache

If this looks useful → ⭐ star the repo

Install

# macOS — Homebrew
brew install thousandflowers/tap/skillreaper

# Any platform — npm (downloads the matching prebuilt, checksum-verified)
npm install -g skillreaper

# No install — one-shot via npx
npx skillreaper

# Any platform — Go (Go ≥ 1.26)
go install github.com/thousandflowers/skillreaper/cmd/reap@latest

Brew and Go install the command as reap; npm/npx expose it as skillreaper. Same tool — every reap … example below works under either name.

Binary downloads — macOS (Intel + Apple Silicon), Linux (amd64 + arm64), Windows (amd64 + arm64) — all on the releases page. Single static binary, no dependencies.

Upgrading, uninstalling, and platform-specific tips → INSTALL.md.


Usage

💬 Curious what reap finds in other setups? Share your output →

reap                          # scan + report (read-only)
reap gap                      # loaded-vs-fired utilization breakdown
reap prune                    # quarantine REAP items (reversible)
reap mute <name>              # strip description, keep skill available
reap unmute <name>            # restore description from backup
reap unmute --all             # restore all muted skills
reap keep <name>              # protect an item from pruning
reap restore --all            # undo every prune
reap why <name>               # explain in detail why an item got its verdict
reap by-project               # skills bucketed by the project that fired them
reap route                    # propose a usage-informed lazy-load routing plan (opt-in)
reap apm                      # emit a proposed APM apm.yml from this repo's firing
reap apm --diff apm.yml       # reconcile: what to add (fired, undeclared) / drop (declared, cold)
reap gap                      # now also scores MCP payload quality (fires-but-noise)
reap manifest <name>          # emit a release manifest for one skill
reap install-hook             # install weekly nudge (SessionStart hook)
reap install-hook --dry-run   # preview without writing
reap uninstall-hook           # remove hook, other hooks untouched
reap --json                   # structured JSON output
reap --md                     # markdown report
reap --days 7                 # shorter evidence window
reap --mute-threshold 0.20    # firing rate below which MUTE triggers (default 20%)
reap version                  # print version

Everything is reversible. reap prune moves files to a reaped/ directory with a versioned manifest. Nothing is ever deleted. Run reap restore --all and everything goes back exactly where it was.

Every write is atomic (temp file + rename) and confined to your Claude directory, so an interrupted prune, mute, or hook edit leaves the original file intact — never a half-written mix.


Verdicts

Label Meaning
REAP(broken) Invoked but errored — broken, not just cold
REAP Zero uses — safe to quarantine
MUTE Used rarely + heavy — description stripped, skill stays available
KEEP Used, tiny, or manually protected
REVIEW Too new or not enough sessions

Every verdict includes a reason suffix explaining why.


Loaded vs fired

Beyond the prune verdicts, reap gap shows your utilization rate — how much of what you load you actually use.

⟡ loaded vs fired — last 30 days · 142 sessions

CATEGORY   LOADED  FIRED   UTIL   ────────────       TOKENS
skills        187      4    2%    ▰▱▱▱▱▱▱▱▱▱     ~8 000 →   210
mcp            12      3   25%    ▰▰▱▱▱▱▱▱▱▱          ? →     ?
agents         30      2    7%    ▰▱▱▱▱▱▱▱▱▱     ~1 200 →    90
───────────────────────────────────────────────────────────────
total         229      9    4%    ▰▱▱▱▱▱▱▱▱▱     ~9 200 →   300

Each row breaks down by category (skill, MCP, agent) with item count, token weight, and a 10-segment utilization bar. Low utilization (<10 %) is red, medium (<50 %) yellow, high (≥50 %) green.

The default reap report also includes a compact utilization summary line:

⟡ utilization 4%  —  9/229 items fired · ~300/9 200 tok touched (30d)

This is the real gap between what your agent carries and what it fires — complementary to the shock box (which only counts items that are safe to prune right now).

reap gap          # text breakdown
reap gap --json   # JSON output
reap gap --md     # markdown table

The gap view also scores payload quality for MCP tools: when a tool fires, is the result signal or noise? A fetch/screenshot tool can fire 80× and return mostly base64 or boilerplate every call — green under load utilization, but context burned on each call. Tools that fire often and return mostly noise are flagged ⚑ noisy. This is the second utilization axis (load is the first), and mute does not catch it.


route — usage-informed lazy-load plan (opt-in)

After pruning, a library of hundreds of legit skills still grows resident context linearly. reap route proposes a category-router organization driven by real firing evidence, not text similarity: frequently-fired skills stay exposed; the rare long tail is pushed behind leaf routers (grouped by namespace, else dominant firing project) loaded on demand. It is strictly opt-in and secondary to pruning — and below ~150 skills, native loading is usually enough, so the plan says so. The output is a plan: proposed, never auto-applied.

reap route                      # text plan
reap route --json               # JSON
reap route --md                 # markdown
reap route --route-threshold 0.05   # route skills firing in <5% of sessions
reap route --route-min-skills 200   # only show a plan past 200 surviving skills

apm — emit a proposed APM manifest

reap apm turns this repo's firing evidence into a proposed APM apm.yml (skills only, first cut). Read-only: it prints YAML, never edits the repo or runs apm install. KEEP → include, REAP → omit, REVIEW → never auto-omit. Upstream coordinates are recovered from apm.lock.yaml when present; otherwise the skill becomes a clearly marked TODO comment rather than an invented coordinate.

reap apm                        # propose apm.yml (yaml)
reap apm --json                 # JSON
reap apm --md                   # markdown
reap apm --diff apm.yml         # reconcile: add fired-but-undeclared, drop declared-but-cold

Weekly nudge

reap install-hook

Installs a SessionStart hook that runs a passive audit at the start of each Claude Code session. If 7 days have passed and the REAP or MUTE count has grown since the last check, it prints a single line to stderr:

skillreaper: 3 skills flagged for pruning since last check. Run reap to review.

Nothing else. No blocking. State stored at ~/.claude/reaped/nudge-state.json.

reap uninstall-hook removes only the skillreaper entry — other hooks untouched.

Platform support

Platform Full support
Claude Code
Codex CLI
Hermes
OpenCode ✅ (usage evidence needs the sqlite3 CLI; inventory-only without it)
Cursor Inventory only (no local transcripts)
OpenClaw Inventory only (no session history)

How it works

  1. Auto-detect — probes every known config directory. Only installed platforms are scanned. No flags needed.
  2. Inventory — scans skills, agents, MCP servers, hooks, and prose files across all detected platforms.
  3. Evidence — parses JSONL session transcripts (Claude Code, Codex CLI, Hermes). Counts tool_use blocks and command invocations with timestamps. OpenCode's SQLite history is read via the sqlite3 CLI (read-only) when it is on PATH; without it, OpenCode stays inventory-only.
  4. Cost — character weight (ceil(chars / 3.7)) + init parser tool declarations. Model pricing auto-resolves by model name.
  5. Verdict — REAP / KEEP / REVIEW with machine-readable reason.
  6. Actreap prune quarantines. reap restore --all undoes.

Limitations (transparency)

Token counts are approximate. The tool estimates tokens as ceil(chars / 3.7), based on the average English BPE tokenizer rate. Real token counts vary by tokenizer (Claude vs GPT vs Gemini) and content (more code ≈ more tokens per char). This is a documented approximation — the relative ranking matters more than the absolute number.

Platform format stability. Each supported platform has its own config layout and transcript format. These change over time as platforms evolve. Parser updates are an ongoing maintenance reality. The project is architected for easy fixes (one struct per platform in internal/platform/), but format changes can lag by days to weeks after a platform update.

OpenCode evidence needs the sqlite3 CLI. OpenCode stores session history in a SQLite database. skillreaper reads it through the system sqlite3 binary in read-only mode — the real engine, so WAL-mode databases and overflow pages are handled correctly (a hand-rolled parser would not). No Go dependency is added. When sqlite3 is not on PATH, OpenCode items have no usage evidence: they stay REVIEW (never REAP) with a warning at scan time. The same safety net applies to any platform with no readable session transcripts.

Incomplete evidence never flags an item. The scanner caps how much it reads per transcript record. If a record is oversized or unreadable, that platform's evidence is marked incomplete and its items stay REVIEW (never REAP/MUTE), with a warning naming the platform — partial evidence can never mistakenly mark a tool as dead.

Not a tool declaration fix. Claude Code's deferred tools reduce the init-time tool declaration overhead. Skillreaper addresses a different problem: always-loaded skill/agent/prose files. If a skill description is 248 characters, it is read into context every session — regardless of lazy tool loading. These two optimizations are complementary, not competing.


Design

  • 100 % local, zero dependencies, single static binary (Go ≥ 1.26)
  • Multi-platform — adding a new platform is one struct in internal/platform/
  • Reversible quarantine — never deletes, never destructive
  • MIT licensed
cmd/reap/       CLI entry point
internal/
  platform/     platform definitions + auto-detection
  scan/         inventory scanners (claudemd.go: CLAUDE.md protection)
  usage/        transcript parser — tool_use + error tracking
  report/       verdict logic (REAP/MUTE/KEEP/REVIEW) + ANSI/JSON/MD renderers
  prune/        reversible quarantine
  mute/         description strip + backup/restore
  safepath/     shared path-confinement boundary (prune/mute/scan)
  atomicfile/   crash-safe writes (temp file + rename)
  hook/         SessionStart install/uninstall + nudge state
  cost/         model pricing
docs/           demo assets


Acknowledgements

v0.2.0 ideas were inspired by work from the r/claudeskills community:

  • groundskeeper — SessionStart weekly nudge pattern and live usage tracking approach
  • optimize — name-only middle state (implemented as MUTE) and CLAUDE.md reference protection
  • Broken-vs-cold distinction direction inspired by discussion on r/claudeskills

Issues · Discussions · Releases · MIT

About

Evidence-based pruning for your AI-agent stack — scan, report, and safely prune unused skills/MCP servers/agents using real transcript evidence

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors