Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions .claude/nightly_skill-lifecycle-integration-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# SPEC: skill-lifecycle — NIGHTLY × skill-router/manage.py 集成

## 背景

`manage.py` 提供基于 session 使用频率的 skill 归档/召回能力:
- **归档**: N 天内零活动的 skill → `_archive_skills/`
- **召回**: 已归档但 session 中仍被引用 ≥ min_hits 次 → 移回 active
- **保护**: 标记永不归档的 skill(`protected_skills.json`)
- **信号源**: `~/.claude/projects/` session transcripts + `route_log.jsonl`

当前 NIGHTLY 的 5 种策略(rule-rewrite, hook-tighten, memory-add, skill-description-tighten, rule-reorder)均不涉及 skill 启停。将 manage.py 的归档/召回逻辑作为第 6 种策略接入 NIGHTLY 循环。

## 目标

新增策略 `skill-lifecycle`:每晚基于 session 使用数据,提出**一个** skill 的归档或召回操作,经 replay-score 验证后决定 keep/revert。

## 设计

### 策略定义

```
| Strategy | When to use |
| skill-lifecycle | manage.py --status 显示:(a) 活跃 skill 30d 零使用,或 (b) 归档 skill 14d ≥3 hits。提出单个归档或召回。|
```

### Proposal 结构

```json
{
"run_id": "...",
"baseline_commit": "...",
"strategy": "skill-lifecycle",
"target_file": "~/.claude/skills/<name>/SKILL.md",
"action": "archive" | "recall",
"skill_name": "some-skill",
"skill_source": "claude" | "agents",
"change_summary": "Archive skill 'X' (0 hits in 30d) to reduce prompt noise",
"evidence": {
"days_analyzed": 30,
"hit_count": 0,
"total_sessions_scanned": 142
},
"motivating_corrections": [],
"proposed_at": "<iso8601>"
}
```

### 执行流程

```
1. Preflight
python3 ~/.agents/skills/skill-router/scripts/manage.py --status --days 30
→ 获取 active/archived 概览 + usage

2. Propose
IF 存在 30d 零活动的非保护 skill → 选 token 占用最大的一个 → action=archive
ELIF manage.py --recall --days 14 --min-hits 3 有推荐 → 选 hits 最高的 → action=recall
ELSE → skip, 选其他策略

3. Apply
IF archive: python3 manage.py --archive --days 30 (仅移动目标 skill,非 batch)
IF recall: python3 manage.py --recall --days 14 --min-hits 3 --apply (仅目标)
注: manage.py 当前是 batch 操作,需要扩展为支持 --name <skill> 的单目标模式

4. Safety check
- skill 在 ALWAYS_KEEP / protected_skills.json 中 → exit 3
- skill 是 symlink → exit 3(不动 symlink skill)
- archive 后剩余 active skill 数 < 5 → exit 3(防止清空)

5. Replay + Score
标准 NIGHTLY 流程:replay benchmark,对比 baseline 分数

6. Decide
与其他策略相同的 keep/revert 规则:
- score ≥ baseline → keep(skill 归档降噪有正收益)
- score < baseline - threshold → revert(该 skill 被隐式依赖)

7. Revert 机制
IF revert:
对 archive 操作: move_skill(name, source, 'archive', 'active')
对 recall 操作: move_skill(name, source, 'active', 'archive')
+ 将 (skill-lifecycle, skill_name) 写入 dead-letter
```

### manage.py 需要的改动

| 改动 | 原因 |
|---|---|
| 新增 `--name <skill>` 参数 | NIGHTLY 每次只操作一个 skill,不要 batch |
| `--json` 输出模式 | agent 解析结构化数据,不解析中文 print |
| `cmd_archive` / `cmd_recall` 支持单目标 | 配合 `--name` |
| 返回 exit code 区分:0=成功, 1=无操作, 3=安全拒绝 | 对齐 NIGHTLY safety_check 协议 |

### safety_check.py 改动

放开对 skill 目录的移动操作(当前 `plugins/` 是禁区):

```python
# 新增白名单规则
if strategy == 'skill-lifecycle':
allowed_paths = [
'~/.claude/skills/',
'~/.claude/_archive_skills/',
'~/.agents/skills/',
'~/.agents/_archive_skills/',
]
# 仅允许 skill 目录间的移动,不允许删除或内容修改
```

### 评估指标扩展

标准 replay score 之外,额外记录:

```json
{
"prompt_token_delta": -1847,
"active_skill_count_before": 52,
"active_skill_count_after": 51
}
```

token_delta 作为辅助信号:即使 replay 分数持平,显著的 token 节省(>1000)也可视为正收益。

### 与 strategy_stats.py 的集成

`skill-lifecycle` 作为独立策略参与 effectiveness tracking:
- 按正常 kept/tried 比率计算 promising/neutral/avoid
- 子类型(archive vs recall)不单独追踪,统一为一个策略桶

## 不做的事

- 不修改 manage.py 的核心 scan_usage 逻辑(信号质量是 skill-router 的事)
- 不同时归档/召回多个 skill(NIGHTLY 原则:one change per run)
- 不触碰 `settings.json`(skill 的 active/archive 是文件系统级操作,不走 settings)
- 不自动重建 `build_index.py`(归档/召回后由下次 skill-router 使用时自动触发)

## 依赖

- `~/.agents/skills/skill-router/scripts/manage.py` 已安装且可执行
- Python 3.10+(已有)
- session transcripts 存在于 `~/.claude/projects/`

## 验收标准

1. `nightly --observation` 能生成 `skill-lifecycle` 类型的 proposal
2. dry-run 模式正确识别归档/召回候选
3. revert 能完整还原 skill 位置(包括 symlink 修复)
4. dead-letter 阻止重复操作同一 skill
5. strategy_stats 正确追踪 skill-lifecycle 的 kept/tried
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ __pycache__/
**/dead-letter.jsonl
**/reports/
**/logs/
.omc/
18 changes: 16 additions & 2 deletions agents/nightly-optimizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The substrate you're improving is `~/.claude/` itself. The eval suite is `~/.cla
2. **`~/.claude/` must be a clean git repo at start.** If `git status` shows uncommitted changes, abort with a clear message — never destroy the user's in-flight work.
3. **All state goes to disk immediately.** Every measurement, every decision. The conversation is not durable storage.
4. **Always include regressions in the report.** Top 3 regressions are a guardrail against silent overfit.
5. **Never touch `~/.claude/projects/`, `~/.claude/plugins/`, `~/.claude/statsig/`, or `~/.claude/ide/`** — those are session/cache state, not substrate.
5. **Never touch `~/.claude/projects/`, `~/.claude/plugins/`, `~/.claude/statsig/`, or `~/.claude/ide/`** — those are session/cache state, not substrate. Exception: `skill-lifecycle` strategy moves skills between `~/.claude/skills/` ↔ `~/.claude/_archive_skills/` and `~/.agents/skills/` ↔ `~/.agents/_archive_skills/` via `skill_lifecycle.py`.
6. **Budget cap: $3 of Haiku tokens.** If you've spent more, stop and log a partial result.
7. **Wall-clock cap: 30 minutes total run time.** Record the run's start time. If 30 min elapses before the loop completes, stop immediately, revert any partially-applied change, and log `decision: "timeout"`. Don't try to "finish" past the cap — the next cron fire will start fresh.
8. **Sanity floor on score: 0.5.** If the experiment scores below 0.5, the loop is broken (not the substrate). Revert, log `decision: "sanity-floor-rejected"`, and write a report that flags the failure. Three consecutive sanity-floor rejections → abort future runs until the user investigates.
Expand Down Expand Up @@ -68,6 +68,7 @@ Pick the highest-leverage change from this menu. Bias by:
| **memory-add** | Two or more recent corrections share a `root_cause`, OR a `proposed_rule` is mechanical enough to live in a SKILL.md. Create a feedback memory or skill file. |
| **skill-description-tighten** | A skill's description is generic enough that wrong skills trigger. Tighten. |
| **rule-reorder** | An anti-pattern rule appears below a less-critical one in operating-mode docs. Move it up. |
| **skill-lifecycle** | `skill_lifecycle.py --propose` returns a candidate. Archive a 30d-unused skill (reduce prompt noise) or recall an archived skill with ≥3 recent hits. |

Write your proposal to `proposal.json` BEFORE applying — this is the audit trail.
```json
Expand All @@ -82,9 +83,21 @@ Write your proposal to `proposal.json` BEFORE applying — this is the audit tra
}
```

**skill-lifecycle specific proposal flow:**
```bash
python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --propose
```
If exit 0, the JSON output has `name`, `source`, `action`, `hits`, `size_bytes`. Use these to fill `proposal.json` with `target_file` = the skill's SKILL.md path. If exit 1, no candidate — pick another strategy.

### 3. Apply
Edit the file(s). Stage the change with `git add -A` but do NOT commit yet. The commit only happens if the experiment is kept.

**skill-lifecycle apply:** instead of editing files, run:
```bash
python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --apply --name <name> --action <archive|recall> --source <claude|agents>
```
Exit 0 = applied. Exit 3 = safety rejected (protected/min-active-count) — treat as safety_check failure.

### 3b. Safety check (mandatory)
Run:
```
Expand Down Expand Up @@ -191,12 +204,13 @@ Before applying decision, read `~/.claude/nightly/dead-letter.jsonl` if it exist

**Default — observation mode** (auto-commit marker file absent):
- Regardless of decision (`keep`, `revert`, `held`), **always revert** the change with `git reset --hard <baseline_commit>`. NIGHTLY never mutates substrate without user review while in this mode.
- **skill-lifecycle revert:** also run `python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --revert --name <name> --action <action> --source <source>` to move the skill back before `git reset`.
- Write the proposal, diff, and score to `~/.claude/nightly/proposed/<run_id>.md` so the user can review and manually approve via `/nightly approve <run_id>` (which re-applies the change and commits with the correct author email).
- Mark the experiment-log `decision: "proposed-<original_decision>"` (e.g. `proposed-kept`, `proposed-reverted`) so the audit trail shows what the loop WOULD have done.

**Auto-commit mode** (user explicitly opted in by creating `~/.claude/nightly/auto-commit.yes`):
- **Keep**: `cd ~/.claude && git commit -m "nightly <run_id>: <strategy> — score <baseline> → <new> (+<delta>)"`.
- **Revert / hold**: `cd ~/.claude && git reset --hard <baseline_commit>`.
- **Revert / hold**: `cd ~/.claude && git reset --hard <baseline_commit>`. For **skill-lifecycle**, also run `python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --revert --name <name> --action <action> --source <source>` before the git reset.

**Why observation mode is the default:** v0.2 scoring uses six regex heuristics over historical replay. The signals are gameable (e.g. a CLAUDE.md edit that forbids "feels balanced" trivially scores higher without improving reasoning), the Δ ≥ +0.02 threshold is below noise without variance estimation, and ground truth is "what the historical assistant did", not "what should have happened". Until v0.3 adds LLM-as-judge + multi-trial variance + correction-weighted scoring, NIGHTLY should propose changes, not commit them.

Expand Down
4 changes: 3 additions & 1 deletion commands/nightly.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
description: NIGHTLY autoresearch loop against ~/.claude/. Default = run one experiment in OBSERVATION mode (no auto-commit). Subcommands: status, diff, approve, reject, disapprove, list-proposals.
description: "NIGHTLY autoresearch loop against ~/.claude/. Default = run one experiment in OBSERVATION mode (no auto-commit). Subcommands: status, diff, approve, reject, disapprove, list-proposals."
---

Arguments: `$ARGUMENTS`
Expand All @@ -23,6 +23,8 @@ Recognized flags (pass-through):
- `--dry-run` — skip benchmark replay, use corpus ground-truth as synthetic substitute
- `--budget <usd>` — override default $3 cap
- `--n <count>` — override default 10 replayable tasks
- `--since <YYYY-MM-DD>` — only replay tasks with first_message_at >= this date
- `--until <YYYY-MM-DD>` — only replay tasks with first_message_at <= this date (inclusive)

### `status`
Print a one-screen status:
Expand Down
Empty file added corrections.jsonl
Empty file.
158 changes: 158 additions & 0 deletions docs/spec-skill-lifecycle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# SPEC: `skill-lifecycle` Strategy

> Add a 6th mutation strategy to the nightly optimizer that archives unused skills and recalls demanded archived skills, using session transcript analysis as the usage signal.

## Motivation

Skills loaded into the Claude Code system prompt consume tokens on every turn. An unused skill wastes prompt budget without providing value. Conversely, a skill archived too aggressively may be needed again — session transcripts reveal latent demand via keyword hits and Skill tool invocations.

The existing 5 strategies (rule-rewrite, hook-tighten, memory-add, skill-description-tighten, rule-reorder) optimize substrate *content*. `skill-lifecycle` optimizes substrate *composition* — which skills are active at all.

## Design

### Usage Signal Detection

The adapter scans `.jsonl` session transcripts from multiple sources:

```
~/.claude/projects/ # Claude Code sessions
~/.codex/sessions/ # Codex sessions (if present)
~/.agents/sessions/ # Other agent harnesses
```

Three signal types per skill:
1. **Skill tool call**: `"tool": "Skill"` with `"input": {"skill": "<name>"}` in transcript
2. **SKILL.md loaded**: `"Base directory for this skill:"` string in assistant messages
3. **Keyword match**: skill name appears in assistant content

### Adapter Script: `src/skill_lifecycle.py`

Self-contained Python script with three modes:

```
python3 src/skill_lifecycle.py --propose
python3 src/skill_lifecycle.py --apply --name X --action archive|recall --source claude|agents
python3 src/skill_lifecycle.py --revert --name X --action archive|recall --source claude|agents
```

**Exit codes** (aligned with `safety_check.py` protocol):
- `0` — action taken / candidate found
- `1` — no candidate available / revert failed
- `3` — safety rejected (protected skill, minimum active count, symlink)

**Directory layout:**

| Source | Active path | Archive path |
|---|---|---|
| `claude` | `~/.claude/skills/` | `~/.claude/_archive_skills/` |
| `agents` | `~/.agents/skills/` | `~/.agents/_archive_skills/` |

### `--propose` Logic

1. Scan sessions from last 30 days, count per-skill usage
2. **Archive candidates**: active skills with 0 hits in 30d, excluding protected set
3. **Recall candidates**: archived skills with ≥3 hits in last 14d
4. **Selection priority**: recall (highest hits) > archive (largest file size = most token savings)
5. Output: single JSON object with `name`, `source`, `action`, `hits`, `size_bytes`

### Safety Guards

Built into `--apply`:
- Skills in the hardcoded `ALWAYS_KEEP` set (`skill-router`, `context-mode`, `oh-my-claudecode`) are never archived
- User-defined `protected_skills.json` entries are never archived
- Symlink skills are never moved (they point to canonical locations)
- Archive is rejected if total active skill count would drop below 5

### Proposal Structure

```json
{
"run_id": "2026-05-27-0300",
"baseline_commit": "abc123",
"strategy": "skill-lifecycle",
"target_file": "~/.claude/skills/<name>/SKILL.md",
"action": "archive",
"skill_name": "some-skill",
"skill_source": "claude",
"change_summary": "Archive skill 'some-skill' (0 hits in 30d, 14KB) to reduce prompt noise",
"evidence": {
"days_analyzed": 30,
"hit_count": 0,
"size_bytes": 14497,
"archive_pool": 3,
"recall_pool": 0
},
"motivating_corrections": [],
"proposed_at": "2026-05-27T03:00:00Z"
}
```

### Integration with Agent Workflow

**Step 2 (Propose):**
```bash
python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --propose
```
If exit 0 → use output to fill `proposal.json`. If exit 1 → no candidate, pick another strategy.

**Step 3 (Apply):**
```bash
python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py \
--apply --name <name> --action <archive|recall> --source <claude|agents>
```
Exit 3 → treat as safety_check failure (revert, dead-letter, stop).

**Step 7 (Revert — both observation mode and auto-commit revert):**
```bash
python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py \
--revert --name <name> --action <archive|recall> --source <claude|agents>
```
Must run BEFORE `git reset --hard` since the filesystem move is not tracked by git.

### Scoring Considerations

Standard replay + mechanical scorer applies. Additional context for the report:

```json
{
"prompt_token_delta": -1847,
"active_skill_count_before": 52,
"active_skill_count_after": 51
}
```

A token savings >1000 with score parity (Δ within noise floor) can be treated as positive signal — less prompt noise without quality regression.

### Strategy Stats Integration

`skill-lifecycle` participates in `strategy_stats.py` effectiveness tracking as a single strategy bucket. Sub-types (archive vs recall) are not tracked separately — the sample size would be too small for meaningful signal.

## Hard Rule 5 Exception

The existing hard rule "Never touch `~/.claude/plugins/`" remains. `skill-lifecycle` operates on `~/.claude/skills/` and `~/.agents/skills/` — these are substrate, not plugin/cache state. The `safety_check.py` path allowlist should whitelist:

```
~/.claude/skills/
~/.claude/_archive_skills/
~/.agents/skills/
~/.agents/_archive_skills/
```

Only for `strategy == "skill-lifecycle"`, and only for directory moves (not content edits or deletions).

## Out of Scope

- Batch archive/recall (violates one-change-per-run principle)
- Modifying `settings.json` (skill activation is filesystem-level, not settings-level)
- Rebuilding search indexes after moves (handled lazily on next skill invocation)
- Changing the usage detection heuristics (signal quality is independent of this strategy)

## Acceptance Criteria

1. `--propose` correctly identifies archive candidates (30d zero usage) and recall candidates (≥3 hits in 14d)
2. `--apply` with a protected skill exits 3
3. `--apply` with a non-existent skill exits 3
4. `--revert` restores original filesystem state (including symlink repointing)
5. Dead-letter blocks re-trying the same `(skill-lifecycle, skill_name)` pair
6. `strategy_stats.py` tracks `skill-lifecycle` kept/tried ratios correctly
7. Observation mode always reverts the filesystem move after scoring
7 changes: 7 additions & 0 deletions plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"name": "nightly",
"version": "1.0.0",
"description": "Nightly self-improvement loop for Claude Code",
"commands": ["commands/nightly.md"],
"hooks": []
}
Loading