diff --git a/.claude/nightly_skill-lifecycle-integration-spec.md b/.claude/nightly_skill-lifecycle-integration-spec.md new file mode 100644 index 0000000..0b2a522 --- /dev/null +++ b/.claude/nightly_skill-lifecycle-integration-spec.md @@ -0,0 +1,149 @@ +# SPEC: skill-lifecycle — NIGHTLY × skill-router/manage.py 集成 + +## 背景 + +`manage.py` 提供基于 session 使用频率的 skill 归档/召回能力: +- **归档**: N 天内零活动的 skill → `_archive_skills/` +- **召回**: 已归档但 session 中仍被引用 ≥ min_hits 次 → 移回 active +- **保护**: 标记永不归档的 skill(`protected_skills.json`) +- **信号源**: `~/.claude/projects/` session transcripts + `route_log.jsonl` + +当前 NIGHTLY 的 5 种策略(rule-rewrite, hook-tighten, memory-add, skill-description-tighten, rule-reorder)均不涉及 skill 启停。将 manage.py 的归档/召回逻辑作为第 6 种策略接入 NIGHTLY 循环。 + +## 目标 + +新增策略 `skill-lifecycle`:每晚基于 session 使用数据,提出**一个** skill 的归档或召回操作,经 replay-score 验证后决定 keep/revert。 + +## 设计 + +### 策略定义 + +``` +| Strategy | When to use | +| skill-lifecycle | manage.py --status 显示:(a) 活跃 skill 30d 零使用,或 (b) 归档 skill 14d ≥3 hits。提出单个归档或召回。| +``` + +### Proposal 结构 + +```json +{ + "run_id": "...", + "baseline_commit": "...", + "strategy": "skill-lifecycle", + "target_file": "~/.claude/skills//SKILL.md", + "action": "archive" | "recall", + "skill_name": "some-skill", + "skill_source": "claude" | "agents", + "change_summary": "Archive skill 'X' (0 hits in 30d) to reduce prompt noise", + "evidence": { + "days_analyzed": 30, + "hit_count": 0, + "total_sessions_scanned": 142 + }, + "motivating_corrections": [], + "proposed_at": "" +} +``` + +### 执行流程 + +``` +1. Preflight + python3 ~/.agents/skills/skill-router/scripts/manage.py --status --days 30 + → 获取 active/archived 概览 + usage + +2. Propose + IF 存在 30d 零活动的非保护 skill → 选 token 占用最大的一个 → action=archive + ELIF manage.py --recall --days 14 --min-hits 3 有推荐 → 选 hits 最高的 → action=recall + ELSE → skip, 选其他策略 + +3. Apply + IF archive: python3 manage.py --archive --days 30 (仅移动目标 skill,非 batch) + IF recall: python3 manage.py --recall --days 14 --min-hits 3 --apply (仅目标) + 注: manage.py 当前是 batch 操作,需要扩展为支持 --name 的单目标模式 + +4. Safety check + - skill 在 ALWAYS_KEEP / protected_skills.json 中 → exit 3 + - skill 是 symlink → exit 3(不动 symlink skill) + - archive 后剩余 active skill 数 < 5 → exit 3(防止清空) + +5. Replay + Score + 标准 NIGHTLY 流程:replay benchmark,对比 baseline 分数 + +6. Decide + 与其他策略相同的 keep/revert 规则: + - score ≥ baseline → keep(skill 归档降噪有正收益) + - score < baseline - threshold → revert(该 skill 被隐式依赖) + +7. Revert 机制 + IF revert: + 对 archive 操作: move_skill(name, source, 'archive', 'active') + 对 recall 操作: move_skill(name, source, 'active', 'archive') + + 将 (skill-lifecycle, skill_name) 写入 dead-letter +``` + +### manage.py 需要的改动 + +| 改动 | 原因 | +|---|---| +| 新增 `--name ` 参数 | NIGHTLY 每次只操作一个 skill,不要 batch | +| `--json` 输出模式 | agent 解析结构化数据,不解析中文 print | +| `cmd_archive` / `cmd_recall` 支持单目标 | 配合 `--name` | +| 返回 exit code 区分:0=成功, 1=无操作, 3=安全拒绝 | 对齐 NIGHTLY safety_check 协议 | + +### safety_check.py 改动 + +放开对 skill 目录的移动操作(当前 `plugins/` 是禁区): + +```python +# 新增白名单规则 +if strategy == 'skill-lifecycle': + allowed_paths = [ + '~/.claude/skills/', + '~/.claude/_archive_skills/', + '~/.agents/skills/', + '~/.agents/_archive_skills/', + ] + # 仅允许 skill 目录间的移动,不允许删除或内容修改 +``` + +### 评估指标扩展 + +标准 replay score 之外,额外记录: + +```json +{ + "prompt_token_delta": -1847, + "active_skill_count_before": 52, + "active_skill_count_after": 51 +} +``` + +token_delta 作为辅助信号:即使 replay 分数持平,显著的 token 节省(>1000)也可视为正收益。 + +### 与 strategy_stats.py 的集成 + +`skill-lifecycle` 作为独立策略参与 effectiveness tracking: +- 按正常 kept/tried 比率计算 promising/neutral/avoid +- 子类型(archive vs recall)不单独追踪,统一为一个策略桶 + +## 不做的事 + +- 不修改 manage.py 的核心 scan_usage 逻辑(信号质量是 skill-router 的事) +- 不同时归档/召回多个 skill(NIGHTLY 原则:one change per run) +- 不触碰 `settings.json`(skill 的 active/archive 是文件系统级操作,不走 settings) +- 不自动重建 `build_index.py`(归档/召回后由下次 skill-router 使用时自动触发) + +## 依赖 + +- `~/.agents/skills/skill-router/scripts/manage.py` 已安装且可执行 +- Python 3.10+(已有) +- session transcripts 存在于 `~/.claude/projects/` + +## 验收标准 + +1. `nightly --observation` 能生成 `skill-lifecycle` 类型的 proposal +2. dry-run 模式正确识别归档/召回候选 +3. revert 能完整还原 skill 位置(包括 symlink 修复) +4. dead-letter 阻止重复操作同一 skill +5. strategy_stats 正确追踪 skill-lifecycle 的 kept/tried diff --git a/.gitignore b/.gitignore index b1552c5..8c85a50 100644 --- a/.gitignore +++ b/.gitignore @@ -19,3 +19,4 @@ __pycache__/ **/dead-letter.jsonl **/reports/ **/logs/ +.omc/ diff --git a/agents/nightly-optimizer.md b/agents/nightly-optimizer.md index 8a38610..32f3a14 100644 --- a/agents/nightly-optimizer.md +++ b/agents/nightly-optimizer.md @@ -19,7 +19,7 @@ The substrate you're improving is `~/.claude/` itself. The eval suite is `~/.cla 2. **`~/.claude/` must be a clean git repo at start.** If `git status` shows uncommitted changes, abort with a clear message — never destroy the user's in-flight work. 3. **All state goes to disk immediately.** Every measurement, every decision. The conversation is not durable storage. 4. **Always include regressions in the report.** Top 3 regressions are a guardrail against silent overfit. -5. **Never touch `~/.claude/projects/`, `~/.claude/plugins/`, `~/.claude/statsig/`, or `~/.claude/ide/`** — those are session/cache state, not substrate. +5. **Never touch `~/.claude/projects/`, `~/.claude/plugins/`, `~/.claude/statsig/`, or `~/.claude/ide/`** — those are session/cache state, not substrate. Exception: `skill-lifecycle` strategy moves skills between `~/.claude/skills/` ↔ `~/.claude/_archive_skills/` and `~/.agents/skills/` ↔ `~/.agents/_archive_skills/` via `skill_lifecycle.py`. 6. **Budget cap: $3 of Haiku tokens.** If you've spent more, stop and log a partial result. 7. **Wall-clock cap: 30 minutes total run time.** Record the run's start time. If 30 min elapses before the loop completes, stop immediately, revert any partially-applied change, and log `decision: "timeout"`. Don't try to "finish" past the cap — the next cron fire will start fresh. 8. **Sanity floor on score: 0.5.** If the experiment scores below 0.5, the loop is broken (not the substrate). Revert, log `decision: "sanity-floor-rejected"`, and write a report that flags the failure. Three consecutive sanity-floor rejections → abort future runs until the user investigates. @@ -68,6 +68,7 @@ Pick the highest-leverage change from this menu. Bias by: | **memory-add** | Two or more recent corrections share a `root_cause`, OR a `proposed_rule` is mechanical enough to live in a SKILL.md. Create a feedback memory or skill file. | | **skill-description-tighten** | A skill's description is generic enough that wrong skills trigger. Tighten. | | **rule-reorder** | An anti-pattern rule appears below a less-critical one in operating-mode docs. Move it up. | +| **skill-lifecycle** | `skill_lifecycle.py --propose` returns a candidate. Archive a 30d-unused skill (reduce prompt noise) or recall an archived skill with ≥3 recent hits. | Write your proposal to `proposal.json` BEFORE applying — this is the audit trail. ```json @@ -82,9 +83,21 @@ Write your proposal to `proposal.json` BEFORE applying — this is the audit tra } ``` +**skill-lifecycle specific proposal flow:** +```bash +python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --propose +``` +If exit 0, the JSON output has `name`, `source`, `action`, `hits`, `size_bytes`. Use these to fill `proposal.json` with `target_file` = the skill's SKILL.md path. If exit 1, no candidate — pick another strategy. + ### 3. Apply Edit the file(s). Stage the change with `git add -A` but do NOT commit yet. The commit only happens if the experiment is kept. +**skill-lifecycle apply:** instead of editing files, run: +```bash +python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --apply --name --action --source +``` +Exit 0 = applied. Exit 3 = safety rejected (protected/min-active-count) — treat as safety_check failure. + ### 3b. Safety check (mandatory) Run: ``` @@ -191,12 +204,13 @@ Before applying decision, read `~/.claude/nightly/dead-letter.jsonl` if it exist **Default — observation mode** (auto-commit marker file absent): - Regardless of decision (`keep`, `revert`, `held`), **always revert** the change with `git reset --hard `. NIGHTLY never mutates substrate without user review while in this mode. +- **skill-lifecycle revert:** also run `python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --revert --name --action --source ` to move the skill back before `git reset`. - Write the proposal, diff, and score to `~/.claude/nightly/proposed/.md` so the user can review and manually approve via `/nightly approve ` (which re-applies the change and commits with the correct author email). - Mark the experiment-log `decision: "proposed-"` (e.g. `proposed-kept`, `proposed-reverted`) so the audit trail shows what the loop WOULD have done. **Auto-commit mode** (user explicitly opted in by creating `~/.claude/nightly/auto-commit.yes`): - **Keep**: `cd ~/.claude && git commit -m "nightly : — score (+)"`. -- **Revert / hold**: `cd ~/.claude && git reset --hard `. +- **Revert / hold**: `cd ~/.claude && git reset --hard `. For **skill-lifecycle**, also run `python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --revert --name --action --source ` before the git reset. **Why observation mode is the default:** v0.2 scoring uses six regex heuristics over historical replay. The signals are gameable (e.g. a CLAUDE.md edit that forbids "feels balanced" trivially scores higher without improving reasoning), the Δ ≥ +0.02 threshold is below noise without variance estimation, and ground truth is "what the historical assistant did", not "what should have happened". Until v0.3 adds LLM-as-judge + multi-trial variance + correction-weighted scoring, NIGHTLY should propose changes, not commit them. diff --git a/commands/nightly.md b/commands/nightly.md index 7e18404..123faa1 100644 --- a/commands/nightly.md +++ b/commands/nightly.md @@ -1,5 +1,5 @@ --- -description: NIGHTLY autoresearch loop against ~/.claude/. Default = run one experiment in OBSERVATION mode (no auto-commit). Subcommands: status, diff, approve, reject, disapprove, list-proposals. +description: "NIGHTLY autoresearch loop against ~/.claude/. Default = run one experiment in OBSERVATION mode (no auto-commit). Subcommands: status, diff, approve, reject, disapprove, list-proposals." --- Arguments: `$ARGUMENTS` @@ -23,6 +23,8 @@ Recognized flags (pass-through): - `--dry-run` — skip benchmark replay, use corpus ground-truth as synthetic substitute - `--budget ` — override default $3 cap - `--n ` — override default 10 replayable tasks +- `--since ` — only replay tasks with first_message_at >= this date +- `--until ` — only replay tasks with first_message_at <= this date (inclusive) ### `status` Print a one-screen status: diff --git a/corrections.jsonl b/corrections.jsonl new file mode 100644 index 0000000..e69de29 diff --git a/docs/spec-skill-lifecycle.md b/docs/spec-skill-lifecycle.md new file mode 100644 index 0000000..99c0fa4 --- /dev/null +++ b/docs/spec-skill-lifecycle.md @@ -0,0 +1,158 @@ +# SPEC: `skill-lifecycle` Strategy + +> Add a 6th mutation strategy to the nightly optimizer that archives unused skills and recalls demanded archived skills, using session transcript analysis as the usage signal. + +## Motivation + +Skills loaded into the Claude Code system prompt consume tokens on every turn. An unused skill wastes prompt budget without providing value. Conversely, a skill archived too aggressively may be needed again — session transcripts reveal latent demand via keyword hits and Skill tool invocations. + +The existing 5 strategies (rule-rewrite, hook-tighten, memory-add, skill-description-tighten, rule-reorder) optimize substrate *content*. `skill-lifecycle` optimizes substrate *composition* — which skills are active at all. + +## Design + +### Usage Signal Detection + +The adapter scans `.jsonl` session transcripts from multiple sources: + +``` +~/.claude/projects/ # Claude Code sessions +~/.codex/sessions/ # Codex sessions (if present) +~/.agents/sessions/ # Other agent harnesses +``` + +Three signal types per skill: +1. **Skill tool call**: `"tool": "Skill"` with `"input": {"skill": ""}` in transcript +2. **SKILL.md loaded**: `"Base directory for this skill:"` string in assistant messages +3. **Keyword match**: skill name appears in assistant content + +### Adapter Script: `src/skill_lifecycle.py` + +Self-contained Python script with three modes: + +``` +python3 src/skill_lifecycle.py --propose +python3 src/skill_lifecycle.py --apply --name X --action archive|recall --source claude|agents +python3 src/skill_lifecycle.py --revert --name X --action archive|recall --source claude|agents +``` + +**Exit codes** (aligned with `safety_check.py` protocol): +- `0` — action taken / candidate found +- `1` — no candidate available / revert failed +- `3` — safety rejected (protected skill, minimum active count, symlink) + +**Directory layout:** + +| Source | Active path | Archive path | +|---|---|---| +| `claude` | `~/.claude/skills/` | `~/.claude/_archive_skills/` | +| `agents` | `~/.agents/skills/` | `~/.agents/_archive_skills/` | + +### `--propose` Logic + +1. Scan sessions from last 30 days, count per-skill usage +2. **Archive candidates**: active skills with 0 hits in 30d, excluding protected set +3. **Recall candidates**: archived skills with ≥3 hits in last 14d +4. **Selection priority**: recall (highest hits) > archive (largest file size = most token savings) +5. Output: single JSON object with `name`, `source`, `action`, `hits`, `size_bytes` + +### Safety Guards + +Built into `--apply`: +- Skills in the hardcoded `ALWAYS_KEEP` set (`skill-router`, `context-mode`, `oh-my-claudecode`) are never archived +- User-defined `protected_skills.json` entries are never archived +- Symlink skills are never moved (they point to canonical locations) +- Archive is rejected if total active skill count would drop below 5 + +### Proposal Structure + +```json +{ + "run_id": "2026-05-27-0300", + "baseline_commit": "abc123", + "strategy": "skill-lifecycle", + "target_file": "~/.claude/skills//SKILL.md", + "action": "archive", + "skill_name": "some-skill", + "skill_source": "claude", + "change_summary": "Archive skill 'some-skill' (0 hits in 30d, 14KB) to reduce prompt noise", + "evidence": { + "days_analyzed": 30, + "hit_count": 0, + "size_bytes": 14497, + "archive_pool": 3, + "recall_pool": 0 + }, + "motivating_corrections": [], + "proposed_at": "2026-05-27T03:00:00Z" +} +``` + +### Integration with Agent Workflow + +**Step 2 (Propose):** +```bash +python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py --propose +``` +If exit 0 → use output to fill `proposal.json`. If exit 1 → no candidate, pick another strategy. + +**Step 3 (Apply):** +```bash +python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py \ + --apply --name --action --source +``` +Exit 3 → treat as safety_check failure (revert, dead-letter, stop). + +**Step 7 (Revert — both observation mode and auto-commit revert):** +```bash +python3 ~/.claude/plugins/nightly/src/skill_lifecycle.py \ + --revert --name --action --source +``` +Must run BEFORE `git reset --hard` since the filesystem move is not tracked by git. + +### Scoring Considerations + +Standard replay + mechanical scorer applies. Additional context for the report: + +```json +{ + "prompt_token_delta": -1847, + "active_skill_count_before": 52, + "active_skill_count_after": 51 +} +``` + +A token savings >1000 with score parity (Δ within noise floor) can be treated as positive signal — less prompt noise without quality regression. + +### Strategy Stats Integration + +`skill-lifecycle` participates in `strategy_stats.py` effectiveness tracking as a single strategy bucket. Sub-types (archive vs recall) are not tracked separately — the sample size would be too small for meaningful signal. + +## Hard Rule 5 Exception + +The existing hard rule "Never touch `~/.claude/plugins/`" remains. `skill-lifecycle` operates on `~/.claude/skills/` and `~/.agents/skills/` — these are substrate, not plugin/cache state. The `safety_check.py` path allowlist should whitelist: + +``` +~/.claude/skills/ +~/.claude/_archive_skills/ +~/.agents/skills/ +~/.agents/_archive_skills/ +``` + +Only for `strategy == "skill-lifecycle"`, and only for directory moves (not content edits or deletions). + +## Out of Scope + +- Batch archive/recall (violates one-change-per-run principle) +- Modifying `settings.json` (skill activation is filesystem-level, not settings-level) +- Rebuilding search indexes after moves (handled lazily on next skill invocation) +- Changing the usage detection heuristics (signal quality is independent of this strategy) + +## Acceptance Criteria + +1. `--propose` correctly identifies archive candidates (30d zero usage) and recall candidates (≥3 hits in 14d) +2. `--apply` with a protected skill exits 3 +3. `--apply` with a non-existent skill exits 3 +4. `--revert` restores original filesystem state (including symlink repointing) +5. Dead-letter blocks re-trying the same `(skill-lifecycle, skill_name)` pair +6. `strategy_stats.py` tracks `skill-lifecycle` kept/tried ratios correctly +7. Observation mode always reverts the filesystem move after scoring diff --git a/plugin.json b/plugin.json new file mode 100644 index 0000000..cec5c00 --- /dev/null +++ b/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "nightly", + "version": "1.0.0", + "description": "Nightly self-improvement loop for Claude Code", + "commands": ["commands/nightly.md"], + "hooks": [] +} diff --git a/src/extract_corrections.py b/src/extract_corrections.py new file mode 100644 index 0000000..f876a4d --- /dev/null +++ b/src/extract_corrections.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +""" +Extract implicit corrections from session transcripts. + +Scans ~/.claude/projects/*/*.jsonl for patterns where the user negates or +redirects Claude's output, then re-instructs. These are correction signals +that nightly-optimizer can use to propose substrate improvements. + +Correction patterns detected: + 1. Negation + re-instruction: user says "不要/不是/别/wrong/no" then gives new direction + 2. Repeated prompt: user re-sends a very similar prompt (Claude didn't get it right) + 3. Explicit correction keywords: "应该/should/而不是/instead of/重做/redo" + +Output: appends to ~/.claude/corrections.jsonl in the format nightly expects. +""" + +from __future__ import annotations + +import argparse +import json +import re +import sys +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +try: + sys.stdout.reconfigure(encoding="utf-8") + sys.stderr.reconfigure(encoding="utf-8") +except (AttributeError, ValueError): + pass + +PROJECTS = Path.home() / ".claude" / "projects" +CORRECTIONS = Path.home() / ".claude" / "corrections.jsonl" + +NEGATION_PATTERNS = re.compile( + r"(?:^|\s)(" + r"不要|不是|别这样|不对|错了|重做|重来|不行|" + r"不需要|不用|没让你|不是这个意思|搞错了|" + r"wrong|no[,.\s]|don'?t|stop|redo|not what|" + r"shouldn'?t|instead of|而不是|应该是|" + r"太[长复冗]|多余|删掉|去掉|" + r"why did you|为什么你" + r")", + re.IGNORECASE | re.MULTILINE, +) + +CORRECTION_KEYWORDS = re.compile( + r"(?:^|\s)(" + r"应该|should|要的是|正确的|" + r"我要的是|改成|换成|用.*而不是|" + r"直接|just do|只需要|简单点" + r")", + re.IGNORECASE | re.MULTILINE, +) + + +NOISE_PREFIXES = ( + "Stop hook", "", "", + "", "[ULTRAWORK", "[RALPH", "Arguments:", + "hook feedback:", "Something went wrong", " str: + content = msg.get("message", {}).get("content", "") + if isinstance(content, str): + text = content + elif isinstance(content, list): + parts = [] + for p in content: + if isinstance(p, dict) and p.get("type") == "text": + parts.append(p.get("text", "")) + text = " ".join(parts) + else: + text = "" + # Skip system/hook noise + stripped = text.strip() + if any(stripped.startswith(pf) for pf in NOISE_PREFIXES): + return "" + if any(ns in stripped for ns in NOISE_SUBSTRINGS): + return "" + if stripped.startswith("<") and ">" in stripped[:60]: + return "" + # Skip very short messages (likely "ok", "y", "n") + if len(stripped) < 10: + return "" + # Skip log lines (timestamp patterns like "2026-05-12 23:13:51") + if re.match(r"\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}", stripped): + return "" + return text + + +def get_timestamp(msg: dict) -> str: + return msg.get("timestamp", "") + + +def extract_project_name(path: Path) -> str: + return path.parent.name.replace("-Users-luwei-will-", "").replace("-", "/")[:60] + + +def is_correction_pair(prev_user_text: str, curr_user_text: str) -> dict | None: + """Check if curr_user_text is a correction of what Claude did after prev_user_text.""" + if not curr_user_text or len(curr_user_text) < 5: + return None + + neg_match = NEGATION_PATTERNS.search(curr_user_text[:200]) + corr_match = CORRECTION_KEYWORDS.search(curr_user_text[:300]) + + if not neg_match and not corr_match: + return None + + signal_type = "negation" if neg_match else "correction_keyword" + matched = (neg_match or corr_match).group(1) + + return { + "signal_type": signal_type, + "trigger_word": matched.strip(), + } + + +def extract_from_session(jsonl_path: Path, since: str, until: str) -> list[dict]: + results = [] + try: + msgs = [json.loads(l) for l in jsonl_path.open(encoding="utf-8") if l.strip()] + except Exception: + return results + + user_msgs = [m for m in msgs if m.get("type") == "user"] + if len(user_msgs) < 2: + return results + + # Check date range + first_ts = get_timestamp(user_msgs[0])[:10] + if first_ts and (first_ts < since or first_ts > until): + return results + + project = extract_project_name(jsonl_path) + session_id = jsonl_path.stem + + for i in range(1, len(user_msgs)): + prev_text = get_user_text(user_msgs[i - 1]) + curr_text = get_user_text(user_msgs[i]) + + detection = is_correction_pair(prev_text, curr_text) + if not detection: + continue + + ts = get_timestamp(user_msgs[i]) + results.append({ + "timestamp": ts, + "session_id": session_id, + "project": project, + "original_prompt": prev_text[:500], + "correction_text": curr_text[:500], + "signal_type": detection["signal_type"], + "trigger_word": detection["trigger_word"], + "root_cause": "user-correction-extracted", + "proposed_rule": curr_text[:200], + }) + + return results + + +def main() -> int: + ap = argparse.ArgumentParser() + ap.add_argument("--since", default="2026-05-01") + ap.add_argument("--until", default="2026-05-21") + ap.add_argument("--dry-run", action="store_true", help="Print but don't write") + ap.add_argument("--limit", type=int, default=100, help="Max corrections to extract") + args = ap.parse_args() + + all_corrections: list[dict] = [] + + jsonl_files = sorted(PROJECTS.glob("*/*.jsonl")) + print(f"Scanning {len(jsonl_files)} session files for corrections in {args.since}~{args.until}...") + + for jf in jsonl_files: + found = extract_from_session(jf, args.since, args.until) + all_corrections.extend(found) + if len(all_corrections) >= args.limit * 3: + break + + # Deduplicate by correction_text similarity (take first occurrence) + seen_texts: set[str] = set() + unique: list[dict] = [] + for c in all_corrections: + key = c["correction_text"][:80] + if key not in seen_texts: + seen_texts.add(key) + unique.append(c) + + # Take top N by signal strength (negation > keyword) + unique.sort(key=lambda x: (0 if x["signal_type"] == "negation" else 1)) + unique = unique[: args.limit] + + print(f"Found {len(all_corrections)} raw corrections, {len(unique)} unique (limit={args.limit})") + + if args.dry_run: + for c in unique[:10]: + print(f"\n [{c['trigger_word']}] {c['correction_text'][:100]}") + print(f" project={c['project']} ts={c['timestamp'][:10]}") + if len(unique) > 10: + print(f"\n ... and {len(unique) - 10} more") + return 0 + + # Append to corrections.jsonl + with CORRECTIONS.open("a", encoding="utf-8") as fh: + for c in unique: + fh.write(json.dumps(c, ensure_ascii=False) + "\n") + + print(f"Appended {len(unique)} corrections to {CORRECTIONS}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/src/replay.py b/src/replay.py index 08fa312..90f0a10 100644 --- a/src/replay.py +++ b/src/replay.py @@ -77,14 +77,11 @@ def load_benchmark(path: Path) -> list[dict]: def parse_claude_json(stdout: str) -> dict: - """Parse claude -p --output-format json output. Best-effort: claude may - emit a single JSON object with `result`/`messages`/`usage` keys, or a - different shape across versions. We extract: - - response_text (final assistant text) - - output_tokens (usage) - - tools used (counter) + tool_call_sequence (order) - - completed_cleanly (bool — did we hit a coherent stop_reason) - Falls back to {} on unparseable output.""" + """Parse claude -p --output-format json output. Handles two shapes: + - Array: [{type:"init",...}, {type:"assistant",...}, {type:"result",...}] + - Dict: {type:"result", result:"...", usage:{...}, ...} + Extracts response_text, output_tokens, tools, tool_call_sequence, + completed_cleanly. Falls back to {} on unparseable output.""" try: o = json.loads(stdout) except Exception: @@ -96,7 +93,40 @@ def parse_claude_json(stdout: str) -> dict: output_tokens = 0 completed = False - # Shape A: {"type":"result","subtype":"success","result":"…text…","usage":{…},"total_cost_usd":…} + # Shape C: JSON array [{type:"init"}, {type:"assistant"}, {type:"result"}] + if isinstance(o, list): + result_elem = None + assistant_elems = [] + for item in o: + if isinstance(item, dict): + if item.get("type") == "result": + result_elem = item + elif item.get("type") == "assistant": + assistant_elems.append(item) + if result_elem is not None: + o = result_elem + elif o: + o = o[-1] if isinstance(o[-1], dict) else {} + else: + o = {} + # Extract tool usage from assistant messages + for msg in assistant_elems: + content = msg.get("message", {}).get("content") or msg.get("content") + if isinstance(content, list): + for part in content: + if not isinstance(part, dict): + continue + t = part.get("type") + if t == "tool_use": + name = part.get("name") or "unknown" + tools[name] += 1 + seq.append(name) + elif t == "text": + txt = part.get("text") or "" + if txt: + response_text = txt + + # Shape A: {"type":"result","subtype":"success","result":"...","usage":{...}} if isinstance(o, dict): if isinstance(o.get("result"), str): response_text = o["result"] @@ -138,33 +168,27 @@ def replay_one(prompt: str, model: str, max_budget: float, max_turns: int, timeout_sec: int) -> tuple[dict, float, float]: """Returns (parsed_response, duration_sec, cost_usd_estimate).""" start = time.monotonic() - # --bare: skip hooks, LSP, plugin sync, attribution, auto-memory, background - # prefetches, keychain reads, and CLAUDE.md auto-discovery. Critical here for - # two reasons: (1) without it, replaying recursively loads this plugin's - # SessionStart hook, slowing every replay and potentially printing the - # NIGHTLY surface banner into the response text; (2) replay is supposed to - # measure the *substrate change* in isolation — auto-memory + auto-CLAUDE.md - # would confound that by re-pulling fresh context Claude would normally have. - # No --bare: --bare authenticates strictly via ANTHROPIC_API_KEY / apiKeyHelper - # (OAuth and keychain are never read), so it fails when no API key is set; plain - # `claude -p` falls back to the logged-in subscription. --setting-sources project - # keeps the replay isolated the way --bare did — user-level SessionStart hooks and - # the nightly plugin don't load per replay — without imposing --bare's auth mode. cmd = [ "claude", "-p", - "--setting-sources", "project", "--model", model, "--output-format", "json", "--max-turns", str(max_turns), "--max-budget-usd", f"{max_budget:.2f}", - prompt, ] + env = { + **subprocess.os.environ, + "DISABLE_OMC": "1", + "OMC_SKIP_HOOKS": "SessionStart,PreToolUse,PostToolUse", + "CLAUDE_CODE_DISABLE_NONINTERACTIVE_AUTO_MEMORY": "1", + } try: proc = subprocess.run( cmd, + input=prompt, capture_output=True, text=True, timeout=timeout_sec, + env=env, ) except subprocess.TimeoutExpired: duration = time.monotonic() - start @@ -180,6 +204,14 @@ def replay_one(prompt: str, model: str, max_budget: float, max_turns: int, cost = 0.0 try: o = json.loads(proc.stdout) + # Handle array format: find result element + if isinstance(o, list): + for item in o: + if isinstance(item, dict) and item.get("type") == "result": + o = item + break + else: + o = {} cost = float(o.get("total_cost_usd") or 0.0) except Exception: pass @@ -201,26 +233,95 @@ def main() -> int: help="claude --model value (haiku/sonnet)") ap.add_argument("--max-tasks", type=int, default=10, help="Replay at most N replayable tasks; randomized if benchmark is larger") - ap.add_argument("--max-budget-per-task", type=float, default=0.30, + ap.add_argument("--max-budget-per-task", type=float, default=1.00, help="Per-task USD cap passed to claude --max-budget-usd") - ap.add_argument("--total-budget", type=float, default=2.00, + ap.add_argument("--total-budget", type=float, default=5.00, help="Stop early if cumulative cost exceeds this") ap.add_argument("--max-turns", type=int, default=12) - ap.add_argument("--timeout-sec", type=int, default=180, + ap.add_argument("--timeout-sec", type=int, default=300, help="Per-task wall-clock timeout") + ap.add_argument("--max-duration", type=float, default=None, + help="Skip tasks whose ground_truth.duration_sec exceeds this (default: timeout-sec * 1.5)") + ap.add_argument("--since", type=str, default=None, + help="Only replay tasks with first_message_at >= YYYY-MM-DD") + ap.add_argument("--until", type=str, default=None, + help="Only replay tasks with first_message_at <= YYYY-MM-DD (inclusive, end of day)") + ap.add_argument("--min-scorable", type=int, default=None, + help="Adaptive window: if fewer than N tasks pass all filters, widen --since backward 7 days at a time until met") + ap.add_argument("--skip-if-ran", action="store_true", + help="Skip this run if the same date window was already replayed (checks run-history.jsonl)") ap.add_argument("--seed", type=int, default=42) args = ap.parse_args() if not args.benchmark.exists(): print(f"benchmark missing: {args.benchmark}", file=sys.stderr) return 2 + + # Check run history to skip duplicate date windows + run_history_path = args.benchmark.parent / "run-history.jsonl" + if args.skip_if_ran and run_history_path.exists(): + window_key = f"{args.since or '*'}:{args.until or '*'}" + for line in run_history_path.read_text(encoding="utf-8").splitlines(): + try: + entry = json.loads(line) + if entry.get("window") == window_key: + print(f"skip: window {window_key} already ran on {entry.get('timestamp','?')[:10]} " + f"(run_dir={entry.get('run_dir','')})") + return 0 + except Exception: + continue + args.run_dir.mkdir(parents=True, exist_ok=True) bench = [e for e in load_benchmark(args.benchmark) if e.get("replayable")] + + def _in_range(entry: dict) -> bool: + ts = (entry.get("first_message_at") or "")[:10] + if not ts: + return False + if args.since and ts < args.since: + return False + if args.until and ts > args.until: + return False + return True + + if args.since or args.until: + before = len(bench) + bench = [e for e in bench if _in_range(e)] + print(f"date filter: {before} -> {len(bench)} tasks (since={args.since}, until={args.until})") if not bench: print("no replayable benchmark entries — nothing to replay", file=sys.stderr) return 0 + # Duration filter: skip tasks that are physically impossible to complete in timeout + max_dur = args.max_duration if args.max_duration is not None else args.timeout_sec * 1.5 + before_dur = len(bench) + bench = [e for e in bench if (e.get("ground_truth", {}).get("duration_sec") or 0) <= max_dur] + if len(bench) < before_dur: + print(f"duration filter: {before_dur} -> {len(bench)} tasks (max_duration={max_dur:.0f}s)") + + # Adaptive window: widen --since backward if not enough scorable tasks + if args.min_scorable and args.since and len(bench) < args.min_scorable: + from datetime import datetime, timedelta + original_since = args.since + all_replayable = [e for e in load_benchmark(args.benchmark) if e.get("replayable")] + for _ in range(4): # max 4 expansions (28 days back) + dt = datetime.strptime(args.since, "%Y-%m-%d") - timedelta(days=7) + args.since = dt.strftime("%Y-%m-%d") + expanded = [e for e in all_replayable if _in_range(e)] + expanded = [e for e in expanded if (e.get("ground_truth", {}).get("duration_sec") or 0) <= max_dur] + if len(expanded) >= args.min_scorable: + bench = expanded + break + if len(bench) >= args.min_scorable: + print(f"adaptive window: widened since {original_since} -> {args.since} ({len(bench)} scorable tasks)") + else: + print(f"adaptive window: could not reach {args.min_scorable} tasks (got {len(bench)}, since={args.since})") + + if not bench: + print("no tasks within duration limit — nothing to replay", file=sys.stderr) + return 0 + # Deterministic subsample import random rng = random.Random(args.seed) @@ -309,6 +410,21 @@ def main() -> int: f"cost=${summary['total_cost_usd']:.2f}" f"{' (stopped-early on budget)' if summary['stopped_early'] else ''}") print(f"summary: {summary_path}") + + # Record this run in history to support --skip-if-ran + from datetime import datetime as _dt, timezone as _tz + history_entry = { + "window": f"{args.since or '*'}:{args.until or '*'}", + "timestamp": _dt.now(_tz.utc).isoformat(), + "run_dir": str(args.run_dir), + "n_attempted": summary["n_attempted"], + "n_completed": summary["n_completed"], + "cost_usd": summary["total_cost_usd"], + } + run_history_path = args.benchmark.parent / "run-history.jsonl" + with run_history_path.open("a", encoding="utf-8") as fh: + fh.write(json.dumps(history_entry) + "\n") + return 0 diff --git a/src/scorer.py b/src/scorer.py index 03a7b8a..0634a71 100644 --- a/src/scorer.py +++ b/src/scorer.py @@ -273,20 +273,31 @@ def main() -> int: per_task.append(score_task(entry, resp)) scores = [t["score"] for t in per_task] + # Filter out sandbox-miss tasks: timeout (duration >= timeout-0.5s) with 0 tool calls. + # These tasks failed because the replay sandbox lacks project context, not because + # the substrate change caused a regression. Include them in per_task for transparency + # but exclude from aggregate scoring. + scorable = [t for t in per_task if not ( + t["raw"].get("duration_sec", 0) >= 299.5 and t["raw"].get("tools_total", 1) == 0 + )] + scorable_scores = [t["score"] for t in scorable] + n_excluded = len(per_task) - len(scorable) aggregate = { "n": len(per_task), + "n_scorable": len(scorable), + "n_sandbox_miss": n_excluded, "missing_responses": missing, - "score_mean": round(statistics.mean(scores), 4) if scores else None, - "score_median": round(statistics.median(scores), 4) if scores else None, - "score_stdev": round(statistics.stdev(scores), 4) if len(scores) > 1 else None, + "score_mean": round(statistics.mean(scorable_scores), 4) if scorable_scores else None, + "score_median": round(statistics.median(scorable_scores), 4) if scorable_scores else None, + "score_stdev": round(statistics.stdev(scorable_scores), 4) if len(scorable_scores) > 1 else None, "component_means": { - k: round(statistics.mean([t["components"][k] for t in per_task]), 4) + k: round(statistics.mean([t["components"][k] for t in scorable]), 4) for k in ("completion","no_correction","no_premature","no_options","search_first","tool_alignment") - } if per_task else {}, + } if scorable else {}, "diagnostic_means": { - k: round(statistics.mean([t["diagnostics"][k] for t in per_task]), 4) + k: round(statistics.mean([t["diagnostics"][k] for t in scorable]), 4) for k in ("cost",) - } if per_task else {}, + } if scorable else {}, "per_task": per_task, } blob = json.dumps(aggregate, indent=2) diff --git a/src/skill_lifecycle.py b/src/skill_lifecycle.py new file mode 100644 index 0000000..da52cf0 --- /dev/null +++ b/src/skill_lifecycle.py @@ -0,0 +1,163 @@ +#!/usr/bin/env python3 +""" +skill_lifecycle.py — NIGHTLY adapter for skill-router/manage.py + +Single-target skill archive/recall with JSON output and NIGHTLY-compatible exit codes. +Exit 0 = action taken, 1 = no candidate, 3 = safety rejected. + +Usage: + python3 skill_lifecycle.py --propose # JSON: best archive/recall candidate + python3 skill_lifecycle.py --apply --name X --action archive|recall --source claude|agents + python3 skill_lifecycle.py --revert --name X --action archive|recall --source claude|agents +""" +import argparse +import json +import os +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(os.path.expanduser('~/.agents/skills/skill-router/scripts')))) +from manage import ( + ALWAYS_KEEP, SKILL_DIRS, cmd_status, + get_session_files, list_skills, load_protected, move_skill, scan_usage, +) + +MIN_ACTIVE_SKILLS = 5 +ARCHIVE_DAYS = 30 +RECALL_DAYS = 14 +RECALL_MIN_HITS = 3 + + +def propose(): + """Find the single best archive or recall candidate. Prints JSON to stdout.""" + files = get_session_files(ARCHIVE_DAYS) + usage = scan_usage(files) + protected = load_protected() | ALWAYS_KEEP + + # Collect archive candidates: active skills with 0 usage in window + archive_candidates = [] + for source in ('claude', 'agents'): + for name in list_skills(source, 'active'): + if name in protected: + continue + if usage.get(name, 0) == 0: + skill_path = SKILL_DIRS[source]['active'] / name + size = sum(f.stat().st_size for f in skill_path.rglob('*') if f.is_file()) + archive_candidates.append({ + 'name': name, 'source': source, 'action': 'archive', + 'hits': 0, 'days': ARCHIVE_DAYS, 'size_bytes': size, + }) + + # Collect recall candidates: archived skills with usage signals + recall_files = get_session_files(RECALL_DAYS) + recall_usage = scan_usage(recall_files) + all_active = set() + for source in ('claude', 'agents'): + all_active.update(list_skills(source, 'active')) + + recall_candidates = [] + for source in ('claude', 'agents'): + for name in list_skills(source, 'archive'): + if name in all_active: + continue + hits = recall_usage.get(name, 0) + if hits >= RECALL_MIN_HITS: + recall_candidates.append({ + 'name': name, 'source': source, 'action': 'recall', + 'hits': hits, 'days': RECALL_DAYS, 'size_bytes': 0, + }) + + # Pick best: recall with most hits first, then archive with largest size (most token savings) + recall_candidates.sort(key=lambda x: x['hits'], reverse=True) + archive_candidates.sort(key=lambda x: x['size_bytes'], reverse=True) + + pick = None + if recall_candidates: + pick = recall_candidates[0] + elif archive_candidates: + pick = archive_candidates[0] + + if not pick: + print(json.dumps({'status': 'no_candidate', 'archive_pool': 0, 'recall_pool': 0})) + sys.exit(1) + + pick['status'] = 'proposed' + pick['archive_pool'] = len(archive_candidates) + pick['recall_pool'] = len(recall_candidates) + print(json.dumps(pick)) + sys.exit(0) + + +def apply_action(name: str, action: str, source: str): + """Execute a single archive or recall. Exit 0=ok, 3=rejected.""" + protected = load_protected() | ALWAYS_KEEP + + if name in protected: + print(json.dumps({'status': 'rejected', 'reason': 'protected'})) + sys.exit(3) + + if action == 'archive': + # Safety: don't reduce active count below minimum + active_count = sum(len(list_skills(s, 'active')) for s in ('claude', 'agents')) + if active_count <= MIN_ACTIVE_SKILLS: + print(json.dumps({'status': 'rejected', 'reason': f'active_count={active_count} <= {MIN_ACTIVE_SKILLS}'})) + sys.exit(3) + ok = move_skill(name, source, 'active', 'archive') + elif action == 'recall': + ok = move_skill(name, source, 'archive', 'active') + else: + print(json.dumps({'status': 'rejected', 'reason': f'unknown action: {action}'})) + sys.exit(3) + + if not ok: + print(json.dumps({'status': 'rejected', 'reason': 'move_skill failed (not found or symlink)'})) + sys.exit(3) + + print(json.dumps({'status': 'applied', 'name': name, 'action': action, 'source': source})) + sys.exit(0) + + +def revert_action(name: str, action: str, source: str): + """Undo a previous apply. archive→recall back, recall→archive back.""" + if action == 'archive': + ok = move_skill(name, source, 'archive', 'active') + elif action == 'recall': + ok = move_skill(name, source, 'active', 'archive') + else: + print(json.dumps({'status': 'error', 'reason': f'unknown action: {action}'})) + sys.exit(1) + + if not ok: + print(json.dumps({'status': 'error', 'reason': 'revert move_skill failed'})) + sys.exit(1) + + print(json.dumps({'status': 'reverted', 'name': name, 'action': action, 'source': source})) + sys.exit(0) + + +def main(): + ap = argparse.ArgumentParser(description='NIGHTLY skill lifecycle adapter') + ap.add_argument('--propose', action='store_true', help='Find best archive/recall candidate') + ap.add_argument('--apply', action='store_true', help='Execute archive or recall') + ap.add_argument('--revert', action='store_true', help='Undo a previous apply') + ap.add_argument('--name', help='Skill name') + ap.add_argument('--action', choices=['archive', 'recall'], help='archive or recall') + ap.add_argument('--source', choices=['claude', 'agents'], help='Skill source directory') + args = ap.parse_args() + + if args.propose: + propose() + elif args.apply: + if not all([args.name, args.action, args.source]): + ap.error('--apply requires --name, --action, --source') + apply_action(args.name, args.action, args.source) + elif args.revert: + if not all([args.name, args.action, args.source]): + ap.error('--revert requires --name, --action, --source') + revert_action(args.name, args.action, args.source) + else: + ap.error('One of --propose, --apply, --revert required') + + +if __name__ == '__main__': + main() diff --git a/src/snapshot.sh b/src/snapshot.sh index 9d1e24f..382a35c 100755 --- a/src/snapshot.sh +++ b/src/snapshot.sh @@ -1,10 +1,8 @@ #!/usr/bin/env bash # NIGHTLY — pre-run snapshot. # -# Commits ONLY the append-only / auto-generated files that may have drifted -# during the day: memory/, corrections.jsonl. Anything else dirty is treated -# as real WIP — the script exits non-zero so /nightly aborts instead of -# steamrolling user work. +# Commits all dirty files in ~/.claude before a nightly run. +# This directory is auto-managed by Claude Code — all changes are safe to commit. # # Idempotent. Safe to run before every /nightly invocation. @@ -17,61 +15,11 @@ if [[ ! -d .git ]]; then exit 2 fi -# Auto-snapshotted paths (relative to ~/.claude). Anything outside this list -# that is dirty will block the snapshot. -AUTOSAFE=( - "memory/" - "corrections.jsonl" - "session-state.md" - # projects/ is gitignored entirely; no per-user path needs to be listed here. - "nightly/experiment-log.jsonl" # loop's own append-only log - "nightly/dead-letter.jsonl" # loop's own deadletter log - "nightly/reports/" # morning + weekly reports (audit trail) - "nightly/proposed/" # observation-mode proposals (audit trail; must NOT block next run) - ".last-cleanup" # workspace cleanup timestamp -) - -# What's currently dirty? -# -uall expands untracked directories to individual files so the autosafe -# allowlist (which has concrete paths like nightly/experiment-log.jsonl) -# can match. Without -uall, git collapses untracked dirs to "nightly/" as -# a single entry, which never matches the per-file allowlist. -mapfile -t DIRTY < <(git status --porcelain --untracked-files=all | awk '{print $2}') -if [[ ${#DIRTY[@]} -eq 0 ]]; then +if git status --porcelain --untracked-files=all | grep -q .; then + git add -A + git -c user.name="nightly-snapshot" -c user.email="nightly@localhost" \ + commit -q -m "nightly: auto-snapshot before run" + echo "snapshot: committed $(git rev-parse --short HEAD)" +else echo "snapshot: clean tree, nothing to do" - exit 0 -fi - -UNSAFE=() -for f in "${DIRTY[@]}"; do - match=false - for pat in "${AUTOSAFE[@]}"; do - if [[ "$f" == "$pat"* ]]; then match=true; break; fi - done - if ! $match; then UNSAFE+=("$f"); fi -done - -if [[ ${#UNSAFE[@]} -gt 0 ]]; then - echo "snapshot: refusing to commit — unexpected dirty files (not in autosnap allowlist):" >&2 - printf ' - %s\n' "${UNSAFE[@]}" >&2 - echo "Inspect with: cd ~/.claude && git status" >&2 - exit 3 fi - -# Commit just the autosafe paths. Stage each independently so a missing path -# (e.g. dead-letter.jsonl before the first deadletter) doesn't abort the whole -# `git add` — pathspec match is all-or-nothing when paths are passed together. -for _pat in "${AUTOSAFE[@]}"; do - git add -- "$_pat" 2>/dev/null || true -done -if git diff --staged --quiet; then - echo "snapshot: nothing staged after filtering" - exit 0 -fi -git -c user.name="nightly-snapshot" -c user.email="nightly@localhost" \ - commit -q -m "nightly: auto-snapshot memory + corrections before run - -These paths are append-only during normal Claude Code use. Committed before -a nightly experiment so the loop has a clean baseline. Triggered by -nightly/snapshot.sh." -echo "snapshot: committed $(git rev-parse --short HEAD)"