From 9c5307b98bf7a52c0fd63ef9fe9478c57139dd6f Mon Sep 17 00:00:00 2001 From: Jin Choi Date: Wed, 27 May 2026 22:15:01 -0700 Subject: [PATCH 1/2] =?UTF-8?q?feat(sensor):=20Zuhn-native=20session-captu?= =?UTF-8?q?re=20hook=20(Option=20A=20=E2=80=94=20credential-free)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A Claude Code SessionEnd hook that captures the session transcript as a Zuhn "session" source. MECHANICAL only — no LLM in the hook. Insight extraction stays the deliberate, gated `npm run autoknowledge` step, so Phase 8's "no auto-extraction" still holds; only capture is automated. Why this and not claude-mem: claude-mem digests via a paid LLM API (no key in this env, install is interactive). Zuhn's autoknowledge already uses the authenticated `claude` CLI — so this path has zero new credentials, zero extra API cost, no background daemon. Components: - scripts/lib/transcript.ts — parseTranscript (JSONL → user-prompt + assistant-text turns; drops isSidechain, tool_use/tool_result, assistant thinking, non-conversation types, and framework-injection user turns matching /); renderConversation (caps to 60K chars keeping the TAIL with a truncation note); buildTranscriptSource (title=first user prompt; slug/id salted by session_id → idempotent; null when there's no usable conversation). - scripts/capture-session.ts — the hook script. Reads hook JSON on stdin (guards process.stdin.isTTY to avoid blocking on manual runs), graceful no-op on missing transcript, idempotent file-exists skip, --dry-run, and NEVER throws out of the hook (try/catch around every I/O; always exit 0). - templates/hooks/session-capture.sh — bash wrapper. Errors are LOGGED to ~/.claude/session-capture.log instead of silenced (a hook you can't watch fire is one where silent failure looks like "it works"). - docs/session-pipeline-setup.md — records the design supersession: the Phase-8 "explicit intent only" constraint existed because there was no automatic quality control; the gate now provides it, so passive transcript capture is safe. Frontmatter: "session" was already added to SourceFrontmatter.type / the health glob / the autoknowledge discovery glob in the claude-mem PR (#4), so session sources flow through the existing extract→gate path unchanged. Tests: 8 (noise-stripping edge cases, cap/tail, idempotency, null on empty). Verified end-to-end on the real 38MB transcript (→ 8831-word digest), on bogus input (graceful no-op), and via the logged hook chain with a synthetic transcript + temp KB. Local codex review stalled twice (the harness backgrounded `codex exec`, which hangs on stdin/tty in that mode). Relying on the GitHub codex App for PR review. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/session-pipeline-setup.md | 51 +++++++++++ package.json | 1 + scripts/capture-session.ts | 102 +++++++++++++++++++++ scripts/lib/transcript.test.ts | 107 ++++++++++++++++++++++ scripts/lib/transcript.ts | 138 +++++++++++++++++++++++++++++ templates/hooks/session-capture.sh | 24 +++++ 6 files changed, 423 insertions(+) create mode 100644 scripts/capture-session.ts create mode 100644 scripts/lib/transcript.test.ts create mode 100644 scripts/lib/transcript.ts create mode 100644 templates/hooks/session-capture.sh diff --git a/docs/session-pipeline-setup.md b/docs/session-pipeline-setup.md index a577ff2b18..62c2a06ae8 100644 --- a/docs/session-pipeline-setup.md +++ b/docs/session-pipeline-setup.md @@ -54,3 +54,54 @@ During any Claude Code session working with Zuhn: - **Explicit intent only** — the hook only reads `/tmp/`, never scrapes other directories - **No auto-extraction** — `extract-session` must be run manually - **`.claude/` files are local** — never commit personal `.claude/` config to the public repo + +--- + +## Automatic capture (SessionEnd hook) + +Phase 8 above is *explicit-intent* capture (Claude manually queues insights). The +**session-capture hook** adds *automatic* capture — it records the whole session +at exit, so nothing has to be queued by hand. This is safe now because the +**quality gate exists**: the gate (not manual intent) is the filter that keeps +junk out, so capture can be automatic while quality stays enforced. + +### How it works + +1. `capture-session.ts` runs on `SessionEnd`. It reads the session transcript, + strips noise (tool output, `thinking`, subagent sidechains, framework + injections) down to user prompts + assistant prose, and writes one `session` + source to `sources/session/`. **No LLM, no extraction** — just capture. +2. `npm run autoknowledge` (manual or scheduled) extracts stanced insights from + those sources — point it at a stricter gate so the noisier session input + clears a higher bar: + `ZUHN_GATE_BLOCKING_CHECKS=stance_present,stance_directional npm run autoknowledge` +3. The gate admits only sharp, novel session-insights; dedup prevents re-capture. + +### Setup + +```bash +cp templates/hooks/session-capture.sh .claude/hooks/session-capture.sh +chmod +x .claude/hooks/session-capture.sh +``` + +Register a `SessionEnd` hook in `.claude/settings.json`: + +```json +"SessionEnd": [ + { + "matcher": "", + "hooks": [ + { "type": "command", "command": "bash \"$CLAUDE_PROJECT_DIR/.claude/hooks/session-capture.sh\"" } + ] + } +] +``` + +### Design note — supersedes Phase 8's constraint *for capture only* + +- **Capture is automated; extraction is not.** The hook only *captures* (writes a + source); insight extraction stays the deliberate, gated `autoknowledge` step — + so Phase 8's "no auto-extraction" still holds. +- **The gate replaces manual intent as the quality filter.** Phase 8 required + explicit intent because there was no automatic quality control; the gate now + provides it, so passive transcript capture is safe. diff --git a/package.json b/package.json index 9c7e4fd345..4f7948cb97 100644 --- a/package.json +++ b/package.json @@ -24,6 +24,7 @@ "quality-score": "tsx scripts/quality-score.ts", "gate": "tsx scripts/insight-gate.ts", "sync-claude-mem": "tsx scripts/sync-claude-mem.ts", + "capture-session": "tsx scripts/capture-session.ts", "ask": "tsx scripts/ask.ts", "autoknowledge": "tsx scripts/autoknowledge.ts", "wake": "tsx scripts/wake.ts", diff --git a/scripts/capture-session.ts b/scripts/capture-session.ts new file mode 100644 index 0000000000..bd1b82c028 --- /dev/null +++ b/scripts/capture-session.ts @@ -0,0 +1,102 @@ +#!/usr/bin/env npx tsx +/** + * capture-session.ts — Claude Code SessionEnd hook → Zuhn "session" source. + * + * Reads the hook's stdin JSON ({ session_id, transcript_path, ... }), parses the + * transcript into a clean conversation, and writes one Zuhn session source. It + * does NOT extract insights — that stays the deliberate, gated autoknowledge + * step (so Phase 8's "no auto-extraction" holds; only capture is automated). + * + * Designed to be a safe hook: never throws, never blocks the session, exits 0. + * + * Hook usage (settings.json SessionEnd): piped the hook JSON on stdin. + * Manual/testing: + * npx tsx scripts/capture-session.ts --transcript --session --dry-run + */ + +import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs"; +import { join } from "node:path"; +import matter from "gray-matter"; + +import { buildTranscriptSource } from "./lib/transcript"; +import { KB_ROOT } from "./lib/kb-root"; + +const SESSION_DIR = join(KB_ROOT, "sources", "session"); + +interface HookInput { + session_id?: string; + transcript_path?: string; +} + +function readStdin(): string { + try { + return readFileSync(0, "utf-8"); + } catch { + return ""; + } +} + +function resolveInput(argv: string[]): { transcriptPath?: string; sessionId?: string } { + const tIdx = argv.indexOf("--transcript"); + if (tIdx !== -1) { + const sIdx = argv.indexOf("--session"); + return { transcriptPath: argv[tIdx + 1], sessionId: sIdx !== -1 ? argv[sIdx + 1] : "manual" }; + } + // Hook mode: JSON on stdin. Guard against a TTY (no pipe) so manual runs + // without --transcript don't block waiting on fd 0. + if (process.stdin.isTTY) return {}; + const raw = readStdin().trim(); + if (!raw) return {}; + try { + const input = JSON.parse(raw) as HookInput; + return { transcriptPath: input.transcript_path, sessionId: input.session_id }; + } catch { + return {}; + } +} + +function main(): void { + const dryRun = process.argv.includes("--dry-run"); + const { transcriptPath, sessionId } = resolveInput(process.argv); + + // Graceful no-op on missing/odd input — a hook must never break the session. + if (!transcriptPath || !existsSync(transcriptPath)) { + if (dryRun) console.log(`No transcript (${transcriptPath ?? "none"}) — nothing to capture.`); + return; + } + + let src; + try { + const jsonl = readFileSync(transcriptPath, "utf-8"); + src = buildTranscriptSource(jsonl, sessionId ?? transcriptPath); + } catch (err) { + if (dryRun) console.log(`Could not read/parse transcript: ${(err as Error).message}`); + return; // never throw out of a hook + } + + if (!src) { + if (dryRun) console.log("Transcript had no usable conversation — skipping."); + return; + } + + const path = join(SESSION_DIR, `${src.slug}.md`); + if (existsSync(path)) { + if (dryRun) console.log(`Already captured: ${src.id} → ${path}`); + return; // idempotent + } + + if (dryRun) { + console.log(`WOULD write ${src.id} "${src.title}" (${src.data.word_count} words) → ${path}`); + return; + } + + try { + mkdirSync(SESSION_DIR, { recursive: true }); + writeFileSync(path, matter.stringify(src.body, src.data), "utf-8"); + console.log(`Zuhn: captured session ${src.id} — run 'npm run autoknowledge' to extract + gate.`); + } catch { + // Swallow — a capture failure must not surface as a session error. + } +} + +main(); diff --git a/scripts/lib/transcript.test.ts b/scripts/lib/transcript.test.ts new file mode 100644 index 0000000000..02daeba589 --- /dev/null +++ b/scripts/lib/transcript.test.ts @@ -0,0 +1,107 @@ +import { describe, expect, it } from "vitest"; + +import { + buildTranscriptSource, + parseTranscript, + renderConversation, + DEFAULT_BODY_CAP, + type Turn, +} from "./transcript"; + +// Build a transcript JSONL string from line objects. +function jsonl(objs: unknown[]): string { + return objs.map((o) => JSON.stringify(o)).join("\n"); +} + +const FIXTURE = jsonl([ + { type: "user", isSidechain: false, message: { role: "user", content: "First real prompt" } }, + { + type: "assistant", + message: { + role: "assistant", + content: [ + { type: "thinking", thinking: "internal reasoning" }, + { type: "text", text: "Assistant reply" }, + { type: "tool_use", name: "Bash", input: {} }, + ], + }, + }, + // tool_result (user, array content, no text block) → dropped + { type: "user", message: { role: "user", content: [{ type: "tool_result", content: "huge output" }] } }, + // subagent → dropped + { type: "user", isSidechain: true, message: { role: "user", content: "sidechain prompt" } }, + // non-conversation line → dropped + { type: "attachment", foo: 1 }, + // framework injection → dropped + { type: "user", message: { role: "user", content: "noise" } }, + "not valid json", +]); + +describe("parseTranscript", () => { + it("keeps only real user prompts + assistant text, dropping all noise", () => { + const turns = parseTranscript(FIXTURE); + expect(turns).toEqual([ + { role: "user", text: "First real prompt" }, + { role: "assistant", text: "Assistant reply" }, + ]); + }); + + it("extracts text blocks from array content but skips tool/thinking blocks", () => { + const t = parseTranscript( + jsonl([ + { type: "user", message: { role: "user", content: [{ type: "text", text: "typed with attachment" }] } }, + ]) + ); + expect(t).toEqual([{ role: "user", text: "typed with attachment" }]); + }); + + it("tolerates malformed lines and empty input", () => { + expect(parseTranscript("not json\n\n{bad")).toEqual([]); + expect(parseTranscript("")).toEqual([]); + }); +}); + +describe("renderConversation", () => { + it("labels turns and caps to the tail with a truncation note", () => { + const long: Turn[] = [ + { role: "user", text: "X".repeat(100) }, + { role: "assistant", text: "Y".repeat(100) }, + ]; + const out = renderConversation(long, 50); + expect(out.length).toBeLessThanOrEqual(50 + "_[earlier turns truncated]_\n\n".length); + expect(out).toContain("earlier turns truncated"); + expect(out.endsWith("Y")).toBe(true); // tail kept + }); + + it("formats user/assistant labels when under the cap", () => { + const out = renderConversation([{ role: "user", text: "hi" }, { role: "assistant", text: "yo" }]); + expect(out).toBe("**User:** hi\n\n**Assistant:** yo"); + expect(DEFAULT_BODY_CAP).toBeGreaterThan(0); + }); +}); + +describe("buildTranscriptSource", () => { + it("builds a 'session' source titled by the first user prompt, salted by session id", () => { + const src = buildTranscriptSource(FIXTURE, "sess-xyz12345", "2026-05-26"); + expect(src).not.toBeNull(); + expect(src!.title).toBe("First real prompt"); + expect(src!.id).toMatch(/^SRC-\d{6}-[A-F0-9]{4}$/); + expect(src!.slug).toContain("sessxyz1"); // first 8 alphanumerics of session id + expect(src!.data.type).toBe("session"); + expect(src!.data.date_ingested).toBe("2026-05-26"); + expect(src!.body).toContain("**User:** First real prompt"); + expect(src!.body).toContain("**Assistant:** Assistant reply"); + }); + + it("returns null when there's no usable conversation", () => { + expect(buildTranscriptSource("", "s")).toBeNull(); + expect(buildTranscriptSource(jsonl([{ type: "attachment" }, { type: "system" }]), "s")).toBeNull(); + }); + + it("is deterministic for a given session id", () => { + const a = buildTranscriptSource(FIXTURE, "sess-1", "2026-05-26"); + const b = buildTranscriptSource(FIXTURE, "sess-1", "2026-05-26"); + expect(a!.id).toBe(b!.id); + expect(a!.slug).toBe(b!.slug); + }); +}); diff --git a/scripts/lib/transcript.ts b/scripts/lib/transcript.ts new file mode 100644 index 0000000000..9e45675a88 --- /dev/null +++ b/scripts/lib/transcript.ts @@ -0,0 +1,138 @@ +/** + * transcript.ts (lib) — Turn a Claude Code session transcript (.jsonl) into a + * clean Zuhn "session" source. MECHANICAL only — no LLM. The heavy lifting + * (authoring stances) stays with autoknowledge's claude-CLI extraction, and the + * gate filters quality, so this is just aggressive noise-stripping + framing. + * + * Transcript reality (verified against a real 38MB/11.8k-line transcript): + * - Each line is a JSON object with a top-level `type`. + * - Signal: type "user" with STRING message.content (real prompts) and type + * "assistant" with `text` content-blocks (response prose). + * - Noise to drop: attachment/system/custom-title/etc. lines, tool_use and + * tool_result blocks, assistant `thinking` blocks, subagent isSidechain + * entries, and framework injections (, , ...). + */ + +import { generateSourceId } from "./generate-id"; +import { slugify } from "./ingest/slug"; + +export interface Turn { + role: "user" | "assistant"; + text: string; +} + +export interface TranscriptSource { + id: string; + slug: string; + title: string; + body: string; + data: Record; +} + +/** Default body cap (~chars). Long sessions are truncated to keep extraction + * tractable; the keep-tail keeps the most recent reasoning. */ +export const DEFAULT_BODY_CAP = 60_000; + +// Framework-injected user turns (not the human's words) — system reminders, +// slash-command scaffolding, local command output, hook stdout. These are +// noise for a "what did I reason about" digest. +const FRAMEWORK_NOISE = /^<(system-reminder|command-name|command-message|command-args|local-command-stdout|local-command-stderr)>/; + +function textFromContent(content: unknown): string { + if (typeof content === "string") return content; + if (Array.isArray(content)) { + return content + .filter( + (b): b is { type: string; text: string } => + !!b && typeof b === "object" && (b as { type?: unknown }).type === "text" && + typeof (b as { text?: unknown }).text === "string" + ) + .map((b) => b.text) + .join("\n"); + } + return ""; +} + +/** Parse a transcript JSONL string into clean user/assistant turns. */ +export function parseTranscript(jsonl: string): Turn[] { + const turns: Turn[] = []; + for (const line of jsonl.split("\n")) { + if (!line.trim()) continue; + + let obj: Record; + try { + obj = JSON.parse(line) as Record; + } catch { + continue; // tolerate a partially-flushed final line + } + + if (obj.isSidechain === true) continue; // subagent transcript — not the main thread + const type = obj.type; + if (type !== "user" && type !== "assistant") continue; + + const message = obj.message as { content?: unknown } | undefined; + const text = textFromContent(message?.content).trim(); + if (!text) continue; // tool_result / tool_use / thinking-only → nothing to keep + + if (type === "user" && FRAMEWORK_NOISE.test(text)) continue; // injected, not human + + turns.push({ role: type, text }); + } + return turns; +} + +function firstSentence(text: string, max = 70): string { + const line = text.split("\n").map((l) => l.trim()).find(Boolean) ?? ""; + return line.length > max ? line.slice(0, max - 1) + "…" : line; +} + +/** + * Render turns into a readable markdown conversation, capped to `cap` chars by + * keeping the TAIL (most-recent reasoning) and noting the truncation. + */ +export function renderConversation(turns: Turn[], cap = DEFAULT_BODY_CAP): string { + const blocks = turns.map((t) => `**${t.role === "user" ? "User" : "Assistant"}:** ${t.text}`); + let body = blocks.join("\n\n"); + if (body.length > cap) { + body = "_[earlier turns truncated]_\n\n" + body.slice(body.length - cap); + } + return body; +} + +/** + * Build a Zuhn "session" source from a transcript. Title = first real user + * prompt; slug/id salted by sessionId (date-independent → idempotent re-capture). + * Returns null when the transcript has no usable conversation. + */ +export function buildTranscriptSource( + jsonl: string, + sessionId: string, + date = new Date().toISOString().slice(0, 10), + cap = DEFAULT_BODY_CAP +): TranscriptSource | null { + const turns = parseTranscript(jsonl); + if (turns.length === 0) return null; + + const firstUser = turns.find((t) => t.role === "user"); + const title = firstUser ? firstSentence(firstUser.text) : "Claude Code session"; + const body = renderConversation(turns, cap); + + const tag = sessionId.replace(/[^a-zA-Z0-9]/g, "").slice(0, 8) || "session"; + const slug = `${slugify(title)}-${tag}`; + const id = generateSourceId(slug, sessionId); + + return { + id, + slug, + title, + body, + data: { + id, + type: "session", + title, + date_ingested: date, + insight_count: 0, + word_count: body.split(/\s+/).filter(Boolean).length, + }, + }; +} diff --git a/templates/hooks/session-capture.sh b/templates/hooks/session-capture.sh new file mode 100644 index 0000000000..b739bc6cca --- /dev/null +++ b/templates/hooks/session-capture.sh @@ -0,0 +1,24 @@ +#!/bin/bash +# SessionEnd hook — capture the Claude Code session transcript as a Zuhn +# "session" source. MECHANICAL only (no LLM): the transcript is parsed into +# a clean conversation; insight extraction stays the separate, gated step +# (`npm run autoknowledge`), so quality is enforced by the gate, not by hand. +# +# Safe + non-blocking by design: always exits 0. Errors are LOGGED (not +# silenced) to ~/.claude/session-capture.log — a hook you can't watch fire +# is one where silent failure means the feature looks fine and never works. +# +# The Claude Code hook JSON ({ session_id, transcript_path, ... }) arrives on +# stdin and is forwarded to the script via inherited fd 0. +# +# Install: copy to .claude/hooks/, `chmod +x`, and register a SessionEnd hook +# in .claude/settings.json (see docs/session-pipeline-setup.md). +LOG="$HOME/.claude/session-capture.log" +exec >> "$LOG" 2>&1 +echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] SessionEnd fire" +ROOT="${CLAUDE_PROJECT_DIR:-$(cd "$(dirname "$0")/../.." && pwd)}" +echo " root: $ROOT" +cd "$ROOT" || { echo " cd failed"; exit 0; } +npx tsx scripts/capture-session.ts +echo " exit: $?" +exit 0 From 79b2e9d094da3f98a261ad69e7fb3bf83f6c3a2e Mon Sep 17 00:00:00 2001 From: Jin Choi Date: Wed, 27 May 2026 22:16:55 -0700 Subject: [PATCH 2/2] chore: make templates/hooks/session-capture.sh executable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI's "Verify hook templates are executable" check caught that the Write tool created the template at 0644. The local copy installed in .claude/hooks/ was already chmod +x'd, but the template itself wasn't — so a fresh clone would copy a non-executable template into .claude/hooks/. Co-Authored-By: Claude Opus 4.7 (1M context) --- templates/hooks/session-capture.sh | 0 1 file changed, 0 insertions(+), 0 deletions(-) mode change 100644 => 100755 templates/hooks/session-capture.sh diff --git a/templates/hooks/session-capture.sh b/templates/hooks/session-capture.sh old mode 100644 new mode 100755