From 9c5307b98bf7a52c0fd63ef9fe9478c57139dd6f Mon Sep 17 00:00:00 2001
From: Jin Choi <jinchoi@u.northwestern.edu>
Date: Wed, 27 May 2026 22:15:01 -0700
Subject: [PATCH 1/2] =?UTF-8?q?feat(sensor):=20Zuhn-native=20session-captu?=
 =?UTF-8?q?re=20hook=20(Option=20A=20=E2=80=94=20credential-free)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A Claude Code SessionEnd hook that captures the session transcript as a
Zuhn "session" source. MECHANICAL only — no LLM in the hook. Insight
extraction stays the deliberate, gated `npm run autoknowledge` step, so
Phase 8's "no auto-extraction" still holds; only capture is automated.

Why this and not claude-mem: claude-mem digests via a paid LLM API (no key
in this env, install is interactive). Zuhn's autoknowledge already uses the
authenticated `claude` CLI — so this path has zero new credentials, zero
extra API cost, no background daemon.

Components:
- scripts/lib/transcript.ts — parseTranscript (JSONL → user-prompt +
  assistant-text turns; drops isSidechain, tool_use/tool_result, assistant
  thinking, non-conversation types, and framework-injection user turns
  matching <system-reminder>/<command-*>); renderConversation (caps to
  60K chars keeping the TAIL with a truncation note); buildTranscriptSource
  (title=first user prompt; slug/id salted by session_id → idempotent;
  null when there's no usable conversation).
- scripts/capture-session.ts — the hook script. Reads hook JSON on stdin
  (guards process.stdin.isTTY to avoid blocking on manual runs), graceful
  no-op on missing transcript, idempotent file-exists skip, --dry-run, and
  NEVER throws out of the hook (try/catch around every I/O; always exit 0).
- templates/hooks/session-capture.sh — bash wrapper. Errors are LOGGED to
  ~/.claude/session-capture.log instead of silenced (a hook you can't watch
  fire is one where silent failure looks like "it works").
- docs/session-pipeline-setup.md — records the design supersession: the
  Phase-8 "explicit intent only" constraint existed because there was no
  automatic quality control; the gate now provides it, so passive transcript
  capture is safe.

Frontmatter: "session" was already added to SourceFrontmatter.type / the
health glob / the autoknowledge discovery glob in the claude-mem PR (#4),
so session sources flow through the existing extract→gate path unchanged.

Tests: 8 (noise-stripping edge cases, cap/tail, idempotency, null on empty).
Verified end-to-end on the real 38MB transcript (→ 8831-word digest), on
bogus input (graceful no-op), and via the logged hook chain with a
synthetic transcript + temp KB.

Local codex review stalled twice (the harness backgrounded `codex exec`,
which hangs on stdin/tty in that mode). Relying on the GitHub codex App
for PR review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/session-pipeline-setup.md     |  51 +++++++++++
 package.json                       |   1 +
 scripts/capture-session.ts         | 102 +++++++++++++++++++++
 scripts/lib/transcript.test.ts     | 107 ++++++++++++++++++++++
 scripts/lib/transcript.ts          | 138 +++++++++++++++++++++++++++++
 templates/hooks/session-capture.sh |  24 +++++
 6 files changed, 423 insertions(+)
 create mode 100644 scripts/capture-session.ts
 create mode 100644 scripts/lib/transcript.test.ts
 create mode 100644 scripts/lib/transcript.ts
 create mode 100644 templates/hooks/session-capture.sh
diff --git a/docs/session-pipeline-setup.md b/docs/session-pipeline-setup.md
index a577ff2b18..62c2a06ae8 100644
--- a/docs/session-pipeline-setup.md
+++ b/docs/session-pipeline-setup.md
@@ -54,3 +54,54 @@ During any Claude Code session working with Zuhn:
 - **Explicit intent only** — the hook only reads `/tmp/`, never scrapes other directories
 - **No auto-extraction** — `extract-session` must be run manually
 - **`.claude/` files are local** — never commit personal `.claude/` config to the public repo
+
+---
+
+## Automatic capture (SessionEnd hook)
+
+Phase 8 above is *explicit-intent* capture (Claude manually queues insights). The
+**session-capture hook** adds *automatic* capture — it records the whole session
+at exit, so nothing has to be queued by hand. This is safe now because the
+**quality gate exists**: the gate (not manual intent) is the filter that keeps
+junk out, so capture can be automatic while quality stays enforced.
+
+### How it works
+
+1. `capture-session.ts` runs on `SessionEnd`. It reads the session transcript,
+   strips noise (tool output, `thinking`, subagent sidechains, framework
+   injections) down to user prompts + assistant prose, and writes one `session`
+   source to `sources/session/`. **No LLM, no extraction** — just capture.
+2. `npm run autoknowledge` (manual or scheduled) extracts stanced insights from
+   those sources — point it at a stricter gate so the noisier session input
+   clears a higher bar:
+   `ZUHN_GATE_BLOCKING_CHECKS=stance_present,stance_directional npm run autoknowledge`
+3. The gate admits only sharp, novel session-insights; dedup prevents re-capture.
+
+### Setup
+
+```bash
+cp templates/hooks/session-capture.sh .claude/hooks/session-capture.sh
+chmod +x .claude/hooks/session-capture.sh
+```
+
+Register a `SessionEnd` hook in `.claude/settings.json`:
+
+```json
+"SessionEnd": [
+  {
+    "matcher": "",
+    "hooks": [
+      { "type": "command", "command": "bash \"$CLAUDE_PROJECT_DIR/.claude/hooks/session-capture.sh\"" }
+    ]
+  }
+]
+```
+
+### Design note — supersedes Phase 8's constraint *for capture only*
+
+- **Capture is automated; extraction is not.** The hook only *captures* (writes a
+  source); insight extraction stays the deliberate, gated `autoknowledge` step —
+  so Phase 8's "no auto-extraction" still holds.
+- **The gate replaces manual intent as the quality filter.** Phase 8 required
+  explicit intent because there was no automatic quality control; the gate now
+  provides it, so passive transcript capture is safe.
diff --git a/package.json b/package.json
index 9c7e4fd345..4f7948cb97 100644
--- a/package.json
+++ b/package.json
@@ -24,6 +24,7 @@
     "quality-score": "tsx scripts/quality-score.ts",
     "gate": "tsx scripts/insight-gate.ts",
     "sync-claude-mem": "tsx scripts/sync-claude-mem.ts",
+    "capture-session": "tsx scripts/capture-session.ts",
     "ask": "tsx scripts/ask.ts",
     "autoknowledge": "tsx scripts/autoknowledge.ts",
     "wake": "tsx scripts/wake.ts",
diff --git a/scripts/capture-session.ts b/scripts/capture-session.ts
new file mode 100644
index 0000000000..bd1b82c028
--- /dev/null
+++ b/scripts/capture-session.ts
@@ -0,0 +1,102 @@
+#!/usr/bin/env npx tsx
+/**
+ * capture-session.ts — Claude Code SessionEnd hook → Zuhn "session" source.
+ *
+ * Reads the hook's stdin JSON ({ session_id, transcript_path, ... }), parses the
+ * transcript into a clean conversation, and writes one Zuhn session source. It
+ * does NOT extract insights — that stays the deliberate, gated autoknowledge
+ * step (so Phase 8's "no auto-extraction" holds; only capture is automated).
+ *
+ * Designed to be a safe hook: never throws, never blocks the session, exits 0.
+ *
+ * Hook usage (settings.json SessionEnd): piped the hook JSON on stdin.
+ * Manual/testing:
+ *   npx tsx scripts/capture-session.ts --transcript <path.jsonl> --session <id> --dry-run
+ */
+
+import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
+import { join } from "node:path";
+import matter from "gray-matter";
+
+import { buildTranscriptSource } from "./lib/transcript";
+import { KB_ROOT } from "./lib/kb-root";
+
+const SESSION_DIR = join(KB_ROOT, "sources", "session");
+
+interface HookInput {
+  session_id?: string;
+  transcript_path?: string;
+}
+
+function readStdin(): string {
+  try {
+    return readFileSync(0, "utf-8");
+  } catch {
+    return "";
+  }
+}
+
+function resolveInput(argv: string[]): { transcriptPath?: string; sessionId?: string } {
+  const tIdx = argv.indexOf("--transcript");
+  if (tIdx !== -1) {
+    const sIdx = argv.indexOf("--session");
+    return { transcriptPath: argv[tIdx + 1], sessionId: sIdx !== -1 ? argv[sIdx + 1] : "manual" };
+  }
+  // Hook mode: JSON on stdin. Guard against a TTY (no pipe) so manual runs
+  // without --transcript don't block waiting on fd 0.
+  if (process.stdin.isTTY) return {};
+  const raw = readStdin().trim();
+  if (!raw) return {};
+  try {
+    const input = JSON.parse(raw) as HookInput;
+    return { transcriptPath: input.transcript_path, sessionId: input.session_id };
+  } catch {
+    return {};
+  }
+}
+
+function main(): void {
+  const dryRun = process.argv.includes("--dry-run");
+  const { transcriptPath, sessionId } = resolveInput(process.argv);
+
+  // Graceful no-op on missing/odd input — a hook must never break the session.
+  if (!transcriptPath || !existsSync(transcriptPath)) {
+    if (dryRun) console.log(`No transcript (${transcriptPath ?? "none"}) — nothing to capture.`);
+    return;
+  }
+
+  let src;
+  try {
+    const jsonl = readFileSync(transcriptPath, "utf-8");
+    src = buildTranscriptSource(jsonl, sessionId ?? transcriptPath);
+  } catch (err) {
+    if (dryRun) console.log(`Could not read/parse transcript: ${(err as Error).message}`);
+    return; // never throw out of a hook
+  }
+
+  if (!src) {
+    if (dryRun) console.log("Transcript had no usable conversation — skipping.");
+    return;
+  }
+
+  const path = join(SESSION_DIR, `${src.slug}.md`);
+  if (existsSync(path)) {
+    if (dryRun) console.log(`Already captured: ${src.id} → ${path}`);
+    return; // idempotent
+  }
+
+  if (dryRun) {
+    console.log(`WOULD write ${src.id}  "${src.title}"  (${src.data.word_count} words) → ${path}`);
+    return;
+  }
+
+  try {
+    mkdirSync(SESSION_DIR, { recursive: true });
+    writeFileSync(path, matter.stringify(src.body, src.data), "utf-8");
+    console.log(`Zuhn: captured session ${src.id} — run 'npm run autoknowledge' to extract + gate.`);
+  } catch {
+    // Swallow — a capture failure must not surface as a session error.
+  }
+}
+
+main();
diff --git a/scripts/lib/transcript.test.ts b/scripts/lib/transcript.test.ts
new file mode 100644
index 0000000000..02daeba589
--- /dev/null
+++ b/scripts/lib/transcript.test.ts
@@ -0,0 +1,107 @@
+import { describe, expect, it } from "vitest";
+
+import {
+  buildTranscriptSource,
+  parseTranscript,
+  renderConversation,
+  DEFAULT_BODY_CAP,
+  type Turn,
+} from "./transcript";
+
+// Build a transcript JSONL string from line objects.
+function jsonl(objs: unknown[]): string {
+  return objs.map((o) => JSON.stringify(o)).join("\n");
+}
+
+const FIXTURE = jsonl([
+  { type: "user", isSidechain: false, message: { role: "user", content: "First real prompt" } },
+  {
+    type: "assistant",
+    message: {
+      role: "assistant",
+      content: [
+        { type: "thinking", thinking: "internal reasoning" },
+        { type: "text", text: "Assistant reply" },
+        { type: "tool_use", name: "Bash", input: {} },
+      ],
+    },
+  },
+  // tool_result (user, array content, no text block) → dropped
+  { type: "user", message: { role: "user", content: [{ type: "tool_result", content: "huge output" }] } },
+  // subagent → dropped
+  { type: "user", isSidechain: true, message: { role: "user", content: "sidechain prompt" } },
+  // non-conversation line → dropped
+  { type: "attachment", foo: 1 },
+  // framework injection → dropped
+  { type: "user", message: { role: "user", content: "<system-reminder>noise</system-reminder>" } },
+  "not valid json",
+]);
+
+describe("parseTranscript", () => {
+  it("keeps only real user prompts + assistant text, dropping all noise", () => {
+    const turns = parseTranscript(FIXTURE);
+    expect(turns).toEqual<Turn[]>([
+      { role: "user", text: "First real prompt" },
+      { role: "assistant", text: "Assistant reply" },
+    ]);
+  });
+
+  it("extracts text blocks from array content but skips tool/thinking blocks", () => {
+    const t = parseTranscript(
+      jsonl([
+        { type: "user", message: { role: "user", content: [{ type: "text", text: "typed with attachment" }] } },
+      ])
+    );
+    expect(t).toEqual([{ role: "user", text: "typed with attachment" }]);
+  });
+
+  it("tolerates malformed lines and empty input", () => {
+    expect(parseTranscript("not json\n\n{bad")).toEqual([]);
+    expect(parseTranscript("")).toEqual([]);
+  });
+});
+
+describe("renderConversation", () => {
+  it("labels turns and caps to the tail with a truncation note", () => {
+    const long: Turn[] = [
+      { role: "user", text: "X".repeat(100) },
+      { role: "assistant", text: "Y".repeat(100) },
+    ];
+    const out = renderConversation(long, 50);
+    expect(out.length).toBeLessThanOrEqual(50 + "_[earlier turns truncated]_\n\n".length);
+    expect(out).toContain("earlier turns truncated");
+    expect(out.endsWith("Y")).toBe(true); // tail kept
+  });
+
+  it("formats user/assistant labels when under the cap", () => {
+    const out = renderConversation([{ role: "user", text: "hi" }, { role: "assistant", text: "yo" }]);
+    expect(out).toBe("**User:** hi\n\n**Assistant:** yo");
+    expect(DEFAULT_BODY_CAP).toBeGreaterThan(0);
+  });
+});
+
+describe("buildTranscriptSource", () => {
+  it("builds a 'session' source titled by the first user prompt, salted by session id", () => {
+    const src = buildTranscriptSource(FIXTURE, "sess-xyz12345", "2026-05-26");
+    expect(src).not.toBeNull();
+    expect(src!.title).toBe("First real prompt");
+    expect(src!.id).toMatch(/^SRC-\d{6}-[A-F0-9]{4}$/);
+    expect(src!.slug).toContain("sessxyz1"); // first 8 alphanumerics of session id
+    expect(src!.data.type).toBe("session");
+    expect(src!.data.date_ingested).toBe("2026-05-26");
+    expect(src!.body).toContain("**User:** First real prompt");
+    expect(src!.body).toContain("**Assistant:** Assistant reply");
+  });
+
+  it("returns null when there's no usable conversation", () => {
+    expect(buildTranscriptSource("", "s")).toBeNull();
+    expect(buildTranscriptSource(jsonl([{ type: "attachment" }, { type: "system" }]), "s")).toBeNull();
+  });
+
+  it("is deterministic for a given session id", () => {
+    const a = buildTranscriptSource(FIXTURE, "sess-1", "2026-05-26");
+    const b = buildTranscriptSource(FIXTURE, "sess-1", "2026-05-26");
+    expect(a!.id).toBe(b!.id);
+    expect(a!.slug).toBe(b!.slug);
+  });
+});
diff --git a/scripts/lib/transcript.ts b/scripts/lib/transcript.ts
new file mode 100644
index 0000000000..9e45675a88
--- /dev/null
+++ b/scripts/lib/transcript.ts
@@ -0,0 +1,138 @@
+/**
+ * transcript.ts (lib) — Turn a Claude Code session transcript (.jsonl) into a
+ * clean Zuhn "session" source. MECHANICAL only — no LLM. The heavy lifting
+ * (authoring stances) stays with autoknowledge's claude-CLI extraction, and the
+ * gate filters quality, so this is just aggressive noise-stripping + framing.
+ *
+ * Transcript reality (verified against a real 38MB/11.8k-line transcript):
+ *   - Each line is a JSON object with a top-level `type`.
+ *   - Signal: type "user" with STRING message.content (real prompts) and type
+ *     "assistant" with `text` content-blocks (response prose).
+ *   - Noise to drop: attachment/system/custom-title/etc. lines, tool_use and
+ *     tool_result blocks, assistant `thinking` blocks, subagent isSidechain
+ *     entries, and framework injections (<system-reminder>, <command-*>, ...).
+ */
+
+import { generateSourceId } from "./generate-id";
+import { slugify } from "./ingest/slug";
+
+export interface Turn {
+  role: "user" | "assistant";
+  text: string;
+}
+
+export interface TranscriptSource {
+  id: string;
+  slug: string;
+  title: string;
+  body: string;
+  data: Record<string, string | number>;
+}
+
+/** Default body cap (~chars). Long sessions are truncated to keep extraction
+ *  tractable; the keep-tail keeps the most recent reasoning. */
+export const DEFAULT_BODY_CAP = 60_000;
+
+// Framework-injected user turns (not the human's words) — system reminders,
+// slash-command scaffolding, local command output, hook stdout. These are
+// noise for a "what did I reason about" digest.
+const FRAMEWORK_NOISE = /^<(system-reminder|command-name|command-message|command-args|local-command-stdout|local-command-stderr)>/;
+
+function textFromContent(content: unknown): string {
+  if (typeof content === "string") return content;
+  if (Array.isArray(content)) {
+    return content
+      .filter(
+        (b): b is { type: string; text: string } =>
+          !!b && typeof b === "object" && (b as { type?: unknown }).type === "text" &&
+          typeof (b as { text?: unknown }).text === "string"
+      )
+      .map((b) => b.text)
+      .join("\n");
+  }
+  return "";
+}
+
+/** Parse a transcript JSONL string into clean user/assistant turns. */
+export function parseTranscript(jsonl: string): Turn[] {
+  const turns: Turn[] = [];
+  for (const line of jsonl.split("\n")) {
+    if (!line.trim()) continue;
+
+    let obj: Record<string, unknown>;
+    try {
+      obj = JSON.parse(line) as Record<string, unknown>;
+    } catch {
+      continue; // tolerate a partially-flushed final line
+    }
+
+    if (obj.isSidechain === true) continue; // subagent transcript — not the main thread
+    const type = obj.type;
+    if (type !== "user" && type !== "assistant") continue;
+
+    const message = obj.message as { content?: unknown } | undefined;
+    const text = textFromContent(message?.content).trim();
+    if (!text) continue; // tool_result / tool_use / thinking-only → nothing to keep
+
+    if (type === "user" && FRAMEWORK_NOISE.test(text)) continue; // injected, not human
+
+    turns.push({ role: type, text });
+  }
+  return turns;
+}
+
+function firstSentence(text: string, max = 70): string {
+  const line = text.split("\n").map((l) => l.trim()).find(Boolean) ?? "";
+  return line.length > max ? line.slice(0, max - 1) + "…" : line;
+}
+
+/**
+ * Render turns into a readable markdown conversation, capped to `cap` chars by
+ * keeping the TAIL (most-recent reasoning) and noting the truncation.
+ */
+export function renderConversation(turns: Turn[], cap = DEFAULT_BODY_CAP): string {
+  const blocks = turns.map((t) => `**${t.role === "user" ? "User" : "Assistant"}:** ${t.text}`);
+  let body = blocks.join("\n\n");
+  if (body.length > cap) {
+    body = "_[earlier turns truncated]_\n\n" + body.slice(body.length - cap);
+  }
+  return body;
+}
+
+/**
+ * Build a Zuhn "session" source from a transcript. Title = first real user
+ * prompt; slug/id salted by sessionId (date-independent → idempotent re-capture).
+ * Returns null when the transcript has no usable conversation.
+ */
+export function buildTranscriptSource(
+  jsonl: string,
+  sessionId: string,
+  date = new Date().toISOString().slice(0, 10),
+  cap = DEFAULT_BODY_CAP
+): TranscriptSource | null {
+  const turns = parseTranscript(jsonl);
+  if (turns.length === 0) return null;
+
+  const firstUser = turns.find((t) => t.role === "user");
+  const title = firstUser ? firstSentence(firstUser.text) : "Claude Code session";
+  const body = renderConversation(turns, cap);
+
+  const tag = sessionId.replace(/[^a-zA-Z0-9]/g, "").slice(0, 8) || "session";
+  const slug = `${slugify(title)}-${tag}`;
+  const id = generateSourceId(slug, sessionId);
+
+  return {
+    id,
+    slug,
+    title,
+    body,
+    data: {
+      id,
+      type: "session",
+      title,
+      date_ingested: date,
+      insight_count: 0,
+      word_count: body.split(/\s+/).filter(Boolean).length,
+    },
+  };
+}
diff --git a/templates/hooks/session-capture.sh b/templates/hooks/session-capture.sh
new file mode 100644
index 0000000000..b739bc6cca
--- /dev/null
+++ b/templates/hooks/session-capture.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# SessionEnd hook — capture the Claude Code session transcript as a Zuhn
+# "session" source. MECHANICAL only (no LLM): the transcript is parsed into
+# a clean conversation; insight extraction stays the separate, gated step
+# (`npm run autoknowledge`), so quality is enforced by the gate, not by hand.
+#
+# Safe + non-blocking by design: always exits 0. Errors are LOGGED (not
+# silenced) to ~/.claude/session-capture.log — a hook you can't watch fire
+# is one where silent failure means the feature looks fine and never works.
+#
+# The Claude Code hook JSON ({ session_id, transcript_path, ... }) arrives on
+# stdin and is forwarded to the script via inherited fd 0.
+#
+# Install: copy to .claude/hooks/, `chmod +x`, and register a SessionEnd hook
+# in .claude/settings.json (see docs/session-pipeline-setup.md).
+LOG="$HOME/.claude/session-capture.log"
+exec >> "$LOG" 2>&1
+echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] SessionEnd fire"
+ROOT="${CLAUDE_PROJECT_DIR:-$(cd "$(dirname "$0")/../.." && pwd)}"
+echo "  root: $ROOT"
+cd "$ROOT" || { echo "  cd failed"; exit 0; }
+npx tsx scripts/capture-session.ts
+echo "  exit: $?"
+exit 0

From 79b2e9d094da3f98a261ad69e7fb3bf83f6c3a2e Mon Sep 17 00:00:00 2001
From: Jin Choi <jinchoi@u.northwestern.edu>
Date: Wed, 27 May 2026 22:16:55 -0700
Subject: [PATCH 2/2] chore: make templates/hooks/session-capture.sh executable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI's "Verify hook templates are executable" check caught that the Write
tool created the template at 0644. The local copy installed in
.claude/hooks/ was already chmod +x'd, but the template itself wasn't —
so a fresh clone would copy a non-executable template into .claude/hooks/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 templates/hooks/session-capture.sh | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 templates/hooks/session-capture.sh

diff --git a/templates/hooks/session-capture.sh b/templates/hooks/session-capture.sh
old mode 100644
new mode 100755