From f74bd4ed381e3baad9e5ef562e574ab0f606707a Mon Sep 17 00:00:00 2001 From: Mher Shahinyan Date: Thu, 11 Jun 2026 22:59:38 +0400 Subject: [PATCH] feat(skill): make explicit self-tagging the primary capture path Rewrite the bundled task-journal skill around agent self-tagging: open a task with a goal, append typed events at the moment of commitment (with alternatives on decisions), and task_close with a written outcome + outcome_tag so resume packs render Goal -> Decisions -> Outcome without relying on the auto-classifier. Auto-capture is reframed as a best-effort backstop that degrades to heuristic-only without an LLM backend. - Drop the stale evidence_strength reference (not an event_add param); document the real goal / alternatives / outcome_tag params. - Tighten task_create MCP tool description to nudge passing goal (expected-always). No behavior change. - README "How it works" leads with self-tagging, demotes auto-capture. - Add crates/tj-mcp/tests/skill_doc.rs guarding the skill against param drift (present, valid frontmatter, no phantom params, self-tagging-first). - Plugin minor bump: plugin.json + marketplace 0.12.0 -> 0.13.0. Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude-plugin/marketplace.json | 4 +- CHANGELOG.md | 15 +++ README.md | 5 +- crates/tj-mcp/src/main.rs | 2 +- crates/tj-mcp/tests/skill_doc.rs | 64 +++++++++ plugin/.claude-plugin/plugin.json | 2 +- plugin/skills/task-journal/SKILL.md | 193 ++++++++++++++++++---------- 7 files changed, 211 insertions(+), 74 deletions(-) create mode 100644 crates/tj-mcp/tests/skill_doc.rs diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index d02fda6..a83d4bf 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -6,14 +6,14 @@ }, "metadata": { "description": "Task Journal — append-only reasoning chain memory for AI-coding tasks", - "version": "0.12.0" + "version": "0.13.0" }, "plugins": [ { "name": "task-journal", "source": "./plugin", "description": "Append-only journal of AI-coding task reasoning chains. Captures hypotheses, decisions, rejections, evidence — renders compact resume packs so an agent can pick up a 2-week-old task with full context.", - "version": "0.12.0", + "version": "0.13.0", "author": { "name": "Digital-Threads" }, diff --git a/CHANGELOG.md b/CHANGELOG.md index 5fb0f5d..2efb606 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Changed +- **Self-tagging is now the primary capture path.** Rewrote the bundled + `task-journal` skill (`plugin/skills/task-journal/SKILL.md`) around explicit + agent self-tagging: open a task with a `goal`, append typed events at the + moment of commitment (with `alternatives` on decisions), and `task_close` + with a written `outcome` + `outcome_tag` so packs render Goal → Decisions → + Outcome without relying on the classifier. Auto-capture is reframed as a + best-effort backstop that degrades to heuristic-only without an LLM backend. + Removed the stale `evidence_strength` reference (no such param) and added the + real `goal` / `alternatives` / `outcome_tag` params. README "How it works" + updated to match. Plugin minor bump (plugin.json + marketplace 0.12.0 → + 0.13.0). +- `task_create` MCP tool description now nudges passing `goal` (expected-always) + instead of only mentioning title + initial context. No behavior change. + ## [0.12.0] ### Added diff --git a/README.md b/README.md index fb2fd0b..57ca0bf 100644 --- a/README.md +++ b/README.md @@ -75,12 +75,13 @@ That's it. Restart Claude Code, start working, and the journal fills itself. ## How it works -- **Auto-capture via Claude Code hooks.** Every prompt, tool call, and Claude reply runs through a two-stage classifier and lands as a typed event (`finding` / `decision` / `evidence` / `rejection` / …). Stage 1 is a fast in-process heuristic — pattern-matches obvious phrasing in EN+RU for zero cost. Stage 2 falls back to the Anthropic API (`ANTHROPIC_API_KEY`) only when the heuristic is uncertain. Hook returns in <100 ms — both stages run in a detached background worker, never blocking your session. +- **Self-tagging is the primary path (recommended).** You — the agent in the live session — record reasoning directly via the five MCP tools: open a task with a `goal`, append a typed `decision` / `finding` / `rejection` / `evidence` event at the moment of commitment, and `task_close` with a written `outcome`. This is free (it rides the interactive session), language-agnostic, and higher-fidelity than any after-the-fact classifier. The bundled `task-journal` skill drives this automatically. See [MCP tools](#mcp-tools). +- **Auto-capture is a best-effort backstop.** Claude Code hooks also run every prompt, tool call, and reply through a two-stage classifier that lands typed events on its own. Stage 1 is a fast in-process heuristic — pattern-matches obvious EN+RU phrasing for zero cost. Stage 2 falls back to an LLM only when the heuristic is uncertain. Without an LLM backend it **degrades to heuristic-only**: it reliably catches keyword-obvious lines but misses real reasoning — especially non-English prose. Treat it as a safety net under your explicit self-tagging, not the main capture mechanism. Hook returns in <100 ms — both stages run in a detached background worker, never blocking your session. - **Artifact extraction.** Each event scans its text for commit hashes, PR URLs, file paths, issue IDs, and branch names. Aggregated artifacts are how Task Journal links related tasks: when you start a new task touching the same issue or file, the prior task is surfaced automatically. - **Resume packs.** `task_pack` (MCP tool or CLI) renders a task into a compact Markdown briefing — Goal, Outcome, decisions, rejections, evidence, artifacts — that fits in a fresh agent's context window without dumping the raw event log. - **Auto-capture boundaries.** Beyond per-event capture, two extra hooks mark *reasoning boundaries* automatically. On `PreCompact`, Task Journal reads the transcript JSONL tail (entries newer than the active task's last event) and enqueues anything the synchronous hooks missed before the compact — then drops a marker decision so the post-compact agent sees a clear cut. A `/rewind`-prefixed prompt appends a single correction event so pack readers see where the user rolled back. No mass-rejection of prior events — the boundary is a sentinel, not a rewrite. -Source of truth is an append-only JSONL log per project. SQLite holds derived state and is fully rebuildable. Nothing is sent off-machine except the classifier prompt to the Anthropic API — and only when the local heuristic is uncertain. With no `ANTHROPIC_API_KEY` set, Task Journal still works: the heuristic handles the obvious cases, and anything it can't classify sits in the local pending queue for later retry. +Source of truth is an append-only JSONL log per project. SQLite holds derived state and is fully rebuildable. Nothing is sent off-machine except the classifier prompt — and only when auto-capture's local heuristic is uncertain. With no LLM backend configured, Task Journal still works fully: explicit self-tagging is unaffected (it never needs a backend), the heuristic backstop handles the obvious cases, and anything it can't classify sits in the local pending queue for later retry. ### Statusline integration diff --git a/crates/tj-mcp/src/main.rs b/crates/tj-mcp/src/main.rs index 12c38a8..55223dc 100644 --- a/crates/tj-mcp/src/main.rs +++ b/crates/tj-mcp/src/main.rs @@ -429,7 +429,7 @@ impl TaskJournalServer { #[tool( name = "task_create", - description = "Open a new task with title and optional initial context." + description = "Open a new task. Always pass `goal` (one sentence: what the user is trying to accomplish) — it is the first line of every resume pack and the anchor for \"why was this done?\" weeks later. `title` is a short label; `initial_context` is optional." )] async fn task_create( &self, diff --git a/crates/tj-mcp/tests/skill_doc.rs b/crates/tj-mcp/tests/skill_doc.rs new file mode 100644 index 0000000..aab2a79 --- /dev/null +++ b/crates/tj-mcp/tests/skill_doc.rs @@ -0,0 +1,64 @@ +//! Doc-consistency guards for the bundled `task-journal` skill. +//! +//! The skill is the agent-facing contract for the five MCP tools. These tests +//! keep it honest against the real tool signatures so it can never again drift +//! into documenting params that do not exist (the `evidence_strength` bug) or +//! drop the ones that do (`goal` / `alternatives` / `outcome_tag`). + +use std::path::PathBuf; + +fn skill_md() -> String { + let path = + PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("../../plugin/skills/task-journal/SKILL.md"); + std::fs::read_to_string(&path) + .unwrap_or_else(|e| panic!("bundled skill missing at {}: {e}", path.display())) +} + +#[test] +fn skill_is_present_with_valid_frontmatter() { + let s = skill_md(); + assert!( + s.starts_with("---\n"), + "skill must open with YAML frontmatter" + ); + assert!( + s.contains("\nname: task-journal\n"), + "frontmatter must declare name: task-journal" + ); + // Frontmatter must be closed. + assert!( + s[4..].contains("\n---\n"), + "frontmatter block must be closed with a --- line" + ); +} + +#[test] +fn skill_documents_real_event_add_params_not_phantom_ones() { + let s = skill_md(); + // `event_add` has no `evidence_strength` param — that field lives only on + // the internal classifier output, never on the MCP tool. The skill must not + // tell agents to pass it. + assert!( + !s.contains("evidence_strength"), + "skill must not reference evidence_strength as an event_add param" + ); + // The params the tools actually expose must be present. + for needle in [ + "goal", + "alternatives", + "outcome_tag", + "task_close", + "task_search", + ] { + assert!(s.contains(needle), "skill must document `{needle}`"); + } +} + +#[test] +fn skill_frames_self_tagging_as_primary() { + let s = skill_md(); + assert!( + s.contains("self-tagging"), + "skill must frame self-tagging as the primary capture path" + ); +} diff --git a/plugin/.claude-plugin/plugin.json b/plugin/.claude-plugin/plugin.json index 2cd165a..647c686 100644 --- a/plugin/.claude-plugin/plugin.json +++ b/plugin/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "task-journal", - "version": "0.12.0", + "version": "0.13.0", "description": "Append-only journal of AI-coding task reasoning chains: hypotheses, decisions, rejections, evidence. Renders compact resume packs so an agent can pick up a 2-week-old task with full context.", "author": { "name": "Mher Shahinyan" diff --git a/plugin/skills/task-journal/SKILL.md b/plugin/skills/task-journal/SKILL.md index 7e4fbf7..d4e4029 100644 --- a/plugin/skills/task-journal/SKILL.md +++ b/plugin/skills/task-journal/SKILL.md @@ -1,73 +1,130 @@ --- name: task-journal description: | - MANDATORY for every coding session. Use task-journal to track ALL tasks, bugs, and investigations. - Create a task at the start, log every significant discovery/decision/error, close when done. - This is NOT optional — without it, context is lost between sessions and compactions. - Triggers: beginning of work, significant finding, decision made, hypothesis formed, error found, - test results obtained, task completed, "what was I working on?", "remind me about X". + MANDATORY for every coding session. You — the agent — are the journal's primary + recorder. Do NOT rely on automatic capture; it only catches keyword-obvious lines + and misses real reasoning, especially in non-English prose. Instead, call the MCP + tools yourself at the decisive moments below. + Open a task with an explicit goal at the start, append a typed event the moment you + decide / reject / discover / prove something, and close with a written outcome so the + resume pack comes out clean: Goal -> Decisions -> Outcome. + Triggers: start of any task/bug/investigation; a choice is committed; an approach is + ruled out; a fact is verified from code or logs; a test/benchmark proves something; an + earlier belief turns out wrong; task finished; "what was I working on?", "remind me + about X", "почему мы сделали так?". --- -# Task Journal — Reasoning Chain Memory - -**MANDATORY WORKFLOW — follow without exceptions:** - -1. **Start any task/bug/investigation** → `task_create` with descriptive title -2. **Every significant discovery** → `event_add` with appropriate type (see below) -3. **Every decision or rejection** → `event_add` (decision/rejection) -4. **Test results, QA outcomes** → `event_add` with `event_type=evidence` -5. **Wrong hypothesis corrected** → `event_add` with `event_type=correction` + `corrects=` -6. **Task done** → `task_close` with reason and outcome - -## Event type guide — choose the RIGHT one - -| Situation | Type | Example | -|-----------|------|---------| -| "I think the bug might be in X" | `hypothesis` | Unverified theory, needs checking | -| "The code shows X does Y at line Z" | `finding` | Verified fact from reading code/logs | -| "Tests pass", "QA verified on staging" | `evidence` | Proof something works or fails | -| "We'll use approach X because Y" | `decision` | Committed choice | -| "Tried X but it won't work because Y" | `rejection` | Explicitly rejected approach | -| "API rate limit is 100/min" | `constraint` | External limitation discovered | -| "Actually, previous finding was wrong" | `correction` | Corrects earlier event (set `corrects` field) | -| "Done, PR merged, verified" | `close` | Task completed | - -**Key distinctions:** -- `hypothesis` = "I think" / "maybe" / "could be" → NOT yet verified -- `finding` = "I see" / "the code shows" / "confirmed" → verified by reading code/logs -- `evidence` = ran a test/experiment that PROVES something (set `evidence_strength`: weak/medium/strong) -- `decision` ≠ `hypothesis`: decision = committed; hypothesis = exploring - -## Tools available - -The plugin's MCP server exposes 5 tools: - -- `task_pack(task_id, mode)` — return Markdown resume pack. `mode`: `compact` (~2KB) or `full` (~10KB). -- `task_create(title, initial_context?)` — open new task, returns `task_id` like `tj-x9rz1f`. -- `event_add(task_id, event_type, text, corrects?, supersedes?)` — append event. -- `task_close(task_id, reason, outcome?)` — close task with reason. -- `task_search(query)` — FTS5 search current project's events; returns task_ids. - -## When to use task_pack - -| User says... | Action | -|--------------|--------| -| "Remind me about task X", "what was I working on?" | `task_pack` | -| "Find the task where I decided about Y" | `task_search` → `task_pack` | -| Session start on existing project | `task_search` for recent open tasks → `task_pack` | - -## Key invariants - -- **Append-only**: events are never edited. To fix a mistake, write a `correction` event with `corrects: `. -- **One task = one logical objective**: don't create a new task every turn. Events accumulate under one task. -- **Always close**: when a task is done, call `task_close`. Don't leave tasks open. -- **Log rejections**: wrong paths are as valuable as correct ones — they prevent repeated mistakes. - -## Auto-capture - -Hooks are installed via `task-journal install-hooks --scope user`. Auto-classification runs through a two-stage hybrid: a local heuristic catches obvious decisions, rejections, evidence, and findings for free; ambiguous chunks fall back to the Anthropic API when `ANTHROPIC_API_KEY` is set. Without the key, the heuristic still works and uncertain chunks queue for later retry. Manual recording via MCP tools always works as a complement. - -## Storage - -Events: `$XDG_DATA_HOME/task-journal/events/.jsonl` -State: `state/.sqlite` (rebuildable via `task-journal rebuild-state`) +# Task Journal — Reasoning Chain Memory (self-tagging first) + +You are the recorder. The point of this journal is the **"why" layer**: weeks later the +code shows *what* changed, not *why*. Capture the decisions and the outcome as they +happen, in your own words, terse and specific. Automatic capture is a weak fallback — +treat it as if it does nothing and record explicitly. + +## Session-start ritual (do this before real work) + +1. `task_search(query=, status="open")` — is there an open + task for this? If yes, `task_pack(task_id)` and continue it. **Do not** open a duplicate. +2. If nothing fits, `task_create(title=, goal=)`. **Always pass `goal`** — it is the first line of every pack and + the anchor for "why was this done?". +3. Hold the returned `task_id` for the whole task. One task = one logical objective. + Events accumulate under it; do not spawn a new task per turn. + +## The decisive moments — when you MUST call `event_add` + +Call `event_add(task_id, event_type, text, ...)` the moment one of these happens. Don't +batch it to the end; record at the point of commitment while the reasoning is fresh. + +| The moment | `event_type` | What goes in `text` (1–2 terse sentences, specifics) | +|------------|--------------|------------------------------------------------------| +| You commit to an approach / architecture choice | `decision` | The choice + the because. Include file/lib/IDs. | +| You rule an approach out | `rejection` | What you tried + why it won't work. Prevents repeat work. | +| You verify a fact from code/logs | `finding` | The fact + where (file:line, config key). | +| A test / benchmark / repro proves something | `evidence` | What ran + the result (numbers, pass/fail). | +| You hit an external limit | `constraint` | The limit (rate, version, platform). | +| An earlier event turns out wrong | `correction` | The correction. Set `corrects=`. | +| This task replaces another | `supersede` | Set `supersedes=`. | +| You form an unverified theory worth tracking | `hypothesis` | "maybe X because Y" — only if you'll act on it. | + +Capture in the user's language; keep it short and concrete. A good event is one line a +human can read cold in two weeks and understand. + +### Decisions: record the alternatives you weighed + +For a `decision`, pass the structured `alternatives` array so the considered options and +the final pick are explicit (this is decision-only; it errors on other types): + +``` +event_add( + task_id="tj-x9rz1f", + event_type="decision", + text="Use fd-lock for the cross-platform JSONL file lock.", + alternatives=[ + {"option": "fd-lock", "chosen": true, "rationale": "single API across Win/Unix, maintained"}, + {"option": "rustix flock", "chosen": false, "rationale": "more code, manual Windows path"}, + {"option": "advisory-only", "chosen": false, "rationale": "races under 5–6 parallel CC instances"} + ] +) +``` + +### Corrections never edit — they append + +Events are append-only. To fix a wrong earlier event, write a `correction` with +`corrects=` (the prior call returned that id). Same for `supersede`. + +## Closing — REQUIRED, and this is what makes the pack clean + +When the objective is met (or abandoned), call `task_close`: + +``` +task_close( + task_id="tj-x9rz1f", + reason="off-by-one fixed, regression test green, PR #1284 merged", + outcome="Token no longer dropped on refresh boundary; covered by test_token_refresh_boundary.", + outcome_tag="done" # done | abandoned | superseded +) +``` + +- `outcome` is the human-readable result line that renders as **Outcome [tag]** in the pack. + Without it the pack has no conclusion and reads like raw log. Always write it. +- `outcome_tag` must be `done`, `abandoned`, or `superseded`. +- Don't leave tasks open. An open task with no close = no Outcome line = the exact "I see + noise, no signal" failure this skill exists to prevent. + +## Reading back — `task_pack` / `task_search` + +| User says | Action | +|-----------|--------| +| "remind me about task X", "what was I working on?" | `task_pack(task_id, mode="compact")` | +| "find where I decided about Y" | `task_search(query="Y", event_type="decision")` → `task_pack` | +| Resuming an existing project | `task_search(status="open")` → `task_pack` on the match | + +`task_pack` mode: `compact` (~2KB, default for resume) or `full` (~10KB, full trail). +`task_search` filters: `status`, `project`, `event_type` (decision/finding/evidence/…). + +## The 5 MCP tools (exact params) + +- `task_create(title, goal?, initial_context?, parent?)` → `task_id` like `tj-x9rz1f`. **Pass `goal`.** +- `event_add(task_id, event_type, text, corrects?, supersedes?, alternatives?)` → `event_id`. + `event_type` ∈ hypothesis | finding | evidence | decision | rejection | constraint | + correction | reopen | supersede | redirect. `alternatives` is decision-only. +- `task_close(task_id, reason, outcome?, outcome_tag?)` — **always pass `outcome` + `outcome_tag`.** +- `task_pack(task_id, mode?)` — `compact` | `full`. +- `task_search(query, status?, project?, event_type?)` — FTS5 over this project's events. + +## Invariants + +- **Append-only.** Never edit; correct via a new `correction`/`supersede` event. +- **One task = one objective.** Don't fragment a single goal across many tasks. +- **Record at the moment, not at the end.** Decisions logged after the fact get lost. +- **Rejections are as valuable as decisions** — they stop you re-walking dead ends. +- **Close with an outcome, every time.** + +## Why explicit (not automatic) + +The hybrid auto-classifier only fires confidently on English keyword patterns; ambiguous +or non-English prose needs an LLM it can't reach without an API key (and post-2026-06-15 +that path consumes a separate credit pool). You, the agent in the live session, already +know what you decided — so record it directly. That is both free (it rides the interactive +session) and higher fidelity than any after-the-fact classifier. Auto-capture stays on as a +backstop; do not depend on it.