From f74bd4ed381e3baad9e5ef562e574ab0f606707a Mon Sep 17 00:00:00 2001
From: Mher Shahinyan <shahinyanm@gmail.com>
Date: Thu, 11 Jun 2026 22:59:38 +0400
Subject: [PATCH] feat(skill): make explicit self-tagging the primary capture
 path

Rewrite the bundled task-journal skill around agent self-tagging: open a
task with a goal, append typed events at the moment of commitment (with
alternatives on decisions), and task_close with a written outcome +
outcome_tag so resume packs render Goal -> Decisions -> Outcome without
relying on the auto-classifier. Auto-capture is reframed as a best-effort
backstop that degrades to heuristic-only without an LLM backend.

- Drop the stale evidence_strength reference (not an event_add param);
  document the real goal / alternatives / outcome_tag params.
- Tighten task_create MCP tool description to nudge passing goal
  (expected-always). No behavior change.
- README "How it works" leads with self-tagging, demotes auto-capture.
- Add crates/tj-mcp/tests/skill_doc.rs guarding the skill against param
  drift (present, valid frontmatter, no phantom params, self-tagging-first).
- Plugin minor bump: plugin.json + marketplace 0.12.0 -> 0.13.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .claude-plugin/marketplace.json     |   4 +-
 CHANGELOG.md                        |  15 +++
 README.md                           |   5 +-
 crates/tj-mcp/src/main.rs           |   2 +-
 crates/tj-mcp/tests/skill_doc.rs    |  64 +++++++++
 plugin/.claude-plugin/plugin.json   |   2 +-
 plugin/skills/task-journal/SKILL.md | 193 ++++++++++++++++++----------
 7 files changed, 211 insertions(+), 74 deletions(-)
 create mode 100644 crates/tj-mcp/tests/skill_doc.rs

diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
index d02fda6..a83d4bf 100644
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -6,14 +6,14 @@
   },
   "metadata": {
     "description": "Task Journal — append-only reasoning chain memory for AI-coding tasks",
-    "version": "0.12.0"
+    "version": "0.13.0"
   },
   "plugins": [
     {
       "name": "task-journal",
       "source": "./plugin",
       "description": "Append-only journal of AI-coding task reasoning chains. Captures hypotheses, decisions, rejections, evidence — renders compact resume packs so an agent can pick up a 2-week-old task with full context.",
-      "version": "0.12.0",
+      "version": "0.13.0",
       "author": {
         "name": "Digital-Threads"
       },
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5fb0f5d..2efb606 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Changed
+- **Self-tagging is now the primary capture path.** Rewrote the bundled
+  `task-journal` skill (`plugin/skills/task-journal/SKILL.md`) around explicit
+  agent self-tagging: open a task with a `goal`, append typed events at the
+  moment of commitment (with `alternatives` on decisions), and `task_close`
+  with a written `outcome` + `outcome_tag` so packs render Goal → Decisions →
+  Outcome without relying on the classifier. Auto-capture is reframed as a
+  best-effort backstop that degrades to heuristic-only without an LLM backend.
+  Removed the stale `evidence_strength` reference (no such param) and added the
+  real `goal` / `alternatives` / `outcome_tag` params. README "How it works"
+  updated to match. Plugin minor bump (plugin.json + marketplace 0.12.0 →
+  0.13.0).
+- `task_create` MCP tool description now nudges passing `goal` (expected-always)
+  instead of only mentioning title + initial context. No behavior change.
+
 ## [0.12.0]
 
 ### Added
diff --git a/README.md b/README.md
index fb2fd0b..57ca0bf 100644
--- a/README.md
+++ b/README.md
@@ -75,12 +75,13 @@ That's it. Restart Claude Code, start working, and the journal fills itself.
 
 ## How it works
 
-- **Auto-capture via Claude Code hooks.** Every prompt, tool call, and Claude reply runs through a two-stage classifier and lands as a typed event (`finding` / `decision` / `evidence` / `rejection` / …). Stage 1 is a fast in-process heuristic — pattern-matches obvious phrasing in EN+RU for zero cost. Stage 2 falls back to the Anthropic API (`ANTHROPIC_API_KEY`) only when the heuristic is uncertain. Hook returns in <100 ms — both stages run in a detached background worker, never blocking your session.
+- **Self-tagging is the primary path (recommended).** You — the agent in the live session — record reasoning directly via the five MCP tools: open a task with a `goal`, append a typed `decision` / `finding` / `rejection` / `evidence` event at the moment of commitment, and `task_close` with a written `outcome`. This is free (it rides the interactive session), language-agnostic, and higher-fidelity than any after-the-fact classifier. The bundled `task-journal` skill drives this automatically. See [MCP tools](#mcp-tools).
+- **Auto-capture is a best-effort backstop.** Claude Code hooks also run every prompt, tool call, and reply through a two-stage classifier that lands typed events on its own. Stage 1 is a fast in-process heuristic — pattern-matches obvious EN+RU phrasing for zero cost. Stage 2 falls back to an LLM only when the heuristic is uncertain. Without an LLM backend it **degrades to heuristic-only**: it reliably catches keyword-obvious lines but misses real reasoning — especially non-English prose. Treat it as a safety net under your explicit self-tagging, not the main capture mechanism. Hook returns in <100 ms — both stages run in a detached background worker, never blocking your session.
 - **Artifact extraction.** Each event scans its text for commit hashes, PR URLs, file paths, issue IDs, and branch names. Aggregated artifacts are how Task Journal links related tasks: when you start a new task touching the same issue or file, the prior task is surfaced automatically.
 - **Resume packs.** `task_pack` (MCP tool or CLI) renders a task into a compact Markdown briefing — Goal, Outcome, decisions, rejections, evidence, artifacts — that fits in a fresh agent's context window without dumping the raw event log.
 - **Auto-capture boundaries.** Beyond per-event capture, two extra hooks mark *reasoning boundaries* automatically. On `PreCompact`, Task Journal reads the transcript JSONL tail (entries newer than the active task's last event) and enqueues anything the synchronous hooks missed before the compact — then drops a marker decision so the post-compact agent sees a clear cut. A `/rewind`-prefixed prompt appends a single correction event so pack readers see where the user rolled back. No mass-rejection of prior events — the boundary is a sentinel, not a rewrite.
 
-Source of truth is an append-only JSONL log per project. SQLite holds derived state and is fully rebuildable. Nothing is sent off-machine except the classifier prompt to the Anthropic API — and only when the local heuristic is uncertain. With no `ANTHROPIC_API_KEY` set, Task Journal still works: the heuristic handles the obvious cases, and anything it can't classify sits in the local pending queue for later retry.
+Source of truth is an append-only JSONL log per project. SQLite holds derived state and is fully rebuildable. Nothing is sent off-machine except the classifier prompt — and only when auto-capture's local heuristic is uncertain. With no LLM backend configured, Task Journal still works fully: explicit self-tagging is unaffected (it never needs a backend), the heuristic backstop handles the obvious cases, and anything it can't classify sits in the local pending queue for later retry.
 
 ### Statusline integration
 
diff --git a/crates/tj-mcp/src/main.rs b/crates/tj-mcp/src/main.rs
index 12c38a8..55223dc 100644
--- a/crates/tj-mcp/src/main.rs
+++ b/crates/tj-mcp/src/main.rs
@@ -429,7 +429,7 @@ impl TaskJournalServer {
 
     #[tool(
         name = "task_create",
-        description = "Open a new task with title and optional initial context."
+        description = "Open a new task. Always pass `goal` (one sentence: what the user is trying to accomplish) — it is the first line of every resume pack and the anchor for \"why was this done?\" weeks later. `title` is a short label; `initial_context` is optional."
     )]
     async fn task_create(
         &self,
diff --git a/crates/tj-mcp/tests/skill_doc.rs b/crates/tj-mcp/tests/skill_doc.rs
new file mode 100644
index 0000000..aab2a79
--- /dev/null
+++ b/crates/tj-mcp/tests/skill_doc.rs
@@ -0,0 +1,64 @@
+//! Doc-consistency guards for the bundled `task-journal` skill.
+//!
+//! The skill is the agent-facing contract for the five MCP tools. These tests
+//! keep it honest against the real tool signatures so it can never again drift
+//! into documenting params that do not exist (the `evidence_strength` bug) or
+//! drop the ones that do (`goal` / `alternatives` / `outcome_tag`).
+
+use std::path::PathBuf;
+
+fn skill_md() -> String {
+    let path =
+        PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("../../plugin/skills/task-journal/SKILL.md");
+    std::fs::read_to_string(&path)
+        .unwrap_or_else(|e| panic!("bundled skill missing at {}: {e}", path.display()))
+}
+
+#[test]
+fn skill_is_present_with_valid_frontmatter() {
+    let s = skill_md();
+    assert!(
+        s.starts_with("---\n"),
+        "skill must open with YAML frontmatter"
+    );
+    assert!(
+        s.contains("\nname: task-journal\n"),
+        "frontmatter must declare name: task-journal"
+    );
+    // Frontmatter must be closed.
+    assert!(
+        s[4..].contains("\n---\n"),
+        "frontmatter block must be closed with a --- line"
+    );
+}
+
+#[test]
+fn skill_documents_real_event_add_params_not_phantom_ones() {
+    let s = skill_md();
+    // `event_add` has no `evidence_strength` param — that field lives only on
+    // the internal classifier output, never on the MCP tool. The skill must not
+    // tell agents to pass it.
+    assert!(
+        !s.contains("evidence_strength"),
+        "skill must not reference evidence_strength as an event_add param"
+    );
+    // The params the tools actually expose must be present.
+    for needle in [
+        "goal",
+        "alternatives",
+        "outcome_tag",
+        "task_close",
+        "task_search",
+    ] {
+        assert!(s.contains(needle), "skill must document `{needle}`");
+    }
+}
+
+#[test]
+fn skill_frames_self_tagging_as_primary() {
+    let s = skill_md();
+    assert!(
+        s.contains("self-tagging"),
+        "skill must frame self-tagging as the primary capture path"
+    );
+}
diff --git a/plugin/.claude-plugin/plugin.json b/plugin/.claude-plugin/plugin.json
index 2cd165a..647c686 100644
--- a/plugin/.claude-plugin/plugin.json
+++ b/plugin/.claude-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
   "name": "task-journal",
-  "version": "0.12.0",
+  "version": "0.13.0",
   "description": "Append-only journal of AI-coding task reasoning chains: hypotheses, decisions, rejections, evidence. Renders compact resume packs so an agent can pick up a 2-week-old task with full context.",
   "author": {
     "name": "Mher Shahinyan"
diff --git a/plugin/skills/task-journal/SKILL.md b/plugin/skills/task-journal/SKILL.md
index 7e4fbf7..d4e4029 100644
--- a/plugin/skills/task-journal/SKILL.md
+++ b/plugin/skills/task-journal/SKILL.md
@@ -1,73 +1,130 @@
 ---
 name: task-journal
 description: |
-  MANDATORY for every coding session. Use task-journal to track ALL tasks, bugs, and investigations.
-  Create a task at the start, log every significant discovery/decision/error, close when done.
-  This is NOT optional — without it, context is lost between sessions and compactions.
-  Triggers: beginning of work, significant finding, decision made, hypothesis formed, error found,
-  test results obtained, task completed, "what was I working on?", "remind me about X".
+  MANDATORY for every coding session. You — the agent — are the journal's primary
+  recorder. Do NOT rely on automatic capture; it only catches keyword-obvious lines
+  and misses real reasoning, especially in non-English prose. Instead, call the MCP
+  tools yourself at the decisive moments below.
+  Open a task with an explicit goal at the start, append a typed event the moment you
+  decide / reject / discover / prove something, and close with a written outcome so the
+  resume pack comes out clean: Goal -> Decisions -> Outcome.
+  Triggers: start of any task/bug/investigation; a choice is committed; an approach is
+  ruled out; a fact is verified from code or logs; a test/benchmark proves something; an
+  earlier belief turns out wrong; task finished; "what was I working on?", "remind me
+  about X", "почему мы сделали так?".
 ---
 
-# Task Journal — Reasoning Chain Memory
-
-**MANDATORY WORKFLOW — follow without exceptions:**
-
-1. **Start any task/bug/investigation** → `task_create` with descriptive title
-2. **Every significant discovery** → `event_add` with appropriate type (see below)
-3. **Every decision or rejection** → `event_add` (decision/rejection)
-4. **Test results, QA outcomes** → `event_add` with `event_type=evidence`
-5. **Wrong hypothesis corrected** → `event_add` with `event_type=correction` + `corrects=<event_id>`
-6. **Task done** → `task_close` with reason and outcome
-
-## Event type guide — choose the RIGHT one
-
-| Situation | Type | Example |
-|-----------|------|---------|
-| "I think the bug might be in X" | `hypothesis` | Unverified theory, needs checking |
-| "The code shows X does Y at line Z" | `finding` | Verified fact from reading code/logs |
-| "Tests pass", "QA verified on staging" | `evidence` | Proof something works or fails |
-| "We'll use approach X because Y" | `decision` | Committed choice |
-| "Tried X but it won't work because Y" | `rejection` | Explicitly rejected approach |
-| "API rate limit is 100/min" | `constraint` | External limitation discovered |
-| "Actually, previous finding was wrong" | `correction` | Corrects earlier event (set `corrects` field) |
-| "Done, PR merged, verified" | `close` | Task completed |
-
-**Key distinctions:**
-- `hypothesis` = "I think" / "maybe" / "could be" → NOT yet verified
-- `finding` = "I see" / "the code shows" / "confirmed" → verified by reading code/logs
-- `evidence` = ran a test/experiment that PROVES something (set `evidence_strength`: weak/medium/strong)
-- `decision` ≠ `hypothesis`: decision = committed; hypothesis = exploring
-
-## Tools available
-
-The plugin's MCP server exposes 5 tools:
-
-- `task_pack(task_id, mode)` — return Markdown resume pack. `mode`: `compact` (~2KB) or `full` (~10KB).
-- `task_create(title, initial_context?)` — open new task, returns `task_id` like `tj-x9rz1f`.
-- `event_add(task_id, event_type, text, corrects?, supersedes?)` — append event.
-- `task_close(task_id, reason, outcome?)` — close task with reason.
-- `task_search(query)` — FTS5 search current project's events; returns task_ids.
-
-## When to use task_pack
-
-| User says... | Action |
-|--------------|--------|
-| "Remind me about task X", "what was I working on?" | `task_pack` |
-| "Find the task where I decided about Y" | `task_search` → `task_pack` |
-| Session start on existing project | `task_search` for recent open tasks → `task_pack` |
-
-## Key invariants
-
-- **Append-only**: events are never edited. To fix a mistake, write a `correction` event with `corrects: <event_id>`.
-- **One task = one logical objective**: don't create a new task every turn. Events accumulate under one task.
-- **Always close**: when a task is done, call `task_close`. Don't leave tasks open.
-- **Log rejections**: wrong paths are as valuable as correct ones — they prevent repeated mistakes.
-
-## Auto-capture
-
-Hooks are installed via `task-journal install-hooks --scope user`. Auto-classification runs through a two-stage hybrid: a local heuristic catches obvious decisions, rejections, evidence, and findings for free; ambiguous chunks fall back to the Anthropic API when `ANTHROPIC_API_KEY` is set. Without the key, the heuristic still works and uncertain chunks queue for later retry. Manual recording via MCP tools always works as a complement.
-
-## Storage
-
-Events: `$XDG_DATA_HOME/task-journal/events/<project_hash>.jsonl`
-State: `state/<hash>.sqlite` (rebuildable via `task-journal rebuild-state`)
+# Task Journal — Reasoning Chain Memory (self-tagging first)
+
+You are the recorder. The point of this journal is the **"why" layer**: weeks later the
+code shows *what* changed, not *why*. Capture the decisions and the outcome as they
+happen, in your own words, terse and specific. Automatic capture is a weak fallback —
+treat it as if it does nothing and record explicitly.
+
+## Session-start ritual (do this before real work)
+
+1. `task_search(query=<a few words about the work>, status="open")` — is there an open
+   task for this? If yes, `task_pack(task_id)` and continue it. **Do not** open a duplicate.
+2. If nothing fits, `task_create(title=<short>, goal=<one sentence: what the user is trying to accomplish>)`. **Always pass `goal`** — it is the first line of every pack and
+   the anchor for "why was this done?".
+3. Hold the returned `task_id` for the whole task. One task = one logical objective.
+   Events accumulate under it; do not spawn a new task per turn.
+
+## The decisive moments — when you MUST call `event_add`
+
+Call `event_add(task_id, event_type, text, ...)` the moment one of these happens. Don't
+batch it to the end; record at the point of commitment while the reasoning is fresh.
+
+| The moment | `event_type` | What goes in `text` (1–2 terse sentences, specifics) |
+|------------|--------------|------------------------------------------------------|
+| You commit to an approach / architecture choice | `decision` | The choice + the because. Include file/lib/IDs. |
+| You rule an approach out | `rejection` | What you tried + why it won't work. Prevents repeat work. |
+| You verify a fact from code/logs | `finding` | The fact + where (file:line, config key). |
+| A test / benchmark / repro proves something | `evidence` | What ran + the result (numbers, pass/fail). |
+| You hit an external limit | `constraint` | The limit (rate, version, platform). |
+| An earlier event turns out wrong | `correction` | The correction. Set `corrects=<event_id>`. |
+| This task replaces another | `supersede` | Set `supersedes=<task_id>`. |
+| You form an unverified theory worth tracking | `hypothesis` | "maybe X because Y" — only if you'll act on it. |
+
+Capture in the user's language; keep it short and concrete. A good event is one line a
+human can read cold in two weeks and understand.
+
+### Decisions: record the alternatives you weighed
+
+For a `decision`, pass the structured `alternatives` array so the considered options and
+the final pick are explicit (this is decision-only; it errors on other types):
+
+```
+event_add(
+  task_id="tj-x9rz1f",
+  event_type="decision",
+  text="Use fd-lock for the cross-platform JSONL file lock.",
+  alternatives=[
+    {"option": "fd-lock", "chosen": true,  "rationale": "single API across Win/Unix, maintained"},
+    {"option": "rustix flock", "chosen": false, "rationale": "more code, manual Windows path"},
+    {"option": "advisory-only", "chosen": false, "rationale": "races under 5–6 parallel CC instances"}
+  ]
+)
+```
+
+### Corrections never edit — they append
+
+Events are append-only. To fix a wrong earlier event, write a `correction` with
+`corrects=<event_id>` (the prior call returned that id). Same for `supersede`.
+
+## Closing — REQUIRED, and this is what makes the pack clean
+
+When the objective is met (or abandoned), call `task_close`:
+
+```
+task_close(
+  task_id="tj-x9rz1f",
+  reason="off-by-one fixed, regression test green, PR #1284 merged",
+  outcome="Token no longer dropped on refresh boundary; covered by test_token_refresh_boundary.",
+  outcome_tag="done"   # done | abandoned | superseded
+)
+```
+
+- `outcome` is the human-readable result line that renders as **Outcome [tag]** in the pack.
+  Without it the pack has no conclusion and reads like raw log. Always write it.
+- `outcome_tag` must be `done`, `abandoned`, or `superseded`.
+- Don't leave tasks open. An open task with no close = no Outcome line = the exact "I see
+  noise, no signal" failure this skill exists to prevent.
+
+## Reading back — `task_pack` / `task_search`
+
+| User says | Action |
+|-----------|--------|
+| "remind me about task X", "what was I working on?" | `task_pack(task_id, mode="compact")` |
+| "find where I decided about Y" | `task_search(query="Y", event_type="decision")` → `task_pack` |
+| Resuming an existing project | `task_search(status="open")` → `task_pack` on the match |
+
+`task_pack` mode: `compact` (~2KB, default for resume) or `full` (~10KB, full trail).
+`task_search` filters: `status`, `project`, `event_type` (decision/finding/evidence/…).
+
+## The 5 MCP tools (exact params)
+
+- `task_create(title, goal?, initial_context?, parent?)` → `task_id` like `tj-x9rz1f`. **Pass `goal`.**
+- `event_add(task_id, event_type, text, corrects?, supersedes?, alternatives?)` → `event_id`.
+  `event_type` ∈ hypothesis | finding | evidence | decision | rejection | constraint |
+  correction | reopen | supersede | redirect. `alternatives` is decision-only.
+- `task_close(task_id, reason, outcome?, outcome_tag?)` — **always pass `outcome` + `outcome_tag`.**
+- `task_pack(task_id, mode?)` — `compact` | `full`.
+- `task_search(query, status?, project?, event_type?)` — FTS5 over this project's events.
+
+## Invariants
+
+- **Append-only.** Never edit; correct via a new `correction`/`supersede` event.
+- **One task = one objective.** Don't fragment a single goal across many tasks.
+- **Record at the moment, not at the end.** Decisions logged after the fact get lost.
+- **Rejections are as valuable as decisions** — they stop you re-walking dead ends.
+- **Close with an outcome, every time.**
+
+## Why explicit (not automatic)
+
+The hybrid auto-classifier only fires confidently on English keyword patterns; ambiguous
+or non-English prose needs an LLM it can't reach without an API key (and post-2026-06-15
+that path consumes a separate credit pool). You, the agent in the live session, already
+know what you decided — so record it directly. That is both free (it rides the interactive
+session) and higher fidelity than any after-the-fact classifier. Auto-capture stays on as a
+backstop; do not depend on it.