Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
},
"metadata": {
"description": "Task Journal — append-only reasoning chain memory for AI-coding tasks",
"version": "0.12.0"
"version": "0.13.0"
},
"plugins": [
{
"name": "task-journal",
"source": "./plugin",
"description": "Append-only journal of AI-coding task reasoning chains. Captures hypotheses, decisions, rejections, evidence — renders compact resume packs so an agent can pick up a 2-week-old task with full context.",
"version": "0.12.0",
"version": "0.13.0",
"author": {
"name": "Digital-Threads"
},
Expand Down
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed
- **Self-tagging is now the primary capture path.** Rewrote the bundled
`task-journal` skill (`plugin/skills/task-journal/SKILL.md`) around explicit
agent self-tagging: open a task with a `goal`, append typed events at the
moment of commitment (with `alternatives` on decisions), and `task_close`
with a written `outcome` + `outcome_tag` so packs render Goal → Decisions →
Outcome without relying on the classifier. Auto-capture is reframed as a
best-effort backstop that degrades to heuristic-only without an LLM backend.
Removed the stale `evidence_strength` reference (no such param) and added the
real `goal` / `alternatives` / `outcome_tag` params. README "How it works"
updated to match. Plugin minor bump (plugin.json + marketplace 0.12.0 →
0.13.0).
- `task_create` MCP tool description now nudges passing `goal` (expected-always)
instead of only mentioning title + initial context. No behavior change.

## [0.12.0]

### Added
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,13 @@ That's it. Restart Claude Code, start working, and the journal fills itself.

## How it works

- **Auto-capture via Claude Code hooks.** Every prompt, tool call, and Claude reply runs through a two-stage classifier and lands as a typed event (`finding` / `decision` / `evidence` / `rejection` / …). Stage 1 is a fast in-process heuristic — pattern-matches obvious phrasing in EN+RU for zero cost. Stage 2 falls back to the Anthropic API (`ANTHROPIC_API_KEY`) only when the heuristic is uncertain. Hook returns in <100 ms — both stages run in a detached background worker, never blocking your session.
- **Self-tagging is the primary path (recommended).** You — the agent in the live session — record reasoning directly via the five MCP tools: open a task with a `goal`, append a typed `decision` / `finding` / `rejection` / `evidence` event at the moment of commitment, and `task_close` with a written `outcome`. This is free (it rides the interactive session), language-agnostic, and higher-fidelity than any after-the-fact classifier. The bundled `task-journal` skill drives this automatically. See [MCP tools](#mcp-tools).
- **Auto-capture is a best-effort backstop.** Claude Code hooks also run every prompt, tool call, and reply through a two-stage classifier that lands typed events on its own. Stage 1 is a fast in-process heuristic — pattern-matches obvious EN+RU phrasing for zero cost. Stage 2 falls back to an LLM only when the heuristic is uncertain. Without an LLM backend it **degrades to heuristic-only**: it reliably catches keyword-obvious lines but misses real reasoning — especially non-English prose. Treat it as a safety net under your explicit self-tagging, not the main capture mechanism. Hook returns in <100 ms — both stages run in a detached background worker, never blocking your session.
- **Artifact extraction.** Each event scans its text for commit hashes, PR URLs, file paths, issue IDs, and branch names. Aggregated artifacts are how Task Journal links related tasks: when you start a new task touching the same issue or file, the prior task is surfaced automatically.
- **Resume packs.** `task_pack` (MCP tool or CLI) renders a task into a compact Markdown briefing — Goal, Outcome, decisions, rejections, evidence, artifacts — that fits in a fresh agent's context window without dumping the raw event log.
- **Auto-capture boundaries.** Beyond per-event capture, two extra hooks mark *reasoning boundaries* automatically. On `PreCompact`, Task Journal reads the transcript JSONL tail (entries newer than the active task's last event) and enqueues anything the synchronous hooks missed before the compact — then drops a marker decision so the post-compact agent sees a clear cut. A `/rewind`-prefixed prompt appends a single correction event so pack readers see where the user rolled back. No mass-rejection of prior events — the boundary is a sentinel, not a rewrite.

Source of truth is an append-only JSONL log per project. SQLite holds derived state and is fully rebuildable. Nothing is sent off-machine except the classifier prompt to the Anthropic API — and only when the local heuristic is uncertain. With no `ANTHROPIC_API_KEY` set, Task Journal still works: the heuristic handles the obvious cases, and anything it can't classify sits in the local pending queue for later retry.
Source of truth is an append-only JSONL log per project. SQLite holds derived state and is fully rebuildable. Nothing is sent off-machine except the classifier prompt — and only when auto-capture's local heuristic is uncertain. With no LLM backend configured, Task Journal still works fully: explicit self-tagging is unaffected (it never needs a backend), the heuristic backstop handles the obvious cases, and anything it can't classify sits in the local pending queue for later retry.

### Statusline integration

Expand Down
2 changes: 1 addition & 1 deletion crates/tj-mcp/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,7 @@ impl TaskJournalServer {

#[tool(
name = "task_create",
description = "Open a new task with title and optional initial context."
description = "Open a new task. Always pass `goal` (one sentence: what the user is trying to accomplish) — it is the first line of every resume pack and the anchor for \"why was this done?\" weeks later. `title` is a short label; `initial_context` is optional."
)]
async fn task_create(
&self,
Expand Down
64 changes: 64 additions & 0 deletions crates/tj-mcp/tests/skill_doc.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
//! Doc-consistency guards for the bundled `task-journal` skill.
//!
//! The skill is the agent-facing contract for the five MCP tools. These tests
//! keep it honest against the real tool signatures so it can never again drift
//! into documenting params that do not exist (the `evidence_strength` bug) or
//! drop the ones that do (`goal` / `alternatives` / `outcome_tag`).

use std::path::PathBuf;

fn skill_md() -> String {
let path =
PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("../../plugin/skills/task-journal/SKILL.md");
std::fs::read_to_string(&path)
.unwrap_or_else(|e| panic!("bundled skill missing at {}: {e}", path.display()))
}

#[test]
fn skill_is_present_with_valid_frontmatter() {
let s = skill_md();
assert!(
s.starts_with("---\n"),
"skill must open with YAML frontmatter"
);
assert!(
s.contains("\nname: task-journal\n"),
"frontmatter must declare name: task-journal"
);
// Frontmatter must be closed.
assert!(
s[4..].contains("\n---\n"),
"frontmatter block must be closed with a --- line"
);
}

#[test]
fn skill_documents_real_event_add_params_not_phantom_ones() {
let s = skill_md();
// `event_add` has no `evidence_strength` param — that field lives only on
// the internal classifier output, never on the MCP tool. The skill must not
// tell agents to pass it.
assert!(
!s.contains("evidence_strength"),
"skill must not reference evidence_strength as an event_add param"
);
// The params the tools actually expose must be present.
for needle in [
"goal",
"alternatives",
"outcome_tag",
"task_close",
"task_search",
] {
assert!(s.contains(needle), "skill must document `{needle}`");
}
}

#[test]
fn skill_frames_self_tagging_as_primary() {
let s = skill_md();
assert!(
s.contains("self-tagging"),
"skill must frame self-tagging as the primary capture path"
);
}
2 changes: 1 addition & 1 deletion plugin/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "task-journal",
"version": "0.12.0",
"version": "0.13.0",
"description": "Append-only journal of AI-coding task reasoning chains: hypotheses, decisions, rejections, evidence. Renders compact resume packs so an agent can pick up a 2-week-old task with full context.",
"author": {
"name": "Mher Shahinyan"
Expand Down
193 changes: 125 additions & 68 deletions plugin/skills/task-journal/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,130 @@
---
name: task-journal
description: |
MANDATORY for every coding session. Use task-journal to track ALL tasks, bugs, and investigations.
Create a task at the start, log every significant discovery/decision/error, close when done.
This is NOT optional — without it, context is lost between sessions and compactions.
Triggers: beginning of work, significant finding, decision made, hypothesis formed, error found,
test results obtained, task completed, "what was I working on?", "remind me about X".
MANDATORY for every coding session. You — the agent — are the journal's primary
recorder. Do NOT rely on automatic capture; it only catches keyword-obvious lines
and misses real reasoning, especially in non-English prose. Instead, call the MCP
tools yourself at the decisive moments below.
Open a task with an explicit goal at the start, append a typed event the moment you
decide / reject / discover / prove something, and close with a written outcome so the
resume pack comes out clean: Goal -> Decisions -> Outcome.
Triggers: start of any task/bug/investigation; a choice is committed; an approach is
ruled out; a fact is verified from code or logs; a test/benchmark proves something; an
earlier belief turns out wrong; task finished; "what was I working on?", "remind me
about X", "почему мы сделали так?".
---

# Task Journal — Reasoning Chain Memory

**MANDATORY WORKFLOW — follow without exceptions:**

1. **Start any task/bug/investigation** → `task_create` with descriptive title
2. **Every significant discovery** → `event_add` with appropriate type (see below)
3. **Every decision or rejection** → `event_add` (decision/rejection)
4. **Test results, QA outcomes** → `event_add` with `event_type=evidence`
5. **Wrong hypothesis corrected** → `event_add` with `event_type=correction` + `corrects=<event_id>`
6. **Task done** → `task_close` with reason and outcome

## Event type guide — choose the RIGHT one

| Situation | Type | Example |
|-----------|------|---------|
| "I think the bug might be in X" | `hypothesis` | Unverified theory, needs checking |
| "The code shows X does Y at line Z" | `finding` | Verified fact from reading code/logs |
| "Tests pass", "QA verified on staging" | `evidence` | Proof something works or fails |
| "We'll use approach X because Y" | `decision` | Committed choice |
| "Tried X but it won't work because Y" | `rejection` | Explicitly rejected approach |
| "API rate limit is 100/min" | `constraint` | External limitation discovered |
| "Actually, previous finding was wrong" | `correction` | Corrects earlier event (set `corrects` field) |
| "Done, PR merged, verified" | `close` | Task completed |

**Key distinctions:**
- `hypothesis` = "I think" / "maybe" / "could be" → NOT yet verified
- `finding` = "I see" / "the code shows" / "confirmed" → verified by reading code/logs
- `evidence` = ran a test/experiment that PROVES something (set `evidence_strength`: weak/medium/strong)
- `decision` ≠ `hypothesis`: decision = committed; hypothesis = exploring

## Tools available

The plugin's MCP server exposes 5 tools:

- `task_pack(task_id, mode)` — return Markdown resume pack. `mode`: `compact` (~2KB) or `full` (~10KB).
- `task_create(title, initial_context?)` — open new task, returns `task_id` like `tj-x9rz1f`.
- `event_add(task_id, event_type, text, corrects?, supersedes?)` — append event.
- `task_close(task_id, reason, outcome?)` — close task with reason.
- `task_search(query)` — FTS5 search current project's events; returns task_ids.

## When to use task_pack

| User says... | Action |
|--------------|--------|
| "Remind me about task X", "what was I working on?" | `task_pack` |
| "Find the task where I decided about Y" | `task_search` → `task_pack` |
| Session start on existing project | `task_search` for recent open tasks → `task_pack` |

## Key invariants

- **Append-only**: events are never edited. To fix a mistake, write a `correction` event with `corrects: <event_id>`.
- **One task = one logical objective**: don't create a new task every turn. Events accumulate under one task.
- **Always close**: when a task is done, call `task_close`. Don't leave tasks open.
- **Log rejections**: wrong paths are as valuable as correct ones — they prevent repeated mistakes.

## Auto-capture

Hooks are installed via `task-journal install-hooks --scope user`. Auto-classification runs through a two-stage hybrid: a local heuristic catches obvious decisions, rejections, evidence, and findings for free; ambiguous chunks fall back to the Anthropic API when `ANTHROPIC_API_KEY` is set. Without the key, the heuristic still works and uncertain chunks queue for later retry. Manual recording via MCP tools always works as a complement.

## Storage

Events: `$XDG_DATA_HOME/task-journal/events/<project_hash>.jsonl`
State: `state/<hash>.sqlite` (rebuildable via `task-journal rebuild-state`)
# Task Journal — Reasoning Chain Memory (self-tagging first)

You are the recorder. The point of this journal is the **"why" layer**: weeks later the
code shows *what* changed, not *why*. Capture the decisions and the outcome as they
happen, in your own words, terse and specific. Automatic capture is a weak fallback —
treat it as if it does nothing and record explicitly.

## Session-start ritual (do this before real work)

1. `task_search(query=<a few words about the work>, status="open")` — is there an open
task for this? If yes, `task_pack(task_id)` and continue it. **Do not** open a duplicate.
2. If nothing fits, `task_create(title=<short>, goal=<one sentence: what the user is trying to accomplish>)`. **Always pass `goal`** — it is the first line of every pack and
the anchor for "why was this done?".
3. Hold the returned `task_id` for the whole task. One task = one logical objective.
Events accumulate under it; do not spawn a new task per turn.

## The decisive moments — when you MUST call `event_add`

Call `event_add(task_id, event_type, text, ...)` the moment one of these happens. Don't
batch it to the end; record at the point of commitment while the reasoning is fresh.

| The moment | `event_type` | What goes in `text` (1–2 terse sentences, specifics) |
|------------|--------------|------------------------------------------------------|
| You commit to an approach / architecture choice | `decision` | The choice + the because. Include file/lib/IDs. |
| You rule an approach out | `rejection` | What you tried + why it won't work. Prevents repeat work. |
| You verify a fact from code/logs | `finding` | The fact + where (file:line, config key). |
| A test / benchmark / repro proves something | `evidence` | What ran + the result (numbers, pass/fail). |
| You hit an external limit | `constraint` | The limit (rate, version, platform). |
| An earlier event turns out wrong | `correction` | The correction. Set `corrects=<event_id>`. |
| This task replaces another | `supersede` | Set `supersedes=<task_id>`. |
| You form an unverified theory worth tracking | `hypothesis` | "maybe X because Y" — only if you'll act on it. |

Capture in the user's language; keep it short and concrete. A good event is one line a
human can read cold in two weeks and understand.

### Decisions: record the alternatives you weighed

For a `decision`, pass the structured `alternatives` array so the considered options and
the final pick are explicit (this is decision-only; it errors on other types):

```
event_add(
task_id="tj-x9rz1f",
event_type="decision",
text="Use fd-lock for the cross-platform JSONL file lock.",
alternatives=[
{"option": "fd-lock", "chosen": true, "rationale": "single API across Win/Unix, maintained"},
{"option": "rustix flock", "chosen": false, "rationale": "more code, manual Windows path"},
{"option": "advisory-only", "chosen": false, "rationale": "races under 5–6 parallel CC instances"}
]
)
```

### Corrections never edit — they append

Events are append-only. To fix a wrong earlier event, write a `correction` with
`corrects=<event_id>` (the prior call returned that id). Same for `supersede`.

## Closing — REQUIRED, and this is what makes the pack clean

When the objective is met (or abandoned), call `task_close`:

```
task_close(
task_id="tj-x9rz1f",
reason="off-by-one fixed, regression test green, PR #1284 merged",
outcome="Token no longer dropped on refresh boundary; covered by test_token_refresh_boundary.",
outcome_tag="done" # done | abandoned | superseded
)
```

- `outcome` is the human-readable result line that renders as **Outcome [tag]** in the pack.
Without it the pack has no conclusion and reads like raw log. Always write it.
- `outcome_tag` must be `done`, `abandoned`, or `superseded`.
- Don't leave tasks open. An open task with no close = no Outcome line = the exact "I see
noise, no signal" failure this skill exists to prevent.

## Reading back — `task_pack` / `task_search`

| User says | Action |
|-----------|--------|
| "remind me about task X", "what was I working on?" | `task_pack(task_id, mode="compact")` |
| "find where I decided about Y" | `task_search(query="Y", event_type="decision")` → `task_pack` |
| Resuming an existing project | `task_search(status="open")` → `task_pack` on the match |

`task_pack` mode: `compact` (~2KB, default for resume) or `full` (~10KB, full trail).
`task_search` filters: `status`, `project`, `event_type` (decision/finding/evidence/…).

## The 5 MCP tools (exact params)

- `task_create(title, goal?, initial_context?, parent?)` → `task_id` like `tj-x9rz1f`. **Pass `goal`.**
- `event_add(task_id, event_type, text, corrects?, supersedes?, alternatives?)` → `event_id`.
`event_type` ∈ hypothesis | finding | evidence | decision | rejection | constraint |
correction | reopen | supersede | redirect. `alternatives` is decision-only.
- `task_close(task_id, reason, outcome?, outcome_tag?)` — **always pass `outcome` + `outcome_tag`.**
- `task_pack(task_id, mode?)` — `compact` | `full`.
- `task_search(query, status?, project?, event_type?)` — FTS5 over this project's events.

## Invariants

- **Append-only.** Never edit; correct via a new `correction`/`supersede` event.
- **One task = one objective.** Don't fragment a single goal across many tasks.
- **Record at the moment, not at the end.** Decisions logged after the fact get lost.
- **Rejections are as valuable as decisions** — they stop you re-walking dead ends.
- **Close with an outcome, every time.**

## Why explicit (not automatic)

The hybrid auto-classifier only fires confidently on English keyword patterns; ambiguous
or non-English prose needs an LLM it can't reach without an API key (and post-2026-06-15
that path consumes a separate credit pool). You, the agent in the live session, already
know what you decided — so record it directly. That is both free (it rides the interactive
session) and higher fidelity than any after-the-fact classifier. Auto-capture stays on as a
backstop; do not depend on it.
Loading