AlteredCraft · samkeen · May 3, 2026 · May 3, 2026
diff --git a/USAGE.md b/USAGE.md
@@ -230,12 +230,33 @@ uv run tilth --visualize <session_id>  # or name one explicitly
 
 Writes a single self-contained file (inline CSS, no JS) to `sessions/<id>/chat.html`. Events are grouped by task — model calls become meta strips with a collapsible reasoning block where the model emitted any, tool calls/results become bubbles, validator runs and judge verdicts become coloured cards. Read-only and runs over the saved `events.jsonl`, so it's safe to invoke against a finished or in-progress session.
 
+### Where the tokens went
+
+Every run prints a summary on exit, including a per-task and per-model breakdown:
+
+```
+── run summary ──
+  session   20260503-101422-a1b2c3
+  duration  14m32s (12.1% of TILTH_MAX_WALL_CLOCK_MINUTES=120)
+  tokens    412,008 (20.6% of TILTH_MAX_TOKENS=2,000,000)
+  tasks     done=3 failed=0 pending=0
+  per task
+    T2     201,344  (worker=180,200, judge=18,400, self_improve=2,744)
+    T1     150,612  (worker=132,000, judge=16,800, self_improve=1,812)
+    T3      60,052  (worker=58,200, judge=1,852)
+  per model
+    moonshotai/kimi-k2-thinking    381,956
+    anthropic/claude-sonnet-4.5     30,052
+```
+
+Use the per-task line to spot PRDs that should have been split (one task that ate half the budget is usually a planning failure, not a model failure). Use the per-model line to attribute spend when the worker and judge run on different providers — especially when deciding whether the judge is pulling its weight. The breakdown is computed by replaying `events.jsonl`, so it survives crashes and `--resume` cycles.
+
 ## 5. Caveats worth being upfront about
 
 - **It's Python-centric.** `post_edit` lints `.py` files. `validators` runs `pytest` and `ruff`. JavaScript / Rust / Go projects need `tilth/validators.py` and `tilth/hooks/post_edit.py` adapted to your toolchain — not deep work, but not zero.
 - **Ruff config matters.** If your project doesn't already use ruff, the validator will fire constantly and the agent will spend iterations fixing things that aren't really broken. Either add a permissive `[tool.ruff]` block to your `pyproject.toml`, or swap the ruff validator for whatever linter you already use.
 - **The planner is you.** Writing a good `prd.json` (small enough tasks, sharp acceptance criteria, tests upfront) is where most of the value is. Vague PRDs make the harness fail loudly and burn tokens.
-- **Costs are real.** A 2-hour run can mean hundreds of thousands of tokens across worker + judge + self-improvement calls. The `TILTH_MAX_TOKENS` cap exists for a reason — set it on first run. Cost per token varies wildly across providers; pick your worker accordingly. Be careful about reaching for a smaller judge model to cut costs — see ["Picking a judge model"](#picking-a-judge-model) below.
+- **Costs are real.** A 2-hour run can mean hundreds of thousands of tokens across worker + judge + self-improvement calls. The `TILTH_MAX_TOKENS` cap exists for a reason — set it on first run. Cost per token varies wildly across providers; pick your worker accordingly. Be careful about reaching for a smaller judge model to cut costs — see ["Picking a judge model"](#picking-a-judge-model) below. The end-of-session summary breaks total spend down per task and per model so you can see exactly where the budget went (see ["Where the tokens went"](#where-the-tokens-went)).
 - **AGENTS.md is yours forever.** It accumulates. Prune it periodically — old learnings that the model has clearly internalised should be removed (the ratchet works in both directions).
 - **Tools are intentionally narrow.** No web fetch, no MCP, no curl-based downloads. If your tasks require external API access, you add a tool to `tilth/tools/` and register it. Keep tools focused — every tool description ships in the prompt every turn.
 - **The harness commits to your repo's git db.** The worktree branch is in your repo, not the harness's. If you delete `{{your projects folder}}/tilth`, the branches in your project's repo remain. Clean up branches the same way you would for a normal feature branch.

diff --git a/deep-dives.md b/deep-dives.md
@@ -179,19 +179,21 @@ A few things worth noting about this pattern:
 - **`or 0` everywhere.** If a provider ships a malformed response, the token count silently falls to zero rather than crashing the run. Defensive choice; the alternative is a 2-hour run dying on one weird `null`.
 - **`prompt + completion`, not `total`.** Some providers report `total_tokens` separately; we sum the two we trust. Equivalent to `total_tokens` for every well-formed response.
 
-The third site (in `_run_task`) also logs the per-call breakdown to `events.jsonl` as a `model_call` event:
+All three sites log a per-call breakdown to `events.jsonl` as a `model_call` event, with a `kind` field tagging which site emitted it:
 
 ```python
 session.log("model_call", {
     "task_id": task["id"],
-    "iter": iter_n + 1,
+    "iter": iter_n + 1,                # worker + judge only; omitted for self_improve
+    "kind": "worker",                  # or "judge", or "self_improve"
+    "model": client.config.worker_model,
     "prompt_tokens": prompt_tokens,
     "eval_tokens": eval_tokens,
     "tokens_used_total": session.tokens_used,
 })
 ```
 
-That's the audit trail. After a run, grep `events.jsonl` for `model_call` and reconstruct exactly when tokens were spent. The judge and self-improve sites *don't* currently log a per-call event — they update the running total but skip the per-call detail. Symmetry would make the audit cleaner; small TODO.
+That's the audit trail. After a run, grep `events.jsonl` for `model_call` and reconstruct exactly when tokens were spent and on which model. `_token_breakdowns()` in `loop.py` does exactly this replay to compute the per-task and per-model figures rendered in the end-of-session summary; the same shape is what `--resume` would compute if you wanted to bolt on cumulative reporting across resume cycles.
 
 ### Enforcement is at the top of each task
 
@@ -247,10 +249,9 @@ The wall-clock baseline (`started_at`) is treated differently: `Session.wake()`
 
 A few honest gaps worth knowing:
 
-1. **No dollar-cost tracking.** Tokens, not dollars. The cap is provider-agnostic — useful as a coarse safety net, useless for "stop when I've spent $50 on this run." Adding dollar tracking means a per-model price table and a cost lookup at each `add_tokens` site. Not in MVP.
-2. **No per-model breakdown.** If worker and judge are different models on different providers, the running total mashes them together. Splitting `tokens_used` into `worker_tokens_used` and `judge_tokens_used` is ~10 lines if it ever matters.
-3. **No headroom warning.** The cap is binary — under it, run; at it, stop. No "you're at 80% of your token budget" alert. Easy to add in `_stop_reason` if you want it.
-4. **The cap is over the whole session, not per-task.** A 10-task run with a 2M cap means tasks 1–9 might gobble tokens and starve task 10. There's no per-task budget. The iteration cap (default 8 model calls per task) is the proxy; combined with average tokens/call, that approximates a per-task ceiling.
+1. **No dollar-cost tracking.** Tokens, not dollars. The cap is provider-agnostic — useful as a coarse safety net, useless for "stop when I've spent $50 on this run." Adding dollar tracking means a per-model price table and a cost lookup at each `add_tokens` site. Not in MVP — but with `model` now on every `model_call` event, a price table is the only missing piece.
+2. **No headroom warning.** The cap is binary — under it, run; at it, stop. No "you're at 80% of your token budget" alert. Easy to add in `_stop_reason` if you want it.
+3. **The cap is over the whole session, not per-task.** A 10-task run with a 2M cap means tasks 1–9 might gobble tokens and starve task 10. There's no per-task budget. The iteration cap (default 8 model calls per task) is the proxy; combined with average tokens/call, that approximates a per-task ceiling. The end-of-session per-task breakdown surfaces *where* the spend went after the fact, but doesn't enforce anything during the run.
 
 ---
 

diff --git a/tests/test_token_breakdowns.py b/tests/test_token_breakdowns.py
@@ -0,0 +1,141 @@
+"""Regression tests for `_token_breakdowns`.
+
+The end-of-session summary's per-task / per-model lines are computed by replaying
+events.jsonl, not from in-memory state. Pin the contract: aggregate every
+`model_call` event by `task_id` and `model`, split per-task by `kind`, and
+gracefully degrade for older events that pre-date the `kind`/`model` fields.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from tilth.loop import _token_breakdowns
+from tilth.session import Session
+
+
+def _make_session(tmp_path: Path, events: list[dict]) -> Session:
+    root = tmp_path / "sess"
+    root.mkdir()
+    events_path = root / "events.jsonl"
+    with events_path.open("w") as f:
+        for e in events:
+            f.write(json.dumps(e) + "\n")
+    return Session(
+        session_id="test",
+        root=root,
+        events_path=events_path,
+        checkpoint_path=root / "checkpoint.json",
+    )
+
+
+def test_aggregates_by_model_and_task_with_kind_split(tmp_path: Path) -> None:
+    session = _make_session(
+        tmp_path,
+        [
+            {"type": "model_call", "payload": {
+                "task_id": "T1", "kind": "worker", "model": "worker-x",
+                "prompt_tokens": 100, "eval_tokens": 50,
+            }},
+            {"type": "model_call", "payload": {
+                "task_id": "T1", "kind": "worker", "model": "worker-x",
+                "prompt_tokens": 200, "eval_tokens": 0,
+            }},
+            {"type": "model_call", "payload": {
+                "task_id": "T1", "kind": "judge", "model": "judge-y",
+                "prompt_tokens": 80, "eval_tokens": 20,
+            }},
+            {"type": "model_call", "payload": {
+                "task_id": "T1", "kind": "self_improve", "model": "worker-x",
+                "prompt_tokens": 30, "eval_tokens": 10,
+            }},
+            {"type": "model_call", "payload": {
+                "task_id": "T2", "kind": "worker", "model": "worker-x",
+                "prompt_tokens": 5, "eval_tokens": 5,
+            }},
+            {"type": "tool_call", "payload": {"task_id": "T1", "tool": "bash"}},
+        ],
+    )
+
+    by_model, by_task = _token_breakdowns(session)
+
+    assert by_model == {"worker-x": 100 + 50 + 200 + 30 + 10 + 5 + 5, "judge-y": 100}
+    assert by_task["T1"]["total"] == 100 + 50 + 200 + 80 + 20 + 30 + 10
+    assert by_task["T1"]["worker"] == 100 + 50 + 200
+    assert by_task["T1"]["judge"] == 80 + 20
+    assert by_task["T1"]["self_improve"] == 30 + 10
+    assert by_task["T2"]["total"] == 10
+    assert by_task["T2"]["worker"] == 10
+
+
+def test_skips_zero_token_events(tmp_path: Path) -> None:
+    """Defensive: providers occasionally return 0/null usage; don't pollute the breakdown."""
+    session = _make_session(
+        tmp_path,
+        [
+            {"type": "model_call", "payload": {
+                "task_id": "T1", "kind": "worker", "model": "m",
+                "prompt_tokens": 0, "eval_tokens": 0,
+            }},
+            {"type": "model_call", "payload": {
+                "task_id": "T1", "kind": "worker", "model": "m",
+                "prompt_tokens": None, "eval_tokens": None,
+            }},
+        ],
+    )
+    by_model, by_task = _token_breakdowns(session)
+    assert by_model == {}
+    assert by_task == {}
+
+
+def test_legacy_events_without_kind_or_model_bucketed_safely(tmp_path: Path) -> None:
+    """Pre-breakdown events lack `kind` and `model`. Replay must not crash."""
+    session = _make_session(
+        tmp_path,
+        [
+            {"type": "model_call", "payload": {
+                "task_id": "T1",
+                "prompt_tokens": 10, "eval_tokens": 5,
+            }},
+        ],
+    )
+    by_model, by_task = _token_breakdowns(session)
+    assert by_model == {"unknown": 15}
+    assert by_task["T1"]["total"] == 15
+    assert by_task["T1"]["worker"] == 15
+
+
+def test_missing_events_file_returns_empty(tmp_path: Path) -> None:
+    root = tmp_path / "no-log"
+    root.mkdir()
+    session = Session(
+        session_id="test",
+        root=root,
+        events_path=root / "events.jsonl",
+        checkpoint_path=root / "checkpoint.json",
+    )
+    assert _token_breakdowns(session) == ({}, {})
+
+
+def test_skips_malformed_jsonl_lines(tmp_path: Path) -> None:
+    """Match `_last_stop_reason`'s tolerance: skip lines that aren't valid JSON."""
+    root = tmp_path / "sess"
+    root.mkdir()
+    events_path = root / "events.jsonl"
+    events_path.write_text(
+        json.dumps({"type": "model_call", "payload": {
+            "task_id": "T1", "kind": "worker", "model": "m",
+            "prompt_tokens": 10, "eval_tokens": 5,
+        }}) + "\n"
+        "not valid json\n"
+    )
+    session = Session(
+        session_id="test",
+        root=root,
+        events_path=events_path,
+        checkpoint_path=root / "checkpoint.json",
+    )
+    by_model, by_task = _token_breakdowns(session)
+    assert by_model == {"m": 15}
+    assert by_task["T1"]["total"] == 15
diff --git a/tilth/loop.py b/tilth/loop.py
@@ -122,6 +122,18 @@ def _judge_task(
     prompt_tokens = int(usage.get("prompt_tokens") or 0)
     eval_tokens = int(usage.get("completion_tokens") or 0)
     session.add_tokens(prompt_tokens + eval_tokens)
+    session.log(
+        "model_call",
+        {
+            "task_id": task["id"],
+            "iter": iter_n + 1,
+            "kind": "judge",
+            "model": client.config.judge_model,
+            "prompt_tokens": prompt_tokens,
+            "eval_tokens": eval_tokens,
+            "tokens_used_total": session.tokens_used,
+        },
+    )
 
     content = ((resp.get("message") or {}).get("content") or "").strip()
     parsed = _parse_json_lenient(content)
@@ -180,6 +192,17 @@ def _self_improve(
     prompt_tokens = int(usage.get("prompt_tokens") or 0)
     eval_tokens = int(usage.get("completion_tokens") or 0)
     session.add_tokens(prompt_tokens + eval_tokens)
+    session.log(
+        "model_call",
+        {
+            "task_id": task["id"],
+            "kind": "self_improve",
+            "model": client.config.worker_model,
+            "prompt_tokens": prompt_tokens,
+            "eval_tokens": eval_tokens,
+            "tokens_used_total": session.tokens_used,
+        },
+    )
 
     content = ((resp.get("message") or {}).get("content") or "").strip()
     parsed = _parse_json_lenient(content)
@@ -480,6 +503,8 @@ def _run_task(
         model_call_payload: dict[str, Any] = {
             "task_id": task["id"],
             "iter": iter_n + 1,
+            "kind": "worker",
+            "model": client.config.worker_model,
             "prompt_tokens": prompt_tokens,
             "eval_tokens": eval_tokens,
             "tokens_used_total": session.tokens_used,
@@ -667,6 +692,46 @@ def _format_duration(seconds: float) -> str:
     return f"{s}s"
 
 
+def _token_breakdowns(session: Session) -> tuple[dict[str, int], dict[str, dict[str, int]]]:
+    """Replay events.jsonl and aggregate model_call tokens.
+
+    Returns (by_model, by_task) where:
+        by_model: {model_name: total_tokens}
+        by_task:  {task_id: {"total": int, "worker": int, "judge": int, "self_improve": int}}
+
+    Reads from disk so the breakdown survives crashes and `--resume` cycles. Older
+    events without `kind`/`model` fields are bucketed under "unknown" / "worker"
+    respectively (the worker site has emitted model_call events since v0; only
+    judge and self_improve started emitting them when this breakdown landed).
+    """
+    by_model: dict[str, int] = {}
+    by_task: dict[str, dict[str, int]] = {}
+    if not session.events_path.is_file():
+        return by_model, by_task
+    with session.events_path.open() as f:
+        for line in f:
+            try:
+                rec = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            if rec.get("type") != "model_call":
+                continue
+            p = rec.get("payload") or {}
+            n = int(p.get("prompt_tokens") or 0) + int(p.get("eval_tokens") or 0)
+            if n <= 0:
+                continue
+            model = p.get("model") or "unknown"
+            task_id = p.get("task_id") or "unknown"
+            kind = p.get("kind") or "worker"
+            by_model[model] = by_model.get(model, 0) + n
+            bucket = by_task.setdefault(
+                task_id, {"total": 0, "worker": 0, "judge": 0, "self_improve": 0}
+            )
+            bucket["total"] += n
+            bucket[kind] = bucket.get(kind, 0) + n
+    return by_model, by_task
+
+
 def _print_summary(session: Session, client: LLMClient, worktree: Path | None) -> None:
     elapsed = time.time() - session.started_at
     cfg = client.config
@@ -706,6 +771,26 @@ def _print_summary(session: Session, client: LLMClient, worktree: Path | None) -
     console.print(f"  tokens    {session.tokens_used:,} [dim]{tokens_dim}[/dim]")
     console.print(f"  tasks     {' '.join(task_bits + extras)}")
 
+    by_model, by_task = _token_breakdowns(session)
+
+    if by_task:
+        width = max(len(tid) for tid in by_task)
+        console.print("  per task")
+        for tid, bucket in sorted(by_task.items(), key=lambda kv: (-kv[1]["total"], kv[0])):
+            split_bits = [
+                f"{kind}={bucket[kind]:,}"
+                for kind in ("worker", "judge", "self_improve")
+                if bucket.get(kind)
+            ]
+            split = f" [dim]({', '.join(split_bits)})[/dim]" if split_bits else ""
+            console.print(f"    {tid.ljust(width)}  {bucket['total']:>9,}{split}")
+
+    if by_model:
+        width = max(len(m) for m in by_model)
+        console.print("  per model")
+        for model, n in sorted(by_model.items(), key=lambda kv: (-kv[1], kv[0])):
+            console.print(f"    {model.ljust(width)}  {n:>9,}")
+
 
 # --- reset ------------------------------------------------------------------
 

diff --git a/tilth/session.py b/tilth/session.py
@@ -4,8 +4,15 @@
 just enough state (last completed task, worktree branch) to resume on a fresh process.
 
 Event types:
-    model_call         — request/response metadata for a worker call. Carries
-                         `reasoning_details` (the OpenRouter-normalised
+    model_call         — request/response metadata for any model call. Emitted
+                         from all three sites (worker iteration, judge, AGENTS.md
+                         self-improvement) so per-model and per-task token
+                         breakdowns can be reconstructed by replaying the log.
+                         Payload: `task_id`, `kind` ∈ {"worker","judge","self_improve"},
+                         `model`, `prompt_tokens`, `eval_tokens`, `tokens_used_total`.
+                         Worker and judge entries also carry `iter` (the worker
+                         iteration that triggered them); self_improve omits it.
+                         Carries `reasoning_details` (the OpenRouter-normalised
                          structured form) when the model emitted any, falling
                          back to a flat `reasoning` string. Either is omitted
                          when absent so non-thinking models keep slim events.