diff --git a/USAGE.md b/USAGE.md index d19b151..ec78fb6 100644 --- a/USAGE.md +++ b/USAGE.md @@ -230,12 +230,33 @@ uv run tilth --visualize # or name one explicitly Writes a single self-contained file (inline CSS, no JS) to `sessions//chat.html`. Events are grouped by task — model calls become meta strips with a collapsible reasoning block where the model emitted any, tool calls/results become bubbles, validator runs and judge verdicts become coloured cards. Read-only and runs over the saved `events.jsonl`, so it's safe to invoke against a finished or in-progress session. +### Where the tokens went + +Every run prints a summary on exit, including a per-task and per-model breakdown: + +``` +── run summary ── + session 20260503-101422-a1b2c3 + duration 14m32s (12.1% of TILTH_MAX_WALL_CLOCK_MINUTES=120) + tokens 412,008 (20.6% of TILTH_MAX_TOKENS=2,000,000) + tasks done=3 failed=0 pending=0 + per task + T2 201,344 (worker=180,200, judge=18,400, self_improve=2,744) + T1 150,612 (worker=132,000, judge=16,800, self_improve=1,812) + T3 60,052 (worker=58,200, judge=1,852) + per model + moonshotai/kimi-k2-thinking 381,956 + anthropic/claude-sonnet-4.5 30,052 +``` + +Use the per-task line to spot PRDs that should have been split (one task that ate half the budget is usually a planning failure, not a model failure). Use the per-model line to attribute spend when the worker and judge run on different providers — especially when deciding whether the judge is pulling its weight. The breakdown is computed by replaying `events.jsonl`, so it survives crashes and `--resume` cycles. + ## 5. Caveats worth being upfront about - **It's Python-centric.** `post_edit` lints `.py` files. `validators` runs `pytest` and `ruff`. JavaScript / Rust / Go projects need `tilth/validators.py` and `tilth/hooks/post_edit.py` adapted to your toolchain — not deep work, but not zero. - **Ruff config matters.** If your project doesn't already use ruff, the validator will fire constantly and the agent will spend iterations fixing things that aren't really broken. Either add a permissive `[tool.ruff]` block to your `pyproject.toml`, or swap the ruff validator for whatever linter you already use. - **The planner is you.** Writing a good `prd.json` (small enough tasks, sharp acceptance criteria, tests upfront) is where most of the value is. Vague PRDs make the harness fail loudly and burn tokens. -- **Costs are real.** A 2-hour run can mean hundreds of thousands of tokens across worker + judge + self-improvement calls. The `TILTH_MAX_TOKENS` cap exists for a reason — set it on first run. Cost per token varies wildly across providers; pick your worker accordingly. Be careful about reaching for a smaller judge model to cut costs — see ["Picking a judge model"](#picking-a-judge-model) below. +- **Costs are real.** A 2-hour run can mean hundreds of thousands of tokens across worker + judge + self-improvement calls. The `TILTH_MAX_TOKENS` cap exists for a reason — set it on first run. Cost per token varies wildly across providers; pick your worker accordingly. Be careful about reaching for a smaller judge model to cut costs — see ["Picking a judge model"](#picking-a-judge-model) below. The end-of-session summary breaks total spend down per task and per model so you can see exactly where the budget went (see ["Where the tokens went"](#where-the-tokens-went)). - **AGENTS.md is yours forever.** It accumulates. Prune it periodically — old learnings that the model has clearly internalised should be removed (the ratchet works in both directions). - **Tools are intentionally narrow.** No web fetch, no MCP, no curl-based downloads. If your tasks require external API access, you add a tool to `tilth/tools/` and register it. Keep tools focused — every tool description ships in the prompt every turn. - **The harness commits to your repo's git db.** The worktree branch is in your repo, not the harness's. If you delete `{{your projects folder}}/tilth`, the branches in your project's repo remain. Clean up branches the same way you would for a normal feature branch. diff --git a/deep-dives.md b/deep-dives.md index cf0c7bb..5319b23 100644 --- a/deep-dives.md +++ b/deep-dives.md @@ -179,19 +179,21 @@ A few things worth noting about this pattern: - **`or 0` everywhere.** If a provider ships a malformed response, the token count silently falls to zero rather than crashing the run. Defensive choice; the alternative is a 2-hour run dying on one weird `null`. - **`prompt + completion`, not `total`.** Some providers report `total_tokens` separately; we sum the two we trust. Equivalent to `total_tokens` for every well-formed response. -The third site (in `_run_task`) also logs the per-call breakdown to `events.jsonl` as a `model_call` event: +All three sites log a per-call breakdown to `events.jsonl` as a `model_call` event, with a `kind` field tagging which site emitted it: ```python session.log("model_call", { "task_id": task["id"], - "iter": iter_n + 1, + "iter": iter_n + 1, # worker + judge only; omitted for self_improve + "kind": "worker", # or "judge", or "self_improve" + "model": client.config.worker_model, "prompt_tokens": prompt_tokens, "eval_tokens": eval_tokens, "tokens_used_total": session.tokens_used, }) ``` -That's the audit trail. After a run, grep `events.jsonl` for `model_call` and reconstruct exactly when tokens were spent. The judge and self-improve sites *don't* currently log a per-call event — they update the running total but skip the per-call detail. Symmetry would make the audit cleaner; small TODO. +That's the audit trail. After a run, grep `events.jsonl` for `model_call` and reconstruct exactly when tokens were spent and on which model. `_token_breakdowns()` in `loop.py` does exactly this replay to compute the per-task and per-model figures rendered in the end-of-session summary; the same shape is what `--resume` would compute if you wanted to bolt on cumulative reporting across resume cycles. ### Enforcement is at the top of each task @@ -247,10 +249,9 @@ The wall-clock baseline (`started_at`) is treated differently: `Session.wake()` A few honest gaps worth knowing: -1. **No dollar-cost tracking.** Tokens, not dollars. The cap is provider-agnostic — useful as a coarse safety net, useless for "stop when I've spent $50 on this run." Adding dollar tracking means a per-model price table and a cost lookup at each `add_tokens` site. Not in MVP. -2. **No per-model breakdown.** If worker and judge are different models on different providers, the running total mashes them together. Splitting `tokens_used` into `worker_tokens_used` and `judge_tokens_used` is ~10 lines if it ever matters. -3. **No headroom warning.** The cap is binary — under it, run; at it, stop. No "you're at 80% of your token budget" alert. Easy to add in `_stop_reason` if you want it. -4. **The cap is over the whole session, not per-task.** A 10-task run with a 2M cap means tasks 1–9 might gobble tokens and starve task 10. There's no per-task budget. The iteration cap (default 8 model calls per task) is the proxy; combined with average tokens/call, that approximates a per-task ceiling. +1. **No dollar-cost tracking.** Tokens, not dollars. The cap is provider-agnostic — useful as a coarse safety net, useless for "stop when I've spent $50 on this run." Adding dollar tracking means a per-model price table and a cost lookup at each `add_tokens` site. Not in MVP — but with `model` now on every `model_call` event, a price table is the only missing piece. +2. **No headroom warning.** The cap is binary — under it, run; at it, stop. No "you're at 80% of your token budget" alert. Easy to add in `_stop_reason` if you want it. +3. **The cap is over the whole session, not per-task.** A 10-task run with a 2M cap means tasks 1–9 might gobble tokens and starve task 10. There's no per-task budget. The iteration cap (default 8 model calls per task) is the proxy; combined with average tokens/call, that approximates a per-task ceiling. The end-of-session per-task breakdown surfaces *where* the spend went after the fact, but doesn't enforce anything during the run. --- diff --git a/tests/test_token_breakdowns.py b/tests/test_token_breakdowns.py new file mode 100644 index 0000000..4373f3e --- /dev/null +++ b/tests/test_token_breakdowns.py @@ -0,0 +1,141 @@ +"""Regression tests for `_token_breakdowns`. + +The end-of-session summary's per-task / per-model lines are computed by replaying +events.jsonl, not from in-memory state. Pin the contract: aggregate every +`model_call` event by `task_id` and `model`, split per-task by `kind`, and +gracefully degrade for older events that pre-date the `kind`/`model` fields. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +from tilth.loop import _token_breakdowns +from tilth.session import Session + + +def _make_session(tmp_path: Path, events: list[dict]) -> Session: + root = tmp_path / "sess" + root.mkdir() + events_path = root / "events.jsonl" + with events_path.open("w") as f: + for e in events: + f.write(json.dumps(e) + "\n") + return Session( + session_id="test", + root=root, + events_path=events_path, + checkpoint_path=root / "checkpoint.json", + ) + + +def test_aggregates_by_model_and_task_with_kind_split(tmp_path: Path) -> None: + session = _make_session( + tmp_path, + [ + {"type": "model_call", "payload": { + "task_id": "T1", "kind": "worker", "model": "worker-x", + "prompt_tokens": 100, "eval_tokens": 50, + }}, + {"type": "model_call", "payload": { + "task_id": "T1", "kind": "worker", "model": "worker-x", + "prompt_tokens": 200, "eval_tokens": 0, + }}, + {"type": "model_call", "payload": { + "task_id": "T1", "kind": "judge", "model": "judge-y", + "prompt_tokens": 80, "eval_tokens": 20, + }}, + {"type": "model_call", "payload": { + "task_id": "T1", "kind": "self_improve", "model": "worker-x", + "prompt_tokens": 30, "eval_tokens": 10, + }}, + {"type": "model_call", "payload": { + "task_id": "T2", "kind": "worker", "model": "worker-x", + "prompt_tokens": 5, "eval_tokens": 5, + }}, + {"type": "tool_call", "payload": {"task_id": "T1", "tool": "bash"}}, + ], + ) + + by_model, by_task = _token_breakdowns(session) + + assert by_model == {"worker-x": 100 + 50 + 200 + 30 + 10 + 5 + 5, "judge-y": 100} + assert by_task["T1"]["total"] == 100 + 50 + 200 + 80 + 20 + 30 + 10 + assert by_task["T1"]["worker"] == 100 + 50 + 200 + assert by_task["T1"]["judge"] == 80 + 20 + assert by_task["T1"]["self_improve"] == 30 + 10 + assert by_task["T2"]["total"] == 10 + assert by_task["T2"]["worker"] == 10 + + +def test_skips_zero_token_events(tmp_path: Path) -> None: + """Defensive: providers occasionally return 0/null usage; don't pollute the breakdown.""" + session = _make_session( + tmp_path, + [ + {"type": "model_call", "payload": { + "task_id": "T1", "kind": "worker", "model": "m", + "prompt_tokens": 0, "eval_tokens": 0, + }}, + {"type": "model_call", "payload": { + "task_id": "T1", "kind": "worker", "model": "m", + "prompt_tokens": None, "eval_tokens": None, + }}, + ], + ) + by_model, by_task = _token_breakdowns(session) + assert by_model == {} + assert by_task == {} + + +def test_legacy_events_without_kind_or_model_bucketed_safely(tmp_path: Path) -> None: + """Pre-breakdown events lack `kind` and `model`. Replay must not crash.""" + session = _make_session( + tmp_path, + [ + {"type": "model_call", "payload": { + "task_id": "T1", + "prompt_tokens": 10, "eval_tokens": 5, + }}, + ], + ) + by_model, by_task = _token_breakdowns(session) + assert by_model == {"unknown": 15} + assert by_task["T1"]["total"] == 15 + assert by_task["T1"]["worker"] == 15 + + +def test_missing_events_file_returns_empty(tmp_path: Path) -> None: + root = tmp_path / "no-log" + root.mkdir() + session = Session( + session_id="test", + root=root, + events_path=root / "events.jsonl", + checkpoint_path=root / "checkpoint.json", + ) + assert _token_breakdowns(session) == ({}, {}) + + +def test_skips_malformed_jsonl_lines(tmp_path: Path) -> None: + """Match `_last_stop_reason`'s tolerance: skip lines that aren't valid JSON.""" + root = tmp_path / "sess" + root.mkdir() + events_path = root / "events.jsonl" + events_path.write_text( + json.dumps({"type": "model_call", "payload": { + "task_id": "T1", "kind": "worker", "model": "m", + "prompt_tokens": 10, "eval_tokens": 5, + }}) + "\n" + "not valid json\n" + ) + session = Session( + session_id="test", + root=root, + events_path=events_path, + checkpoint_path=root / "checkpoint.json", + ) + by_model, by_task = _token_breakdowns(session) + assert by_model == {"m": 15} + assert by_task["T1"]["total"] == 15 diff --git a/tilth/loop.py b/tilth/loop.py index c28209a..f02c1aa 100644 --- a/tilth/loop.py +++ b/tilth/loop.py @@ -122,6 +122,18 @@ def _judge_task( prompt_tokens = int(usage.get("prompt_tokens") or 0) eval_tokens = int(usage.get("completion_tokens") or 0) session.add_tokens(prompt_tokens + eval_tokens) + session.log( + "model_call", + { + "task_id": task["id"], + "iter": iter_n + 1, + "kind": "judge", + "model": client.config.judge_model, + "prompt_tokens": prompt_tokens, + "eval_tokens": eval_tokens, + "tokens_used_total": session.tokens_used, + }, + ) content = ((resp.get("message") or {}).get("content") or "").strip() parsed = _parse_json_lenient(content) @@ -180,6 +192,17 @@ def _self_improve( prompt_tokens = int(usage.get("prompt_tokens") or 0) eval_tokens = int(usage.get("completion_tokens") or 0) session.add_tokens(prompt_tokens + eval_tokens) + session.log( + "model_call", + { + "task_id": task["id"], + "kind": "self_improve", + "model": client.config.worker_model, + "prompt_tokens": prompt_tokens, + "eval_tokens": eval_tokens, + "tokens_used_total": session.tokens_used, + }, + ) content = ((resp.get("message") or {}).get("content") or "").strip() parsed = _parse_json_lenient(content) @@ -480,6 +503,8 @@ def _run_task( model_call_payload: dict[str, Any] = { "task_id": task["id"], "iter": iter_n + 1, + "kind": "worker", + "model": client.config.worker_model, "prompt_tokens": prompt_tokens, "eval_tokens": eval_tokens, "tokens_used_total": session.tokens_used, @@ -667,6 +692,46 @@ def _format_duration(seconds: float) -> str: return f"{s}s" +def _token_breakdowns(session: Session) -> tuple[dict[str, int], dict[str, dict[str, int]]]: + """Replay events.jsonl and aggregate model_call tokens. + + Returns (by_model, by_task) where: + by_model: {model_name: total_tokens} + by_task: {task_id: {"total": int, "worker": int, "judge": int, "self_improve": int}} + + Reads from disk so the breakdown survives crashes and `--resume` cycles. Older + events without `kind`/`model` fields are bucketed under "unknown" / "worker" + respectively (the worker site has emitted model_call events since v0; only + judge and self_improve started emitting them when this breakdown landed). + """ + by_model: dict[str, int] = {} + by_task: dict[str, dict[str, int]] = {} + if not session.events_path.is_file(): + return by_model, by_task + with session.events_path.open() as f: + for line in f: + try: + rec = json.loads(line) + except json.JSONDecodeError: + continue + if rec.get("type") != "model_call": + continue + p = rec.get("payload") or {} + n = int(p.get("prompt_tokens") or 0) + int(p.get("eval_tokens") or 0) + if n <= 0: + continue + model = p.get("model") or "unknown" + task_id = p.get("task_id") or "unknown" + kind = p.get("kind") or "worker" + by_model[model] = by_model.get(model, 0) + n + bucket = by_task.setdefault( + task_id, {"total": 0, "worker": 0, "judge": 0, "self_improve": 0} + ) + bucket["total"] += n + bucket[kind] = bucket.get(kind, 0) + n + return by_model, by_task + + def _print_summary(session: Session, client: LLMClient, worktree: Path | None) -> None: elapsed = time.time() - session.started_at cfg = client.config @@ -706,6 +771,26 @@ def _print_summary(session: Session, client: LLMClient, worktree: Path | None) - console.print(f" tokens {session.tokens_used:,} [dim]{tokens_dim}[/dim]") console.print(f" tasks {' '.join(task_bits + extras)}") + by_model, by_task = _token_breakdowns(session) + + if by_task: + width = max(len(tid) for tid in by_task) + console.print(" per task") + for tid, bucket in sorted(by_task.items(), key=lambda kv: (-kv[1]["total"], kv[0])): + split_bits = [ + f"{kind}={bucket[kind]:,}" + for kind in ("worker", "judge", "self_improve") + if bucket.get(kind) + ] + split = f" [dim]({', '.join(split_bits)})[/dim]" if split_bits else "" + console.print(f" {tid.ljust(width)} {bucket['total']:>9,}{split}") + + if by_model: + width = max(len(m) for m in by_model) + console.print(" per model") + for model, n in sorted(by_model.items(), key=lambda kv: (-kv[1], kv[0])): + console.print(f" {model.ljust(width)} {n:>9,}") + # --- reset ------------------------------------------------------------------ diff --git a/tilth/session.py b/tilth/session.py index fea5763..4bf05f4 100644 --- a/tilth/session.py +++ b/tilth/session.py @@ -4,8 +4,15 @@ just enough state (last completed task, worktree branch) to resume on a fresh process. Event types: - model_call — request/response metadata for a worker call. Carries - `reasoning_details` (the OpenRouter-normalised + model_call — request/response metadata for any model call. Emitted + from all three sites (worker iteration, judge, AGENTS.md + self-improvement) so per-model and per-task token + breakdowns can be reconstructed by replaying the log. + Payload: `task_id`, `kind` ∈ {"worker","judge","self_improve"}, + `model`, `prompt_tokens`, `eval_tokens`, `tokens_used_total`. + Worker and judge entries also carry `iter` (the worker + iteration that triggered them); self_improve omits it. + Carries `reasoning_details` (the OpenRouter-normalised structured form) when the model emitted any, falling back to a flat `reasoning` string. Either is omitted when absent so non-thinking models keep slim events. diff --git a/tilth/visualize/render.py b/tilth/visualize/render.py index 6dcc990..54ac14a 100644 --- a/tilth/visualize/render.py +++ b/tilth/visualize/render.py @@ -96,14 +96,25 @@ def _render_context_reset(_typ: str, ts: str, _p: dict[str, Any]) -> str: def _render_model_call(_typ: str, ts: str, p: dict[str, Any]) -> str: - iter_n = p.get("iter", "?") + kind = p.get("kind") or "worker" + iter_n = p.get("iter") + model = p.get("model") or "" pt = int(p.get("prompt_tokens", 0) or 0) et = int(p.get("eval_tokens", 0) or 0) total = int(p.get("tokens_used_total", 0) or 0) + if kind == "worker": + badge_label = f"iter {iter_n}" if iter_n is not None else "worker" + elif kind == "judge": + badge_label = f"judge (iter {iter_n})" if iter_n is not None else "judge" + else: + badge_label = kind + meta_bits = [f"prompt {pt:,}", f"eval {et:,}", f"total {total:,}"] + if model: + meta_bits.append(html.escape(model)) strip = ( '
' - f'iter {html.escape(str(iter_n))}' - f'prompt {pt:,} · eval {et:,} · total {total:,}' + f'{html.escape(badge_label)}' + f'{" · ".join(meta_bits)}' f'{html.escape(ts)}' '
' )