Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,12 +230,33 @@ uv run tilth --visualize <session_id> # or name one explicitly

Writes a single self-contained file (inline CSS, no JS) to `sessions/<id>/chat.html`. Events are grouped by task — model calls become meta strips with a collapsible reasoning block where the model emitted any, tool calls/results become bubbles, validator runs and judge verdicts become coloured cards. Read-only and runs over the saved `events.jsonl`, so it's safe to invoke against a finished or in-progress session.

### Where the tokens went

Every run prints a summary on exit, including a per-task and per-model breakdown:

```
── run summary ──
session 20260503-101422-a1b2c3
duration 14m32s (12.1% of TILTH_MAX_WALL_CLOCK_MINUTES=120)
tokens 412,008 (20.6% of TILTH_MAX_TOKENS=2,000,000)
tasks done=3 failed=0 pending=0
per task
T2 201,344 (worker=180,200, judge=18,400, self_improve=2,744)
T1 150,612 (worker=132,000, judge=16,800, self_improve=1,812)
T3 60,052 (worker=58,200, judge=1,852)
per model
moonshotai/kimi-k2-thinking 381,956
anthropic/claude-sonnet-4.5 30,052
```

Use the per-task line to spot PRDs that should have been split (one task that ate half the budget is usually a planning failure, not a model failure). Use the per-model line to attribute spend when the worker and judge run on different providers — especially when deciding whether the judge is pulling its weight. The breakdown is computed by replaying `events.jsonl`, so it survives crashes and `--resume` cycles.

## 5. Caveats worth being upfront about

- **It's Python-centric.** `post_edit` lints `.py` files. `validators` runs `pytest` and `ruff`. JavaScript / Rust / Go projects need `tilth/validators.py` and `tilth/hooks/post_edit.py` adapted to your toolchain — not deep work, but not zero.
- **Ruff config matters.** If your project doesn't already use ruff, the validator will fire constantly and the agent will spend iterations fixing things that aren't really broken. Either add a permissive `[tool.ruff]` block to your `pyproject.toml`, or swap the ruff validator for whatever linter you already use.
- **The planner is you.** Writing a good `prd.json` (small enough tasks, sharp acceptance criteria, tests upfront) is where most of the value is. Vague PRDs make the harness fail loudly and burn tokens.
- **Costs are real.** A 2-hour run can mean hundreds of thousands of tokens across worker + judge + self-improvement calls. The `TILTH_MAX_TOKENS` cap exists for a reason — set it on first run. Cost per token varies wildly across providers; pick your worker accordingly. Be careful about reaching for a smaller judge model to cut costs — see ["Picking a judge model"](#picking-a-judge-model) below.
- **Costs are real.** A 2-hour run can mean hundreds of thousands of tokens across worker + judge + self-improvement calls. The `TILTH_MAX_TOKENS` cap exists for a reason — set it on first run. Cost per token varies wildly across providers; pick your worker accordingly. Be careful about reaching for a smaller judge model to cut costs — see ["Picking a judge model"](#picking-a-judge-model) below. The end-of-session summary breaks total spend down per task and per model so you can see exactly where the budget went (see ["Where the tokens went"](#where-the-tokens-went)).
- **AGENTS.md is yours forever.** It accumulates. Prune it periodically — old learnings that the model has clearly internalised should be removed (the ratchet works in both directions).
- **Tools are intentionally narrow.** No web fetch, no MCP, no curl-based downloads. If your tasks require external API access, you add a tool to `tilth/tools/` and register it. Keep tools focused — every tool description ships in the prompt every turn.
- **The harness commits to your repo's git db.** The worktree branch is in your repo, not the harness's. If you delete `{{your projects folder}}/tilth`, the branches in your project's repo remain. Clean up branches the same way you would for a normal feature branch.
Expand Down
15 changes: 8 additions & 7 deletions deep-dives.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,19 +179,21 @@ A few things worth noting about this pattern:
- **`or 0` everywhere.** If a provider ships a malformed response, the token count silently falls to zero rather than crashing the run. Defensive choice; the alternative is a 2-hour run dying on one weird `null`.
- **`prompt + completion`, not `total`.** Some providers report `total_tokens` separately; we sum the two we trust. Equivalent to `total_tokens` for every well-formed response.

The third site (in `_run_task`) also logs the per-call breakdown to `events.jsonl` as a `model_call` event:
All three sites log a per-call breakdown to `events.jsonl` as a `model_call` event, with a `kind` field tagging which site emitted it:

```python
session.log("model_call", {
"task_id": task["id"],
"iter": iter_n + 1,
"iter": iter_n + 1, # worker + judge only; omitted for self_improve
"kind": "worker", # or "judge", or "self_improve"
"model": client.config.worker_model,
"prompt_tokens": prompt_tokens,
"eval_tokens": eval_tokens,
"tokens_used_total": session.tokens_used,
})
```

That's the audit trail. After a run, grep `events.jsonl` for `model_call` and reconstruct exactly when tokens were spent. The judge and self-improve sites *don't* currently log a per-call event — they update the running total but skip the per-call detail. Symmetry would make the audit cleaner; small TODO.
That's the audit trail. After a run, grep `events.jsonl` for `model_call` and reconstruct exactly when tokens were spent and on which model. `_token_breakdowns()` in `loop.py` does exactly this replay to compute the per-task and per-model figures rendered in the end-of-session summary; the same shape is what `--resume` would compute if you wanted to bolt on cumulative reporting across resume cycles.

### Enforcement is at the top of each task

Expand Down Expand Up @@ -247,10 +249,9 @@ The wall-clock baseline (`started_at`) is treated differently: `Session.wake()`

A few honest gaps worth knowing:

1. **No dollar-cost tracking.** Tokens, not dollars. The cap is provider-agnostic — useful as a coarse safety net, useless for "stop when I've spent $50 on this run." Adding dollar tracking means a per-model price table and a cost lookup at each `add_tokens` site. Not in MVP.
2. **No per-model breakdown.** If worker and judge are different models on different providers, the running total mashes them together. Splitting `tokens_used` into `worker_tokens_used` and `judge_tokens_used` is ~10 lines if it ever matters.
3. **No headroom warning.** The cap is binary — under it, run; at it, stop. No "you're at 80% of your token budget" alert. Easy to add in `_stop_reason` if you want it.
4. **The cap is over the whole session, not per-task.** A 10-task run with a 2M cap means tasks 1–9 might gobble tokens and starve task 10. There's no per-task budget. The iteration cap (default 8 model calls per task) is the proxy; combined with average tokens/call, that approximates a per-task ceiling.
1. **No dollar-cost tracking.** Tokens, not dollars. The cap is provider-agnostic — useful as a coarse safety net, useless for "stop when I've spent $50 on this run." Adding dollar tracking means a per-model price table and a cost lookup at each `add_tokens` site. Not in MVP — but with `model` now on every `model_call` event, a price table is the only missing piece.
2. **No headroom warning.** The cap is binary — under it, run; at it, stop. No "you're at 80% of your token budget" alert. Easy to add in `_stop_reason` if you want it.
3. **The cap is over the whole session, not per-task.** A 10-task run with a 2M cap means tasks 1–9 might gobble tokens and starve task 10. There's no per-task budget. The iteration cap (default 8 model calls per task) is the proxy; combined with average tokens/call, that approximates a per-task ceiling. The end-of-session per-task breakdown surfaces *where* the spend went after the fact, but doesn't enforce anything during the run.

---

Expand Down
141 changes: 141 additions & 0 deletions tests/test_token_breakdowns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
"""Regression tests for `_token_breakdowns`.

The end-of-session summary's per-task / per-model lines are computed by replaying
events.jsonl, not from in-memory state. Pin the contract: aggregate every
`model_call` event by `task_id` and `model`, split per-task by `kind`, and
gracefully degrade for older events that pre-date the `kind`/`model` fields.
"""

from __future__ import annotations

import json
from pathlib import Path

from tilth.loop import _token_breakdowns
from tilth.session import Session


def _make_session(tmp_path: Path, events: list[dict]) -> Session:
root = tmp_path / "sess"
root.mkdir()
events_path = root / "events.jsonl"
with events_path.open("w") as f:
for e in events:
f.write(json.dumps(e) + "\n")
return Session(
session_id="test",
root=root,
events_path=events_path,
checkpoint_path=root / "checkpoint.json",
)


def test_aggregates_by_model_and_task_with_kind_split(tmp_path: Path) -> None:
session = _make_session(
tmp_path,
[
{"type": "model_call", "payload": {
"task_id": "T1", "kind": "worker", "model": "worker-x",
"prompt_tokens": 100, "eval_tokens": 50,
}},
{"type": "model_call", "payload": {
"task_id": "T1", "kind": "worker", "model": "worker-x",
"prompt_tokens": 200, "eval_tokens": 0,
}},
{"type": "model_call", "payload": {
"task_id": "T1", "kind": "judge", "model": "judge-y",
"prompt_tokens": 80, "eval_tokens": 20,
}},
{"type": "model_call", "payload": {
"task_id": "T1", "kind": "self_improve", "model": "worker-x",
"prompt_tokens": 30, "eval_tokens": 10,
}},
{"type": "model_call", "payload": {
"task_id": "T2", "kind": "worker", "model": "worker-x",
"prompt_tokens": 5, "eval_tokens": 5,
}},
{"type": "tool_call", "payload": {"task_id": "T1", "tool": "bash"}},
],
)

by_model, by_task = _token_breakdowns(session)

assert by_model == {"worker-x": 100 + 50 + 200 + 30 + 10 + 5 + 5, "judge-y": 100}
assert by_task["T1"]["total"] == 100 + 50 + 200 + 80 + 20 + 30 + 10
assert by_task["T1"]["worker"] == 100 + 50 + 200
assert by_task["T1"]["judge"] == 80 + 20
assert by_task["T1"]["self_improve"] == 30 + 10
assert by_task["T2"]["total"] == 10
assert by_task["T2"]["worker"] == 10


def test_skips_zero_token_events(tmp_path: Path) -> None:
"""Defensive: providers occasionally return 0/null usage; don't pollute the breakdown."""
session = _make_session(
tmp_path,
[
{"type": "model_call", "payload": {
"task_id": "T1", "kind": "worker", "model": "m",
"prompt_tokens": 0, "eval_tokens": 0,
}},
{"type": "model_call", "payload": {
"task_id": "T1", "kind": "worker", "model": "m",
"prompt_tokens": None, "eval_tokens": None,
}},
],
)
by_model, by_task = _token_breakdowns(session)
assert by_model == {}
assert by_task == {}


def test_legacy_events_without_kind_or_model_bucketed_safely(tmp_path: Path) -> None:
"""Pre-breakdown events lack `kind` and `model`. Replay must not crash."""
session = _make_session(
tmp_path,
[
{"type": "model_call", "payload": {
"task_id": "T1",
"prompt_tokens": 10, "eval_tokens": 5,
}},
],
)
by_model, by_task = _token_breakdowns(session)
assert by_model == {"unknown": 15}
assert by_task["T1"]["total"] == 15
assert by_task["T1"]["worker"] == 15


def test_missing_events_file_returns_empty(tmp_path: Path) -> None:
root = tmp_path / "no-log"
root.mkdir()
session = Session(
session_id="test",
root=root,
events_path=root / "events.jsonl",
checkpoint_path=root / "checkpoint.json",
)
assert _token_breakdowns(session) == ({}, {})


def test_skips_malformed_jsonl_lines(tmp_path: Path) -> None:
"""Match `_last_stop_reason`'s tolerance: skip lines that aren't valid JSON."""
root = tmp_path / "sess"
root.mkdir()
events_path = root / "events.jsonl"
events_path.write_text(
json.dumps({"type": "model_call", "payload": {
"task_id": "T1", "kind": "worker", "model": "m",
"prompt_tokens": 10, "eval_tokens": 5,
}}) + "\n"
"not valid json\n"
)
session = Session(
session_id="test",
root=root,
events_path=events_path,
checkpoint_path=root / "checkpoint.json",
)
by_model, by_task = _token_breakdowns(session)
assert by_model == {"m": 15}
assert by_task["T1"]["total"] == 15
85 changes: 85 additions & 0 deletions tilth/loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,18 @@ def _judge_task(
prompt_tokens = int(usage.get("prompt_tokens") or 0)
eval_tokens = int(usage.get("completion_tokens") or 0)
session.add_tokens(prompt_tokens + eval_tokens)
session.log(
"model_call",
{
"task_id": task["id"],
"iter": iter_n + 1,
"kind": "judge",
"model": client.config.judge_model,
"prompt_tokens": prompt_tokens,
"eval_tokens": eval_tokens,
"tokens_used_total": session.tokens_used,
},
)

content = ((resp.get("message") or {}).get("content") or "").strip()
parsed = _parse_json_lenient(content)
Expand Down Expand Up @@ -180,6 +192,17 @@ def _self_improve(
prompt_tokens = int(usage.get("prompt_tokens") or 0)
eval_tokens = int(usage.get("completion_tokens") or 0)
session.add_tokens(prompt_tokens + eval_tokens)
session.log(
"model_call",
{
"task_id": task["id"],
"kind": "self_improve",
"model": client.config.worker_model,
"prompt_tokens": prompt_tokens,
"eval_tokens": eval_tokens,
"tokens_used_total": session.tokens_used,
},
)

content = ((resp.get("message") or {}).get("content") or "").strip()
parsed = _parse_json_lenient(content)
Expand Down Expand Up @@ -480,6 +503,8 @@ def _run_task(
model_call_payload: dict[str, Any] = {
"task_id": task["id"],
"iter": iter_n + 1,
"kind": "worker",
"model": client.config.worker_model,
"prompt_tokens": prompt_tokens,
"eval_tokens": eval_tokens,
"tokens_used_total": session.tokens_used,
Expand Down Expand Up @@ -667,6 +692,46 @@ def _format_duration(seconds: float) -> str:
return f"{s}s"


def _token_breakdowns(session: Session) -> tuple[dict[str, int], dict[str, dict[str, int]]]:
"""Replay events.jsonl and aggregate model_call tokens.

Returns (by_model, by_task) where:
by_model: {model_name: total_tokens}
by_task: {task_id: {"total": int, "worker": int, "judge": int, "self_improve": int}}

Reads from disk so the breakdown survives crashes and `--resume` cycles. Older
events without `kind`/`model` fields are bucketed under "unknown" / "worker"
respectively (the worker site has emitted model_call events since v0; only
judge and self_improve started emitting them when this breakdown landed).
"""
by_model: dict[str, int] = {}
by_task: dict[str, dict[str, int]] = {}
if not session.events_path.is_file():
return by_model, by_task
with session.events_path.open() as f:
for line in f:
try:
rec = json.loads(line)
except json.JSONDecodeError:
continue
if rec.get("type") != "model_call":
continue
p = rec.get("payload") or {}
n = int(p.get("prompt_tokens") or 0) + int(p.get("eval_tokens") or 0)
if n <= 0:
continue
model = p.get("model") or "unknown"
task_id = p.get("task_id") or "unknown"
kind = p.get("kind") or "worker"
by_model[model] = by_model.get(model, 0) + n
bucket = by_task.setdefault(
task_id, {"total": 0, "worker": 0, "judge": 0, "self_improve": 0}
)
bucket["total"] += n
bucket[kind] = bucket.get(kind, 0) + n
return by_model, by_task


def _print_summary(session: Session, client: LLMClient, worktree: Path | None) -> None:
elapsed = time.time() - session.started_at
cfg = client.config
Expand Down Expand Up @@ -706,6 +771,26 @@ def _print_summary(session: Session, client: LLMClient, worktree: Path | None) -
console.print(f" tokens {session.tokens_used:,} [dim]{tokens_dim}[/dim]")
console.print(f" tasks {' '.join(task_bits + extras)}")

by_model, by_task = _token_breakdowns(session)

if by_task:
width = max(len(tid) for tid in by_task)
console.print(" per task")
for tid, bucket in sorted(by_task.items(), key=lambda kv: (-kv[1]["total"], kv[0])):
split_bits = [
f"{kind}={bucket[kind]:,}"
for kind in ("worker", "judge", "self_improve")
if bucket.get(kind)
]
split = f" [dim]({', '.join(split_bits)})[/dim]" if split_bits else ""
console.print(f" {tid.ljust(width)} {bucket['total']:>9,}{split}")

if by_model:
width = max(len(m) for m in by_model)
console.print(" per model")
for model, n in sorted(by_model.items(), key=lambda kv: (-kv[1], kv[0])):
console.print(f" {model.ljust(width)} {n:>9,}")


# --- reset ------------------------------------------------------------------

Expand Down
11 changes: 9 additions & 2 deletions tilth/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,15 @@
just enough state (last completed task, worktree branch) to resume on a fresh process.

Event types:
model_call — request/response metadata for a worker call. Carries
`reasoning_details` (the OpenRouter-normalised
model_call — request/response metadata for any model call. Emitted
from all three sites (worker iteration, judge, AGENTS.md
self-improvement) so per-model and per-task token
breakdowns can be reconstructed by replaying the log.
Payload: `task_id`, `kind` ∈ {"worker","judge","self_improve"},
`model`, `prompt_tokens`, `eval_tokens`, `tokens_used_total`.
Worker and judge entries also carry `iter` (the worker
iteration that triggered them); self_improve omits it.
Carries `reasoning_details` (the OpenRouter-normalised
structured form) when the model emitted any, falling
back to a flat `reasoning` string. Either is omitted
when absent so non-thinking models keep slim events.
Expand Down
Loading