Skip to content

feat(AUR-218): custom Docker sandbox image for agentwatch-daily#29

Closed
mishanefedov wants to merge 120 commits into
mainfrom
agent/aur-218-sandbox-docker
Closed

feat(AUR-218): custom Docker sandbox image for agentwatch-daily#29
mishanefedov wants to merge 120 commits into
mainfrom
agent/aur-218-sandbox-docker

Conversation

@mishanefedov
Copy link
Copy Markdown
Owner

Implements Linear issue AUR-218.

What changed and why
Added a Dockerfile and a sandbox-runbook.md into .agentwatch-bot/ to allow OpenClaw to run the daily agent fully sandboxed while still providing Node.js 22 and gh CLI. Without this custom image, npm test and gh pr create would fail in the default OpenClaw sandbox.

What you considered and rejected
Considered extending openclaw-sandbox:bookworm-slim but extending standard node:22-bookworm-slim is simpler and provides the exact LTS matrix version we need.

Test evidence

  • docker build -t agentwatch-sandbox -f .agentwatch-bot/Dockerfile . runs successfully.

Misha Nefedov and others added 30 commits April 14, 2026 11:29
Four related fixes from real dogfooding:

1. CLAUDE WAS SILENTLY BROKEN. The chokidar v4 watch pattern
   `${dir}/**/*.jsonl` never fired because v4 dropped glob support
   (we already fixed this in openclaw.ts and cursor.ts, forgot
   claude). Replaced with recursive watch + regex filter. Smoke test
   on dev machine: claude adapter now surfaces thousands of events
   across 9+ projects where before it was 0.

2. PROJECT PREFIX on every event. Claude session paths encode the
   project (`~/.claude/projects/-Users-foo-IdeaProjects-auraqu/...`)
   so we extract the last segment as a `[auraqu]` prefix on every
   summary. OpenClaw tracks cwd per session from `session_start`
   events and tags subsequent events in that session. Cursor uses
   path heuristics. Huge visibility win — you can finally see WHICH
   project each agent is working on.

3. CURSOR STARTUP NOISE REMOVED. Before: Cursor emitted 3 'detected
   at startup' events into the timeline on every launch, which
   looked like agent activity even when you hadn't touched Cursor
   in months. Now: the startup snapshot populates a CursorStatus
   object returned alongside the stop fn; only real file-change
   events (mcp edit, permission edit, .cursorrules edit, new
   recently-viewed file) hit the timeline.

4. TIMELINE TRUNCATION via Ink's wrap='truncate' so long summaries
   don't break rows into multi-line walls of text.

13 tests still passing. Typecheck clean.
Previously events rendered in arrival order — backfill from session
files could arrive out-of-order so an event from 02:00 would appear
above a live event from 09:24. Now each incoming event is
binary-inserted at its correct position based on the event's ts
field so the timeline stays strictly reverse-chronological.
Two UX fixes from real dogfood:

1. Add a bold/dim header row (TIME / AGENT / TYPE / EVENT) at the
   top of the timeline so it's obvious what each column is.

2. Enter the terminal's alt screen buffer (\\x1b[?1049h) on startup
   and leave it (\\x1b[?1049l) on exit. Standard TUI behaviour —
   lazygit/k9s/htop do this. While agentwatch runs it takes over
   the viewport; on quit the shell's scrollback is restored as
   though nothing happened. Registered handlers for exit, SIGINT,
   SIGTERM, SIGHUP so Ctrl-C and kill signals still restore the
   terminal cleanly.
chokidar's watcher.close() awaits pending fs handles which made q
feel laggy (2-3s) on exit. We don't care — OS reaps fds. Call
exit() for the Ink unmount then schedule process.exit(0) on the
next tick.
Previously many Claude events rendered as '[auraqu] tool_call' /
'[auraqu] response' with no payload because extractFields only
looked at top-level o.input, but the real tool_use data lives in
o.message.content[i].input. Now we walk message.content properly:

- Bash tool_use → 'Bash: <command>' + cmd field populated +
  classified as shell_exec (risk 9 for destructive cmds)
- Read/Write/Edit/MultiEdit → '<Tool>: <path>' + path field + correct
  event type (file_read / file_write)
- Grep/Glob → '<Tool>: <pattern>'
- Task → 'Task: <description>'
- WebFetch → 'WebFetch: <url>'
- Fallback for unknown tools: '<name>: <first string arg>'

Also suppresses noise:
- Assistant messages with empty content (compaction stubs)
- User turns that are tool_result blocks only (no human text)
- worktree-state / compact / summary entries

Smoke test on dev data: 0 empty-looking events across 2147 events.
Every entry now has readable content or is filtered out.

Updated one test (Bash now classifies as shell_exec) + added a
regression test for empty-message suppression. 14 tests passing.
0.0.1 was published but unusable (claude adapter silently returned
0 events, EMFILE crash on workspace scan, empty timeline rows).
0.0.2 bundles all fixes discovered during same-day dogfood.

Contents documented in CHANGELOG.
Press 'p' now shows three sections:

- Claude Code: as before (allow/deny/defaultMode + risk flags)
- Cursor: approval mode, sandbox state, allow/deny counts, MCP
  server list, discovered .cursorrules paths. Data was already
  collected via CursorStatus — just wasn't rendered.
- OpenClaw: default workspace + per-sub-agent breakdown (name,
  emoji, id, model, workspace). OpenClaw has no allow/deny — scope
  is controlled by the workspace path, so we render that instead.

Gemini CLI exposes only auth (no permission model), so we document
that fact in the footer rather than show an empty section.
Claude Code stores subagent runs at:
  ~/.claude/projects/<proj>/<session>/subagents/agent-<id>.jsonl

Previously invisible — our path regex required the jsonl to be
directly in the project folder and chokidar depth was 3. Every
subagent's inner tool calls (often dozens: Bash, WebFetch, Grep…)
never made it to the timeline; only the parent 'Agent: <task>'
entry showed up.

Now:
- regex matches both main-session and subagents/ files
- depth bumped 3 → 5
- subAgentId extracted from filename, threaded into the translator
- events prefixed with [project / sub:<id8>] so you can see which
  subagent each call came from

Smoke test on dev data: 2217 main-session events + 8513 subagent
events now surfaced (previously 0 from subagents). Closes a huge
observability gap that was invisible until today.

All 14 tests still passing.
AUR-99. First M4 ticket. claude-devtools parity win.

- Added EventDetails type to schema (fullText, thinking, toolInput,
  toolUseId) so translators stash the real payload alongside the
  truncated summary
- Claude adapter populates details for prompts, responses, and
  tool_use events. Extracts thinking blocks separately.
- OpenClaw adapter same
- New EventDetail.tsx renders the full content: wraps long text to
  terminal width, shows JSON-pretty tool input, highlights extended
  thinking, scrollable with up/down / j/k
- Reducer gains selectedIdx + detailOpen + detailScroll state
- ↑↓ or j/k moves selection in the timeline (inverse-highlighted row)
- Enter / l opens detail; esc / q closes
- If selection drifts off-screen, Timeline centers the window on it
- When new events arrive above a selected row, the cursor shifts to
  stay on the same event

14 tests still passing. Biggest UX upgrade since the initial scaffold
— every event now has real inspectable content.
AUR-101. Second M4 ticket.

- Press / in the TUI to open a search input; every keystroke
  narrows the timeline to events whose summary, path, cmd, tool,
  agent, fullText, or thinking contain the query (case-insensitive
  substring). Backspace to edit, esc to clear.
- Match count shown below the timeline while searching.
- Query is a sticky filter on top of the existing agent filter and
  the buffer sort order.

Cross-session disk search will come later (AUR-111, M6). This one
covers the in-memory buffer which is instant up to ~10k events.
AUR-118. Third M4 ticket.

- New src/util/cost.ts with per-model pricing tables (opus-4-6,
  sonnet-4-6, haiku-4-5) + default fallback. Parses the usage block
  (input_tokens / cache_creation_input_tokens / cache_read_input_tokens /
  output_tokens) and returns USD cost.
- Cache accounting is critical — cache_read is ~10% of base rate, so
  naive summers are 3-10x wrong on Claude. Get this right.
- Claude adapter extracts model + usage from each assistant turn and
  stashes them in event.details.{usage, cost, model}.
- Agent side panel shows per-agent cost total (yellow).
- Event detail pane shows token breakdown + cost for every
  assistant-turn event.

Smoke test on dev data: 10,735 events, 9,707 with cost, total
$861.21 across accumulated backfill. Real pricing math.
…flag

AUR-123. Fourth M4 ticket. Unblocks AUR-100 subagent drilldown and
AUR-122 inline expansion.

- New EventSink type in schema: adapters receive { emit, enrich }
  instead of just an emit callback. emit is unchanged; enrich
  patches an already-emitted event's details in place.
- App reducer: new 'enrich' action walks the buffer and merges the
  patch onto the matching event (O(n) scan, fine for <500 events).
- Claude adapter:
  - pendingToolUses map tracks every tool_use's eventId + ts
  - handleToolResults scans user turns for tool_result blocks,
    pairs by tool_use_id, enriches with:
      * toolResult (stdout / file body / search matches)
      * durationMs (ts delta)
      * toolError (is_error flag)
  - orphanResults cache handles backfill out-of-order (bounded to
    1000 entries, drops oldest on overflow)
  - tool_use emitters check the orphan cache and enrich immediately
    if the result arrived first
- Detail pane now shows duration + tool result content + error flag
  alongside tokens/cost/model
- Openclaw / cursor / fs-watcher: accept either an Emit fn or an
  EventSink — backward compatible

Smoke test on dev data: 10,729 Claude events surfaced, 6,005 got a
durationMs, 5,919 got full toolResult content, 241 errors
flagged. Real pairing across backfill + live.
AUR-100. Fifth M4 ticket. Uses the tool_use/tool_result pairing
from AUR-123.

When a Claude Agent tool_use's tool_result arrives, the result text
contains the spawned subagent's agentId (e.g. 'agentId:
ab3c99fca44a218cb'). We regex-extract it in handleToolResults and
stash on details.subAgentId.

UI:
- Timeline rows with a known subAgentId show a yellow suffix:
    [auraqu] Agent: Multi-agent dev pain research ▸ 52 child events
  where the child count aggregates all events whose sessionId ==
  'agent-<subAgentId>' (the subagent jsonl filename).
- New hotkey 'x' on a focused Agent event scopes the timeline to
  that subagent only — shows every Bash, WebFetch, Grep it made.
- 'X' (shift-x) unscopes back to full timeline.
- Scope banner above the footer makes the filter visible.
- Works with existing search / agent-filter / pause stack.

Smoke test on dev data: 15 parent Agent events tagged with their
subAgentId — matches the count of Task spawns today. Claude-devtools
parity for multi-agent drilldown.
… list

AUR-122. Sixth M4 ticket. Matches claude-devtools' signature UX.

Events now show a ▸ marker in the summary row when they have
expandable content (toolResult, toolInput, fullText, thinking).

Hotkeys:
- right-arrow or 'o' on focused row: toggle inline expansion
- left-arrow: collapse (if expanded)
- Enter still opens the full-screen detail pane for long content

Expanded content rendered indented under the parent row:
- tool_input (JSON-pretty) when there's no tool_result yet
- tool_result (stdout / file body / matches), red if error
- fullText for prompts/responses when no tool context
- capped at 10 lines with '… N more lines (press Enter for full view)'
  so a long Read doesn't blow up the timeline

Summary rows also got duration + ERR flag inline, matching
claude-devtools' compressed tool row format:
  ▸ Bash: git log · 15ms
  ▸ Edit: src/app.tsx · 7ms · ERR

Timeline component extended with expandedIds prop; state.expandedIds
lives in App.tsx as a Set<string>.
AUR-119. Seventh M4 ticket. First of the navigation hierarchy that
matches claude-devtools' left sidebar.

- New src/util/project-index.ts builds a ProjectRow[] from the
  event buffer, aggregating: event count, per-agent count, session
  ids, cost, last activity. Sorted by last ts descending.
- New src/ui/ProjectsView.tsx renders the list with selected row
  highlight, per-project agent counts (claude:484, openclaw:34,
  …), cost, last-active (5m ago / 2d ago).
- Hotkey P opens the view; ↑↓ / j/k navigates; Enter picks a
  project and applies it as a filter to the main timeline.
- Hotkey A clears the project filter (back to all).
- Scope banner above the footer makes the active filter visible
  alongside subagent scope + search query.

Smoke test on dev data: 23 projects surfaced across all Claude
sessions (auraqu / collector / landing / research / …). Real
cross-project aggregation even before AUR-120 sessions view.
…ll-in

Per user feedback: inline ▸/▾ expansion clutters the timeline and
isn't as useful as the full-screen modal. Reverting that piece of
AUR-122 while keeping the good parts (duration + ERR inline in the
summary row, child count marker).

- Removed expandedIds state + toggle-expand action + right/left
  arrow hotkeys
- Removed ExpansionBlock + hasExpandableContent + buildExpansionLines
- Timeline rows no longer carry a ▸/▾ prefix
- Enter opens the full-screen detail pane, esc closes — as before

Net effect: cleaner timeline, single drill-in UX.
Two dogfood bugs reported by user.

1. q wasn't quitting. Root cause: search input's Enter handler did
   nothing, leaving the user stuck in search capture mode where
   every keystroke (including q) went into the query buffer. Added
   a confirm-search action: Enter now exits input mode while
   keeping the query as a sticky filter. Outside input mode, q
   always quits.

2. Permissions screen overflowed the terminal with no way to see
   clipped content. Rewrote PermissionView to build a flat row
   array (h1 / h2 / kv / item / text / blank) and render only a
   slice based on scroll offset. New state.permissionsScroll moves
   with ↑↓ / j/k. Header shows '1-20 of 47' while scrolled. esc or
   p closes. q still quits.

All 14 tests green.
AUR-120 + AUR-121. Two hierarchy levels landed together since they
share state.

Flow: P opens projects → Enter picks a project → sessions list for
that project appears, grouped by date (Today / Yesterday / Last 7
days / Older) → Enter on a session scopes the main timeline to only
that session's events.

- src/util/project-index.ts: new buildSessionRows + dateBucket +
  SessionRow type. First user prompt per session is cached.
- src/ui/SessionsView.tsx: bucketed list with scroll window,
  selection highlight, per-row agent tag (claude-code /
  openclaw:content / cursor), event count + relative time + cost +
  ERR flag.
- App.tsx: new reducer actions for open/close-sessions,
  sessions-move, sessions-scroll, sessions-open-selected. New
  sessionFilter state applies to the timeline filter chain.
- Footer hint reflects the currently-active view.
- Banner shows 'session <short-id> (A to clear)' when scoped.

Keyboard: ↑↓ / j/k navigates; enter opens; esc back to projects
list; q still quits from anywhere.
AUR-125. The live TUI was showing 'unknown file_change' events
that duplicated Claude's own file_write events from the jsonl. Every
Claude Edit / Write / MultiEdit was counted twice — once with full
diff + project tag, once as bare unattributed noise.

- New src/util/recent-writes.ts module with a cross-adapter cache:
  markAgentWrite(path, ts) + wasRecentlyWrittenByAgent(path).
- 5s dedupe window, 30s TTL. Entries auto-sweep when the module is
  touched (no timers, no background work).
- Claude adapter calls markAgentWrite after emitting any file_write
  or file_read with a path.
- fs-watcher skips emission when the path matches a recent agent
  write.

Keeps fs-watcher as a safety net for truly manual edits (Cursor
with its broken activity log, terminal edits, non-instrumented
agents) while silencing the Claude double-count.
AUR-124. Parity with claude-devtools' per-tool copy button.

- New src/util/clipboard.ts — zero-dep, platform-native: pbcopy on
  macOS; wl-copy/xclip/xsel on Linux (first match wins); clip on
  Windows. Returns structured {ok,reason} so the caller can
  surface 'install xclip' instead of crashing.
- eventToYankText picks the most useful payload: tool result >
  fullText > cmd > path > summary.
- App.tsx: 'y' on a focused event yanks; dispatches a transient
  flash message (green '✓ copied N chars' or red '✗ reason') that
  auto-clears after 2s.
- Footer hint updated: 'y yank' added.

Verified on dev machine: pbcopy works end-to-end.
AUR-102. Final M4 ticket — M4 complete.

Fires desktop notifications for:
- .env read/write from any agent
- ~/.ssh, ~/.aws, ~/.gnupg paths touched
- rm -rf / sudo / curl | sh in shell_exec
- tool_result is_error

Rate-limited: one alert per rule-key per 60s (keyed on path or
cmd prefix) so a looping agent doesn't spam notifications.

Platform dispatch:
- macOS: osascript 'display notification' (no deps)
- Linux: notify-send
- Windows: PowerShell MessageBox fallback

Only fires for events whose ts is AFTER TUI launch time —
backfill from historical sessions is silent. Notifier self-
disables on the first platform error so a missing notify-send
doesn't spam stderr.
Ink's raw-mode stdin breaks default stdio inheritance on spawnSync,
surfacing as 'EBADF' when pbcopy (or notify-send / osascript) is
invoked from inside the running TUI.

Fix: explicit stdio for every spawnSync — pipe stdin where we're
supplying input, ignore child stdout/stderr to stay out of the
TUI's renderer.

Clipboard now works from inside the TUI instead of only from
scripts.
AUR-129. Ship before publish so hotkeys are discoverable.

Press ? from anywhere to open a grouped keybindings reference:
Navigate / Filter & scope / Actions / Info views / Detail pane /
Help. Press ? or esc to close.

Footer hint now leads with [?] help so first-time users notice
immediately. Also trimmed the footer hint down by removing
redundant hotkeys already covered in the help screen.
…DUCT + templates + UX polish

Launch-grade README (250+ lines):
- Centered hero with badges (npm, CI, license, Node)
- Table of contents
- Why / Install / First-60-seconds walkthroughs
- Per-feature sections with real examples: timeline, detail pane,
  subagent drilldown, project/session nav, search, permissions,
  cost-with-cache-accounting, notifications, clipboard yank
- Full keyboard reference as tables
- What agentwatch reads (paths table)
- Configuration (env vars)
- How it compares (claude-devtools / Unfucked / Langfuse / Phoenix)
- Limitations (honest: Cursor config-only, Gemini + Codex not
  instrumented, macOS+Linux only, 40-row timeline window)
- Non-goals (hard scope boundaries)
- Roadmap (v0.4 / v0.5 / v1.0 highlights)
- Architecture diagram with layer mental model
- Development / Security / License

Revived + written:
- CONTRIBUTING.md — dev workflow, PR checklist, what's in-scope
- SECURITY.md — responsible disclosure, scope, full path list,
  'what it does NOT do' invariants
- CODE_OF_CONDUCT.md — Contributor Covenant v2.1 pointer
- .github/ISSUE_TEMPLATE/{bug_report,adapter_request,feature_request}.md
- .github/PULL_REQUEST_TEMPLATE.md

UX polish (AUR-126):
- Breadcrumb component surfaces active view + every active filter/scope
  (project, session, sub-agent, agent, search)
- '0' = home — reset all filters/scopes, close modals
- 'Z' = clear filters (replaces confusing 'A' case-variant)
- 'esc' = go back one level consistently (sessions→projects→timeline)
- Removed the per-banner scope hints (breadcrumb covers them)
- Footer hint tightened + reordered

Added LAUNCH_POSTS.md to .gitignore (local-only draft doc).

All 14 tests green. tsup build clean.
Drafts live in the private Linear ticket / chat, not the repo.
AUR-76. Third agent live (after Claude + OpenClaw). v0 surfaces
517 events across 4 projects on the dev machine.

- Watches ~/.gemini/tmp recursively (depth 4) for session JSON
  files matching chats/session-*.json
- Each session is a single JSON document (not JSONL); re-read +
  diff-against-emitted-ids on every change
- Translates messages with type 'user'→prompt, 'gemini'→response,
  'error'→response (risk 6), skips 'info'
- Honors session kind='subagent' in the project prefix + tool tag
- Project extracted from tmp/<project>/chats/ path segment, with
  fallback for unusual layouts

Gemini's text doesn't contain structured tool_use like Claude's —
the model narrates its actions in prose. We surface them as
response events verbatim rather than applying brittle regex.
… status

Added detection for 5 more agents: Codex, Aider, Cline (VS Code
extension), Continue.dev, Windsurf (via .codeium), Goose.

- AgentName union expanded in schema.ts
- detect.ts grew per-agent detection paths with OS awareness
  (Cline has different locations on macOS vs Linux)
- New 'instrumented' boolean on DetectedAgent marks whether we
  actually parse events or just recognize the install
- AgentPanel renders a yellow dot + 'detected (events TBD)' label
  for detected-but-not-instrumented agents
- doctor output suggests opening an issue with a redacted session
  file so we can ship the adapter

Discovered Windsurf is installed on the dev machine (~/.codeium).
Flagged as not-yet-instrumented so users see honest status rather
than silent 'works with X' marketing claims.

Design choice: rather than stub 3 unverified parsers (Aider, Codex,
Cline, Windsurf) from documented specs without local data to test
against, we detect + tell the truth + ask for sample sessions.
Shipping a broken adapter is worse than not shipping one.
mishanefedov and others added 26 commits April 22, 2026 11:37
chore: executable bin + dynamic version read
docs(directives): stop backlog growth; ambiguity is not a blocker
Adds the persist-everywhere rule to §11: every repo touched in a run
must be committed and pushed before session-end, including partial work
that ends in a [BLOCKED] exit. A [BLOCKED] ping with a dirty working
tree is now itself defined as a failure — the agent must clean the tree
(commit+push, or stash) before pinging Telegram.

Also: include the KB commit SHA in the Telegram summary when the run
wrote to the knowledge base.

Motivation: today the agentwatch-daily cron sent "[BLOCKED] dirty main"
because a previous run had left the directives file uncommitted. The
fix is to make leaving a dirty tree a session-end failure, not a
session-start abort.
docs(directives): require commit+push before [BLOCKED] exit
The readline-based read loop in claude-code, codex, and openclaw advanced
the cursor by Buffer.byteLength(line) + 1 for every emitted line — including
the trailing line of a chunk that had not yet been newline-terminated by
the producing agent. JSON.parse failed on the partial line, the catch
block swallowed it, and the rest of the line, when later flushed, was
read as a fresh line and also failed. Result: silent permanent loss of
otherwise-valid events.

Replaced with a small synchronous helper readNewlineTerminatedLines that
returns only complete (\n-terminated) lines and a consumed count that
points at the last \n. Adapters now advance their cursor by exactly
that count, so any unterminated tail stays unread until the next pass.

Added jsonl-stream.test.ts plus claude-code.test.ts cases covering the
split-chunk write scenario.
…-241)

Run 981bbbf1-e4bc-442b-9d87-39b65e169039 burned 17 minutes on
2026-04-21 before the cron killed it. Per-run logs only show the
\"timeout\" terminator — no command-level trace, so the offending
command is unrecoverable. Most likely culprits: a wedged
\`openclaw status\`, a hung \`gh\` call (auth flow / paginated
GraphQL), or an unbounded \`curl\` to Linear/Telegram.

Mitigation, since the cron backstop is too late:

1. New section in .agentwatch-bot/prompt.md — \"Timeouts on hang-prone
   commands\" — defines a portable \`wt <secs> <cmd>\` helper using
   perl's \`alarm\` (no coreutils dependency, works on stock macOS +
   Linux). Returns 124 on timeout to match GNU \`timeout\`.

2. Table of hang-prone commands and suggested timeouts: openclaw
   status (30s), gh (60s), curl (30s), git fetch (60s), git push
   (120s), npm test (300s).

3. STEP 0's \`openclaw status\` invocation already updated to use the
   helper inline.

4. New AGENT_DIRECTIVES.md §7 hard red line: do not run hang-prone
   commands without an explicit timeout.

This is preventive, not retroactive — we can't tell which command
hung on April 21 without per-command logs. AUR-241 closes here; if
the next timeout repeats, file a follow-up that adds command-level
trace logging to the cron runner.
fix(directives): require timeout wrappers on hang-prone commands (AUR-241)
fix(adapters): preserve partial JSONL lines across reads (AUR-227)
The claude-code, codex, and openclaw adapters used to swallow JSON.parse
failures with an empty catch block. If an agent silently changed its
session-file format or a line was corrupted, operators would lose events
and never know — the TUI just showed a shorter timeline.

Added a per-session ParseErrorTracker that emits one synthetic
parse_error event the first time a session fails to parse a
newline-terminated line, then enriches the same event with the running
count + a truncated sample of the latest offending line. The event
shows up in the timeline at low risk (1) with summary "⚠ unparseable
line — context loss possible", so the operator sees a clear marker
that they're missing context.

New schema EventType "parse_error" plus details.parseErrorCount /
parseErrorSample. Wired into all three JSONL adapters.
…16) (#11)

Provider rates change between releases of the agentwatch CLI. Until
this commit the Claude / Gemini / GPT rates were hardcoded in
src/util/cost.ts, so any provider price change silently produced wrong
cost math for every operator until a new CLI version shipped.

loadRates() now merges the baked-in DEFAULT_RATES with any entries the
user wrote to ~/.agentwatch/pricing.json (overridable via the
AGENTWATCH_PRICING_PATH env var). Each entry must carry all four
fields (input / cacheCreate / cacheRead / output) as non-negative
numbers — partial entries are dropped so we never silently mix a stale
field with a fresh one. Validation rejections are logged when
AGENTWATCH_PRICING_DEBUG=1.

Schema documented in docs/features/cost-accounting.md, including the
normalized-model-name rule and the "all-or-nothing" override
semantics.
…AUR-217) (#10)

The Claude, Codex, and Gemini adapters all enrich their tool_use events
with the matching toolResult / exit code / duration from the next turn.
The OpenClaw adapter was missing this entirely — every tool_use event
landed in the timeline without stdout, error state, or runtime, making
the OpenClaw view nearly useless for postmortem.

Two changes here:

1. extractToolUse now matches OpenClaw's native shape (type:"toolCall",
   arguments:{...}) in addition to the Anthropic-style (type:"tool_use",
   input:{...}) it previously assumed. Real ~/.openclaw sessions use
   the former, so most tool_use events were silently being dropped at
   the translateSession stage too. Captures the toolCall id as
   details.toolUseId.

2. New handleOpenClawToolResult harvester recognizes
   message.role:"toolResult" turns, looks up the pending tool_use by
   toolCallId, and enriches with toolResult / toolError / durationMs.
   Mirrors the Claude adapter's pendingToolUses + orphanResults
   pattern, including bounded maps to survive crashes mid-turn.

Also handles file/cmd field synonyms (file, cmd) because OpenClaw's
toolCalls aren't strictly aligned with the file_path / command names.
…age.txt (AUR-242) (#12)

The TRIAGE mode in AGENT_DIRECTIVES.md §5 referenced
~/.agentwatch-bot/last-triage.txt with the parenthetical "create it on
first run with `now`" — too vague. On a fresh dev machine, after a
manual edit, or if the file gets wiped between runs, the
`gh search "created:>$LAST_TRIAGE"` query silently breaks (empty
$LAST_TRIAGE → query parses but returns nothing useful, OR `cat`
errors and the agent never recovers).

Replaced the vague hint with an explicit defensive-init bash block
that:
- mkdir -p the dotdir
- writes a now-minus-24h ISO timestamp if the file is missing
- validates the file content matches an ISO-8601 UTC pattern, rewrites
  the default if not
- splits the trailing Z for the gh search variant that needs it

Also added a session-start checklist note in
.agentwatch-bot/prompt.md pointing the bot at this block before the
first gh search of any TRIAGE run.
…-214) (#13)

Acceptance criteria for AUR-214 was: research whether Gemini CLI or
OpenClaw persist any compaction marker; if not, document the
structural limitation in the relevant feature contracts.

Result of the research: neither does.

Gemini chat JSON (~/.gemini/tmp/<proj>/chats/session-*.json) carries
only user/gemini/error/info message types. The CLI's /compress command
rewrites context in-place but writes nothing distinguishable into the
file. Survey of every active session on this dev machine — no
compaction-shaped record.

OpenClaw session JSONL records session, message, model_change,
thinking_level_change, custom, custom_message. The custom subtypes in
the wild are model-snapshot, openclaw:bootstrap-context:full,
openclaw:prompt-error, plus openclaw.sessions_yield as a
parent→child handoff. None of these are context resets.

Added docs/features/compaction-visualizer.md with the per-agent
support matrix and a note on what shape a future marker would need to
take to be wired through. Explicitly warned against synthesizing
compaction from indirect signals (cacheRead drop, model swap) — false
positives are worse than honest blanks here.

No source changes; the visualizer already supports compaction events
when an adapter emits them.
Cuts the version + CHANGELOG for the eight fixes that have been
sitting on main since v0.0.4 (AUR-214, AUR-216, AUR-217, AUR-227,
AUR-228, AUR-241, AUR-242, plus version-from-package.json + chmod
on bin). No new features.
* feat(store): SQLite event store with FTS5 (AUR-263)

Adds src/store/sqlite.ts as the persistent source of truth for every
event the adapters emit. Replaces the 4 MB rolling backfill with an
indexed, queryable, FTS-searchable store at ~/.agentwatch/events.db.

Three tables — events (canonical AgentEvent), sessions (auto-aggregated
via insert trigger: cost, ts range, count, project), tool_calls (tool,
duration, error). FTS5 virtual table over prompt/response/thinking/
tool_result/summary with porter+unicode61 tokenization. Versioned
migrations (schema_version). WAL + synchronous=NORMAL.

Wires the store into both TUI and serve-mode EventSinks via a write-
through wrapper (src/store/wire.ts) — failures are logged once and
never propagated. Adds 'agentwatch prune --older-than-days N' CLI
subcommand and a new 'history' mode on POST /api/search backed by FTS5.

Out of scope (filed as follow-ups): TUI reducer becoming a thin cache
over the store, full /sessions /projects route migration, schema
migration tooling for end users.

Bench: ingests 10k events in ~430ms on M1 air. 279/279 tests pass.

* feat(daemon): background capture daemon with launchd + systemd (AUR-262) (#17)

Closes the largest stated limitation — agentwatch was a viewer, not a
daemon. `agentwatch daemon start | stop | status | logs` now installs
a user-level service that runs the adapter pipeline 24/7 and writes
every event into the SQLite store at ~/.agentwatch/events.db.

src/daemon/install.ts renders the launchd plist (macOS) or systemd
user unit (Linux); the unit invokes `agentwatch daemon run`, the
internal foreground subcommand that:

  1. Acquires a PID lock at ~/.agentwatch/daemon.pid (stale-pid aware
     via a process-alive probe; auto-rotates the lock if the holder
     died without cleaning up).
  2. Opens the SQLite store and starts every adapter wired through
     wrapSinkWithStore so events persist on disk.
  3. Writes its start time at ~/.agentwatch/daemon.started_at so
     `status` can compute uptime.
  4. Drains cleanly on SIGTERM / SIGINT / SIGHUP — closes adapters,
     closes the store, releases the lock.

src/daemon/log-rotate.ts is an append-only writer with a single
rotation slot at 10 MB (the file rolls to .log.1; older history is
out of scope on purpose).

src/daemon/index.ts is the controller that dispatches start / stop /
status / logs / run. `status` reports running yes/no, PID, uptime,
events captured, last event ts, and DB size by querying the same
SQLite store the daemon writes to.

The TUI and `agentwatch serve` keep working as clients of the same
store, so events captured overnight are visible the moment you open
them. Stacked on PR #16 (AUR-263 SQLite store) — the daemon depends
on `src/store/`. 10 new tests cover log rotation, plist + systemd
unit rendering, and the process-liveness probe; full suite 289/289.

* feat(classify): per-event activity classifier (AUR-264) (#18)

Adds a 12-category activity classifier that answers the CodeBurn-viral
question 'where is my spend going?' Categories: coding, debugging,
exploration, planning, refactor, testing, docs, chat, config, review,
devops, research.

src/classify/activity.ts is a heuristic ladder — no ML dep. Rules
combine file-extension signals (.test.ts → testing, .md → docs,
package.json → config), tool-name signals (Grep → exploration,
WebFetch → research, kubectl/docker/terraform → devops), shell-
command signals (npm test, pytest, git diff, eslint), and prompt/
response keyword signals (refactor, error, audit, plan). Each rule
contributes a weighted score; argmax wins. Empty input falls through
to chat.

src/classify/sink.ts wraps an EventSink so events land with
details.category attached before they reach the store or the TUI
reducer. Idempotent — won't overwrite an already-set category.

Wired into both TUI and serve mode in front of the store wrapper:
adapters → withClassifier → wrapSinkWithStore → innerSink.

Schema v2 migration: ALTER TABLE events ADD COLUMN category TEXT,
new index idx_events_category, store has activityBySession() and
activityByProject() that GROUP BY category.

New API: GET /api/sessions/:id/activity, GET /api/projects/:name/activity.

36 classifier tests (per-rule cases + sink wrapper + 75% top-1
agreement on the synthetic dataset) + 3 store activity-rollup tests.
Full suite 315/315 pass.

Out of scope (defer to follow-up): React activity views in the web
UI, TUI EventDetail surface for category, the full 200-turn hand-
labelled validation harness — synthetic dataset substitutes for v0.1
and we re-evaluate on real-data accuracy after dogfood.

Stacked on PR #16 (AUR-263 SQLite store) — uses the v2 migration.
…get) (#21)

* feat(store): SQLite event store with FTS5 (AUR-263)

Adds src/store/sqlite.ts as the persistent source of truth for every
event the adapters emit. Replaces the 4 MB rolling backfill with an
indexed, queryable, FTS-searchable store at ~/.agentwatch/events.db.

Three tables — events (canonical AgentEvent), sessions (auto-aggregated
via insert trigger: cost, ts range, count, project), tool_calls (tool,
duration, error). FTS5 virtual table over prompt/response/thinking/
tool_result/summary with porter+unicode61 tokenization. Versioned
migrations (schema_version). WAL + synchronous=NORMAL.

Wires the store into both TUI and serve-mode EventSinks via a write-
through wrapper (src/store/wire.ts) — failures are logged once and
never propagated. Adds 'agentwatch prune --older-than-days N' CLI
subcommand and a new 'history' mode on POST /api/search backed by FTS5.

Out of scope (filed as follow-ups): TUI reducer becoming a thin cache
over the store, full /sessions /projects route migration, schema
migration tooling for end users.

Bench: ingests 10k events in ~430ms on M1 air. 279/279 tests pass.

* feat(yield): git-correlation $/commit + $/line views (AUR-265)

src/git/correlate.ts walks each project's git log via spawnSync and
pairs commits with sessions whose [first_ts, last_ts + 30min] window
contains the commit's author date. Read-only — git verbs are
allow-listed (log / rev-parse / worktree / show / diff / blame /
status / config / branch / remote); any other verb throws.

Per-session yield: cost-per-commit, cost-per-line-changed, total
insertions/deletions/files. Per-project yield: weekly cost-per-commit
trend + a sorted 'spend without commit' list of sessions that burned
dollars but produced no commits in window.

Worktree de-dup via gitCommonDir() so two checkouts of the same repo
share a backing repo. Project-name → git-root resolution by walking
WORKSPACE_ROOT one level looking for a basename match with a .git
entry.

New routes: GET /api/sessions/:id/yield, GET /api/projects/:name/yield.
Both return ok:false with a reason when there's no store, no project
tag, or no git repo under WORKSPACE_ROOT.

14 vitest tests using real git repos via execSync (gitInit + commit
helpers). Full suite 293/293 pass.

Out of scope (filed as follow-up): React /sessions/:id/yield and
/projects/:name/yield views in the web UI; the API ships now and the
visualization is its own ticket. Stacked on PR #16 (AUR-263 SQLite
store) — sessions/projects come from the store.

* feat(daemon): background capture daemon with launchd + systemd (AUR-262) (#17)

Closes the largest stated limitation — agentwatch was a viewer, not a
daemon. `agentwatch daemon start | stop | status | logs` now installs
a user-level service that runs the adapter pipeline 24/7 and writes
every event into the SQLite store at ~/.agentwatch/events.db.

src/daemon/install.ts renders the launchd plist (macOS) or systemd
user unit (Linux); the unit invokes `agentwatch daemon run`, the
internal foreground subcommand that:

  1. Acquires a PID lock at ~/.agentwatch/daemon.pid (stale-pid aware
     via a process-alive probe; auto-rotates the lock if the holder
     died without cleaning up).
  2. Opens the SQLite store and starts every adapter wired through
     wrapSinkWithStore so events persist on disk.
  3. Writes its start time at ~/.agentwatch/daemon.started_at so
     `status` can compute uptime.
  4. Drains cleanly on SIGTERM / SIGINT / SIGHUP — closes adapters,
     closes the store, releases the lock.

src/daemon/log-rotate.ts is an append-only writer with a single
rotation slot at 10 MB (the file rolls to .log.1; older history is
out of scope on purpose).

src/daemon/index.ts is the controller that dispatches start / stop /
status / logs / run. `status` reports running yes/no, PID, uptime,
events captured, last event ts, and DB size by querying the same
SQLite store the daemon writes to.

The TUI and `agentwatch serve` keep working as clients of the same
store, so events captured overnight are visible the moment you open
them. Stacked on PR #16 (AUR-263 SQLite store) — the daemon depends
on `src/store/`. 10 new tests cover log rotation, plist + systemd
unit rendering, and the process-liveness probe; full suite 289/289.

* feat(classify): per-event activity classifier (AUR-264) (#18)

Adds a 12-category activity classifier that answers the CodeBurn-viral
question 'where is my spend going?' Categories: coding, debugging,
exploration, planning, refactor, testing, docs, chat, config, review,
devops, research.

src/classify/activity.ts is a heuristic ladder — no ML dep. Rules
combine file-extension signals (.test.ts → testing, .md → docs,
package.json → config), tool-name signals (Grep → exploration,
WebFetch → research, kubectl/docker/terraform → devops), shell-
command signals (npm test, pytest, git diff, eslint), and prompt/
response keyword signals (refactor, error, audit, plan). Each rule
contributes a weighted score; argmax wins. Empty input falls through
to chat.

src/classify/sink.ts wraps an EventSink so events land with
details.category attached before they reach the store or the TUI
reducer. Idempotent — won't overwrite an already-set category.

Wired into both TUI and serve mode in front of the store wrapper:
adapters → withClassifier → wrapSinkWithStore → innerSink.

Schema v2 migration: ALTER TABLE events ADD COLUMN category TEXT,
new index idx_events_category, store has activityBySession() and
activityByProject() that GROUP BY category.

New API: GET /api/sessions/:id/activity, GET /api/projects/:name/activity.

36 classifier tests (per-rule cases + sink wrapper + 75% top-1
agreement on the synthetic dataset) + 3 store activity-rollup tests.
Full suite 315/315 pass.

Out of scope (defer to follow-up): React activity views in the web
UI, TUI EventDetail surface for category, the full 200-turn hand-
labelled validation harness — synthetic dataset substitutes for v0.1
and we re-evaluate on real-data accuracy after dogfood.

Stacked on PR #16 (AUR-263 SQLite store) — uses the v2 migration.
* feat(store): SQLite event store with FTS5 (AUR-263)

Adds src/store/sqlite.ts as the persistent source of truth for every
event the adapters emit. Replaces the 4 MB rolling backfill with an
indexed, queryable, FTS-searchable store at ~/.agentwatch/events.db.

Three tables — events (canonical AgentEvent), sessions (auto-aggregated
via insert trigger: cost, ts range, count, project), tool_calls (tool,
duration, error). FTS5 virtual table over prompt/response/thinking/
tool_result/summary with porter+unicode61 tokenization. Versioned
migrations (schema_version). WAL + synchronous=NORMAL.

Wires the store into both TUI and serve-mode EventSinks via a write-
through wrapper (src/store/wire.ts) — failures are logged once and
never propagated. Adds 'agentwatch prune --older-than-days N' CLI
subcommand and a new 'history' mode on POST /api/search backed by FTS5.

Out of scope (filed as follow-ups): TUI reducer becoming a thin cache
over the store, full /sessions /projects route migration, schema
migration tooling for end users.

Bench: ingests 10k events in ~430ms on M1 air. 279/279 tests pass.

* feat(daemon): background capture daemon with launchd + systemd (AUR-262) (#17)

Closes the largest stated limitation — agentwatch was a viewer, not a
daemon. `agentwatch daemon start | stop | status | logs` now installs
a user-level service that runs the adapter pipeline 24/7 and writes
every event into the SQLite store at ~/.agentwatch/events.db.

src/daemon/install.ts renders the launchd plist (macOS) or systemd
user unit (Linux); the unit invokes `agentwatch daemon run`, the
internal foreground subcommand that:

  1. Acquires a PID lock at ~/.agentwatch/daemon.pid (stale-pid aware
     via a process-alive probe; auto-rotates the lock if the holder
     died without cleaning up).
  2. Opens the SQLite store and starts every adapter wired through
     wrapSinkWithStore so events persist on disk.
  3. Writes its start time at ~/.agentwatch/daemon.started_at so
     `status` can compute uptime.
  4. Drains cleanly on SIGTERM / SIGINT / SIGHUP — closes adapters,
     closes the store, releases the lock.

src/daemon/log-rotate.ts is an append-only writer with a single
rotation slot at 10 MB (the file rolls to .log.1; older history is
out of scope on purpose).

src/daemon/index.ts is the controller that dispatches start / stop /
status / logs / run. `status` reports running yes/no, PID, uptime,
events captured, last event ts, and DB size by querying the same
SQLite store the daemon writes to.

The TUI and `agentwatch serve` keep working as clients of the same
store, so events captured overnight are visible the moment you open
them. Stacked on PR #16 (AUR-263 SQLite store) — the daemon depends
on `src/store/`. 10 new tests cover log rotation, plist + systemd
unit rendering, and the process-liveness probe; full suite 289/289.

* feat(classify): per-event activity classifier (AUR-264) (#18)

Adds a 12-category activity classifier that answers the CodeBurn-viral
question 'where is my spend going?' Categories: coding, debugging,
exploration, planning, refactor, testing, docs, chat, config, review,
devops, research.

src/classify/activity.ts is a heuristic ladder — no ML dep. Rules
combine file-extension signals (.test.ts → testing, .md → docs,
package.json → config), tool-name signals (Grep → exploration,
WebFetch → research, kubectl/docker/terraform → devops), shell-
command signals (npm test, pytest, git diff, eslint), and prompt/
response keyword signals (refactor, error, audit, plan). Each rule
contributes a weighted score; argmax wins. Empty input falls through
to chat.

src/classify/sink.ts wraps an EventSink so events land with
details.category attached before they reach the store or the TUI
reducer. Idempotent — won't overwrite an already-set category.

Wired into both TUI and serve mode in front of the store wrapper:
adapters → withClassifier → wrapSinkWithStore → innerSink.

Schema v2 migration: ALTER TABLE events ADD COLUMN category TEXT,
new index idx_events_category, store has activityBySession() and
activityByProject() that GROUP BY category.

New API: GET /api/sessions/:id/activity, GET /api/projects/:name/activity.

36 classifier tests (per-rule cases + sink wrapper + 75% top-1
agreement on the synthetic dataset) + 3 store activity-rollup tests.
Full suite 315/315 pass.

Out of scope (defer to follow-up): React activity views in the web
UI, TUI EventDetail surface for category, the full 200-turn hand-
labelled validation harness — synthetic dataset substitutes for v0.1
and we re-evaluate on real-data accuracy after dogfood.

Stacked on PR #16 (AUR-263 SQLite store) — uses the v2 migration.

* feat(adapters): Claude Code native hooks adapter (AUR-266)

Anthropic's hooks API delivers events about 1–2 seconds faster than
the JSONL transcript and never misses a sub-event. agentwatch can now
register itself as a hook and have Claude POST every event into our
pipeline in real time.

src/adapters/claude-hooks.ts registers POST /api/hooks/:event on the
fastify app. translateHook() maps the 10 known Claude event types
(SessionStart, SessionEnd, UserPromptSubmit, PreToolUse, PostToolUse,
Stop, SubagentStop, PreCompact, PostCompact, Notification) into our
canonical AgentEvent shape. Unknown future events fall through to a
generic tool_call so a Claude release adding new hook types doesn't
silently drop data.

src/adapters/claude-hooks-install.ts owns ~/.claude/settings.json
round-trips. Stanzas are tagged with the marker comment
'[agentwatch-managed]' so uninstall only touches our stanzas; any
user-configured hooks are preserved.

src/adapters/hooks-dedup.ts is a 5-second-window registry shared
between the hooks adapter (writes) and the JSONL adapter sink wrapper
(reads). withClaudeHookDedup() drops claude-code events that arrive
without details.source === "hooks" when their (sessionId, toolUseId)
signature was marked in the last 5s. Hook events bypass the check
because they're stamped 'hooks' as the source.

Wiring:
  adapters → withClaudeHookDedup → wrapSinkWithStore → innerSink
  hook route emit → withClaudeHookDedup (bypass) → ... → innerSink

ServerHandle gains setHookSink(sink) so the route looks up the sink
lazily — that lets the TUI start the server first and wire the sink
once adapters are up.

CLI: agentwatch hooks {install | uninstall | status}.
Doctor reports 'claude code hooks: installed | not-installed | partial'.

19 vitest tests cover the dedup registry, the dedup sink wrapper,
hook payload translation for every known event type, settings.json
install/uninstall round-trips with user-hook preservation, and status
reporting. Full suite 298/298 pass.

docs/features/claude-hooks.md documents the install flow + dedup
semantics + why both hook and JSONL paths run together.

Out of scope: hook-based blocking (PreToolUse decision: block) — the
v0.3 control-plane bet. Hooks for non-Claude agents (none of them
ship a hooks API).

Stacked on PR #16 (AUR-263 SQLite store).
Adds the minimal glama.json metadata file ($schema + maintainers)
that Glama's profile completion check requires, and embeds the score
+ card badges in the README so the registry status is visible from
the repo front page.

Triggered by https://github.com/punkpeye/awesome-mcp-servers#5665 —
the awesome-mcp-servers PR is gated on Glama having a quality score
for the server, which can only happen once Glama is able to find and
parse glama.json.

Pairs with the v0.0.5 GitHub release I just cut (Glama also requires
at least one tagged release to count the project as 'maintained').
…#23)

Routes now read from the SQLite store if passed via StartServerOptions, falling back to the in-memory ring buffer if not. This enables the UI to query events that have fallen out of the live tail window.
#26)

Budget rollups, anomaly histories, and sub-agent child counts now query
the SQLite store via the new EventStore.listRecentEvents() instead of
iterating the 500-event React buffer. The live tail still drives the
timeline view; only the derived passes change source.

App.tsx also seeds the live tail at launch from store.listRecentEvents(
{ limit: 500 }) so the timeline isn't empty until adapter JSONL backfill
re-emits. The wrapSinkWithStore dedup guards against id collisions when
adapters do re-emit.

Adds 4 vitest cases covering desc/asc order, sinceTs filter, and limit
clamp. Full suite 368/368 (was 364).
Visualizes the per-event activity classifier (AUR-264) data:

- /sessions/:id/activity — stacked bar of events × category in 1-min
  buckets across the session's lifetime, plus a per-category cost +
  count + percentage table. Bucketed client-side from the session's
  events so we don't need a new time-series API.
- /projects/:name/activity — pie of events-by-category + pie of
  cost-by-category + per-category breakdown table including the new
  sessionsTouched column. Empty-state for unclassified projects.

Both routes are lazy-loaded behind the existing Suspense fallback,
matching SessionTokens/Trends/SessionGraph. Nav links surface from
Session.tsx and ProjectDetail.tsx.

Extends activityByProject SQL with COUNT(DISTINCT session_id) and
adds sessionsTouched to ActivityBucket. EventDetails type in the web
client now exposes details.category. +1 vitest case for the new column.
Full suite 369/369 (was 368).
Visualizes the git-correlation $/commit + $/line endpoints (AUR-265).

- /sessions/:id/yield — commits-in-window table (hash, author, subject,
  files, +/-) with totals row, plus $/commit + $/line + cost +
  lines-changed callouts. Empty state when no commits landed during
  the session window.
- /projects/:name/yield — weekly composed chart (cost bars + commit
  bars + $/commit line overlay), plus a sortable 'spend without commit'
  session list (cost / lines / files). Sortable column buttons.

Both views handle the API's ok:false reasons (no store, no project tag,
not a git repo under WORKSPACE_ROOT) with a helpful explainer.

Lazy-loaded behind Suspense in main.tsx; nav links from Session.tsx
and ProjectDetail.tsx.

Full suite 369/369; typecheck clean (one pre-existing Logs.tsx unused
var unrelated to this PR).
Adds Dockerfile with Node 22 + gh CLI over Debian bookworm-slim to support
running the autonomous agent fully containerized via OpenClaw sandbox.mode.
Includes runbook for human setup and updates.
@mishanefedov
Copy link
Copy Markdown
Owner Author

Closing — this PR only added the internal autonomous-bot sandbox (.agentwatch-bot/), which has been removed from the repo. Not part of agentwatch itself.

@mishanefedov mishanefedov deleted the agent/aur-218-sandbox-docker branch May 25, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant