Skip to content

feat: improve Antigravity token extraction and sync metrics#661

Open
vikram wants to merge 1 commit into
kenn-io:mainfrom
vikram:fix/antigravity-session-tools-analysis
Open

feat: improve Antigravity token extraction and sync metrics#661
vikram wants to merge 1 commit into
kenn-io:mainfrom
vikram:fix/antigravity-session-tools-analysis

Conversation

@vikram

@vikram vikram commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Improve Antigravity session parser to extract richer tool-call metadata and token usage from session transcripts
  • Expand test coverage for the Antigravity parser with new edge cases and structured assertions
  • Surface sync progress metrics in the sync engine for observability

Changed

  • internal/parser/antigravity.go: Enhanced token extraction logic and tool-call analysis
  • internal/parser/antigravity_test.go: 333 lines of new/improved test cases
  • internal/sync/engine.go / progress.go: Added sync metric hooks
  • Minor doc and config updates (AGENTS.md, SECURITY.md, Makefile, huma routes)

@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (a3f544c)

Summary verdict: one medium correctness issue remains; no security issues were found.

Medium

  • internal/parser/antigravity.go:564 - extractAntigravityToolCalls scans every short string in the assistant payload against the global tool taxonomy and fabricates an ID when no UUID/input is nearby. Generic Antigravity payload strings that match tools from other agents, such as read, write, message, process, or strings containing subagent, can be persisted as real tool calls and set HasToolUse, skewing tool and timing analytics.
    • Fix: Scope matching to observed Antigravity tool names/structures and require stronger evidence, such as an adjacent tool ID/input object or known tool-call parent shape. Add a regression test for standalone generic strings.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 5m48s), codex_security (codex/security, done, 3m6s) | Total: 9m3s

@vikram vikram force-pushed the fix/antigravity-session-tools-analysis branch from a3f544c to adc0b8e Compare June 12, 2026 20:14
@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (adc0b8e)

Summary verdict: one medium parser correctness issue should be fixed before merging.

Medium

  • Location: internal/parser/antigravity.go:506
    • Problem: Tool-only assistant steps are dropped before tool extraction runs. agProtoCollectStrings(fields, 20) filters out short tool names/IDs/short JSON args, then len(strs) == 0 returns false, so an otherwise valid assistant tool call with empty prose is never emitted.
    • Fix: Determine role and extract assistant tool calls before the displayable-content guard, and allow a message when either content or tool calls are present.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 4m17s), codex_security (codex/security, done, 2m55s) | Total: 7m18s

@vikram vikram force-pushed the fix/antigravity-session-tools-analysis branch 2 times, most recently from ec97a0c to 794a891 Compare June 12, 2026 20:55
@mjacobs

mjacobs commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Thanks for digging into this area — the SidecarRejected metric is a welcome observability addition, and the Makefile GOLANGCI_LINT resolution fix is handy. I have one substantive concern about the token-extraction change, backed by data from real archives, plus a few smaller points.

Token block semantics: what real archives show

The PR changes billableOutput from output + reasoning to output on the grounds that candidates already include reasoning, and reads field 4 as cache-read. I wanted to verify this, because #647 gave us ground truth that didn't exist when #619 landed: agy-reader trajectory sidecars carry the daemon's own per-generation accounting (generatorMetadata[].chatModel.usage).

I cross-checked 15 real CLI sessions (June 2–11, spanning several agy releases) that have both a .db and a .trajectory.json: decode every gen_metadata token block with the same wire-walk, then look for a same-session sidecar generation with matching counts. (gen_metadata row idx doesn't align with sidecar stepIndices, so matching is by value; the matches are exact on all three fields simultaneously, which makes coincidence implausible.) Results across 550/550 decodable blocks:

block field current main (#619) reads it as this PR reads it as matches sidecar ground truth as
f2 candidates (output excl. thinking) output incl. thinking uncached input (inputTokens)
f3 thinking thinking total output incl. thinking (outputTokens)
f4 cache-read always 0 / absent
f5 input input cache-read (cacheReadTokens)

Example: block (f2=5529, f3=72, f5=38885) ↔ sidecar (inputTokens=5529, outputTokens=72, cacheReadTokens=38885). 0/550 blocks matched the current reading, 0/550 matched this PR's reading, 0 unexplained.

Two corollaries:

  • This explains sessions where extractTokenUsage finds no block at all: the heuristic requires f5, but f5 is cache-read, which is omitted from the wire when a fresh session has no cache hits yet. All three of my June-11 single-generation sessions hit exactly that.
  • The fix this area needs is a remap (input=f2, output=f3, cacheRead=f5, context=f2+f5; no per-field reasoning available in the block), rather than an adjustment to the fold. As written, the PR's test expectations would bake the f4/f5 reading in.

Happy to share the verification test (~150 lines against the internal/parser helpers) or collaborate on the remap — I have the validation data on hand.

Smaller points

  • dataVersion: this changes parser output for existing sessions (token numbers, new tool calls), so it needs a dataVersion bump (main is at 40) to trigger re-parse.
  • Peak context with caching: applyUsageEventTokenTotals (types.go) derives PeakContextTokens from ev.InputTokens and overwrites message-derived totals whenever events exist. With InputTokens = input − cached, peak context collapses to the uncached portion for any session with cache hits. The new tests all pass cached=0, so the cached>0 path isn't exercised end-to-end.
  • Tool-name matching scope: matching payload strings against the global NormalizeToolCategory taxonomy means generic lowercase strings that are tool names for other agents (read, write, task, view, exec, process, …) become Antigravity tool calls when they appear standalone. Scoping the match to the Antigravity tool-name set would remove most of the false-positive surface (roborev flagged this too). Two adjacent mechanics worth a look: the ±2 UUID window lets call N pick up call N+1's UUID when its own is missing, and (per roborev) tool-only steps are dropped by the displayable-content guard before extraction runs.
  • Sidecar-covered sessions already get structured tool calls from the trajectory's own toolCalls JSON, so the heuristic only serves the .db-fallback and IDE paths — worth weighing how much heuristic surface that's worth, given Prefer agy-reader trajectory sidecars for Antigravity CLI sessions #620 moved the CLI path sidecar-first for exactly this reason.
  • Lint is failing on antigravity.go:539 (staticcheck QF1001).

Since the metric is independent of the parser questions, splitting SidecarRejected (+ the Makefile fix) into its own PR would let that part land quickly.

@vikram vikram force-pushed the fix/antigravity-session-tools-analysis branch from 794a891 to 1bc395e Compare June 12, 2026 21:12
@vikram

vikram commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks I'm looking into it. agree can separate the sidecarRejected piece into a separate PR.

@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (1bc395e)

The PR has one medium-severity issue: Antigravity parser changes need a data-version bump before merge.

Medium

  • internal/db/db.go:186: The Antigravity parser now changes persisted output totals, cache-read usage events, and extracted tool calls, but dataVersion remains 40. Existing databases with Antigravity rows already stamped at the current version will not be resynced, so corrected usage/cost data and new tool calls only appear for sessions whose source files change later.

    Fix: Bump dataVersion and add a note describing the Antigravity token/cache/tool-call parser change so existing rows are reparsed.


Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 3m27s), codex_security (codex/security, done, 1m57s) | Total: 5m32s

@vikram vikram force-pushed the fix/antigravity-session-tools-analysis branch from 1bc395e to a5d67dd Compare June 12, 2026 21:34
@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (a5d67dd)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 5m4s), codex_security (codex/security, done, 1m55s) | Total: 6m59s

@vikram vikram force-pushed the fix/antigravity-session-tools-analysis branch from a5d67dd to 54d2c5a Compare June 12, 2026 22:07
- Limit tool matching to a known whitelist to prevent false positives.
- Preserve tool-only steps instead of dropping them.
- Correct protobuf token field mapping based on sidecar verification.
- Update CLI usage pricing and tests accordingly.
@vikram vikram force-pushed the fix/antigravity-session-tools-analysis branch from 54d2c5a to 29170a8 Compare June 12, 2026 22:19
@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (29170a8)

Summary verdict: One medium parser correctness issue needs fixing; no security issues were found.

Medium

  • Location: internal/parser/antigravity.go:420
  • Problem: tokenBlockFrom requires field 2 even though the new mapping treats it as uncached input, which can legitimately be zero and omitted by proto3 when all context comes from cache. Those cache-only generations would reject the whole token block and drop usage.
  • Fix: Treat missing field 2 as zero when the rest of the token block is valid, and add a regression test with field 2 omitted and field 5 present.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 6m26s), codex_security (codex/security, done, 1m26s) | Total: 7m59s

@mjacobs

mjacobs commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Re-ran the sidecar cross-validation harness against 29170a87, this time through the PR's own extractTokenUsage rather than a standalone decoder: 645/645 blocks exact-match the sidecar ground truth (uncachedInput, totalOutput, cacheRead all three values simultaneously), zero mismatches. That's 95 more blocks than the old f5-required matcher could find — the optional-f5 change recovers every cache-miss generation, and all three of my June-11 single-generation sessions that previously produced no usage at all now decode correctly.

The tool-call allowlist, pre-guard extraction for tool-only steps, intervening-tool checks on the ID/JSON windows, dataVersion 41, and the cached>0 end-to-end tests all look right to me as well. Thanks for the quick and thorough turnaround.

On roborev's remaining note (f2 required, rejecting a fully-cache-served generation where uncached input is 0): I'd treat that as an accepted tradeoff rather than a change. Every real turn carries at least the new user message uncached — all 645 observed blocks have f2 present — and with f5 now optional, f2 is the anchor that keeps decoy blocks (latency counters etc.) from passing the plausibility check. Loosening it would trade a theoretical gap for a real false-positive surface. Might be worth a one-line comment in tokenBlockFrom recording that decision.

Thanks, Vikram!

@vikram

vikram commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Thank you for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants