Skip to content

cli: add parse-diff report mode to vet parser changes against a real archive#662

Open
mjacobs wants to merge 15 commits into
kenn-io:mainfrom
mjacobs:feat/reparse-diff
Open

cli: add parse-diff report mode to vet parser changes against a real archive#662
mjacobs wants to merge 15 commits into
kenn-io:mainfrom
mjacobs:feat/reparse-diff

Conversation

@mjacobs

@mjacobs mjacobs commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Closes #649.

Adds agentsview parse-diff, a report-only CLI mode that re-parses session
source files with the current binary, runs the result through the same
normalization the sync engine applies, and compares it against the stored
SQLite rows. It writes no session, skip-cache, or sync-state data; per the
issue it turns each parser PR's "I hope this is right" into a reviewable
artifact and surfaces upstream format drift the day it lands.

agentsview parse-diff [--agent X ...] [--limit N] [--fail-on-change] [--json] [-v]

The report classifies every session into one bucket — identical, changed,
pending_resync, new_on_disk, parse_error, excluded_by_parser,
transient_needs_retry, or skipped (with a reason) — so genuine parser drift
is separated from pipeline history. Human output is a summary plus a
field-affected histogram plus a changed-session drill-down; --json emits the
same model for scripting and PR attachment. --fail-on-change exits non-zero
on real changes or parse errors.

How it works

The comparison runs through the engine's real write-path normalization
(prepareSessionWrite -> toDBMessages/pairAndFilter, toDBSession,
postFilterCounts, toDBUsageEvents), so an unchanged parser produces zero
diffs against rows it wrote itself. Because that pipeline is unexported, the
differ lives in internal/sync (Engine.ParseDiff, NewDiffEngine) and the
CLI is a thin renderer over a renderer-agnostic report model in
parsediff_report.go.

Re-parse reuses processFile and the production worker pool. A new unexported
forceParse flag (set only by NewDiffEngine, which also forces Ephemeral)
disables the three skip layers — the in-memory/persisted skip cache, the
stored size/mtime/data_version short-circuits, and the incremental JSONL path
— so every discovered file is fully re-parsed against the populated archive.
Normal sync never sets the flag, so its behavior is unchanged.

Comparison is two-tier for speed: a per-session token fingerprint and a
content-length fingerprint gate the cheap identical path; only on a mismatch
are message rows loaded and the difference attributed to a specific field.
Usage events compare as an order-insensitive multiset. Known false-diff
sources are classified rather than reported as drift: pending-resync rows
(data version behind the binary), incremental-append termination_status
clears for Claude/Codex, orphaned/source-missing sessions, remote and
import-only sessions, and user-trashed rows.

Coverage and limitations

  • Covers every file-based agent, including Claude DAG forks, and the shared
    SQLite stores (Kiro data.sqlite3, db-mode OpenCode opencode.db) that
    discovery does not emit. Database-backed and import-only agents (Warp,
    Forge, Piebald, claude-ai, ChatGPT) have no on-disk source and are rejected.
  • Best run against a quiescent archive. Sessions still being written, and
    sessions last written through the incremental-append path, can show benign
    drift against a full re-parse; a freshly resynced archive is the cleanest
    baseline. --help and the report say so, and a run whose data version is
    ahead of the whole archive prints a warning that no drift can be detected.

Where to look

internal/sync/parsediff.go (loop + classification), parsediff_compare.go
(field-by-field comparison and the fingerprint tiers), parsediff_report.go
(the shared model), the forceParse guards in internal/sync/engine.go, and
cmd/agentsview/parse_diff.go (command + renderer). The feature was vetted
against a real ~726-session local archive across the supported agents.

mjacobs added 9 commits June 12, 2026 12:59
Report-only re-parse comparison (kenn-io#649): classification buckets, stable
JSON field names, per-field aggregation, and the ParseDiff/NewDiffEngine
signatures. Implementation follows.
…mode

Fill in the ParseDiff/NewDiffEngine contract (kenn-io#649): a report-only
re-parse that discovers files like syncAllLocked, force-parses them
through the normal worker pool and prepareSessionWrite normalization,
and compares the results against the stored archive without writing.

- engine.go: add Engine.forceParse and guard every stored-state skip
  (skip cache consult, shouldSkipFile/ByPath/OpenCodeByPath/Copilot,
  the OpenCode/Zed/Kiro per-meta SQLite skip loops, and incremental
  JSONL deltas). Normal sync never sets the flag, so behavior is
  unchanged when false.
- parsediff.go: agent resolution and validation, discovery, newest-
  first --limit sampling, a full stored-session index, per-file
  classification with parse-error attribution and exclusion handling,
  a deferred presence sweep for sessions the parser no longer emits,
  and a final skipped-session sweep with per-reason attribution.
- parsediff_compare.go: contract-field comparator with nil/empty and
  SanitizeUTF8 normalization, a two-tier message comparison gated on
  an in-memory twin of db.MessageTokenFingerprint, order-insensitive
  usage-event multiset comparison keyed by dedup key with token-total
  attribution, and the informational termination_status rule for
  incremental-append history.
- parsediff_compare_test.go: table-driven comparator tests, the
  fingerprint-twin parity test through the real bulk write pipeline,
  classification precedence, and the HasFailures truth table.
Adds 'agentsview parse-diff', a report-only data command that re-parses
session source files with the current binary via sync.NewDiffEngine and
renders the resulting ParseDiffReport. Supports --agent (validated
against the parser registry, rejecting database-backed and import-only
agents), --limit, --json, --verbose, and --fail-on-change (exit 1 on
report.HasFailures()). Human output is a tabwriter summary with a
changed-field histogram and a capped changed-sessions drill-down;
progress goes to stderr so stdout owns the report.
…gress

The empty-archive and JSON CLI tests redirected only AGENTSVIEW_DATA_DIR,
so the real engine discovered the developer machine's session files via
the HOME-relative default agent dirs. Isolate HOME and every per-agent
env override in end-to-end tests. The progress ticker now only renders
on an interactive stderr, and the report header derives its agent label
from the user's --agent selection instead of the engine's resolved
list.
Dogfooding against a real archive surfaced 21 presence false positives:
fork rows with data_version 0 from interrupted live syncs and orphan-
copied rows from earlier resyncs carry IDs the current parser no longer
derives. Those are pipeline history, not parser drift. The presence
sweep now routes rows below the current data version to pending_resync
(presence field still attached for drill-down); only a current-version
session that vanished from its file's parse output reports as changed.
…and metadata

Review against the real archive found the comparator could report a
session identical after its own tier-1 fingerprint proved the messages
differ. Three gaps closed:

- Message body changes (the most common parser drift) were in neither
  the token fingerprint nor the session compare. Add a content tier
  (db.MessageContentFingerprint vs the prepared messages' content-length
  aggregate) and surface per-message content drift as message_content.
- A token-fingerprint mismatch confined to role/timestamp/source ids or
  the sidechain/compact flags produced zero field diffs and classified
  identical. Tier 2 now attributes those via message_metadata, with a
  guaranteed fallback so a fingerprint inequality is never silently
  identical.
- started_at/ended_at were never compared; add them to the session-row
  fields.

Also: scope the termination_status stored-NULL informational downgrade
to the incremental-append agents (Claude, Codex) — for full-replace
agents a newly emitted status is real new-field detection; and fold
source+model into the dedup-keyed usage-event multiset so re-attribution
under a stable dedup key surfaces instead of passing silently.
…f loop

- Kiro data.sqlite3 and db-mode OpenCode opencode.db sessions are never
  emitted by DiscoverFunc (normal sync reaches them through dedicated
  phases), so parse-diff was reporting zero of them as 'not discovered'
  and an --agent kiro/opencode run could pass while comparing nothing.
  Synthesize discovery entries for those db paths; processKiro and
  processOpenCode already fan one db path out to every contained session
  under force-parse. Add a Kiro-SQLite coverage integration test.
- Reclassify trashed-but-reemitted sessions as skipped ('trashed in
  archive') instead of excluded_by_parser: the user trashed them and
  sync preserves trash, so they are neither a parser exclusion nor drift.
- Initialize report.Sessions to an empty slice so a clean JSON run
  serializes [] rather than null.
- Cancel the worker pool on error-return paths instead of parsing every
  remaining file just to drain it.
- Assert report.Agents unconditionally in the scope test (was a vacuous
  len-guarded check).
Review found several output gaps that undercut the report's scripted use:

- Parse errors trip --fail-on-change but were invisible in text output,
  even with -v: only the summary count rendered. List each failing file
  and its error so a failed run is actionable.
- When the binary's data version is ahead of the whole archive every
  session is pending_resync and no drift can be detected, yet the run
  read as a clean pass. Warn on a vacuous run (the headline trap for a
  parser PR that bumps dataVersion in the same commit), and render
  pending_resync field diffs under -v so the operator can still see what
  would change.
- The JSON report now carries db_path so an attached report names the
  archive it vetted.
- Qualify the help text: opening the archive still creates the DB and
  runs migrations, and parse-diff wants a quiescent (ideally resynced)
  archive because live and incrementally-synced sessions show benign
  drift against a full re-parse.
- Reword the excluded-by-parser summary note now that user-trashed rows
  are bucketed as skipped.

Tests cover parse-error rendering, the vacuous-resync warning, the
verbose pending-resync drill-down, JSON sessions:[]/db_path, and both
directions of the --fail-on-change exit conjunction.
@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (828c955)

Medium findings remain after deduplication.

Medium

  • internal/sync/parsediff_compare.go:64
    The comparator only loads stored messages when token/content fingerprints differ, but the token fingerprint does not include per-message role or timestamp while messageMetadataDiff claims to compare them. Role-only or timestamp-only parser drift can therefore be reported as identical.
    Fix: Use an exact ordered fingerprint covering every per-message field in the parse-diff contract, or load stored rows for exact metadata comparison. Add regression tests for role-only and timestamp-only drift.

  • internal/sync/parsediff.go:279
    OpenCode opencode.db is synthesized only when ResolveOpenCodeSource selects SQLite mode, but normal sync still reads opencode.db in storage-mode roots for DB-only legacy sessions. Mixed storage/SQLite OpenCode roots will have those sessions skipped as “not discovered” instead of compared, so --fail-on-change can pass without vetting them.
    Fix: Add any regular opencode.db under the OpenCode root regardless of source mode, relying on the existing storage-ID filtering to avoid duplicates. Add a mixed storage-plus-SQLite test.

  • internal/sync/engine.go:3732
    The shared-DB loops for OpenCode, Zed, and Kiro log per-session parse errors and continue. Since parse-diff now routes those databases through these loops, malformed/new sessions inside the DB can produce no DiffParseError and may be silently omitted or misclassified as presence drift.
    Fix: In force-parse/parse-diff mode, surface per-session parse errors through processResult so the report emits DiffParseError entries and --fail-on-change fails actionably.

  • cmd/agentsview/parse_diff.go:421
    The human parse-diff renderer prints session-derived values directly to the terminal, including f.Stored, f.Parsed, and f.Detail. These values can contain terminal control sequences from session files or synced archive rows, enabling output spoofing, phishing hyperlinks, cursor movement, or OSC 52 clipboard writes when running agentsview parse-diff --verbose. The same unsanitized pattern also appears for parse-error paths/reasons and session IDs in the human renderer.
    Fix: Apply the existing sanitizeTerminal helper to all human-mode parse-diff output derived from sessions, paths, errors, or report values before formatting. Leave --json output unchanged. Add a regression test with ESC/OSC bytes in a diffed field and assert the human output contains no control sequences.


Panel: ci_default_security | Synthesis: codex, 22s | Members: codex_default (codex/default, done, 8m42s), codex_security (codex/security, done, 3m46s) | Total: 12m50s

Address the four medium findings from the roborev review of PR kenn-io#662:

- Add a role/timestamp tier-1 fingerprint (MessageRoleTimeFingerprint
  plus its in-memory twin) so role-only or timestamp-only parser drift
  triggers the tier-2 row comparison instead of reporting identical.
  The token fingerprint shape is shared with the PG push fast-path and
  stays untouched.
- Synthesize opencode.db for parse-diff whenever it exists, not only
  in SQLite source mode. Normal sync reads DB-only legacy sessions in
  storage-mode roots too; mode-gating left them "not discovered".
- Surface per-session parse errors from the shared-DB fan-out loops
  (OpenCode, Zed, Kiro) through processResult.sessionErrs in
  force-parse mode, so parse-diff emits DiffParseError entries and
  --fail-on-change fails actionably instead of silently dropping or
  misclassifying the session as presence drift. Normal sync behavior
  is unchanged (log and continue).
- Sanitize all session-derived values in the human parse-diff renderer
  (agents, IDs, paths, field values, details, error reasons) with the
  existing sanitizeTerminal helper; --json stays raw.
@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (28f6b90)

Summary verdict: One medium correctness issue remains; no high or critical findings were reported.

Medium

  • internal/sync/parsediff_compare.go:59 - Message content comparison can false-pass. The fast path only compares content_length sum/max/min, and tier 2 only checks ContentLength, never Content; equal-length text rewrites or per-message length changes with the same aggregate are reported identical and do not trip --fail-on-change.
    • Fix: Use an ordered per-message content fingerprint, for example ordinal plus sanitized content hash and length, then compare the stored and parsed content when it differs. Add regression coverage for equal-length body changes and aggregate length collisions.

Panel: ci_default_security | Synthesis: codex, 8s | Members: codex_default (codex/default, done, 7m17s), codex_security (codex/security, done, 2m57s) | Total: 10m22s

The content fast-path compared only sum/max/min of content_length and
tier 2 only compared the content_length column, so equal-length body
rewrites — and per-message length changes whose aggregates collide,
such as two messages swapping lengths — were reported identical and
never tripped --fail-on-change.

Replace the aggregate check in the comparator with an exact ordered
fingerprint (MessageContentHashFingerprint: ordinal, content_length,
SHA-256 of the sanitized body) plus an in-memory twin pinned by the
existing twin-parity test. Tier 2 now compares the body as well as the
length column, attributing equal-length rewrites to message_content;
diff details report sizes only and never echo bodies. The aggregate
MessageContentFingerprint stays untouched for the PG push fast-path.
@roborev-ci

roborev-ci Bot commented Jun 13, 2026

Copy link
Copy Markdown

roborev: Combined Review (8cb306f)

Summary verdict: The PR has two medium correctness issues; no security vulnerabilities were reported.

Medium

  • internal/sync/parsediff_compare.go:37
    The comparator only checks a subset of parser-owned persisted data. It omits fields such as project/cwd/parent relationships, malformed/truncated flags, thinking/system/tool-use message flags, and tool call/result rows. A parser change affecting only those columns can still report the session as identical.
    Fix: Extend the parse-diff contract and fingerprints/diffs to cover all parser-owned session, message, and tool-call rows, or explicitly narrow the command contract and tests to the supported subset.

  • cmd/agentsview/parse_diff.go:159
    --fail-on-change exits successfully when the run is vacuous because every examined row is pending_resync, even though VacuousResync() says no drift can be detected and the run should not be treated as a passing vet.
    Fix: Include report.VacuousResync() in the fail condition for --fail-on-change, or provide a distinct non-zero outcome for unverifiable runs.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 6m51s), codex_security (codex/security, done, 1m58s) | Total: 8m58s

mjacobs added 4 commits June 12, 2026 21:58
A vacuous run -- where this binary's data version is ahead of every
examined session, so all of them are pending_resync -- detects no drift
by construction, so a clean result is not evidence the parser is
unchanged. Under --fail-on-change that previously exited 0, giving a
green CI gate that verified nothing.

Fold report.VacuousResync() into the --fail-on-change exit decision and
explain the non-zero exit on stderr, which also reaches --json callers
that never see the stdout warning. The decision is extracted into a pure
parseDiffExitFailure helper so the exit contract is unit-tested.
…, and tool calls

The comparator only checked a subset of parser-owned persisted data, so
a parser change confined to the omitted columns reported the session
identical. Close that blind spot:

- Session columns: compare cwd, git_branch, parent_session_id,
  relationship_type, source_session_id, source_version,
  parser_malformed_lines, and is_truncated. UpdateSessionIncremental
  leaves these frozen, so for the incremental-append agents (Claude,
  Codex) a difference is benign pipeline history and is marked
  informational, mirroring termination_status. The project column is
  deliberately not compared: it is rewritten from the mutable
  worktree_project_mappings table, so its parser-owned input cwd is
  compared instead.
- Message columns: fold is_system, has_thinking, has_tool_use, and
  thinking_text into the message_metadata comparison, backed by a new
  MessageFlagsFingerprint tier-1 signal.
- tool_calls rows: a new ToolCallFingerprint tier-1 signal plus tier-2
  attribution over tool_name, category, tool_use_id, input_json,
  skill_name, subagent_session_id, and result_content_length. The
  database-assigned ids and the possibly-blocked result body are not
  compared; result content is represented only by its length.

New fingerprints are parse-diff-only and never alter the message
fingerprints shared with the PG push fast-path. In-memory twins are
pinned against their DB queries through the real write pipeline.
Exercise the new session-metadata, message-flag, and tool-call
comparisons through real JSONL parsing and the full ParseDiff
orchestration:

- The clean-archive acid test now syncs a Claude session carrying a
  thinking block, a tool_use/tool_result pair, and a system message, so
  the flag and tool-call fingerprints round-trip against real parsed
  data instead of empty strings.
- A new drift test confirms tool_call and per-message flag changes
  classify as changed, a full-replace-agent git_branch change is real
  drift, and an incremental-append cwd change stays identical with an
  informational diff.
…cument tool_result_events exclusion

Adversarial review of the comparator extension surfaced two issues:

- session_name was strict-compared, but the incremental-append path
  freezes it (UpdateSessionIncremental never rewrites it) while Claude's
  /rename mutates it mid-session. A rename followed by an incremental
  sync would therefore report a false change and trip --fail-on-change on
  an unchanged parser. Mark it informational for the incremental-append
  agents, the same rule termination_status and the cwd/source metadata
  already follow; first_message and started_at derive from the session
  head and stay strict. markIncrementalHistory now preserves any existing
  Detail (e.g. a long-value rune count) instead of overwriting it.

- The tool_result_events table was omitted without disclosure while the
  comparator docs claimed full coverage. It is excluded deliberately: the
  blocked-category config clears those rows wholesale, so a strict
  comparison would be config-sensitive, and their dominant signal (the
  summarized content size) is already captured by the config-stable
  result_content_length. Document the exclusion alongside the existing
  project and result-body exclusions.

Also add attribution tests for the remaining toolCallDiff sub-fields
(category, tool_use_id, input_json, skill_name, subagent_session_id).
@mjacobs

mjacobs commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Both findings addressed in feat/reparse-diff.

Vacuous --fail-on-change: the gate now also fails when VacuousResync() is true, with a stderr reason that reaches --json callers (which see no stdout warning). HasFailures() stays scoped to real drift / parse errors; vacuity is folded in at the CLI decision point (parseDiffExitFailure).

Comparator coverage: extended to the omitted parser-owned data — session metadata (cwd, git_branch, parent_session_id, relationship_type, source_session_id, source_version, parser_malformed_lines, is_truncated), per-message flags (is_system, has_thinking, has_tool_use, thinking_text, folded into message_metadata), and tool_calls rows (new MessageFlagsFingerprint / ToolCallFingerprint tier-1 signals plus tier-2 attribution; these are parse-diff-only and don't alter the message fingerprints shared with the PG-push fast path).

Fields the incremental-append path freezes are compared but marked informational for Claude/Codex, matching the existing termination_status rule, so they never produce a false --fail-on-change failure on a live archive. While extending this I found session_name had the same latent problem (frozen by UpdateSessionIncremental, yet mutable mid-session via Claude's /rename) and gave it the same treatment; first_message and started_at derive from the session head and stay strict.

Three values are excluded by documented design:

  • project — rewritten from the mutable worktree_project_mappings table, so its parser-owned input cwd is compared instead.
  • the tool_call result body and the tool_result_events rows — both redacted to empty by the blocked-category config (the events slice is cleared wholesale) and unbounded in size, so only the config-stable result_content_length (set before the redaction check) is compared.

Out of scope here, but worth flagging: the PG-push tool_calls fast path checks only count + result_content_length sum, with no per-field identity fingerprint, so a tool_name/input_json change with an unchanged count and sum could be skipped on push. Pre-existing; not touched by this PR.

In-memory fingerprint twins are pinned against their DB queries through the real write pipeline, the clean-archive acid test now exercises a session with thinking, a tool_use/tool_result pair, and a system message, and the new tier-1 fingerprints are proven to fire end-to-end.

@roborev-ci

roborev-ci Bot commented Jun 13, 2026

Copy link
Copy Markdown

roborev: Combined Review (1195b09)

Summary verdict: One medium correctness issue remains; no security issues were found.

Medium

  • internal/sync/parsediff_compare.go:1056: Dedup-keyed usage events are keyed only by dedup key/source/model, while token fields are compared only as aggregate totals. If a parser change reassigns token counts between two stable dedup-keyed events, the per-event data changes but count, totals, and multiset keys all remain equal, so parse-diff reports no usage drift and --fail-on-change can pass incorrectly.
    • Fix: Include parser-owned token fields in the dedup-keyed comparison, or compare per-dedup-key payloads rather than only aggregate totals.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 8m41s), codex_security (codex/security, done, 1m9s) | Total: 9m56s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

cli: re-parse diff report mode to vet parser changes against a real archive

1 participant