cli: add parse-diff report mode to vet parser changes against a real archive#662
cli: add parse-diff report mode to vet parser changes against a real archive#662mjacobs wants to merge 15 commits into
Conversation
Report-only re-parse comparison (kenn-io#649): classification buckets, stable JSON field names, per-field aggregation, and the ParseDiff/NewDiffEngine signatures. Implementation follows.
…mode Fill in the ParseDiff/NewDiffEngine contract (kenn-io#649): a report-only re-parse that discovers files like syncAllLocked, force-parses them through the normal worker pool and prepareSessionWrite normalization, and compares the results against the stored archive without writing. - engine.go: add Engine.forceParse and guard every stored-state skip (skip cache consult, shouldSkipFile/ByPath/OpenCodeByPath/Copilot, the OpenCode/Zed/Kiro per-meta SQLite skip loops, and incremental JSONL deltas). Normal sync never sets the flag, so behavior is unchanged when false. - parsediff.go: agent resolution and validation, discovery, newest- first --limit sampling, a full stored-session index, per-file classification with parse-error attribution and exclusion handling, a deferred presence sweep for sessions the parser no longer emits, and a final skipped-session sweep with per-reason attribution. - parsediff_compare.go: contract-field comparator with nil/empty and SanitizeUTF8 normalization, a two-tier message comparison gated on an in-memory twin of db.MessageTokenFingerprint, order-insensitive usage-event multiset comparison keyed by dedup key with token-total attribution, and the informational termination_status rule for incremental-append history. - parsediff_compare_test.go: table-driven comparator tests, the fingerprint-twin parity test through the real bulk write pipeline, classification precedence, and the HasFailures truth table.
Adds 'agentsview parse-diff', a report-only data command that re-parses session source files with the current binary via sync.NewDiffEngine and renders the resulting ParseDiffReport. Supports --agent (validated against the parser registry, rejecting database-backed and import-only agents), --limit, --json, --verbose, and --fail-on-change (exit 1 on report.HasFailures()). Human output is a tabwriter summary with a changed-field histogram and a capped changed-sessions drill-down; progress goes to stderr so stdout owns the report.
…gress The empty-archive and JSON CLI tests redirected only AGENTSVIEW_DATA_DIR, so the real engine discovered the developer machine's session files via the HOME-relative default agent dirs. Isolate HOME and every per-agent env override in end-to-end tests. The progress ticker now only renders on an interactive stderr, and the report header derives its agent label from the user's --agent selection instead of the engine's resolved list.
Dogfooding against a real archive surfaced 21 presence false positives: fork rows with data_version 0 from interrupted live syncs and orphan- copied rows from earlier resyncs carry IDs the current parser no longer derives. Those are pipeline history, not parser drift. The presence sweep now routes rows below the current data version to pending_resync (presence field still attached for drill-down); only a current-version session that vanished from its file's parse output reports as changed.
…and metadata Review against the real archive found the comparator could report a session identical after its own tier-1 fingerprint proved the messages differ. Three gaps closed: - Message body changes (the most common parser drift) were in neither the token fingerprint nor the session compare. Add a content tier (db.MessageContentFingerprint vs the prepared messages' content-length aggregate) and surface per-message content drift as message_content. - A token-fingerprint mismatch confined to role/timestamp/source ids or the sidechain/compact flags produced zero field diffs and classified identical. Tier 2 now attributes those via message_metadata, with a guaranteed fallback so a fingerprint inequality is never silently identical. - started_at/ended_at were never compared; add them to the session-row fields. Also: scope the termination_status stored-NULL informational downgrade to the incremental-append agents (Claude, Codex) — for full-replace agents a newly emitted status is real new-field detection; and fold source+model into the dedup-keyed usage-event multiset so re-attribution under a stable dedup key surfaces instead of passing silently.
…f loop
- Kiro data.sqlite3 and db-mode OpenCode opencode.db sessions are never
emitted by DiscoverFunc (normal sync reaches them through dedicated
phases), so parse-diff was reporting zero of them as 'not discovered'
and an --agent kiro/opencode run could pass while comparing nothing.
Synthesize discovery entries for those db paths; processKiro and
processOpenCode already fan one db path out to every contained session
under force-parse. Add a Kiro-SQLite coverage integration test.
- Reclassify trashed-but-reemitted sessions as skipped ('trashed in
archive') instead of excluded_by_parser: the user trashed them and
sync preserves trash, so they are neither a parser exclusion nor drift.
- Initialize report.Sessions to an empty slice so a clean JSON run
serializes [] rather than null.
- Cancel the worker pool on error-return paths instead of parsing every
remaining file just to drain it.
- Assert report.Agents unconditionally in the scope test (was a vacuous
len-guarded check).
Review found several output gaps that undercut the report's scripted use: - Parse errors trip --fail-on-change but were invisible in text output, even with -v: only the summary count rendered. List each failing file and its error so a failed run is actionable. - When the binary's data version is ahead of the whole archive every session is pending_resync and no drift can be detected, yet the run read as a clean pass. Warn on a vacuous run (the headline trap for a parser PR that bumps dataVersion in the same commit), and render pending_resync field diffs under -v so the operator can still see what would change. - The JSON report now carries db_path so an attached report names the archive it vetted. - Qualify the help text: opening the archive still creates the DB and runs migrations, and parse-diff wants a quiescent (ideally resynced) archive because live and incrementally-synced sessions show benign drift against a full re-parse. - Reword the excluded-by-parser summary note now that user-trashed rows are bucketed as skipped. Tests cover parse-error rendering, the vacuous-resync warning, the verbose pending-resync drill-down, JSON sessions:[]/db_path, and both directions of the --fail-on-change exit conjunction.
roborev: Combined Review (
|
Address the four medium findings from the roborev review of PR kenn-io#662: - Add a role/timestamp tier-1 fingerprint (MessageRoleTimeFingerprint plus its in-memory twin) so role-only or timestamp-only parser drift triggers the tier-2 row comparison instead of reporting identical. The token fingerprint shape is shared with the PG push fast-path and stays untouched. - Synthesize opencode.db for parse-diff whenever it exists, not only in SQLite source mode. Normal sync reads DB-only legacy sessions in storage-mode roots too; mode-gating left them "not discovered". - Surface per-session parse errors from the shared-DB fan-out loops (OpenCode, Zed, Kiro) through processResult.sessionErrs in force-parse mode, so parse-diff emits DiffParseError entries and --fail-on-change fails actionably instead of silently dropping or misclassifying the session as presence drift. Normal sync behavior is unchanged (log and continue). - Sanitize all session-derived values in the human parse-diff renderer (agents, IDs, paths, field values, details, error reasons) with the existing sanitizeTerminal helper; --json stays raw.
roborev: Combined Review (
|
The content fast-path compared only sum/max/min of content_length and tier 2 only compared the content_length column, so equal-length body rewrites — and per-message length changes whose aggregates collide, such as two messages swapping lengths — were reported identical and never tripped --fail-on-change. Replace the aggregate check in the comparator with an exact ordered fingerprint (MessageContentHashFingerprint: ordinal, content_length, SHA-256 of the sanitized body) plus an in-memory twin pinned by the existing twin-parity test. Tier 2 now compares the body as well as the length column, attributing equal-length rewrites to message_content; diff details report sizes only and never echo bodies. The aggregate MessageContentFingerprint stays untouched for the PG push fast-path.
roborev: Combined Review (
|
A vacuous run -- where this binary's data version is ahead of every examined session, so all of them are pending_resync -- detects no drift by construction, so a clean result is not evidence the parser is unchanged. Under --fail-on-change that previously exited 0, giving a green CI gate that verified nothing. Fold report.VacuousResync() into the --fail-on-change exit decision and explain the non-zero exit on stderr, which also reaches --json callers that never see the stdout warning. The decision is extracted into a pure parseDiffExitFailure helper so the exit contract is unit-tested.
…, and tool calls The comparator only checked a subset of parser-owned persisted data, so a parser change confined to the omitted columns reported the session identical. Close that blind spot: - Session columns: compare cwd, git_branch, parent_session_id, relationship_type, source_session_id, source_version, parser_malformed_lines, and is_truncated. UpdateSessionIncremental leaves these frozen, so for the incremental-append agents (Claude, Codex) a difference is benign pipeline history and is marked informational, mirroring termination_status. The project column is deliberately not compared: it is rewritten from the mutable worktree_project_mappings table, so its parser-owned input cwd is compared instead. - Message columns: fold is_system, has_thinking, has_tool_use, and thinking_text into the message_metadata comparison, backed by a new MessageFlagsFingerprint tier-1 signal. - tool_calls rows: a new ToolCallFingerprint tier-1 signal plus tier-2 attribution over tool_name, category, tool_use_id, input_json, skill_name, subagent_session_id, and result_content_length. The database-assigned ids and the possibly-blocked result body are not compared; result content is represented only by its length. New fingerprints are parse-diff-only and never alter the message fingerprints shared with the PG push fast-path. In-memory twins are pinned against their DB queries through the real write pipeline.
Exercise the new session-metadata, message-flag, and tool-call comparisons through real JSONL parsing and the full ParseDiff orchestration: - The clean-archive acid test now syncs a Claude session carrying a thinking block, a tool_use/tool_result pair, and a system message, so the flag and tool-call fingerprints round-trip against real parsed data instead of empty strings. - A new drift test confirms tool_call and per-message flag changes classify as changed, a full-replace-agent git_branch change is real drift, and an incremental-append cwd change stays identical with an informational diff.
…cument tool_result_events exclusion Adversarial review of the comparator extension surfaced two issues: - session_name was strict-compared, but the incremental-append path freezes it (UpdateSessionIncremental never rewrites it) while Claude's /rename mutates it mid-session. A rename followed by an incremental sync would therefore report a false change and trip --fail-on-change on an unchanged parser. Mark it informational for the incremental-append agents, the same rule termination_status and the cwd/source metadata already follow; first_message and started_at derive from the session head and stay strict. markIncrementalHistory now preserves any existing Detail (e.g. a long-value rune count) instead of overwriting it. - The tool_result_events table was omitted without disclosure while the comparator docs claimed full coverage. It is excluded deliberately: the blocked-category config clears those rows wholesale, so a strict comparison would be config-sensitive, and their dominant signal (the summarized content size) is already captured by the config-stable result_content_length. Document the exclusion alongside the existing project and result-body exclusions. Also add attribution tests for the remaining toolCallDiff sub-fields (category, tool_use_id, input_json, skill_name, subagent_session_id).
|
Both findings addressed in Vacuous Comparator coverage: extended to the omitted parser-owned data — session metadata ( Fields the incremental-append path freezes are compared but marked informational for Claude/Codex, matching the existing Three values are excluded by documented design:
Out of scope here, but worth flagging: the PG-push In-memory fingerprint twins are pinned against their DB queries through the real write pipeline, the clean-archive acid test now exercises a session with thinking, a tool_use/tool_result pair, and a system message, and the new tier-1 fingerprints are proven to fire end-to-end. |
roborev: Combined Review (
|
Closes #649.
Adds
agentsview parse-diff, a report-only CLI mode that re-parses sessionsource files with the current binary, runs the result through the same
normalization the sync engine applies, and compares it against the stored
SQLite rows. It writes no session, skip-cache, or sync-state data; per the
issue it turns each parser PR's "I hope this is right" into a reviewable
artifact and surfaces upstream format drift the day it lands.
The report classifies every session into one bucket — identical, changed,
pending_resync, new_on_disk, parse_error, excluded_by_parser,
transient_needs_retry, or skipped (with a reason) — so genuine parser drift
is separated from pipeline history. Human output is a summary plus a
field-affected histogram plus a changed-session drill-down;
--jsonemits thesame model for scripting and PR attachment.
--fail-on-changeexits non-zeroon real changes or parse errors.
How it works
The comparison runs through the engine's real write-path normalization
(
prepareSessionWrite->toDBMessages/pairAndFilter,toDBSession,postFilterCounts,toDBUsageEvents), so an unchanged parser produces zerodiffs against rows it wrote itself. Because that pipeline is unexported, the
differ lives in
internal/sync(Engine.ParseDiff,NewDiffEngine) and theCLI is a thin renderer over a renderer-agnostic report model in
parsediff_report.go.Re-parse reuses
processFileand the production worker pool. A new unexportedforceParseflag (set only byNewDiffEngine, which also forcesEphemeral)disables the three skip layers — the in-memory/persisted skip cache, the
stored size/mtime/data_version short-circuits, and the incremental JSONL path
— so every discovered file is fully re-parsed against the populated archive.
Normal sync never sets the flag, so its behavior is unchanged.
Comparison is two-tier for speed: a per-session token fingerprint and a
content-length fingerprint gate the cheap identical path; only on a mismatch
are message rows loaded and the difference attributed to a specific field.
Usage events compare as an order-insensitive multiset. Known false-diff
sources are classified rather than reported as drift: pending-resync rows
(data version behind the binary), incremental-append
termination_statusclears for Claude/Codex, orphaned/source-missing sessions, remote and
import-only sessions, and user-trashed rows.
Coverage and limitations
SQLite stores (Kiro
data.sqlite3, db-mode OpenCodeopencode.db) thatdiscovery does not emit. Database-backed and import-only agents (Warp,
Forge, Piebald, claude-ai, ChatGPT) have no on-disk source and are rejected.
sessions last written through the incremental-append path, can show benign
drift against a full re-parse; a freshly resynced archive is the cleanest
baseline.
--helpand the report say so, and a run whose data version isahead of the whole archive prints a warning that no drift can be detected.
Where to look
internal/sync/parsediff.go(loop + classification),parsediff_compare.go(field-by-field comparison and the fingerprint tiers),
parsediff_report.go(the shared model), the
forceParseguards ininternal/sync/engine.go, andcmd/agentsview/parse_diff.go(command + renderer). The feature was vettedagainst a real ~726-session local archive across the supported agents.