feat(platform-integrations): add provenance usage audits#251
Conversation
Restores and extends the PR 239 usage-provenance flow on top of the unified plugin source. Adds offline provenance analysis for recalled guidelines, stores trajectories for the supported harnesses, and adds Docker e2e coverage for Claude and Codex learn/recall/provenance flows. The audit path now writes recall and influence events under the configured EVOLVE_DIR instead of deriving a parent project root, so custom evolve data directories keep recall, entities, and provenance together. Influence writes are also idempotent per session/entity so rerunning provenance does not double-count usage.
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (25)
📝 WalkthroughWalkthroughThis PR implements provenance analysis across evolve-lite: evolve_dir-aware audit.append; stable entity IDs on recall; session-id-aware trajectory filenames; provenance SKILL and log_influence scripts; learn/save-trajectory workflow updates to use saved trajectories; subscription rollback on audit failure; and expanded tests and docs. ChangesProvenance Analysis System
Sequence DiagramsequenceDiagram
participant User
participant Session1 as Session 1: Learn
participant Session2 as Session 2: Recall
participant Session3 as Session 3: Analyze
participant AuditLog as Audit Log
participant Trajectories as Saved Trajectories
participant Influence as Influence Log
User->>Session1: Issue task & guidance
Session1->>Trajectories: Save conversation trajectory
Session1->>AuditLog: Append learned entity
User->>Session2: Ask related question
Session2->>Session2: Recall learned guidelines
Session2->>AuditLog: Append recall event (entity IDs)
Session2->>Trajectories: Reference saved trajectory
User->>Session3: Run provenance analysis
Session3->>AuditLog: Load recall events
Session3->>Trajectories: Locate matching trajectories
Session3->>Session3: Read recalled entity content
Session3->>Session3: Assess influence verdicts
Session3->>Influence: Log influence assessment (verdict + evidence)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
coderabbitai can you review? |
SummaryThis PR adds provenance logging and offline provenance-analysis guidance across the generated plugin source and rendered platform integrations, along with new Claude/Codex sandbox coverage. The overall direction makes sense, but the branch is red locally and I found a few concrete regressions in the current implementation. Findings
Testing
|
visahak
left a comment
There was a problem hiding this comment.
Please check the comments.
Addresses review feedback from visahak Restores the all-or-nothing contract: if audit_append raises after a successful clone + config save, the new repo entry and cloned directory are removed before exiting non-zero with a "failed to record subscription" diagnostic on stderr. Previously the command swallowed the failure and reported success, leaving the clone and config mutation in place even though tests and callers expect rollback on partial failure.
Addresses review feedback from visahak The previous commit popped the repo entry from the in-memory list but did not call set_repos(cfg, repos), so the subsequent save_config call re-wrote the same state to disk and the config still contained the new subscription. Update the in-memory cfg before the compensating save_config and align the codex-sharing test, which previously expected the old warn-and-succeed behavior, with the all-or-nothing contract asserted by test_rolls_back_clone_if_audit_write_fails.
Addresses review feedback from visahak Rejected subscription entries (invalid names, missing remotes, unknown scopes) and path-traversal guard trips were being folded into the stdout "Synced N repo(s):" summary, making diagnostics indistinguishable from normal sync output and breaking the platform-integration tests that assert diagnostics appear on stderr. Route each rejection through stderr instead and exclude rejected entries from the stdout summary count.
…venance Addresses review feedback from visahak The provenance skill advertised deterministic session-to-trajectory matching via session_id, but save-trajectory wrote files as trajectory_<timestamp>.json with no persisted session identifier, so two back-to-back codex sessions produced indistinguishable files and the e2e test had to pass the target session id in through the prompt. Extend the envelope to include session_id (Step 4 of save-trajectory) and thread that id into the output filename as trajectory_<timestamp>_<session-id>.json so provenance can resolve a recall event to exactly one trajectory by inspecting the filename. The filename slice is sanitised to filesystem-safe characters and capped at 64 chars. When no session id is available the filename falls back to the original trajectory_<timestamp>.json form. Provenance SKILL Step 2 now documents a three-step matching strategy (claude transcript filename; session-id suffix on trajectory files; envelope session_id field as a last resort) so agents do not have to guess from content alone.
…tly swallowing Addresses CodeRabbit review finding: Silent rollback-save failure leaves config and filesystem inconsistent The compensating save_config call in the audit-failure rollback path previously swallowed every exception with a bare `except Exception: pass`, which could leave the on-disk evolve.config.yaml still listing the freshly-added repo even though the clone was removed. Print a clear stderr warning that names the affected project_root, the caught exception, and the subscription entry that may still need manual removal so the user can repair the config themselves.
Addresses CodeRabbit review finding: Harden type validation before verdict checks and dedupe keying Require session_id to be a non-empty string at the payload level, and require each assessment's entity field to be a non-empty string inside the loop; malformed items are logged and skipped instead of risking a TypeError when the (session_id, entity) dedupe key is built. Coerce evidence to a string for the same reason so the audit schema stays stable even if callers hand us a numeric or null evidence field.
Addresses CodeRabbit review finding: transcript_path priority silently shadows the explicit session_id for non-Claude platforms removeprefix is a no-op on stems that do not start with claude-transcript_, so on non-Claude platforms that pass both transcript_path and session_id the raw filename stem was winning and the explicit session_id was ignored. Only consume transcript_path when the stem actually carries the Claude prefix; otherwise fall through to input_data["session_id"] so Codex/Bob/Claw Code get the session id their hook actually provided.
…sion_id guidance Addresses CodeRabbit review finding: CLAUDE_SESSION_ID is a Claude-specific env var referenced in a Codex SKILL.md CLAUDE_SESSION_ID does not exist on Codex, Claw Code, or Bob, so the rendered SKILL.md on those platforms invited agents to chase an identifier that would never resolve. Replace the concrete vendor env var with generic guidance — "whatever the harness exposes" plus the existing fallback behavior — so every platform gets the same instruction and no platform-specific symbol leaks into the others' rendered output.
… assertion Addresses CodeRabbit review findings: Use task-scoped recalled IDs in the learned-ID check; The hard followed requirement is flaky for stochastic e2e model behavior The learned-vs-recalled intersection was computed over the aggregated recalled_ids across every recall event in the log, which could let the assertion pass even when the task session itself never actually recalled a learned id. Intersect task_recalled_ids (just the final recall event) with learned_ids instead so we verify the specific task session recalled what it learned. Separately, "followed" is only one of three valid influence verdicts and the real model can legitimately pick "contradicted" or "not_applicable" on any given run. Relax the hard-followed assertion to "any verdict in the allowed set" — the test now guards the shape of the influence audit rather than pinning a stochastic outcome.
…fail test Addresses CodeRabbit review finding: Guard against FileNotFoundError when rollback deletes the newly-created config Once the rollback path removes the subscription entry and rewrites the config, a future implementation could reasonably end up with the config file absent (e.g., after removing the only repo). Guard the read_text() call with an exists() check so the assertion continues to verify that 'name: alice' is not present regardless of whether the file is empty or gone.
…rder assertion Addresses CodeRabbit review findings: read_audit missing encoding=utf-8 may fail in non-UTF-8 locales; Entity list order assertion is fragile — rglob/os.walk order is not guaranteed read_text() falls back to the platform default encoding, which is not utf-8 on every CI host. Pin the decoder to utf-8 for both the recall audit parser in test_retrieve and the read_audit helper in test_log_influence so non-ASCII audit entries decode reliably. The recall test also asserted strict list equality on the entities field, but retrieve_entities orders them via rglob which is not guaranteed across platforms. Switch to a set comparison so we assert on membership rather than traversal order.
Addresses CodeRabbit review finding: Count only trajectory files in the codex learn gate
glob("*") on .evolve/trajectories/ counts any directory or stray
artifact (e.g., a lock dir from a previous run) the harness happens
to leave behind. Restrict the count to files whose name matches
trajectory_*.json so the codex branch only passes when the learn
flow actually produced a saved trajectory.
|
@visahak thanks for the detailed review — all three findings addressed on this branch:
CI green; full platform_integrations suite (152 e2e + 196 unit) passes locally. |
Summary
Related
Tests
Notes
Repeated Docker e2e runs are intentionally stochastic because the agents may learn different guidelines from the seed task. The audit/provenance plumbing is deterministic; the usefulness verdicts reflect each run's learned guidance.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Tests