Skip to content

feat(platform-integrations): add provenance usage audits#251

Merged
visahak merged 13 commits into
AgentToolkit:mainfrom
vinodmut:provenance-usage
May 7, 2026
Merged

feat(platform-integrations): add provenance usage audits#251
visahak merged 13 commits into
AgentToolkit:mainfrom
vinodmut:provenance-usage

Conversation

@vinodmut
Copy link
Copy Markdown
Contributor

@vinodmut vinodmut commented May 6, 2026

Summary

  • add offline provenance analysis for recalled evolve-lite guidelines
  • record recall and influence audit events across Claude, Codex, Claw, and Bob platform integrations
  • add Docker e2e coverage for Claude and Codex learn/recall/provenance flows
  • make audit writes respect custom EVOLVE_DIR and de-duplicate influence events on reruns

Related

Tests

  • uv run pytest -v tests/platform_integrations/test_retrieve.py
  • uv run pytest -v --run-e2e tests/platform_integrations/test_log_influence.py
  • uv run pytest -v -rs --run-e2e -m e2e tests/e2e/test_codex_sandbox_learn_recall.py --log-cli-level=INFO
  • uv run pytest -v -rs --run-e2e -m e2e tests/e2e/test_claude_sandbox_learn_recall.py --log-cli-level=INFO

Notes

Repeated Docker e2e runs are intentionally stochastic because the agents may learn different guidelines from the seed task. The audit/provenance plumbing is deterministic; the usefulness verdicts reflect each run's learned guidance.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added provenance analysis to evaluate whether guidelines influenced completed sessions by analyzing saved trajectories offline.
  • Improvements

    • Enhanced trajectory saving to embed session IDs for deterministic offline correlation.
    • Updated learn workflow to require saving and analyzing saved trajectories instead of live context.
    • Made audit logging required for subscriptions with proper rollback on failure.
  • Tests

    • Added comprehensive end-to-end tests validating the complete learn→recall→provenance workflow across platforms.

Restores and extends the PR 239 usage-provenance flow on top of the unified plugin source.

Adds offline provenance analysis for recalled guidelines, stores trajectories for the supported harnesses, and adds Docker e2e coverage for Claude and Codex learn/recall/provenance flows.

The audit path now writes recall and influence events under the configured EVOLVE_DIR instead of deriving a parent project root, so custom evolve data directories keep recall, entities, and provenance together. Influence writes are also idempotent per session/entity so rerunning provenance does not double-count usage.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Warning

Rate limit exceeded

@vinodmut has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 32 minutes and 39 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00d24bf0-4cb7-446d-86e1-3fde23c58b30

📥 Commits

Reviewing files that changed from the base of the PR and between 0068873 and 2052a45.

📒 Files selected for processing (25)
  • platform-integrations/bob/evolve-lite/skills/evolve-lite-provenance/scripts/log_influence.py
  • platform-integrations/bob/evolve-lite/skills/evolve-lite-recall/scripts/retrieve_entities.py
  • platform-integrations/bob/evolve-lite/skills/evolve-lite-save-trajectory/SKILL.md
  • platform-integrations/bob/evolve-lite/skills/evolve-lite-subscribe/scripts/subscribe.py
  • platform-integrations/claude/plugins/evolve-lite/skills/evolve-lite/provenance/scripts/log_influence.py
  • platform-integrations/claude/plugins/evolve-lite/skills/evolve-lite/recall/scripts/retrieve_entities.py
  • platform-integrations/claude/plugins/evolve-lite/skills/evolve-lite/save-trajectory/SKILL.md
  • platform-integrations/claude/plugins/evolve-lite/skills/evolve-lite/subscribe/scripts/subscribe.py
  • platform-integrations/claw-code/plugins/evolve-lite/skills/evolve-lite/provenance/scripts/log_influence.py
  • platform-integrations/claw-code/plugins/evolve-lite/skills/evolve-lite/recall/scripts/retrieve_entities.py
  • platform-integrations/claw-code/plugins/evolve-lite/skills/evolve-lite/save-trajectory/SKILL.md
  • platform-integrations/claw-code/plugins/evolve-lite/skills/evolve-lite/subscribe/scripts/subscribe.py
  • platform-integrations/codex/plugins/evolve-lite/skills/evolve-lite/provenance/scripts/log_influence.py
  • platform-integrations/codex/plugins/evolve-lite/skills/evolve-lite/recall/scripts/retrieve_entities.py
  • platform-integrations/codex/plugins/evolve-lite/skills/evolve-lite/save-trajectory/SKILL.md
  • platform-integrations/codex/plugins/evolve-lite/skills/evolve-lite/subscribe/scripts/subscribe.py
  • plugin-source/skills/evolve-lite/provenance/scripts/log_influence.py
  • plugin-source/skills/evolve-lite/recall/scripts/retrieve_entities.py
  • plugin-source/skills/evolve-lite/save-trajectory/SKILL.md.j2
  • plugin-source/skills/evolve-lite/subscribe/scripts/subscribe.py
  • tests/e2e/test_codex_sandbox_learn_recall.py
  • tests/platform_integrations/test_codex_sharing.py
  • tests/platform_integrations/test_log_influence.py
  • tests/platform_integrations/test_retrieve.py
  • tests/smoke_skills.py
📝 Walkthrough

Walkthrough

This PR implements provenance analysis across evolve-lite: evolve_dir-aware audit.append; stable entity IDs on recall; session-id-aware trajectory filenames; provenance SKILL and log_influence scripts; learn/save-trajectory workflow updates to use saved trajectories; subscription rollback on audit failure; and expanded tests and docs.

Changes

Provenance Analysis System

Layer / File(s) Summary
Audit API Enhancement
plugin-source/lib/audit.py, platform-integrations/*/plugins/evolve-lite/lib/audit.py, platform-integrations/*/evolve-lite/lib/audit.py
append() function signature updated to accept optional evolve_dir parameter; log path computed from evolve_dir/audit.log when provided, else defaults to project_root/.evolve/audit.log. Docstrings updated accordingly.
Recall Auditing & Entity Identification
platform-integrations/*/recall/scripts/retrieve_entities.py, plugin-source/skills/evolve-lite/recall/scripts/retrieve_entities.py
Each loaded entity assigned a stable _id derived from markdown path relative to entities directory. Recall events audited via new import of get_evolve_dir and audit module; session_id extracted from transcript_path or input, and audit record appended with evolve_dir, entity IDs, and event metadata (non-fatal error handling).
Provenance Logging Script
platform-integrations/*/provenance/scripts/log_influence.py, plugin-source/skills/evolve-lite/provenance/scripts/log_influence.py
New script reads JSON from stdin containing session_id and influence assessments; locates plugin library dynamically; deduplicates assessments against existing audit.log; validates verdict values (followed, contradicted, not_applicable); appends new influence events to audit log and emits summary.
Provenance Skill Documentation
platform-integrations/*/provenance/SKILL.md, plugin-source/skills/evolve-lite/provenance/SKILL.md.j2, platform-integrations/bob/evolve-lite/commands/evolve-lite-provenance.md
New documentation defining Provenance Analyzer workflow: load recall events, locate matching trajectories, read recalled entities, assess influence, write influence events. Covers verdict semantics, file layout, payload structure, and edge cases.
Learn Skill Trajectory Workflow
platform-integrations/*/learn/SKILL.md, plugin-source/skills/evolve-lite/learn/SKILL.md.j2
Added Step 0: Save and Load conversation trajectory. Updated Step 1 to analyze from saved trajectory rather than live conversation. Added guidance on generalizing artifacts by removing incidental inputs and documenting purpose and usage for reuse.
Save Trajectory Script Paths & Filenames
platform-integrations/*/save-trajectory/SKILL.md, */save_trajectory.py
Updated helper script invocation paths to use evolve-lite variants across platforms; save_trajectory scripts sanitize and optionally embed session_id into filenames via new helper and updated open_trajectory_file(session_id=None).
Subscription / Sync Behavior
platform-integrations/*/subscribe/scripts/subscribe.py, */sync/scripts/sync.py
Subscription audit failures now roll back the added repo and exit non-zero; sync scripts print skip diagnostics to stderr instead of aggregating them into summaries.
Plugin Configuration
platform-integrations/codex/plugins/evolve-lite/.codex-plugin/plugin.json, plugin-source/plugin.toml
Added new default prompt entry: "Analyze saved trajectories for Evolve guideline provenance." to interface configuration.
Integration Tests
tests/platform_integrations/test_log_influence.py, tests/platform_integrations/test_retrieve.py
New tests for log_influence script validating single/multiple assessments, custom evolve_dir, deduplication, verdict validation, and error handling. New recall auditing tests validating entity IDs, session_id handling, custom evolve_dir isolation, and transcript_path fallback.
E2E Tests & Documentation
tests/e2e/test_claude_sandbox_learn_recall.py, tests/e2e/test_codex_sandbox_learn_recall.py, tests/platform_integrations/test_codex.py, tests/platform_integrations/test_plugin_structure.py, tests/platform_integrations/test_skill_directory_names.py, sandbox/README.md
Refactored Claude E2E test with improved command quoting; added comprehensive Codex sandbox E2E test covering three-session EXIF/learn/recall/provenance flow. Updated Codex install tests to verify new provenance and save-trajectory directories/scripts. Updated sandbox README with platform-specific test references and dotenv invocation patterns.
Smoke Test Updates
tests/smoke_skills.py
Added save_trajectory_cmd field to PlatformPlan dataclass; plumbed platform-specific save-trajectory invocations (Claude, Codex, Bob) into seed-and-learn flow; added trajectory_count validation for Codex; enhanced failure reporting to include trajectory presence checks.

Sequence Diagram

sequenceDiagram
    participant User
    participant Session1 as Session 1: Learn
    participant Session2 as Session 2: Recall
    participant Session3 as Session 3: Analyze
    participant AuditLog as Audit Log
    participant Trajectories as Saved Trajectories
    participant Influence as Influence Log

    User->>Session1: Issue task & guidance
    Session1->>Trajectories: Save conversation trajectory
    Session1->>AuditLog: Append learned entity

    User->>Session2: Ask related question
    Session2->>Session2: Recall learned guidelines
    Session2->>AuditLog: Append recall event (entity IDs)
    Session2->>Trajectories: Reference saved trajectory

    User->>Session3: Run provenance analysis
    Session3->>AuditLog: Load recall events
    Session3->>Trajectories: Locate matching trajectories
    Session3->>Session3: Read recalled entity content
    Session3->>Session3: Assess influence verdicts
    Session3->>Influence: Log influence assessment (verdict + evidence)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • visahak
  • illeatmyhat

"I hop through logs and tidy tracks,
Saved trails stamped with careful acts,
One verdict per recalled little guide,
Evidence short, right there beside,
A rabbit cheers — provenance intact!"

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@vinodmut vinodmut requested review from illeatmyhat and visahak May 6, 2026 17:41
@visahak
Copy link
Copy Markdown
Collaborator

visahak commented May 6, 2026

coderabbitai can you review?

@visahak
Copy link
Copy Markdown
Collaborator

visahak commented May 6, 2026

Summary

This PR adds provenance logging and offline provenance-analysis guidance across the generated plugin source and rendered platform integrations, along with new Claude/Codex sandbox coverage. The overall direction makes sense, but the branch is red locally and I found a few concrete regressions in the current implementation.

Findings

  1. subscribe.py no longer rolls back when audit logging fails (confidence: 100/100)

    • Why it matters: this changes subscribe from an all-or-nothing operation into a partial success. If the audit append fails, the repo clone and config entry are left behind
      even though the established behavior and tests expect the command to fail and clean up.
    • Evidence: platform-integrations/claude/plugins/evolve-lite/skills/evolve-lite/subscribe/scripts/subscribe.py:125 catches audit_append(...) failures and only prints a
      warning at :140, then still reports success at :142. That behavior now fails tests/platform_integrations/ test_subscribe.py::TestSubscribe::test_rolls_back_clone_if_audit_write_fails.
    • Example: with a read-only .evolve/audit.log, the command now exits 0 and keeps the cloned subscription instead of aborting.
  2. Invalid subscription rejections were moved from stderr into the sync summary on stdout (confidence: 100/100)

    • Why it matters: this breaks the CLI contract already used by the platform-integration tests and makes invalid-config diagnostics indistinguishable from normal sync
      summaries.
    • Evidence: the sync scripts accumulate invalid-entry rejections into summaries at platform-integrations/claude/plugins/evolve-lite/skills/evolve-lite/sync/scripts/ sync.py:161 and then print the whole summary to stdout at :231 instead of emitting the invalid-name diagnostic to stderr. The same pattern is present in the Codex and Bob
      variants. This is exactly what now breaks tests/platform_integrations/test_sync.py::TestSync::test_skips_invalid_subscription_name, tests/platform_integrations/ test_codex_sharing.py::TestCodexSharingScripts::test_sync_skips_invalid_subscription_name, and the corresponding Bob sync tests.
    • Example: syncing a repo named ../evil now returns stdout="Synced 1 repo(s): '../evil' (skipped - invalid subscription name)" with empty stderr.
  3. The new Codex provenance contract cannot reliably do the session-to-trajectory matching it claims (confidence: 90/100)

    • Why it matters: the new provenance skill says it can match recall events to saved trajectories by session_id, but Codex trajectories are still saved as anonymous
      timestamped trajectory_*.json files with no persisted session_id in the filename or envelope. That makes deterministic offline provenance matching impossible for Codex
      without extra metadata.
    • Evidence: plugin-source/skills/evolve-lite/provenance/SKILL.md.j2:23 says to “match each recall event to a trajectory by session_id”. But plugin-source/skills/evolve- lite/save-trajectory/scripts/save_trajectory.py:68 creates only trajectory_<timestamp>.json, and the saved payload written at :128-139 does not add a session_id. The
      Codex e2e at tests/e2e/test_codex_sandbox_learn_recall.py:242-260 works around this by passing the chosen session_id back into the prompt, which does not prove the
      advertised automatic matching behavior.
    • Example: two Codex sessions can produce multiple trajectory_*.json files with no stable key tying a recall audit event’s session_id to exactly one saved trajectory.

Testing

  • uv run pytest -m e2e -v: 7 failed, 192 passed, 573 deselected, 2 warnings in 748.02s
  • Failure details:
    • tests/platform_integrations/test_subscribe.py::TestSubscribe::test_rolls_back_clone_if_audit_write_fails
    • tests/platform_integrations/test_sync.py::TestSync::test_skips_invalid_subscription_name
    • tests/platform_integrations/test_codex_sharing.py::TestCodexSharingScripts::test_sync_skips_invalid_subscription_name
    • tests/platform_integrations/test_bob_sharing.py::TestBobSync::test_skips_invalid_subscription_name
    • tests/platform_integrations/test_bob_sharing.py::TestBobSync::test_rejects_dot_and_double_dot_names
    • tests/e2e/test_claude_sandbox_learn_recall.py::test_claude_learn_then_recall_flow
    • tests/e2e/test_codex_sandbox_learn_recall.py::test_codex_learn_then_recall_flow

Copy link
Copy Markdown
Collaborator

@visahak visahak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the comments.

vinodmut added 4 commits May 7, 2026 00:08
Addresses review feedback from visahak

Restores the all-or-nothing contract: if audit_append raises after a
successful clone + config save, the new repo entry and cloned directory
are removed before exiting non-zero with a "failed to record subscription"
diagnostic on stderr. Previously the command swallowed the failure and
reported success, leaving the clone and config mutation in place even
though tests and callers expect rollback on partial failure.
Addresses review feedback from visahak

The previous commit popped the repo entry from the in-memory list but
did not call set_repos(cfg, repos), so the subsequent save_config call
re-wrote the same state to disk and the config still contained the new
subscription. Update the in-memory cfg before the compensating
save_config and align the codex-sharing test, which previously expected
the old warn-and-succeed behavior, with the all-or-nothing contract
asserted by test_rolls_back_clone_if_audit_write_fails.
Addresses review feedback from visahak

Rejected subscription entries (invalid names, missing remotes, unknown
scopes) and path-traversal guard trips were being folded into the
stdout "Synced N repo(s):" summary, making diagnostics indistinguishable
from normal sync output and breaking the platform-integration tests
that assert diagnostics appear on stderr. Route each rejection through
stderr instead and exclude rejected entries from the stdout summary
count.
…venance

Addresses review feedback from visahak

The provenance skill advertised deterministic session-to-trajectory
matching via session_id, but save-trajectory wrote files as
trajectory_<timestamp>.json with no persisted session identifier, so
two back-to-back codex sessions produced indistinguishable files and
the e2e test had to pass the target session id in through the prompt.

Extend the envelope to include session_id (Step 4 of save-trajectory)
and thread that id into the output filename as
trajectory_<timestamp>_<session-id>.json so provenance can resolve a
recall event to exactly one trajectory by inspecting the filename.
The filename slice is sanitised to filesystem-safe characters and
capped at 64 chars. When no session id is available the filename
falls back to the original trajectory_<timestamp>.json form.

Provenance SKILL Step 2 now documents a three-step matching strategy
(claude transcript filename; session-id suffix on trajectory files;
envelope session_id field as a last resort) so agents do not have to
guess from content alone.
coderabbitai[bot]

This comment was marked as resolved.

vinodmut added 8 commits May 7, 2026 01:10
…tly swallowing

Addresses CodeRabbit review finding: Silent rollback-save failure leaves config and filesystem inconsistent

The compensating save_config call in the audit-failure rollback path
previously swallowed every exception with a bare `except Exception:
pass`, which could leave the on-disk evolve.config.yaml still listing
the freshly-added repo even though the clone was removed. Print a
clear stderr warning that names the affected project_root, the caught
exception, and the subscription entry that may still need manual
removal so the user can repair the config themselves.
Addresses CodeRabbit review finding: Harden type validation before verdict checks and dedupe keying

Require session_id to be a non-empty string at the payload level, and
require each assessment's entity field to be a non-empty string inside
the loop; malformed items are logged and skipped instead of risking a
TypeError when the (session_id, entity) dedupe key is built. Coerce
evidence to a string for the same reason so the audit schema stays
stable even if callers hand us a numeric or null evidence field.
Addresses CodeRabbit review finding: transcript_path priority silently shadows the explicit session_id for non-Claude platforms

removeprefix is a no-op on stems that do not start with
claude-transcript_, so on non-Claude platforms that pass both
transcript_path and session_id the raw filename stem was winning and
the explicit session_id was ignored. Only consume transcript_path
when the stem actually carries the Claude prefix; otherwise fall
through to input_data["session_id"] so Codex/Bob/Claw Code get the
session id their hook actually provided.
…sion_id guidance

Addresses CodeRabbit review finding: CLAUDE_SESSION_ID is a Claude-specific env var referenced in a Codex SKILL.md

CLAUDE_SESSION_ID does not exist on Codex, Claw Code, or Bob, so the
rendered SKILL.md on those platforms invited agents to chase an
identifier that would never resolve. Replace the concrete vendor env
var with generic guidance — "whatever the harness exposes" plus the
existing fallback behavior — so every platform gets the same
instruction and no platform-specific symbol leaks into the others'
rendered output.
… assertion

Addresses CodeRabbit review findings: Use task-scoped recalled IDs in the learned-ID check; The hard followed requirement is flaky for stochastic e2e model behavior

The learned-vs-recalled intersection was computed over the aggregated
recalled_ids across every recall event in the log, which could let
the assertion pass even when the task session itself never actually
recalled a learned id. Intersect task_recalled_ids (just the final
recall event) with learned_ids instead so we verify the specific
task session recalled what it learned.

Separately, "followed" is only one of three valid influence verdicts
and the real model can legitimately pick "contradicted" or
"not_applicable" on any given run. Relax the hard-followed assertion
to "any verdict in the allowed set" — the test now guards the shape
of the influence audit rather than pinning a stochastic outcome.
…fail test

Addresses CodeRabbit review finding: Guard against FileNotFoundError when rollback deletes the newly-created config

Once the rollback path removes the subscription entry and rewrites the
config, a future implementation could reasonably end up with the
config file absent (e.g., after removing the only repo). Guard the
read_text() call with an exists() check so the assertion continues to
verify that 'name: alice' is not present regardless of whether the
file is empty or gone.
…rder assertion

Addresses CodeRabbit review findings: read_audit missing encoding=utf-8 may fail in non-UTF-8 locales; Entity list order assertion is fragile — rglob/os.walk order is not guaranteed

read_text() falls back to the platform default encoding, which is not
utf-8 on every CI host. Pin the decoder to utf-8 for both the recall
audit parser in test_retrieve and the read_audit helper in
test_log_influence so non-ASCII audit entries decode reliably.

The recall test also asserted strict list equality on the entities
field, but retrieve_entities orders them via rglob which is not
guaranteed across platforms. Switch to a set comparison so we assert
on membership rather than traversal order.
Addresses CodeRabbit review finding: Count only trajectory files in the codex learn gate

glob("*") on .evolve/trajectories/ counts any directory or stray
artifact (e.g., a lock dir from a previous run) the harness happens
to leave behind. Restrict the count to files whose name matches
trajectory_*.json so the codex branch only passes when the learn
flow actually produced a saved trajectory.
@vinodmut
Copy link
Copy Markdown
Contributor Author

vinodmut commented May 7, 2026

@visahak thanks for the detailed review — all three findings addressed on this branch:

  1. subscribe rollback on audit failure — restored to the all-or-nothing contract in 3ee80a4 (re-raise + rmtree), then 9b53d15 added the missing set_repos(cfg, repos) so the compensating save_config actually persists the rollback, and 4e4358b replaced the silent except: pass on the compensating save with a stderr warning (per CodeRabbit follow-up). test_rolls_back_clone_if_audit_write_fails passes; the now-stale test_subscribe_warns_when_audit_write_fails in test_codex_sharing was flipped to assert rollback.
  2. invalid-subscription rejections — routed to stderr in 06c3d57, keeping the exact legacy format the tests expect. test_skips_invalid_subscription_name and test_rejects_dot_and_double_dot_names pass across claude/codex/bob.
  3. Codex session-to-trajectory matching0068873 adds session_id to the trajectory envelope and embeds it in the filename as trajectory_<timestamp>_<session-id>.json, and the provenance SKILL now documents a three-step matching strategy (claude transcript filename → session-id suffix on saved trajectories → envelope field as a last resort) so matching is deterministic for codex without relying on the prompt carrying the id.

CI green; full platform_integrations suite (152 e2e + 196 unit) passes locally.

@visahak visahak self-requested a review May 7, 2026 13:48
Copy link
Copy Markdown
Collaborator

@visahak visahak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@visahak visahak merged commit d7a3f49 into AgentToolkit:main May 7, 2026
17 checks passed
@vinodmut vinodmut deleted the provenance-usage branch May 12, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants