Skip to content

fix: deterministic audit metrics via run_summary.json cache and workflow-logs/ exclusion#26148

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/fix-inconsistent-audit-metrics
Draft

fix: deterministic audit metrics via run_summary.json cache and workflow-logs/ exclusion#26148
Copilot wants to merge 4 commits intomainfrom
copilot/fix-inconsistent-audit-metrics

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

audit returned inconsistent token_usage and turns on repeated calls for the same run (observed: 9 turns / 381k tokens vs 22 turns / 4.7M tokens) due to two compounding issues.

Root Cause

The inconsistency had two causes:

  1. Re-processing on every call: AuditWorkflowRun always re-processed local log files even when a valid run_summary.json was on disk.

  2. Double-counting via workflow-logs/: The .log file walk in extractLogMetrics was not excluding the workflow-logs/ directory. When audit is called with a full URL (owner+repo), downloadWorkflowRunLogs successfully downloads GitHub Actions step-output into workflow-logs/. These step captures include the agent's full stdout, so the same token-usage JSON blocks already present in agent-stdio.log appeared again (~12 times) and were summed into metrics.TokenUsage — causing the ~12× inflation.

Changes

  • Cache-hit early path in AuditWorkflowRun: before any API calls or log processing, check loadRunSummary. On a hit, reconstruct processedRun from the cached summary and return immediately — no re-download, no re-parse, no API calls.

  • renderAuditReport helper: extracted the shared render + finalization logic (comparison snapshot, JSON/console output, --parse handling, success message) into a new function called by both the cache-hit path and the fresh-download path.

  • Save summary before rendering: in the fresh-download path, saveRunSummary is now called before renderAuditReport so the cache is persisted as early as possible.

  • Skip workflow-logs/ in log walk: extractLogMetrics now returns filepath.SkipDir when the walk visits a directory named workflow-logs, preventing GitHub Actions runner captures from being counted alongside the dedicated agent artifact logs.

Cache invalidation is unchanged — loadRunSummary already rejects summaries whose CLIVersion doesn't match the running binary, so improvements in log parsing are picked up automatically after upgrades.

// New early-return in AuditWorkflowRun (cache hit path)
if summary, ok := loadRunSummary(runOutputDir, verbose); ok {
    processedRun := ProcessedRun{
        Run: summary.Run, /* ... all fields from summary */
    }
    return renderAuditReport(processedRun, summary.Metrics, summary.MCPToolUsage, ...)
}
// Fresh-download path: process → saveRunSummary → renderAuditReport

// workflow-logs/ is now skipped in the log walk
if info.IsDir() {
    if info.Name() == "workflow-logs" {
        return filepath.SkipDir
    }
    return nil
}

Testing

Added TestExtractLogMetricsExcludesWorkflowLogsDir as a regression test: creates a run directory with agent.log (1000 tokens) plus copies of the same data under workflow-logs/, and asserts the total remains 1000 — confirming workflow-logs/ files are ignored.

When AuditWorkflowRun is called for a run that already has a valid
run_summary.json on disk, return the cached report immediately without
re-processing local log files. Re-processing the same files can produce
different results (e.g. when the GitHub API aggregates token usage from
a different source than the locally-stored firewall proxy logs), causing
the token_usage and turns metrics to differ across calls for the same run.

The fix:
- Add a cache-check early path in AuditWorkflowRun: if loadRunSummary
  succeeds, reconstruct processedRun from the summary and render
  without touching any API or log files.
- Extract the shared render + finalization logic into renderAuditReport,
  called by both the cache-hit path and the fresh-download path.
- The fresh-download path now saves the summary BEFORE calling
  renderAuditReport, so the cache is written as early as possible.
- The run_summary.json is not overwritten on cached calls, ensuring the
  first-processed (firewall-based) metrics are preserved.

Cache invalidation is already handled: loadRunSummary rejects summaries
whose CLIVersion differs from the current binary version, so bug fixes in
log parsing are automatically picked up after a tool upgrade.

Adds two new unit tests:
- TestAuditUsesRunSummaryCache: verifies that cached token_usage and
  turns are preserved and run_summary.json is not modified.
- TestRenderAuditReportUsesProvidedMetrics: exercises the new helper.

Fixes: Audit shows inconsistent metrics on repeated calls for same run

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/f8e72352-eed9-4bb9-bc4b-7c856bcb8bbf

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix inconsistent metrics in audit tool for same run fix: use cached run_summary.json to ensure deterministic audit metrics Apr 14, 2026
Copilot AI requested a review from pelikhan April 14, 2026 04:15
@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review why different code paths return different number, refactor and fix, add tests

Copilot AI and others added 2 commits April 14, 2026 04:37
…okens

The log-file walk in extractLogMetrics was not excluding the workflow-logs/
directory, which is populated by downloadWorkflowRunLogs with GitHub Actions
step-output files. These files capture the combined stdout/stderr of each step
including the agent's own log lines, so the same token-usage JSON blocks that
already appear in the agent artifact (agent-stdio.log / agent.log) were being
found again and their token counts were summed a second (or nth) time.

Call 1 of audit (run ID only): downloadWorkflowRunLogs may fail silently because
owner/repo is resolved from the local git remote (may not match), so workflow-logs/
is absent → only agent.log is counted → 381k tokens (correct).

Call 2 of audit (full URL with owner/repo): downloadWorkflowRunLogs succeeds, so
workflow-logs/ is populated with step logs → the same agent data is encountered
~12 more times → 4.7M tokens (12× inflated).

Fix: return filepath.SkipDir when the walk visits a directory named "workflow-logs",
so only the agent artifact files contribute to the metrics.

Adds TestExtractLogMetricsExcludesWorkflowLogsDir as a regression test that
explicitly verifies the fix: agent.log is counted once (1000 tokens), and copies
inside workflow-logs/ are ignored even when they pass the filename filter.

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/3c72016d-4028-4648-88f8-6ab62aa01277

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 14, 2026

@copilot review why different code paths return different number, refactor and fix, add tests

Root cause identified and fixed in commit d63e3be:

Why the two paths returned different numbers:

extractLogMetrics has two code paths:

  1. events.jsonl path – single authoritative source (Copilot session events) → deterministic
  2. .log file walk fallback – sums TokenUsage across every matched .log/*log*.txt file

The inconsistency came from workflow-logs/: when audit was called with a full URL (owner+repo), downloadWorkflowRunLogs successfully downloaded GitHub Actions step-output into workflow-logs/. These step captures include the agent's full stdout, so the same token-usage JSON blocks already present in agent-stdio.log appeared again (and again) in the step files — each copy was summed, inflating token_usage ~12×.

Call 1 (run ID only) → downloadWorkflowRunLogs may fail/skip when owner/repo isn't in the URL → workflow-logs/ absent → only agent.log counted → 381k tokens ✅
Call 2 (full URL) → downloadWorkflowRunLogs succeeds → workflow-logs/ populated → same data counted ~12 more times → 4.7M tokens ❌

Fixes:

  • logs_metrics.go: return filepath.SkipDir when the walk visits workflow-logs/, since those are GHA runner logs that duplicate agent artifact content
  • audit.go: serve repeated audit calls from run_summary.json cache (PR commit 1) so even if the walk were non-deterministic, the first correctly-computed result is always reused
  • copilot_metrics_fix_test.go: regression test TestExtractLogMetricsExcludesWorkflowLogsDir that asserts workflow-logs/ copies are ignored

Copilot AI changed the title fix: use cached run_summary.json to ensure deterministic audit metrics fix: deterministic audit metrics via run_summary.json cache and workflow-logs/ exclusion Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cli-tools-test] Daily CLI Tools Test: Audit shows inconsistent metrics on repeated calls for same run

2 participants