[safe-output-health] Safe Output Health Report - 2026-04-13 #26038
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Safe Output Health Monitor. A newer discussion is available at Discussion #26222. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
The primary issue today is the API rate limit concurrent burst pattern — 3
create_issuesafe output operations failed with HTTP 403 during a 3-minute window (12:17–12:20 UTC) when multiple daily workflows completed and triggered safe output jobs simultaneously. This is the third occurrence of this pattern (also seen 2026-04-02 and 2026-04-07).Safe Output Job Statistics
create_issuecreate_discussionadd_commentcreate_pull_requestupload_assetnoopadd_labelsupload_artifactupdate_issueassign_to_agentsubmit_pull_request_reviewdispatch_workflowcreate_pull_request_review_comment*Warning issued but counted as non-failure (see Cluster 2)
**Skipped (not in PR context)
Error Clusters
Cluster 1: API Rate Limit Exceeded —
create_issue(HIGH severity)POST /repos/github/gh-aw/issues — 403Error details (run-24342586738 — Workflow Health Manager)
submit_pull_request_reviewwas configured but the run was triggered byschedule, not a PR event. The inlinecreate_pull_request_review_commentmessages were also skipped (not in PR context). The overall run completed successfully (7/10 messages succeeded).Root Cause Analysis
API Rate Limit Pattern
The GitHub App installation has a finite hourly request budget. When many workflows complete concurrently (the noon UTC burst), safe output jobs race to make API calls. The
safe_output_handler_manager.cjshas zero retry logic for rate-limited responses (retries: 0,retry-exempt-status-codes: 400,401,403,404,422— notably 403 is exempt). This means rate limit hits are instant, unrecoverable failures.Historical occurrence frequency:
PR Review Context Warning
submit_pull_request_reviewhandler checks for a PR review context at finalization time. When called via schedule (not PR event), inline comments are skipped and the review context is never set. The handler emits a warning but does not count this as a failure. This is expected behavior for smoke tests that run on schedules — the agent submits review-related operations that only make sense in PR context.Recommendations
Critical Issues (Immediate Action Required)
Add retry logic for rate-limited safe output operations
safe_output_handler_manager.cjsusesactions/github-scriptwithretries: 0and includes403inretry-exempt-status-codes, bypassing all retry logic for rate limit responses.Stagger noon-UTC schedule triggers to reduce burst
15 12 * * *,30 12 * * *, etc.) for non-time-critical workflows.Bug Fixes Required
safe_output_handler_manager.cjs— Rate Limit Retryactions/safe_output_handler_manager.cjs(oractions/github-scriptconfiguration)create_issue,add_comment,create_discussion, and all GitHub API write operations.Configuration Changes
daily→ noon UTC5 12,15 12,25 12) to spread safe output calls over a wider window.Process Improvements
Rate limit monitoring and alerting
x-ratelimit-remaining) in safe output jobs; emit a warning annotation when remaining drops below a threshold (e.g., 10% of hourly limit); create a noop ormissing_toolreport when rate limits are the root cause so the issue is surfaced in the health monitor.PR review context guard
submit_pull_request_reviewis called in schedule contextcreate_pull_request_review_commentconfig could includerequire_pr_context: trueto suppress warnings and cleanly skip when not in PR context.Work Item Plans
Work Item 1: Retry Logic for Rate-Limited Safe Output Operations
safe_output_handler_manager.cjs, wrap GitHub API calls with a retry helper that checkserror.message.includes('API rate limit exceeded'). UsesetTimeoutwith 30s/60s/120s delays. Consider checkingx-ratelimit-resetheader for exact reset time.Work Item 2: Schedule Stagger for Noon-UTC Burst Workflows
schedule: dailyand assign explicit staggered cron expressions distributed across 11:45–12:45 UTC. Document the assignment convention to prevent future clumping.Historical Context
Daily health trend (last 7 audited days)
add_comment,create_discussion,create_pull_request,upload_asset— all 100% today.create_issue— all 3 failures today, consistently the job most affected by rate limiting.api_rate_limit_concurrent_burst— 3 dates, 10 total failures. No fix has been implemented despite appearing in recommendations on 2026-04-02.Metrics and KPIs
add_comment,create_discussion,create_pull_request(100%)create_issue(84.2%)Next Steps
safe_output_handler_manager.cjsReferences:
Beta Was this translation helpful? Give feedback.
All reactions