Report job success immediately after main graph completes, before post steps#4921
Report job success immediately after main graph completes, before post steps#4921deepsm007 wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughReports main graph success immediately on completion (and records that a success report was sent), runs post-steps concurrently as best-effort cleanup (logging failures without aborting), and records aggregated metrics including main/post-step/total durations and a post-steps-failed flag. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cmd/ci-operator/main.go`:
- Around line 1065-1071: Duplicate success reports are caused by the early
reporter.Report(nil) call in the Run path and the final opt.Report() call in
main(); make reporting idempotent by adding a boolean flag on the options struct
(e.g., earlyReported) and set it when you call reporter.Report(nil) after
o.resultsOptions.Reporter(...) (the block that currently logs loadErr), then
update the Report() method (opt.Report / Report()) to check that flag and skip
calling reporter.Report(nil) if earlyReported is true; reference the Run()
caller flow, main() → opt.Report(), the Report() method, reporter.Report(nil),
and o.resultsOptions.Reporter to locate where to set and check the flag.
…t steps - Report success to users as soon as main graph completes successfully - Post steps (promotion/cleanup) now run as best-effort and don't affect job result - Add metrics tracking: main_graph_duration_seconds, post_steps_duration_seconds, time_saved_seconds - Metrics are extractable from ci-operator-metrics.json in test_platform_insights events - Prevent duplicate success reporting by tracking early-report state in options
8a88f52 to
fd54539
Compare
…riodics, exclude rehearsals
|
/unhold |
|
/test images |
|
Scheduling required tests: |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepsm007, liangxia The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
@deepsm007: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // Main graph completed successfully - report success immediately before post steps | ||
| mainGraphCompletedAt := time.Now() | ||
| mainGraphDuration := mainGraphCompletedAt.Sub(start) | ||
| eventRecorder.Event(runtimeObject, coreapi.EventTypeNormal, "CiJobSucceeded", eventJobDescription(o.jobSpec, o.namespace)) |
There was a problem hiding this comment.
Nothing is consuming this. I had plans to deprecate them all.
|
/cc |
|
I think the best approach will be to report after each step is completed in general and keep the final report state after post steps as it was. |
| reporter.Report(nil) | ||
| o.successReported = true |
There was a problem hiding this comment.
I think this is the part you expect to deliver the early report, but AFAIK this does not do any actual meaningful reporting. It reports results to result-aggregator which AFAIK only powers the ci operator executions fail too much because of a $reason alerts and apparently does something related to pod scaler, too (but its not documented so who knows lol).
The actual result reporting goes strictly through exit codes: ci-operator exits, prow picks up that exit code and propagates the result accordingly (sets Prowjob status, crier reports to GitHub...). So early reporting is likely a much greater architecture change - either we invent APIs on both ci-operator and Prow layers to directly support this (report the final result while the job is still executing), or we'd allow ci-operator and therefore the prowjob to terminate with the final result, and the "remaining" parts of the workload would need to be handed over to some background manager to process and finalize them. Both are IMO non trivial and a kind of effort that needs some design work
/hold
/cc @droslean @openshift/test-platform