Skip to content

Report job success immediately after main graph completes, before post steps#4921

Open
deepsm007 wants to merge 2 commits intoopenshift:mainfrom
deepsm007:post-step-job-success
Open

Report job success immediately after main graph completes, before post steps#4921
deepsm007 wants to merge 2 commits intoopenshift:mainfrom
deepsm007:post-step-job-success

Conversation

@deepsm007
Copy link
Contributor

  • Report success to users as soon as main graph completes successfully
  • Post steps (promotion/cleanup) now run as best-effort and don't affect job result
  • Added metrics tracking: main_graph_duration_seconds, post_steps_duration_seconds, time_saved_seconds

/hold

/cc @droslean @openshift/test-platform

@openshift-ci-robot
Copy link
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot requested review from a team and droslean February 2, 2026 20:06
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Walkthrough

Reports main graph success immediately on completion (and records that a success report was sent), runs post-steps concurrently as best-effort cleanup (logging failures without aborting), and records aggregated metrics including main/post-step/total durations and a post-steps-failed flag.

Changes

Cohort / File(s) Summary
CI operator main logic
cmd/ci-operator/main.go
Adds successReported bool to options; adds shouldReportEarly(jobSpec *api.JobSpec) bool; emits CiJobSucceeded and attempts an immediate success report after main graph completion; runs promotion/post-steps concurrently as best-effort (log/count failures without aborting); prevents duplicate success reports; records metrics for main_graph_duration_seconds, post_steps_duration_seconds, total duration_seconds, time_saved_seconds, success and conditionally post_steps_failed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@cmd/ci-operator/main.go`:
- Around line 1065-1071: Duplicate success reports are caused by the early
reporter.Report(nil) call in the Run path and the final opt.Report() call in
main(); make reporting idempotent by adding a boolean flag on the options struct
(e.g., earlyReported) and set it when you call reporter.Report(nil) after
o.resultsOptions.Reporter(...) (the block that currently logs loadErr), then
update the Report() method (opt.Report / Report()) to check that flag and skip
calling reporter.Report(nil) if earlyReported is true; reference the Run()
caller flow, main() → opt.Report(), the Report() method, reporter.Report(nil),
and o.resultsOptions.Reporter to locate where to set and check the flag.

…t steps

- Report success to users as soon as main graph completes successfully
- Post steps (promotion/cleanup) now run as best-effort and don't affect job result
- Add metrics tracking: main_graph_duration_seconds, post_steps_duration_seconds, time_saved_seconds
- Metrics are extractable from ci-operator-metrics.json in test_platform_insights events
- Prevent duplicate success reporting by tracking early-report state in options
@deepsm007 deepsm007 force-pushed the post-step-job-success branch from 8a88f52 to fd54539 Compare February 2, 2026 20:19
@deepsm007
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2026
@deepsm007
Copy link
Contributor Author

/test images

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007, liangxia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4188be3 and 2 for PR HEAD efdce43 in total

@deepsm007
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2026

@deepsm007: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images efdce43 link unknown /test images
ci/prow/breaking-changes efdce43 link false /test breaking-changes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

// Main graph completed successfully - report success immediately before post steps
mainGraphCompletedAt := time.Now()
mainGraphDuration := mainGraphCompletedAt.Sub(start)
eventRecorder.Event(runtimeObject, coreapi.EventTypeNormal, "CiJobSucceeded", eventJobDescription(o.jobSpec, o.namespace))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is consuming this. I had plans to deprecate them all.

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller February 3, 2026 13:23
@droslean
Copy link
Member

droslean commented Feb 3, 2026

I think the best approach will be to report after each step is completed in general and keep the final report state after post steps as it was.

Comment on lines +1078 to +1079
reporter.Report(nil)
o.successReported = true
Copy link
Member

@petr-muller petr-muller Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the part you expect to deliver the early report, but AFAIK this does not do any actual meaningful reporting. It reports results to result-aggregator which AFAIK only powers the ci operator executions fail too much because of a $reason alerts and apparently does something related to pod scaler, too (but its not documented so who knows lol).

The actual result reporting goes strictly through exit codes: ci-operator exits, prow picks up that exit code and propagates the result accordingly (sets Prowjob status, crier reports to GitHub...). So early reporting is likely a much greater architecture change - either we invent APIs on both ci-operator and Prow layers to directly support this (report the final result while the job is still executing), or we'd allow ci-operator and therefore the prowjob to terminate with the final result, and the "remaining" parts of the workload would need to be handed over to some background manager to process and finalize them. Both are IMO non trivial and a kind of effort that needs some design work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants