Skip to content

ROSAENG-61038 | fix: improve ready-state cluster waiting#3327

Open
olucasfreitas wants to merge 1 commit into
openshift:masterfrom
olucasfreitas:cursor/b2edb6ff
Open

ROSAENG-61038 | fix: improve ready-state cluster waiting#3327
olucasfreitas wants to merge 1 commit into
openshift:masterfrom
olucasfreitas:cursor/b2edb6ff

Conversation

@olucasfreitas

@olucasfreitas olucasfreitas commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

PR Summary

Improve the ready-state waiter used by ROSA e2e helpers so manual HCP cluster flows report actionable status instead of timing out on rosa list cluster alone.

Detailed Description of the Issue

OCP-75536 / ROSAENG-61038 tracks an HCP manual-mode cluster-create failure that timed out waiting for ready. The shared waiter in tests/utils/exec/rosacli/cluster_service.go only polled rosa list cluster, which gave low-signal failures and let provisioning state details get lost. This change keeps the generic waiter for non-ready states, but uses describe cluster state/details when callers wait for ready.

Related Issues and PRs

  • Jira: ROSAENG-61038
  • Fixes: N/A
  • Related PR(s): N/A
  • Related design/docs: N/A

Type of Change

  • fix - resolves an incorrect behavior or bug.
  • feat - adds a new user-facing capability.
  • docs - updates documentation only.
  • style - formatting or naming changes with no logic impact.
  • refactor - code restructuring with no behavior change.
  • test - adds or updates tests only.
  • chore - maintenance work (tooling, housekeeping, non-product code).
  • build - changes build system, packaging, or dependencies for build output.
  • ci - changes CI pipelines, jobs, or automation workflows.
  • perf - improves performance without changing intended behavior.

Previous Behavior

Waiting for ready only checked rosa list cluster. Manual HCP flows could time out without surfacing provisioning errors or the last observed cluster state, and an empty cluster ID was not rejected up front.

Behavior After This Change

When waiting for ready, the helper validates the cluster ID, polls describe cluster state/details, preserves legitimate intermediate states such as resuming, resets runner JSON format correctly after describe failures, and returns timeouts/errors with the last known state and provisioning details. Focused tests cover the ready waiter path and its error handling.

How to Test (Step-by-Step)

Preconditions

  • Checkout branch cursor/b2edb6ff
  • Go toolchain installed
  • Repo hooks installed via make install-hooks

Test Steps

  1. make fmt
  2. go test ./tests/utils/exec/rosacli
  3. go test ./tests/e2e -run TestDoesNotExist
  4. make lint
  5. make rosa
  6. make test
  7. make basic-checks

Expected Results

  • All commands pass.
  • The ready-wait helper reports actionable state details instead of timing out with only rosa list cluster output.

Proof of the Fix

  • Screenshots: N/A
  • Videos: N/A
  • Logs/CLI output: go test ./tests/utils/exec/rosacli, go test ./tests/e2e -run TestDoesNotExist, make lint, make rosa, make test, and make basic-checks all passed locally.
  • Other artifacts: Added focused waiter coverage in tests/utils/exec/rosacli/cluster_service_test.go.

Risk / Follow-up

  • Low residual risk: there is still no single waiter-level test that combines a failed JSON describe with a successful text describe fallback inside the same wait loop. The new direct format-reset coverage protects the shared runner state for that path.

Breaking Changes

  • No breaking changes
  • Yes, this PR introduces a breaking change (describe impact and migration plan below)

Breaking Change Details / Migration Plan

N/A

Developer Verification Checklist

  • Commit subject/title follows [JIRA-TICKET] | [TYPE]: <MESSAGE>.
  • PR description clearly explains both what changed and why.
  • Relevant Jira/GitHub issues and related PRs are linked.
  • make install-hooks has been run in this clone.
  • Tests were added/updated where appropriate.
  • I manually tested the change.
  • make test passes.
  • make lint passes.
  • make rosa passes.
  • Documentation or repo-local agent guidance was added/updated where appropriate.
  • Any risk, limitation, or follow-up work is documented.

Summary by CodeRabbit

  • Bug Fixes
    • Improved cluster status waiting by validating required inputs and using safer polling timing values.
    • Enhanced readiness handling with clearer state evaluation, including better messages for uninstalling, provisioning errors, and “cluster not found” during readiness waits.
    • Ensured JSON formatting settings are always restored after describe operations, even when failures occur.
  • Tests
    • Added a Ginkgo/Gomega test suite covering readiness transitions, timeouts, error formatting, and JSON describe fallback behavior.
    • Added additional test helpers and a suite entrypoint for the rosacli package.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR updates cluster_service.go to reset JSON formatting with deferred cleanup, validate empty cluster IDs, compute wait timing from minute inputs, and add a Ready-specific polling path with new helpers for readiness evaluation and timeout/error formatting. It also adds Ginkgo/Gomega coverage for these behaviors, plus a rosacli test suite entrypoint and a helper for prepending PATH in tests.

Changes

Related Issues: Not specified
Related PRs: Not specified
Suggested labels: tests, waiting, cluster-management
Suggested reviewers: Not specified

Sequence Diagram(s)

Included in the hidden review stack artifact.

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Several specs use bare Expect(...).ToNot(HaveOccurred()), and two tests mutate process-wide PATH, hurting isolation and consistency. Add diagnostic messages to bare expectations and remove the global PATH changes by relying on isolated runner env setup and cleanup.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the required Jira/type format and clearly summarizes the ready-state waiting fix.
Description check ✅ Passed The description follows the template well and includes the issue, changes, related issue, testing, proof, and risk sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo titles are static, human-readable strings; no dynamic names, timestamps, IDs, or generated values appear in the new rosacli tests.
Microshift Test Compatibility ✅ Passed New Ginkgo tests only exercise fake rosa CLI behavior and contain no unsupported OpenShift APIs, namespaces, or MicroShift-hostile assumptions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new rosacli tests only cover CLI status polling and error formatting; they make no node-count, scheduling, failover, or topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed Only rosacli test/helper files changed; no deployment manifests, controllers, or scheduling constraints were introduced.
Ote Binary Stdout Contract ✅ Passed No stdout writes were added in process-level code; the new suite only calls RunSpecs, and all prints are inside normal helper functions or It blocks.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The new Ginkgo tests use a local fake rosa binary and in-memory state; no IPv4 literals, net parsing, or external connectivity requirements were found.
No-Weak-Crypto ✅ Passed Changed files add waiter logic and tests only; no weak algorithms, custom crypto, or secret/token comparisons are present.
Container-Privileges ✅ Passed Touched files are Go helpers/tests only; scans found no privileged/hostPID/hostNetwork/etc. settings or manifest files.
No-Sensitive-Data-In-Logs ✅ Passed No new log statements or sensitive-value logging were added; the patch only returns errors and tests.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olucasfreitas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
tests/utils/exec/rosacli/cluster_service.go (3)

576-614: 📐 Maintainability & Code Quality | 🔵 Trivial

Duplicated detail-building logic between formatReadyClusterError and formatReadyClusterTimeout.

Both functions build an identical details slice from ProvisioningErrorCode, ProvisioningErrorMessage, and FailedInflightChecks. Extract a shared helper to avoid the two implementations drifting apart over time.

♻️ Extract shared detail builder
+func provisioningDetails(description *ClusterDescription) []string {
+	details := []string{}
+	if description.ProvisioningErrorCode != "" {
+		details = append(details, fmt.Sprintf("provisioning error code: %s", description.ProvisioningErrorCode))
+	}
+	if description.ProvisioningErrorMessage != "" {
+		details = append(details, fmt.Sprintf("provisioning error message: %s", description.ProvisioningErrorMessage))
+	}
+	if description.FailedInflightChecks != "" {
+		details = append(details, fmt.Sprintf("failed inflight checks: %s", description.FailedInflightChecks))
+	}
+	return details
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/exec/rosacli/cluster_service.go` around lines 576 - 614,
`formatReadyClusterError` and `formatReadyClusterTimeout` duplicate the same
detail collection logic for `ProvisioningErrorCode`, `ProvisioningErrorMessage`,
and `FailedInflightChecks`. Extract that shared behavior into a helper in
`cluster_service.go` (for example, a details builder used by both functions) and
have both error formatters call it, keeping `description.State`/timeout-specific
wording in each function. This will keep the `formatReadyClusterError` and
`formatReadyClusterTimeout` implementations aligned and reduce future drift.

514-549: 📐 Maintainability & Code Quality | 🔵 Trivial

Custom polling loop duplicates wait.PollUntilContextTimeout semantics.

waitForClusterReadyStatus reimplements timeout/interval polling manually instead of using wait.PollUntilContextTimeout (already imported and used just below for the non-Ready branch). Consider expressing this as a wait.PollUntilContextTimeout condition function to keep a single polling implementation, reduce duplicated timing logic, and inherit its context-cancellation semantics consistently across both branches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/exec/rosacli/cluster_service.go` around lines 514 - 549,
waitForClusterReadyStatus manually reimplements interval/timeout polling,
duplicating logic that should match the existing wait.PollUntilContextTimeout
usage in the same area. Refactor the Ready-state loop in
waitForClusterReadyStatus to use a wait.PollUntilContextTimeout condition
function, reusing getState, getDescription, and evaluateReadyClusterState so
both branches share the same polling and cancellation semantics. Keep the
lastDescription handling and timeout error formatting intact via
formatReadyClusterTimeout.

469-486: 📐 Maintainability & Code Quality | 🔵 Trivial

Ready-path bypasses wait.PollUntilContextTimeout used elsewhere in this same function.

The non-Ready branch below reuses the apimachinery wait.PollUntilContextTimeout utility, but the Ready branch delegates to a hand-rolled time.Now()/time.Sleep loop in waitForClusterReadyStatus. This duplicates polling/timeout logic that the library already provides in the same file, increasing maintenance surface for two divergent polling mechanisms.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/exec/rosacli/cluster_service.go` around lines 469 - 486, The
Ready branch in waitForClusterReadyStatus uses a custom time.Now()/time.Sleep
polling loop instead of the same wait.PollUntilContextTimeout approach used in
the other branch of cluster_service.go. Refactor waitForClusterReadyStatus (and
its caller in the Ready path) to use the shared apimachinery polling utility for
both status checks and timeout handling, keeping the existing fetch/describe
callbacks intact so there is only one polling mechanism in this code path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/utils/exec/rosacli/cluster_service_test.go`:
- Around line 91-104: Add test coverage for the no-details fallback in
formatReadyClusterError by adding a case in clusterServiceTest around
evaluateReadyClusterState where the ClusterDescription only sets State to
constants.Error and leaves ProvisioningErrorCode, ProvisioningErrorMessage, and
FailedInflightChecks empty. Assert the returned error still includes the basic
“cluster … is in error state” message and that ready is false, so the
len(details) == 0 branch in formatReadyClusterError is exercised.
- Around line 82-122: Add tests for the missing branches in
evaluateReadyClusterState to cover the nil ClusterDescription path and the
empty/whitespace State path, since current cases only exercise waiting, error,
uninstalling, and resuming states. Update cluster_service_test.go by adding
focused assertions around evaluateReadyClusterState(clusterID, nil) and
evaluateReadyClusterState(clusterID, &ClusterDescription{State: "   "}) to
verify the returned ready flag and error behavior match the function’s
nil/empty-state handling.

---

Nitpick comments:
In `@tests/utils/exec/rosacli/cluster_service.go`:
- Around line 576-614: `formatReadyClusterError` and `formatReadyClusterTimeout`
duplicate the same detail collection logic for `ProvisioningErrorCode`,
`ProvisioningErrorMessage`, and `FailedInflightChecks`. Extract that shared
behavior into a helper in `cluster_service.go` (for example, a details builder
used by both functions) and have both error formatters call it, keeping
`description.State`/timeout-specific wording in each function. This will keep
the `formatReadyClusterError` and `formatReadyClusterTimeout` implementations
aligned and reduce future drift.
- Around line 514-549: waitForClusterReadyStatus manually reimplements
interval/timeout polling, duplicating logic that should match the existing
wait.PollUntilContextTimeout usage in the same area. Refactor the Ready-state
loop in waitForClusterReadyStatus to use a wait.PollUntilContextTimeout
condition function, reusing getState, getDescription, and
evaluateReadyClusterState so both branches share the same polling and
cancellation semantics. Keep the lastDescription handling and timeout error
formatting intact via formatReadyClusterTimeout.
- Around line 469-486: The Ready branch in waitForClusterReadyStatus uses a
custom time.Now()/time.Sleep polling loop instead of the same
wait.PollUntilContextTimeout approach used in the other branch of
cluster_service.go. Refactor waitForClusterReadyStatus (and its caller in the
Ready path) to use the shared apimachinery polling utility for both status
checks and timeout handling, keeping the existing fetch/describe callbacks
intact so there is only one polling mechanism in this code path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dcefcb5f-ad43-4d37-bb03-6da5eaa87ac8

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7f60c and 624568a.

📒 Files selected for processing (3)
  • tests/utils/exec/rosacli/cluster_service.go
  • tests/utils/exec/rosacli/cluster_service_test.go
  • tests/utils/exec/rosacli/rosacli_suite_test.go

Comment thread tests/utils/exec/rosacli/cluster_service_test.go
Comment thread tests/utils/exec/rosacli/cluster_service_test.go
Comment thread tests/utils/exec/rosacli/cluster_service.go Outdated
Comment thread tests/utils/exec/rosacli/cluster_service.go Outdated
waitTimeout = 200 * time.Millisecond
}

if status == constants.Ready {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review is for all status/state comparisons: should the code use case-insensitive comparison if OCM state casing can vary? Exact match on "ready" today.

@coderabbitai validate

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -a 'constants' tests/utils/exec/rosacli tests/utils/constants 2>/dev/null
rg -n "Ready|Uninstalling|Error\b|Waiting\b|Resuming" tests/utils/constants -A1 -B1 2>/dev/null | head -100

Length of output: 6444


@amandahla Confirmed — I checked tests/utils/constants/cluster.go, and all the state constants used in these comparisons (Ready = "ready", Waiting = "waiting", Error = "error", Uninstalling = "uninstalling", etc.) are lowercase, and the OCM state field returned by rosa describe cluster (both JSON and table output) is consistently lowercase as well. So the exact-match comparisons (description.State == constants.Ready, clusterItem.State == status, clusterJSONData.DigString("state"), etc.) work correctly today because both sides follow the same lowercase convention — it's not a correctness bug currently.

That said, since this is test/e2e tooling parsing CLI text output (which is inherently more fragile than a typed API response), I agree it's worth hardening defensively:

  • Use strings.EqualFold(description.State, constants.Ready) instead of == for state comparisons.
  • Use strings.Contains(strings.ToLower(description.State), constants.Uninstalling) (or strings.EqualFold-based helper) for the Uninstalling/Error substring checks.
  • Apply the same to WaitForClusterPassWaiting's != constants.Waiting check.

This would guard against any future casing changes in OCM's state values or unexpected CLI formatting drift, at negligible cost. Let me know if you'd like me to kick off a plan for this change, or if you'd prefer to leave it as exact-match since it matches current OCM behavior.


🧬 Code Graph Analysis Results

Relevant snippet(s)

tests/utils/exec/rosacli/cluster_service.go (approx. lines 230-320): status/state comparisons in WaitClusterStatus

func (c *clusterService) WaitClusterStatus(clusterID string, status string, interval int, duration int) error {
	if strings.TrimSpace(clusterID) == "" {
		return fmt.Errorf("cluster ID is required when waiting for cluster status %s", status)
	}

	waitInterval := time.Duration(interval) * time.Minute
	if waitInterval <= 0 {
		waitInterval = time.Millisecond
	}
	waitTimeout := time.Duration(duration) * time.Minute
	if waitTimeout <= 0 {
		waitTimeout = 200 * time.Millisecond
	}

	if status == constants.Ready {
		return waitForClusterReadyStatus(
			clusterID,
			waitInterval,
			waitTimeout,
			func() (string, error) {
				clusterJSONData, err := c.GetJSONClusterDescription(clusterID)
				if err != nil {
					return "", err
				}
				return clusterJSONData.DigString("state"), nil
			},
			func() (*ClusterDescription, error) {
				return c.DescribeClusterAndReflect(clusterID)
			},
			time.Sleep,
		)
	}

	err := wait.PollUntilContextTimeout(
		context.Background(),
		waitInterval,
		waitTimeout,
		false,
		func(context.Context) (bool, error) {
			clusterListB, err := c.List()
			if err != nil {
				return false, err
			}
			clusterList, err := c.ReflectClusterList(clusterListB)
			if err != nil {
				return false, err
			}
			clusterItem := clusterList.Cluster(clusterID)
			if err != nil {
				return false, err
			}
			if clusterItem.State == status {
				return true, nil
			}
			return false, err
		})
	return err
}

tests/utils/exec/rosacli/cluster_service.go (approx. lines 322-386): constants.Ready exact match in waitForClusterReadyStatus

func waitForClusterReadyStatus(
	clusterID string,
	interval time.Duration,
	timeout time.Duration,
	getState func() (string, error),
	getDescription func() (*ClusterDescription, error),
	sleepFn func(time.Duration),
) error {
	endTime := time.Now().Add(timeout)
	var lastDescription *ClusterDescription

	for time.Now().Before(endTime) {
		state, err := getState()
		if err == nil && state == constants.Ready {
			return nil
		}

		description, err := getDescription()
		if err != nil {
			return err
		}
		lastDescription = description

		ready, err := evaluateReadyClusterState(clusterID, description)
		if err != nil {
			return err
		}
		if ready {
			return nil
		}

		sleepFn(interval)
	}

	return formatReadyClusterTimeout(clusterID, timeout, lastDescription)
}

tests/utils/exec/rosacli/cluster_service.go (approx. lines 388-450): state comparisons in evaluateReadyClusterState

func evaluateReadyClusterState(clusterID string, description *ClusterDescription) (bool, error) {
	if description == nil {
		return false, fmt.Errorf("no cluster description found for %s", clusterID)
	}

	if description.State == constants.Ready {
		return true, nil
	}

	if strings.Contains(description.State, constants.Uninstalling) {
		return false, fmt.Errorf("cluster %s is %s now. Cannot wait for it ready",
			clusterID, constants.Uninstalling)
	}

	if strings.Contains(description.State, constants.Error) {
		return false, formatReadyClusterError(clusterID, description)
	}

	if strings.TrimSpace(description.State) == "" {
		return false, fmt.Errorf("cluster %s returned an empty state", clusterID)
	}

	return false, nil
}

tests/utils/exec/rosacli/cluster_service.go (approx. lines 470-530): constants.Waiting exact match in WaitForClusterPassWaiting

func (c *clusterService) WaitForClusterPassWaiting(clusterID string, interval int, timeoutMin int) error {
	endTime := time.Now().Add(time.Duration(timeoutMin) * time.Minute)
	for time.Now().Before(endTime) {
		clusterJsonData, err := c.GetJSONClusterDescription(clusterID)
		if err != nil {
			return err
		}
		if clusterJsonData.DigString("state") != constants.Waiting {
			log.Logger.Infof("Cluster %s is not in waiting state anymore", clusterID)
			return nil
		}
		time.Sleep(time.Duration(interval) * time.Minute)
	}
	return fmt.Errorf("timeout for cluster stuck waiting after %d mins", timeoutMin)
}

tests/utils/exec/rosacli/cluster_service.go (approx. lines 440-470): uninstalling state matching via strings.Contains in WaitForClusterPassUninstalled

func (c *clusterService) WaitForClusterPassUninstalled(clusterID string, interval int, timeoutMin int) error {
	endTime := time.Now().Add(time.Duration(timeoutMin) * time.Minute)
	for time.Now().Before(endTime) {
		output, err := c.DescribeCluster(clusterID)
		if err != nil {
			if strings.Contains(output.String(),
				fmt.Sprintf("There is no cluster with identifier or name '%s'", clusterID)) ||
				strings.Contains(output.String(),
					fmt.Sprintf("Cluster '%s' not found", clusterID)) {
				log.Logger.Infof("Cluster %s has been deleted.", clusterID)
				return nil
			}
			log.Logger.Warnf("Issue retrieving cluster: %v.", err)
			continue
		}
		desc, err := c.ReflectClusterDescription(output)
		if err != nil {
			return err
		}
		if strings.Contains(desc.State, constants.Uninstalling) {
			time.Sleep(time.Duration(interval))
			continue
		}
		return fmt.Errorf("cluster %s is in status of %s which won't be deleted, stop waiting", clusterID, desc.State)
	}
	return fmt.Errorf("timeout for waiting for cluster deletion finished after %d mins", timeoutMin)
}

tests/utils/constants/cluster.go (lines 34-39): canonical state string constants

Ready        = "ready"
Waiting      = "waiting"
Error        = "error"
Uninstalling = "uninstalling"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this one against the current test contract before changing it. tests/utils/constants/cluster.go and the CLI/OCM state strings we parse are all lowercase today, so I left the comparisons exact in this patch. If we ever see mixed-case state output, I would rather normalize once at the parse boundary and add a focused regression test than broaden each comparison independently.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Use describe-cluster state and provisioning details when waiting for clusters to become ready so manual HCP flows fail with actionable errors instead of low-signal timeouts. Add focused waiter coverage, including the ready-state polling path used by the failing test scenario.

Signed-off-by: lufreita <lufreita@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/utils/exec/rosacli/cluster_service_test.go (1)

27-99: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicated fake-script setup between the two subprocess tests.

Both tests repeat the temp-dir creation, script writing, PATH manipulation, and client construction boilerplate. Extracting a small helper (e.g., newFakeRosaClient(t, script)) would reduce duplication and make future additions easier.

Also applies to: 242-268

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/exec/rosacli/cluster_service_test.go` around lines 27 - 99, The
fake rosa subprocess setup is duplicated across the two cluster service tests,
including temp-dir creation, script generation, PATH overrides, and
client/service construction. Factor this boilerplate into a small helper such as
newFakeRosaClient that builds the fake script, updates PATH, creates the client
via NewClient, and returns the configured NewClusterService (or client) for both
tests to reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/utils/exec/rosacli/cluster_service_test.go`:
- Around line 27-99: The test in WaitClusterStatus is mutating the process-wide
PATH unnecessarily even though Client.Runner.envs already uses
prependPathEnv(tempDir) to resolve the fake rosa binary. Remove the
os.Setenv("PATH", ...) call and its deferred restore from the
cluster_service_test setup, and rely only on the runner env injection so the
test stays isolated from shared process state.

---

Nitpick comments:
In `@tests/utils/exec/rosacli/cluster_service_test.go`:
- Around line 27-99: The fake rosa subprocess setup is duplicated across the two
cluster service tests, including temp-dir creation, script generation, PATH
overrides, and client/service construction. Factor this boilerplate into a small
helper such as newFakeRosaClient that builds the fake script, updates PATH,
creates the client via NewClient, and returns the configured NewClusterService
(or client) for both tests to reuse.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9d803977-ce21-40d2-b0f5-474f3f95a368

📥 Commits

Reviewing files that changed from the base of the PR and between 624568a and 9aa5d43.

📒 Files selected for processing (3)
  • tests/utils/exec/rosacli/cluster_service.go
  • tests/utils/exec/rosacli/cluster_service_test.go
  • tests/utils/exec/rosacli/rosacli_suite_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/utils/exec/rosacli/rosacli_suite_test.go
  • tests/utils/exec/rosacli/cluster_service.go

Comment on lines +27 to +99
It("waits for ready status using describe output", func() {
tempDir, err := os.MkdirTemp("", "fake-rosa-*")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tempDir)

textCountFile := filepath.Join(tempDir, "describe-count")
jsonCountFile := filepath.Join(tempDir, "json-describe-count")
scriptPath := filepath.Join(tempDir, "rosa")
script := fmt.Sprintf(`#!/usr/bin/env bash
set -euo pipefail
text_count_file=%q
json_count_file=%q

if [[ "$1" == "describe" && "$2" == "cluster" ]]; then
if [[ "$*" == *"--output json"* ]]; then
count=0
if [[ -f "$json_count_file" ]]; then
count=$(<"$json_count_file")
fi
count=$((count + 1))
printf '%%s' "$count" > "$json_count_file"
printf '{"state":"waiting"}\n'
exit 0
fi

count=0
if [[ -f "$text_count_file" ]]; then
count=$(<"$text_count_file")
fi
count=$((count + 1))
printf '%%s' "$count" > "$text_count_file"

if [[ "$count" -eq 1 ]]; then
cat <<'EOF'
Name: fake
ID: 1234abcd
State: resuming
EOF
else
cat <<'EOF'
Name: fake
ID: 1234abcd
State: ready
EOF
fi
exit 0
fi

echo "unexpected args: $*" >&2
exit 1
`, textCountFile, jsonCountFile)
err = os.WriteFile(scriptPath, []byte(script), 0755)
Expect(err).ToNot(HaveOccurred())

originalPath := os.Getenv("PATH")
err = os.Setenv("PATH", tempDir+":"+originalPath)
Expect(err).ToNot(HaveOccurred())
defer os.Setenv("PATH", originalPath)

client := NewClient()
client.Runner.envs = prependPathEnv(tempDir)
service := NewClusterService(client)

err = service.WaitClusterStatus(clusterID, constants.Ready, 0, 0)

Expect(err).ToNot(HaveOccurred())

jsonCalls := "0"
if content, readErr := os.ReadFile(jsonCountFile); readErr == nil {
jsonCalls = string(content)
}
Expect(jsonCalls).To(Equal("0"))
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the ready-wait defaults and confirm the fallback timing
ast-grep run --pattern 'func waitForClusterReadyStatus($$$) $$$' --lang go tests/utils/exec/rosacli/cluster_service.go

Repository: openshift/rosa

Length of output: 2855


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== WaitClusterStatus and timeout defaults =="
ast-grep run --pattern 'func (client *Client) WaitClusterStatus($$$) $$$' --lang go tests/utils/exec/rosacli/cluster_service.go

echo
echo "== Surrounding helper for ready wait =="
sed -n '420,560p' tests/utils/exec/rosacli/cluster_service.go

echo
echo "== Test file around the two subprocess tests =="
sed -n '1,120p' tests/utils/exec/rosacli/cluster_service_test.go
echo
sed -n '220,290p' tests/utils/exec/rosacli/cluster_service_test.go

Repository: openshift/rosa

Length of output: 197


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Find WaitClusterStatus definition =="
rg -n "WaitClusterStatus|waitForClusterReadyStatus|PollUntilContextTimeout|MkdirTemp|Setenv\(\"PATH\"" tests/utils/exec/rosacli/cluster_service.go tests/utils/exec/rosacli/cluster_service_test.go

echo
echo "== Read relevant implementation section =="
sed -n '470,555p' tests/utils/exec/rosacli/cluster_service.go

echo
echo "== Read the first subprocess test =="
sed -n '1,120p' tests/utils/exec/rosacli/cluster_service_test.go

echo
echo "== Read the later subprocess test =="
sed -n '230,285p' tests/utils/exec/rosacli/cluster_service_test.go

Repository: openshift/rosa

Length of output: 8663


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== WaitClusterStatus wrapper =="
sed -n '456,490p' tests/utils/exec/rosacli/cluster_service.go

echo
echo "== Client runner env handling =="
rg -n "type Runner|envs|prependPathEnv|runnerCfg|exec.Command|os.Environ|PATH=" tests/utils/exec/rosacli -g '*.go'

Repository: openshift/rosa

Length of output: 3225


Drop the process-wide PATH mutation here. prependPathEnv(tempDir) already injects the fake rosa binary into the runner env, so os.Setenv("PATH", ...) and the deferred restore are redundant and widen the test’s shared-state surface.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/exec/rosacli/cluster_service_test.go` around lines 27 - 99, The
test in WaitClusterStatus is mutating the process-wide PATH unnecessarily even
though Client.Runner.envs already uses prependPathEnv(tempDir) to resolve the
fake rosa binary. Remove the os.Setenv("PATH", ...) call and its deferred
restore from the cluster_service_test setup, and rely only on the runner env
injection so the test stays isolated from shared process state.

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@olucasfreitas: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-presubmits-pr-rosa-hcp-advanced 9aa5d43 link false /test e2e-presubmits-pr-rosa-hcp-advanced
ci/prow/security 9aa5d43 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants