ROSAENG-61038 | fix: improve ready-state cluster waiting#3327
ROSAENG-61038 | fix: improve ready-state cluster waiting#3327olucasfreitas wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR updates ChangesRelated Issues: Not specified Sequence Diagram(s)Included in the hidden review stack artifact. 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: olucasfreitas The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tests/utils/exec/rosacli/cluster_service.go (3)
576-614: 📐 Maintainability & Code Quality | 🔵 TrivialDuplicated detail-building logic between
formatReadyClusterErrorandformatReadyClusterTimeout.Both functions build an identical
detailsslice fromProvisioningErrorCode,ProvisioningErrorMessage, andFailedInflightChecks. Extract a shared helper to avoid the two implementations drifting apart over time.♻️ Extract shared detail builder
+func provisioningDetails(description *ClusterDescription) []string { + details := []string{} + if description.ProvisioningErrorCode != "" { + details = append(details, fmt.Sprintf("provisioning error code: %s", description.ProvisioningErrorCode)) + } + if description.ProvisioningErrorMessage != "" { + details = append(details, fmt.Sprintf("provisioning error message: %s", description.ProvisioningErrorMessage)) + } + if description.FailedInflightChecks != "" { + details = append(details, fmt.Sprintf("failed inflight checks: %s", description.FailedInflightChecks)) + } + return details +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/exec/rosacli/cluster_service.go` around lines 576 - 614, `formatReadyClusterError` and `formatReadyClusterTimeout` duplicate the same detail collection logic for `ProvisioningErrorCode`, `ProvisioningErrorMessage`, and `FailedInflightChecks`. Extract that shared behavior into a helper in `cluster_service.go` (for example, a details builder used by both functions) and have both error formatters call it, keeping `description.State`/timeout-specific wording in each function. This will keep the `formatReadyClusterError` and `formatReadyClusterTimeout` implementations aligned and reduce future drift.
514-549: 📐 Maintainability & Code Quality | 🔵 TrivialCustom polling loop duplicates
wait.PollUntilContextTimeoutsemantics.
waitForClusterReadyStatusreimplements timeout/interval polling manually instead of usingwait.PollUntilContextTimeout(already imported and used just below for the non-Ready branch). Consider expressing this as await.PollUntilContextTimeoutcondition function to keep a single polling implementation, reduce duplicated timing logic, and inherit its context-cancellation semantics consistently across both branches.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/exec/rosacli/cluster_service.go` around lines 514 - 549, waitForClusterReadyStatus manually reimplements interval/timeout polling, duplicating logic that should match the existing wait.PollUntilContextTimeout usage in the same area. Refactor the Ready-state loop in waitForClusterReadyStatus to use a wait.PollUntilContextTimeout condition function, reusing getState, getDescription, and evaluateReadyClusterState so both branches share the same polling and cancellation semantics. Keep the lastDescription handling and timeout error formatting intact via formatReadyClusterTimeout.
469-486: 📐 Maintainability & Code Quality | 🔵 TrivialReady-path bypasses
wait.PollUntilContextTimeoutused elsewhere in this same function.The non-Ready branch below reuses the apimachinery
wait.PollUntilContextTimeoututility, but the Ready branch delegates to a hand-rolledtime.Now()/time.Sleeploop inwaitForClusterReadyStatus. This duplicates polling/timeout logic that the library already provides in the same file, increasing maintenance surface for two divergent polling mechanisms.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/exec/rosacli/cluster_service.go` around lines 469 - 486, The Ready branch in waitForClusterReadyStatus uses a custom time.Now()/time.Sleep polling loop instead of the same wait.PollUntilContextTimeout approach used in the other branch of cluster_service.go. Refactor waitForClusterReadyStatus (and its caller in the Ready path) to use the shared apimachinery polling utility for both status checks and timeout handling, keeping the existing fetch/describe callbacks intact so there is only one polling mechanism in this code path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/utils/exec/rosacli/cluster_service_test.go`:
- Around line 91-104: Add test coverage for the no-details fallback in
formatReadyClusterError by adding a case in clusterServiceTest around
evaluateReadyClusterState where the ClusterDescription only sets State to
constants.Error and leaves ProvisioningErrorCode, ProvisioningErrorMessage, and
FailedInflightChecks empty. Assert the returned error still includes the basic
“cluster … is in error state” message and that ready is false, so the
len(details) == 0 branch in formatReadyClusterError is exercised.
- Around line 82-122: Add tests for the missing branches in
evaluateReadyClusterState to cover the nil ClusterDescription path and the
empty/whitespace State path, since current cases only exercise waiting, error,
uninstalling, and resuming states. Update cluster_service_test.go by adding
focused assertions around evaluateReadyClusterState(clusterID, nil) and
evaluateReadyClusterState(clusterID, &ClusterDescription{State: " "}) to
verify the returned ready flag and error behavior match the function’s
nil/empty-state handling.
---
Nitpick comments:
In `@tests/utils/exec/rosacli/cluster_service.go`:
- Around line 576-614: `formatReadyClusterError` and `formatReadyClusterTimeout`
duplicate the same detail collection logic for `ProvisioningErrorCode`,
`ProvisioningErrorMessage`, and `FailedInflightChecks`. Extract that shared
behavior into a helper in `cluster_service.go` (for example, a details builder
used by both functions) and have both error formatters call it, keeping
`description.State`/timeout-specific wording in each function. This will keep
the `formatReadyClusterError` and `formatReadyClusterTimeout` implementations
aligned and reduce future drift.
- Around line 514-549: waitForClusterReadyStatus manually reimplements
interval/timeout polling, duplicating logic that should match the existing
wait.PollUntilContextTimeout usage in the same area. Refactor the Ready-state
loop in waitForClusterReadyStatus to use a wait.PollUntilContextTimeout
condition function, reusing getState, getDescription, and
evaluateReadyClusterState so both branches share the same polling and
cancellation semantics. Keep the lastDescription handling and timeout error
formatting intact via formatReadyClusterTimeout.
- Around line 469-486: The Ready branch in waitForClusterReadyStatus uses a
custom time.Now()/time.Sleep polling loop instead of the same
wait.PollUntilContextTimeout approach used in the other branch of
cluster_service.go. Refactor waitForClusterReadyStatus (and its caller in the
Ready path) to use the shared apimachinery polling utility for both status
checks and timeout handling, keeping the existing fetch/describe callbacks
intact so there is only one polling mechanism in this code path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dcefcb5f-ad43-4d37-bb03-6da5eaa87ac8
📒 Files selected for processing (3)
tests/utils/exec/rosacli/cluster_service.gotests/utils/exec/rosacli/cluster_service_test.gotests/utils/exec/rosacli/rosacli_suite_test.go
| waitTimeout = 200 * time.Millisecond | ||
| } | ||
|
|
||
| if status == constants.Ready { |
There was a problem hiding this comment.
This review is for all status/state comparisons: should the code use case-insensitive comparison if OCM state casing can vary? Exact match on "ready" today.
@coderabbitai validate
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -a 'constants' tests/utils/exec/rosacli tests/utils/constants 2>/dev/null
rg -n "Ready|Uninstalling|Error\b|Waiting\b|Resuming" tests/utils/constants -A1 -B1 2>/dev/null | head -100Length of output: 6444
@amandahla Confirmed — I checked tests/utils/constants/cluster.go, and all the state constants used in these comparisons (Ready = "ready", Waiting = "waiting", Error = "error", Uninstalling = "uninstalling", etc.) are lowercase, and the OCM state field returned by rosa describe cluster (both JSON and table output) is consistently lowercase as well. So the exact-match comparisons (description.State == constants.Ready, clusterItem.State == status, clusterJSONData.DigString("state"), etc.) work correctly today because both sides follow the same lowercase convention — it's not a correctness bug currently.
That said, since this is test/e2e tooling parsing CLI text output (which is inherently more fragile than a typed API response), I agree it's worth hardening defensively:
- Use
strings.EqualFold(description.State, constants.Ready)instead of==for state comparisons. - Use
strings.Contains(strings.ToLower(description.State), constants.Uninstalling)(orstrings.EqualFold-based helper) for theUninstalling/Errorsubstring checks. - Apply the same to
WaitForClusterPassWaiting's!= constants.Waitingcheck.
This would guard against any future casing changes in OCM's state values or unexpected CLI formatting drift, at negligible cost. Let me know if you'd like me to kick off a plan for this change, or if you'd prefer to leave it as exact-match since it matches current OCM behavior.
🧬 Code Graph Analysis Results
Relevant snippet(s)
tests/utils/exec/rosacli/cluster_service.go (approx. lines 230-320): status/state comparisons in WaitClusterStatus
func (c *clusterService) WaitClusterStatus(clusterID string, status string, interval int, duration int) error {
if strings.TrimSpace(clusterID) == "" {
return fmt.Errorf("cluster ID is required when waiting for cluster status %s", status)
}
waitInterval := time.Duration(interval) * time.Minute
if waitInterval <= 0 {
waitInterval = time.Millisecond
}
waitTimeout := time.Duration(duration) * time.Minute
if waitTimeout <= 0 {
waitTimeout = 200 * time.Millisecond
}
if status == constants.Ready {
return waitForClusterReadyStatus(
clusterID,
waitInterval,
waitTimeout,
func() (string, error) {
clusterJSONData, err := c.GetJSONClusterDescription(clusterID)
if err != nil {
return "", err
}
return clusterJSONData.DigString("state"), nil
},
func() (*ClusterDescription, error) {
return c.DescribeClusterAndReflect(clusterID)
},
time.Sleep,
)
}
err := wait.PollUntilContextTimeout(
context.Background(),
waitInterval,
waitTimeout,
false,
func(context.Context) (bool, error) {
clusterListB, err := c.List()
if err != nil {
return false, err
}
clusterList, err := c.ReflectClusterList(clusterListB)
if err != nil {
return false, err
}
clusterItem := clusterList.Cluster(clusterID)
if err != nil {
return false, err
}
if clusterItem.State == status {
return true, nil
}
return false, err
})
return err
}tests/utils/exec/rosacli/cluster_service.go (approx. lines 322-386): constants.Ready exact match in waitForClusterReadyStatus
func waitForClusterReadyStatus(
clusterID string,
interval time.Duration,
timeout time.Duration,
getState func() (string, error),
getDescription func() (*ClusterDescription, error),
sleepFn func(time.Duration),
) error {
endTime := time.Now().Add(timeout)
var lastDescription *ClusterDescription
for time.Now().Before(endTime) {
state, err := getState()
if err == nil && state == constants.Ready {
return nil
}
description, err := getDescription()
if err != nil {
return err
}
lastDescription = description
ready, err := evaluateReadyClusterState(clusterID, description)
if err != nil {
return err
}
if ready {
return nil
}
sleepFn(interval)
}
return formatReadyClusterTimeout(clusterID, timeout, lastDescription)
}tests/utils/exec/rosacli/cluster_service.go (approx. lines 388-450): state comparisons in evaluateReadyClusterState
func evaluateReadyClusterState(clusterID string, description *ClusterDescription) (bool, error) {
if description == nil {
return false, fmt.Errorf("no cluster description found for %s", clusterID)
}
if description.State == constants.Ready {
return true, nil
}
if strings.Contains(description.State, constants.Uninstalling) {
return false, fmt.Errorf("cluster %s is %s now. Cannot wait for it ready",
clusterID, constants.Uninstalling)
}
if strings.Contains(description.State, constants.Error) {
return false, formatReadyClusterError(clusterID, description)
}
if strings.TrimSpace(description.State) == "" {
return false, fmt.Errorf("cluster %s returned an empty state", clusterID)
}
return false, nil
}tests/utils/exec/rosacli/cluster_service.go (approx. lines 470-530): constants.Waiting exact match in WaitForClusterPassWaiting
func (c *clusterService) WaitForClusterPassWaiting(clusterID string, interval int, timeoutMin int) error {
endTime := time.Now().Add(time.Duration(timeoutMin) * time.Minute)
for time.Now().Before(endTime) {
clusterJsonData, err := c.GetJSONClusterDescription(clusterID)
if err != nil {
return err
}
if clusterJsonData.DigString("state") != constants.Waiting {
log.Logger.Infof("Cluster %s is not in waiting state anymore", clusterID)
return nil
}
time.Sleep(time.Duration(interval) * time.Minute)
}
return fmt.Errorf("timeout for cluster stuck waiting after %d mins", timeoutMin)
}tests/utils/exec/rosacli/cluster_service.go (approx. lines 440-470): uninstalling state matching via strings.Contains in WaitForClusterPassUninstalled
func (c *clusterService) WaitForClusterPassUninstalled(clusterID string, interval int, timeoutMin int) error {
endTime := time.Now().Add(time.Duration(timeoutMin) * time.Minute)
for time.Now().Before(endTime) {
output, err := c.DescribeCluster(clusterID)
if err != nil {
if strings.Contains(output.String(),
fmt.Sprintf("There is no cluster with identifier or name '%s'", clusterID)) ||
strings.Contains(output.String(),
fmt.Sprintf("Cluster '%s' not found", clusterID)) {
log.Logger.Infof("Cluster %s has been deleted.", clusterID)
return nil
}
log.Logger.Warnf("Issue retrieving cluster: %v.", err)
continue
}
desc, err := c.ReflectClusterDescription(output)
if err != nil {
return err
}
if strings.Contains(desc.State, constants.Uninstalling) {
time.Sleep(time.Duration(interval))
continue
}
return fmt.Errorf("cluster %s is in status of %s which won't be deleted, stop waiting", clusterID, desc.State)
}
return fmt.Errorf("timeout for waiting for cluster deletion finished after %d mins", timeoutMin)
}tests/utils/constants/cluster.go (lines 34-39): canonical state string constants
Ready = "ready"
Waiting = "waiting"
Error = "error"
Uninstalling = "uninstalling"There was a problem hiding this comment.
I checked this one against the current test contract before changing it. tests/utils/constants/cluster.go and the CLI/OCM state strings we parse are all lowercase today, so I left the comparisons exact in this patch. If we ever see mixed-case state output, I would rather normalize once at the parse boundary and add a focused regression test than broaden each comparison independently.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Use describe-cluster state and provisioning details when waiting for clusters to become ready so manual HCP flows fail with actionable errors instead of low-signal timeouts. Add focused waiter coverage, including the ready-state polling path used by the failing test scenario. Signed-off-by: lufreita <lufreita@redhat.com>
624568a to
9aa5d43
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/utils/exec/rosacli/cluster_service_test.go (1)
27-99: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicated fake-script setup between the two subprocess tests.
Both tests repeat the temp-dir creation, script writing, PATH manipulation, and client construction boilerplate. Extracting a small helper (e.g.,
newFakeRosaClient(t, script)) would reduce duplication and make future additions easier.Also applies to: 242-268
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/exec/rosacli/cluster_service_test.go` around lines 27 - 99, The fake rosa subprocess setup is duplicated across the two cluster service tests, including temp-dir creation, script generation, PATH overrides, and client/service construction. Factor this boilerplate into a small helper such as newFakeRosaClient that builds the fake script, updates PATH, creates the client via NewClient, and returns the configured NewClusterService (or client) for both tests to reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/utils/exec/rosacli/cluster_service_test.go`:
- Around line 27-99: The test in WaitClusterStatus is mutating the process-wide
PATH unnecessarily even though Client.Runner.envs already uses
prependPathEnv(tempDir) to resolve the fake rosa binary. Remove the
os.Setenv("PATH", ...) call and its deferred restore from the
cluster_service_test setup, and rely only on the runner env injection so the
test stays isolated from shared process state.
---
Nitpick comments:
In `@tests/utils/exec/rosacli/cluster_service_test.go`:
- Around line 27-99: The fake rosa subprocess setup is duplicated across the two
cluster service tests, including temp-dir creation, script generation, PATH
overrides, and client/service construction. Factor this boilerplate into a small
helper such as newFakeRosaClient that builds the fake script, updates PATH,
creates the client via NewClient, and returns the configured NewClusterService
(or client) for both tests to reuse.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9d803977-ce21-40d2-b0f5-474f3f95a368
📒 Files selected for processing (3)
tests/utils/exec/rosacli/cluster_service.gotests/utils/exec/rosacli/cluster_service_test.gotests/utils/exec/rosacli/rosacli_suite_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/utils/exec/rosacli/rosacli_suite_test.go
- tests/utils/exec/rosacli/cluster_service.go
| It("waits for ready status using describe output", func() { | ||
| tempDir, err := os.MkdirTemp("", "fake-rosa-*") | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| defer os.RemoveAll(tempDir) | ||
|
|
||
| textCountFile := filepath.Join(tempDir, "describe-count") | ||
| jsonCountFile := filepath.Join(tempDir, "json-describe-count") | ||
| scriptPath := filepath.Join(tempDir, "rosa") | ||
| script := fmt.Sprintf(`#!/usr/bin/env bash | ||
| set -euo pipefail | ||
| text_count_file=%q | ||
| json_count_file=%q | ||
|
|
||
| if [[ "$1" == "describe" && "$2" == "cluster" ]]; then | ||
| if [[ "$*" == *"--output json"* ]]; then | ||
| count=0 | ||
| if [[ -f "$json_count_file" ]]; then | ||
| count=$(<"$json_count_file") | ||
| fi | ||
| count=$((count + 1)) | ||
| printf '%%s' "$count" > "$json_count_file" | ||
| printf '{"state":"waiting"}\n' | ||
| exit 0 | ||
| fi | ||
|
|
||
| count=0 | ||
| if [[ -f "$text_count_file" ]]; then | ||
| count=$(<"$text_count_file") | ||
| fi | ||
| count=$((count + 1)) | ||
| printf '%%s' "$count" > "$text_count_file" | ||
|
|
||
| if [[ "$count" -eq 1 ]]; then | ||
| cat <<'EOF' | ||
| Name: fake | ||
| ID: 1234abcd | ||
| State: resuming | ||
| EOF | ||
| else | ||
| cat <<'EOF' | ||
| Name: fake | ||
| ID: 1234abcd | ||
| State: ready | ||
| EOF | ||
| fi | ||
| exit 0 | ||
| fi | ||
|
|
||
| echo "unexpected args: $*" >&2 | ||
| exit 1 | ||
| `, textCountFile, jsonCountFile) | ||
| err = os.WriteFile(scriptPath, []byte(script), 0755) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| originalPath := os.Getenv("PATH") | ||
| err = os.Setenv("PATH", tempDir+":"+originalPath) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| defer os.Setenv("PATH", originalPath) | ||
|
|
||
| client := NewClient() | ||
| client.Runner.envs = prependPathEnv(tempDir) | ||
| service := NewClusterService(client) | ||
|
|
||
| err = service.WaitClusterStatus(clusterID, constants.Ready, 0, 0) | ||
|
|
||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| jsonCalls := "0" | ||
| if content, readErr := os.ReadFile(jsonCountFile); readErr == nil { | ||
| jsonCalls = string(content) | ||
| } | ||
| Expect(jsonCalls).To(Equal("0")) | ||
| }) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the ready-wait defaults and confirm the fallback timing
ast-grep run --pattern 'func waitForClusterReadyStatus($$$) $$$' --lang go tests/utils/exec/rosacli/cluster_service.goRepository: openshift/rosa
Length of output: 2855
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== WaitClusterStatus and timeout defaults =="
ast-grep run --pattern 'func (client *Client) WaitClusterStatus($$$) $$$' --lang go tests/utils/exec/rosacli/cluster_service.go
echo
echo "== Surrounding helper for ready wait =="
sed -n '420,560p' tests/utils/exec/rosacli/cluster_service.go
echo
echo "== Test file around the two subprocess tests =="
sed -n '1,120p' tests/utils/exec/rosacli/cluster_service_test.go
echo
sed -n '220,290p' tests/utils/exec/rosacli/cluster_service_test.goRepository: openshift/rosa
Length of output: 197
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find WaitClusterStatus definition =="
rg -n "WaitClusterStatus|waitForClusterReadyStatus|PollUntilContextTimeout|MkdirTemp|Setenv\(\"PATH\"" tests/utils/exec/rosacli/cluster_service.go tests/utils/exec/rosacli/cluster_service_test.go
echo
echo "== Read relevant implementation section =="
sed -n '470,555p' tests/utils/exec/rosacli/cluster_service.go
echo
echo "== Read the first subprocess test =="
sed -n '1,120p' tests/utils/exec/rosacli/cluster_service_test.go
echo
echo "== Read the later subprocess test =="
sed -n '230,285p' tests/utils/exec/rosacli/cluster_service_test.goRepository: openshift/rosa
Length of output: 8663
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== WaitClusterStatus wrapper =="
sed -n '456,490p' tests/utils/exec/rosacli/cluster_service.go
echo
echo "== Client runner env handling =="
rg -n "type Runner|envs|prependPathEnv|runnerCfg|exec.Command|os.Environ|PATH=" tests/utils/exec/rosacli -g '*.go'Repository: openshift/rosa
Length of output: 3225
Drop the process-wide PATH mutation here. prependPathEnv(tempDir) already injects the fake rosa binary into the runner env, so os.Setenv("PATH", ...) and the deferred restore are redundant and widen the test’s shared-state surface.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/utils/exec/rosacli/cluster_service_test.go` around lines 27 - 99, The
test in WaitClusterStatus is mutating the process-wide PATH unnecessarily even
though Client.Runner.envs already uses prependPathEnv(tempDir) to resolve the
fake rosa binary. Remove the os.Setenv("PATH", ...) call and its deferred
restore from the cluster_service_test setup, and rely only on the runner env
injection so the test stays isolated from shared process state.
|
@olucasfreitas: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR Summary
Improve the ready-state waiter used by ROSA e2e helpers so manual HCP cluster flows report actionable status instead of timing out on
rosa list clusteralone.Detailed Description of the Issue
OCP-75536/ROSAENG-61038tracks an HCP manual-mode cluster-create failure that timed out waiting forready. The shared waiter intests/utils/exec/rosacli/cluster_service.goonly polledrosa list cluster, which gave low-signal failures and let provisioning state details get lost. This change keeps the generic waiter for non-ready states, but usesdescribe clusterstate/details when callers wait forready.Related Issues and PRs
Type of Change
Previous Behavior
Waiting for
readyonly checkedrosa list cluster. Manual HCP flows could time out without surfacing provisioning errors or the last observed cluster state, and an empty cluster ID was not rejected up front.Behavior After This Change
When waiting for
ready, the helper validates the cluster ID, pollsdescribe clusterstate/details, preserves legitimate intermediate states such asresuming, resets runner JSON format correctly after describe failures, and returns timeouts/errors with the last known state and provisioning details. Focused tests cover the ready waiter path and its error handling.How to Test (Step-by-Step)
Preconditions
cursor/b2edb6ffmake install-hooksTest Steps
make fmtgo test ./tests/utils/exec/rosacligo test ./tests/e2e -run TestDoesNotExistmake lintmake rosamake testmake basic-checksExpected Results
rosa list clusteroutput.Proof of the Fix
go test ./tests/utils/exec/rosacli,go test ./tests/e2e -run TestDoesNotExist,make lint,make rosa,make test, andmake basic-checksall passed locally.tests/utils/exec/rosacli/cluster_service_test.go.Risk / Follow-up
describewith a successful textdescribefallback inside the same wait loop. The new direct format-reset coverage protects the shared runner state for that path.Breaking Changes
Breaking Change Details / Migration Plan
N/A
Developer Verification Checklist
[JIRA-TICKET] | [TYPE]: <MESSAGE>.make install-hookshas been run in this clone.make testpasses.make lintpasses.make rosapasses.Summary by CodeRabbit