Fix test failures and enhance retry logic for monitoring and istio by floatingman · Pull Request #701 · rancher/tests

floatingman · 2026-05-27T15:09:26Z

This pull request introduces improvements to the reliability and maintainability of chart installation and validation tests, especially for Istio and Monitoring charts. The main enhancements include adding robust retry logic for transient watch errors during chart operations, improving test environment cleanup, and refining error handling for Prometheus target checks. Additionally, it removes overly broad health checks in Fleet GitRepo validation to prevent unrelated test failures.

Test reliability and retry logic:

Added a new RetryOnWatchError helper in actions/charts/watch_retry.go to automatically retry operations that fail due to transient watch connection errors, reducing test flakiness during chart installs and resource watches. This helper is now used throughout the Istio and Monitoring chart test suites. [1] [2] [3] [4] [5] [6] [7] [8]

Test environment cleanup and setup:

Improved Istio test setup and teardown to proactively clean up Istio resources and the istio-system namespace before and after tests, ensuring a clean test environment and preventing resource leaks between test runs. [1] [2]

Prometheus targets and error handling:

Enhanced error handling for Prometheus target checks in validation/charts/monitoring.go to log and retry on proxy or parsing errors instead of failing immediately, making the monitoring tests more resilient to transient API issues. [1] [2]

Fleet GitRepo validation:

Simplified the VerifyGitRepo logic to rely solely on Fleet's status reporting, removing cluster-wide pod health checks that could cause unrelated test failures. This change addresses issues with false negatives due to unrelated pods in error states.

Miscellaneous improvements:

Refactored deployment listing for Istio to add polling with a timeout, ensuring deployments are reliably detected before proceeding with tests.

These changes collectively make the test suites more robust, reduce false failures, and improve the clarity and maintainability of the test code.

The cluster-wide pods.StatusPods() check in VerifyGitRepo() and fleet test SetupSuite() was causing false failures when unrelated test suites (chart installs) left Error-state helm job pods on the cluster. Fleet's own git repo and cluster status reporting is authoritative for deployment health. Fixes 7 fleet test failures in pit.daily suite. Related: rancher/shepherd#574

- appco/istio: Create fresh session/client in TearDownSuite to avoid 'attempted to register cleanup function to closed test session' error - istio: Add preemptive cleanup of leftover istio resources in SetupSuite to prevent 409 namespace conflicts from appco istio tests - monitoring: Add retryWatchAndWait helper with 3 retries for transient 'error with watch connection' failures - longhorn: Handle 409 namespace creation gracefully, add chart uninstall to TearDownSuite - longhorn/chartinstall: Skip test on 409 namespace conflict when main longhorn suite is running concurrently

Add charts.RetryOnWatchError() to actions/charts/watch_retry.go as a reusable wrapper for WatchAndWait operations that fail with transient "error with watch connection" errors caused by the Rancher server's watcher channel lifecycle bug. - Extract retry logic from monitoring_test.go into shared helper - Apply retry wrapper to all 13 WatchAndWait call sites in istio_test.go - Apply retry wrapper to all 6 WatchAndWait call sites in monitoring_test.go - Uses shepherd's wait.WatchConnectionError constant for matching

The watch connection error also occurs inside InstallRancherMonitoringChart and InstallRancherIstioChart during the internal WatchWait on the App CR status. Wrap these calls with RetryOnWatchError, checking GetChartStatus first to avoid re-installing if a previous attempt succeeded but the watch failed. Also increase DefaultWatchRetries from 3 to 5 for more resilience. Monitoring test now passes the install phase successfully.

…hecks Both waitUnknownPrometheusTargets and checkPrometheusTargets were treating JSON parse errors (e.g. "unexpected end of JSON input") and proxy errors as hard failures that immediately stopped their retry loops. Since these are transient (Prometheus not yet ready, proxy connection flaky), return (false, nil) instead so polling continues. Also guard against nil mapResponse["data"] to prevent panics on incomplete responses. With this fix, TestMonitoringChart passes end-to-end including Prometheus target validation and alerting webhook verification.

listIstioDeployments now polls for up to 5 minutes (5s interval) for Pilot and IngressGateway deployments to appear with the operator.istio.io/version label. The istio operator takes time to reconcile after the chart CR reports 'deployed', causing the upgrade test to fail with 'expected: 2 deployments, actual: 0'.

lscalabrini01 · 2026-05-28T15:01:52Z

+	i.T().Log("Preemptively cleaning up istio resources")
+	charts.DeleteIstioResources(client, cluster.ID)
+	// DeleteIstioResources doesn't remove the namespace itself, so delete it via kubectl
+	kubectl.Command(client, nil, cluster.ID, []string{"sh", "-c", "kubectl delete namespace istio-system --ignore-not-found=true"}, "1MB")


the kubectl.Command return values are being discarded

yeah the error should be logged at least even though it should be ignored

@floatingman

lscalabrini01 · 2026-05-28T15:03:41Z

-	// validate all resources on the cluster are actually in a good state, regardless of what fleet is reporting
-	podErrors := pods.StatusPods(client, k8sClusterID)
-	if len(podErrors) > 0 {
-		for _, err := range podErrors {
-			logrus.Error(err.Error())
-		}
-		return errors.New("pods are not healthy in " + steveClusterID)
-	}
-	return err


Why remove this code?

The StatusPods function fails the test even if an unrelated pod is in an error state. That's what the comment below says.

lscalabrini01 · 2026-05-28T15:10:26Z

+			return lastErr
+		}
+		logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr)
+		time.Sleep(watchRetryDelay)


Could we use a kwait function, instand of time.Sleep?

I'll look into it.

Copilot

Pull request overview

This PR tries to reduce flakiness in Rancher chart tests (Monitoring, Istio) and Fleet GitRepo validation by adding a watch-connection-error retry helper, wrapping chart install/wait calls with it, softening Prometheus target error handling, proactively cleaning up Istio state before/after suites, and removing cluster-wide pod-health gating from Fleet validation. It also includes incidental gofmt/goimports formatting changes across several unrelated files.

Changes:

New charts.RetryOnWatchError helper, applied around Monitoring/Istio install + WatchAndWait{Deployments,DaemonSets,StatefulSets} calls and Istio upgrade.
Prometheus target checks now log-and-retry on proxy/JSON errors; Istio test suites preemptively delete Istio resources/namespace and AppCo Istio uses a fresh session in TearDownSuite.
interoperability/fleet.VerifyGitRepo and Fleet test setup no longer run cluster-wide pods.StatusPods; Longhorn suites tolerate pre-existing longhorn-system namespace.

Reviewed changes

Copilot reviewed 14 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
actions/charts/watch_retry.go	New `RetryOnWatchError` helper using `time.Sleep` between attempts.
validation/charts/monitoring.go	Soften proxy/parse errors to retry; add partial `ok` check on `activeTargets` assertion.
validation/charts/monitoring_test.go	Wrap install + workload waits with `RetryOnWatchError` in both test methods.
validation/charts/istio.go	Add `listIstioDeployments` polling wrapper around the renamed `listIstioDeploymentsOnce`.
validation/charts/istio_test.go	Preemptive Istio cleanup in `SetupSuite`; wrap install/upgrade/wait calls with `RetryOnWatchError`.
validation/charts/appco/istio_test.go	Add `TearDownSuite` that opens a fresh session/client to delete Istio resources.
interoperability/fleet/fleet.go	Remove cluster-wide `pods.StatusPods` check from `VerifyGitRepo`.
validation/fleet/public_gitrepo_test.go	Drop `pods.StatusPods` gate in setup; document rationale.
validation/fleet/snapshot/snapshot_restore_existing_gitrepo_test.go	Same `pods.StatusPods` removal as public_gitrepo.
validation/longhorn/longhorn_test.go	Uninstall Longhorn chart in `TearDownSuite`; tolerate existing namespace.
validation/longhorn/chartinstall/installation_test.go	Skip when namespace already exists to avoid 409 conflicts.
validation/rbac/globalroles/deprecated_restrictedadmin_role_test.go	Import order and struct field alignment cleanup.
validation/auth/tokens/legacy_token_manager_ext_compatibility_test.go	Whitespace cleanup in build tag.
validation/auth/kubeconfigs/kubeconfigs_watchlist_test.go	Struct alignment and trailing whitespace cleanup.
interoperability/qainfraautomation/provisioning.go	Reformat const block alignment.
interoperability/qainfraautomation/ansible/ansible.go	Reformat struct field alignment.
actions/useractivity/useractivity.go	Reorder imports and trailing newline.
actions/uiplugins/constants.go	Tab/alignment cleanup for const block.
actions/tokens/exttokens/verify.go	Whitespace and trailing-newline cleanup.
actions/kubeapi/nodes/nodes.go	Trailing newline added.

+			return lastErr
+		}
+		logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr)
+		time.Sleep(watchRetryDelay)



-		activeTargets := mapResponse["data"].(map[string]any)["activeTargets"].([]any)
-		if len(activeTargets) < 1 {
+		activeTargets, ok := mapResponse["data"].(map[string]any)["activeTargets"].([]any)


+	// Create a fresh session and client for cleanup since the test sessions
+	// were already closed by TearDownTest. Without this, kubectl.Command
+	// fails with "attempted to register cleanup function to closed test session".
+	cleanupSession := session.NewSession()
+	cleanupClient, err := rancher.NewClient("", cleanupSession)
+	if err != nil {
+		i.T().Logf("Failed to create cleanup client: %v", err)
+	} else {
+		_, err := charts.DeleteIstioResources(cleanupClient, i.cluster.ID)
+		if err != nil {
+			i.T().Logf("Failed to delete Istio resources: %v", err)
+		}
+	}
+	cleanupSession.Cleanup()
 	i.session.Cleanup()


+		}
+		require.NoError(l.T(), err)


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

hamistao

I really like this! It was about time we handled some of these.

I do have some suggestions and observations but overall looks really good.

hamistao · 2026-05-29T14:22:04Z

+	// which caused false failures when unrelated test suites (e.g., chart installs) left
+	// Error-state helm job pods on the cluster. Fleet's own status reporting (checked above)
+	// is authoritative for determining if the git repo deployment succeeded.
+	// See: https://github.com/rancher/shepherd/issues/574


I am not sure we need comments on the history of the code

hamistao · 2026-05-29T18:47:47Z

+	i.T().Log("Preemptively cleaning up istio resources")
+	charts.DeleteIstioResources(client, cluster.ID)
+	// DeleteIstioResources doesn't remove the namespace itself, so delete it via kubectl
+	kubectl.Command(client, nil, cluster.ID, []string{"sh", "-c", "kubectl delete namespace istio-system --ignore-not-found=true"}, "1MB")


yeah the error should be logged at least even though it should be ignored

@floatingman

hamistao · 2026-05-29T18:50:29Z

+func RetryOnWatchError(maxRetries int, fn func() error) error {
+	var lastErr error
+	for i := 0; i < maxRetries; i++ {
+		lastErr = fn()
+		if lastErr == nil {
+			return nil
+		}
+		if !strings.Contains(lastErr.Error(), wait.WatchConnectionError) {
+			return lastErr
+		}
+		logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr)
+		_ = kwait.PollUntilContextTimeout(context.TODO(), watchRetryDelay, watchRetryDelay+time.Second, false, func(context.Context) (bool, error) {
+			return true, nil
+		})
+	}
+	return lastErr
+}


Just a question: Have you looked into whether there is a specific condition that we should wait on instead of just retrying? I am fine with keeping the retry logic I guess but it would be nice to know in case we should deal with something similar.

hamistao · 2026-05-29T18:51:53Z

+	// NOTE: Skipped cluster-wide StatusPods check here. When running the full pit.daily
+	// suite, chart tests (monitoring, istio, longhorn) leave Error-state helm job pods
+	// that cause false failures. Individual test methods verify fleet deployment health
+	// via fleet.VerifyGitRepo() which checks fleet status directly.
+	// See: https://github.com/rancher/shepherd/issues/574


Same comment on how we may not need to record history here.

hamistao · 2026-05-29T18:53:46Z

+	if err != nil {
+		// If namespace already exists, it's likely the main longhorn test suite is running concurrently.
+		// Skip rather than fail to avoid 409 conflict.
+		if strings.Contains(err.Error(), "409") || strings.Contains(err.Error(), "AlreadyExists") {
+			l.T().Skipf("Skipping: namespace %s already exists (likely from concurrent longhorn test suite): %v", charts.LonghornNamespace, err)
+		} else {
+			require.NoError(l.T(), err)
+		}
+	}


This may be the only change I disagree with. I think it makes more sense to just ignore the error in these cases and move on. Assuming it is a concurrent test suite is far too conjectury. It would me most often something left over from another operation (manual or automated) that already finished.

hamistao · 2026-05-29T18:59:02Z

+		err = charts.RetryOnWatchError(charts.DefaultWatchRetries, func() error {
+			return extencharts.WatchAndWaitDeployments(client, i.project.ClusterID, charts.RancherMonitoringNamespace, metav1.ListOptions{})
+		})


I think we should be careful with applying this retry logic on so many places. Have you actually seen the need for all these retries on waits? If so, do you think we can handle those differentely somehow?

Perhaps doing something like this could help instead:

err = charts.WatchAndWaitDeployments(client, i.project.ClusterID, charts.RancherMonitoringNamespace, metav1.ListOptions{ FieldSelector: "metadata.name=" + desiredDeploymentName })

floatingman added 6 commits May 26, 2026 13:24

floatingman requested review from lscalabrini01 and slickwarren May 27, 2026 15:09

floatingman self-assigned this May 27, 2026

floatingman requested a review from a team as a code owner May 27, 2026 15:09

floatingman added the team/pit-crew slack notifier for pit crew label May 27, 2026

Fix formatting

d04c307

lscalabrini01 reviewed May 28, 2026

View reviewed changes

floatingman requested review from Copilot and hamistao May 28, 2026 15:15

Copilot started reviewing on behalf of floatingman May 28, 2026 15:15 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

floatingman and others added 2 commits May 28, 2026 10:23

Add comment about discarding error.

e98f362

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Address PR comments

3c9729c

hamistao reviewed May 29, 2026

View reviewed changes

floatingman added 2 commits May 29, 2026 15:46

Fix istio tests

64a8ae0

Fix formatting

6644281

Conversation

floatingman commented May 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

hamistao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants