Fix test failures and enhance retry logic for monitoring and istio#701
Fix test failures and enhance retry logic for monitoring and istio#701floatingman wants to merge 11 commits into
Conversation
The cluster-wide pods.StatusPods() check in VerifyGitRepo() and fleet test SetupSuite() was causing false failures when unrelated test suites (chart installs) left Error-state helm job pods on the cluster. Fleet's own git repo and cluster status reporting is authoritative for deployment health. Fixes 7 fleet test failures in pit.daily suite. Related: rancher/shepherd#574
- appco/istio: Create fresh session/client in TearDownSuite to avoid 'attempted to register cleanup function to closed test session' error - istio: Add preemptive cleanup of leftover istio resources in SetupSuite to prevent 409 namespace conflicts from appco istio tests - monitoring: Add retryWatchAndWait helper with 3 retries for transient 'error with watch connection' failures - longhorn: Handle 409 namespace creation gracefully, add chart uninstall to TearDownSuite - longhorn/chartinstall: Skip test on 409 namespace conflict when main longhorn suite is running concurrently
Add charts.RetryOnWatchError() to actions/charts/watch_retry.go as a reusable wrapper for WatchAndWait operations that fail with transient "error with watch connection" errors caused by the Rancher server's watcher channel lifecycle bug. - Extract retry logic from monitoring_test.go into shared helper - Apply retry wrapper to all 13 WatchAndWait call sites in istio_test.go - Apply retry wrapper to all 6 WatchAndWait call sites in monitoring_test.go - Uses shepherd's wait.WatchConnectionError constant for matching
The watch connection error also occurs inside InstallRancherMonitoringChart and InstallRancherIstioChart during the internal WatchWait on the App CR status. Wrap these calls with RetryOnWatchError, checking GetChartStatus first to avoid re-installing if a previous attempt succeeded but the watch failed. Also increase DefaultWatchRetries from 3 to 5 for more resilience. Monitoring test now passes the install phase successfully.
…hecks Both waitUnknownPrometheusTargets and checkPrometheusTargets were treating JSON parse errors (e.g. "unexpected end of JSON input") and proxy errors as hard failures that immediately stopped their retry loops. Since these are transient (Prometheus not yet ready, proxy connection flaky), return (false, nil) instead so polling continues. Also guard against nil mapResponse["data"] to prevent panics on incomplete responses. With this fix, TestMonitoringChart passes end-to-end including Prometheus target validation and alerting webhook verification.
listIstioDeployments now polls for up to 5 minutes (5s interval) for Pilot and IngressGateway deployments to appear with the operator.istio.io/version label. The istio operator takes time to reconcile after the chart CR reports 'deployed', causing the upgrade test to fail with 'expected: 2 deployments, actual: 0'.
| i.T().Log("Preemptively cleaning up istio resources") | ||
| charts.DeleteIstioResources(client, cluster.ID) | ||
| // DeleteIstioResources doesn't remove the namespace itself, so delete it via kubectl | ||
| kubectl.Command(client, nil, cluster.ID, []string{"sh", "-c", "kubectl delete namespace istio-system --ignore-not-found=true"}, "1MB") |
There was a problem hiding this comment.
the kubectl.Command return values are being discarded
There was a problem hiding this comment.
yeah the error should be logged at least even though it should be ignored
| // validate all resources on the cluster are actually in a good state, regardless of what fleet is reporting | ||
| podErrors := pods.StatusPods(client, k8sClusterID) | ||
| if len(podErrors) > 0 { | ||
| for _, err := range podErrors { | ||
| logrus.Error(err.Error()) | ||
| } | ||
| return errors.New("pods are not healthy in " + steveClusterID) | ||
| } | ||
| return err |
There was a problem hiding this comment.
Why remove this code?
There was a problem hiding this comment.
The StatusPods function fails the test even if an unrelated pod is in an error state. That's what the comment below says.
| return lastErr | ||
| } | ||
| logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr) | ||
| time.Sleep(watchRetryDelay) |
There was a problem hiding this comment.
Could we use a kwait function, instand of time.Sleep?
There was a problem hiding this comment.
I'll look into it.
There was a problem hiding this comment.
Pull request overview
This PR tries to reduce flakiness in Rancher chart tests (Monitoring, Istio) and Fleet GitRepo validation by adding a watch-connection-error retry helper, wrapping chart install/wait calls with it, softening Prometheus target error handling, proactively cleaning up Istio state before/after suites, and removing cluster-wide pod-health gating from Fleet validation. It also includes incidental gofmt/goimports formatting changes across several unrelated files.
Changes:
- New
charts.RetryOnWatchErrorhelper, applied around Monitoring/Istio install +WatchAndWait{Deployments,DaemonSets,StatefulSets}calls and Istio upgrade. - Prometheus target checks now log-and-retry on proxy/JSON errors; Istio test suites preemptively delete Istio resources/namespace and AppCo Istio uses a fresh session in
TearDownSuite. interoperability/fleet.VerifyGitRepoand Fleet test setup no longer run cluster-widepods.StatusPods; Longhorn suites tolerate pre-existinglonghorn-systemnamespace.
Reviewed changes
Copilot reviewed 14 out of 20 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| actions/charts/watch_retry.go | New RetryOnWatchError helper using time.Sleep between attempts. |
| validation/charts/monitoring.go | Soften proxy/parse errors to retry; add partial ok check on activeTargets assertion. |
| validation/charts/monitoring_test.go | Wrap install + workload waits with RetryOnWatchError in both test methods. |
| validation/charts/istio.go | Add listIstioDeployments polling wrapper around the renamed listIstioDeploymentsOnce. |
| validation/charts/istio_test.go | Preemptive Istio cleanup in SetupSuite; wrap install/upgrade/wait calls with RetryOnWatchError. |
| validation/charts/appco/istio_test.go | Add TearDownSuite that opens a fresh session/client to delete Istio resources. |
| interoperability/fleet/fleet.go | Remove cluster-wide pods.StatusPods check from VerifyGitRepo. |
| validation/fleet/public_gitrepo_test.go | Drop pods.StatusPods gate in setup; document rationale. |
| validation/fleet/snapshot/snapshot_restore_existing_gitrepo_test.go | Same pods.StatusPods removal as public_gitrepo. |
| validation/longhorn/longhorn_test.go | Uninstall Longhorn chart in TearDownSuite; tolerate existing namespace. |
| validation/longhorn/chartinstall/installation_test.go | Skip when namespace already exists to avoid 409 conflicts. |
| validation/rbac/globalroles/deprecated_restrictedadmin_role_test.go | Import order and struct field alignment cleanup. |
| validation/auth/tokens/legacy_token_manager_ext_compatibility_test.go | Whitespace cleanup in build tag. |
| validation/auth/kubeconfigs/kubeconfigs_watchlist_test.go | Struct alignment and trailing whitespace cleanup. |
| interoperability/qainfraautomation/provisioning.go | Reformat const block alignment. |
| interoperability/qainfraautomation/ansible/ansible.go | Reformat struct field alignment. |
| actions/useractivity/useractivity.go | Reorder imports and trailing newline. |
| actions/uiplugins/constants.go | Tab/alignment cleanup for const block. |
| actions/tokens/exttokens/verify.go | Whitespace and trailing-newline cleanup. |
| actions/kubeapi/nodes/nodes.go | Trailing newline added. |
| return lastErr | ||
| } | ||
| logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr) | ||
| time.Sleep(watchRetryDelay) |
|
|
||
| activeTargets := mapResponse["data"].(map[string]any)["activeTargets"].([]any) | ||
| if len(activeTargets) < 1 { | ||
| activeTargets, ok := mapResponse["data"].(map[string]any)["activeTargets"].([]any) |
| // Create a fresh session and client for cleanup since the test sessions | ||
| // were already closed by TearDownTest. Without this, kubectl.Command | ||
| // fails with "attempted to register cleanup function to closed test session". | ||
| cleanupSession := session.NewSession() | ||
| cleanupClient, err := rancher.NewClient("", cleanupSession) | ||
| if err != nil { | ||
| i.T().Logf("Failed to create cleanup client: %v", err) | ||
| } else { | ||
| _, err := charts.DeleteIstioResources(cleanupClient, i.cluster.ID) | ||
| if err != nil { | ||
| i.T().Logf("Failed to delete Istio resources: %v", err) | ||
| } | ||
| } | ||
| cleanupSession.Cleanup() | ||
| i.session.Cleanup() |
| } | ||
| require.NoError(l.T(), err) |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
hamistao
left a comment
There was a problem hiding this comment.
I really like this! It was about time we handled some of these.
I do have some suggestions and observations but overall looks really good.
| // which caused false failures when unrelated test suites (e.g., chart installs) left | ||
| // Error-state helm job pods on the cluster. Fleet's own status reporting (checked above) | ||
| // is authoritative for determining if the git repo deployment succeeded. | ||
| // See: https://github.com/rancher/shepherd/issues/574 |
There was a problem hiding this comment.
I am not sure we need comments on the history of the code
| i.T().Log("Preemptively cleaning up istio resources") | ||
| charts.DeleteIstioResources(client, cluster.ID) | ||
| // DeleteIstioResources doesn't remove the namespace itself, so delete it via kubectl | ||
| kubectl.Command(client, nil, cluster.ID, []string{"sh", "-c", "kubectl delete namespace istio-system --ignore-not-found=true"}, "1MB") |
There was a problem hiding this comment.
yeah the error should be logged at least even though it should be ignored
| func RetryOnWatchError(maxRetries int, fn func() error) error { | ||
| var lastErr error | ||
| for i := 0; i < maxRetries; i++ { | ||
| lastErr = fn() | ||
| if lastErr == nil { | ||
| return nil | ||
| } | ||
| if !strings.Contains(lastErr.Error(), wait.WatchConnectionError) { | ||
| return lastErr | ||
| } | ||
| logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr) | ||
| _ = kwait.PollUntilContextTimeout(context.TODO(), watchRetryDelay, watchRetryDelay+time.Second, false, func(context.Context) (bool, error) { | ||
| return true, nil | ||
| }) | ||
| } | ||
| return lastErr | ||
| } |
There was a problem hiding this comment.
Just a question: Have you looked into whether there is a specific condition that we should wait on instead of just retrying? I am fine with keeping the retry logic I guess but it would be nice to know in case we should deal with something similar.
| // NOTE: Skipped cluster-wide StatusPods check here. When running the full pit.daily | ||
| // suite, chart tests (monitoring, istio, longhorn) leave Error-state helm job pods | ||
| // that cause false failures. Individual test methods verify fleet deployment health | ||
| // via fleet.VerifyGitRepo() which checks fleet status directly. | ||
| // See: https://github.com/rancher/shepherd/issues/574 |
There was a problem hiding this comment.
Same comment on how we may not need to record history here.
| if err != nil { | ||
| // If namespace already exists, it's likely the main longhorn test suite is running concurrently. | ||
| // Skip rather than fail to avoid 409 conflict. | ||
| if strings.Contains(err.Error(), "409") || strings.Contains(err.Error(), "AlreadyExists") { | ||
| l.T().Skipf("Skipping: namespace %s already exists (likely from concurrent longhorn test suite): %v", charts.LonghornNamespace, err) | ||
| } else { | ||
| require.NoError(l.T(), err) | ||
| } | ||
| } |
There was a problem hiding this comment.
This may be the only change I disagree with. I think it makes more sense to just ignore the error in these cases and move on. Assuming it is a concurrent test suite is far too conjectury. It would me most often something left over from another operation (manual or automated) that already finished.
| err = charts.RetryOnWatchError(charts.DefaultWatchRetries, func() error { | ||
| return extencharts.WatchAndWaitDeployments(client, i.project.ClusterID, charts.RancherMonitoringNamespace, metav1.ListOptions{}) | ||
| }) |
There was a problem hiding this comment.
I think we should be careful with applying this retry logic on so many places. Have you actually seen the need for all these retries on waits? If so, do you think we can handle those differentely somehow?
Perhaps doing something like this could help instead:
err = charts.WatchAndWaitDeployments(client, i.project.ClusterID, charts.RancherMonitoringNamespace, metav1.ListOptions{
FieldSelector: "metadata.name=" + desiredDeploymentName
})
This pull request introduces improvements to the reliability and maintainability of chart installation and validation tests, especially for Istio and Monitoring charts. The main enhancements include adding robust retry logic for transient watch errors during chart operations, improving test environment cleanup, and refining error handling for Prometheus target checks. Additionally, it removes overly broad health checks in Fleet GitRepo validation to prevent unrelated test failures.
Test reliability and retry logic:
RetryOnWatchErrorhelper inactions/charts/watch_retry.goto automatically retry operations that fail due to transient watch connection errors, reducing test flakiness during chart installs and resource watches. This helper is now used throughout the Istio and Monitoring chart test suites. [1] [2] [3] [4] [5] [6] [7] [8]Test environment cleanup and setup:
istio-systemnamespace before and after tests, ensuring a clean test environment and preventing resource leaks between test runs. [1] [2]Prometheus targets and error handling:
validation/charts/monitoring.goto log and retry on proxy or parsing errors instead of failing immediately, making the monitoring tests more resilient to transient API issues. [1] [2]Fleet GitRepo validation:
VerifyGitRepologic to rely solely on Fleet's status reporting, removing cluster-wide pod health checks that could cause unrelated test failures. This change addresses issues with false negatives due to unrelated pods in error states.Miscellaneous improvements:
These changes collectively make the test suites more robust, reduce false failures, and improve the clarity and maintainability of the test code.