Skip to content

Fix test failures and enhance retry logic for monitoring and istio#701

Open
floatingman wants to merge 11 commits into
rancher:mainfrom
floatingman:fix/fix-pit-daily-test
Open

Fix test failures and enhance retry logic for monitoring and istio#701
floatingman wants to merge 11 commits into
rancher:mainfrom
floatingman:fix/fix-pit-daily-test

Conversation

@floatingman
Copy link
Copy Markdown
Contributor

This pull request introduces improvements to the reliability and maintainability of chart installation and validation tests, especially for Istio and Monitoring charts. The main enhancements include adding robust retry logic for transient watch errors during chart operations, improving test environment cleanup, and refining error handling for Prometheus target checks. Additionally, it removes overly broad health checks in Fleet GitRepo validation to prevent unrelated test failures.

Test reliability and retry logic:

  • Added a new RetryOnWatchError helper in actions/charts/watch_retry.go to automatically retry operations that fail due to transient watch connection errors, reducing test flakiness during chart installs and resource watches. This helper is now used throughout the Istio and Monitoring chart test suites. [1] [2] [3] [4] [5] [6] [7] [8]

Test environment cleanup and setup:

  • Improved Istio test setup and teardown to proactively clean up Istio resources and the istio-system namespace before and after tests, ensuring a clean test environment and preventing resource leaks between test runs. [1] [2]

Prometheus targets and error handling:

  • Enhanced error handling for Prometheus target checks in validation/charts/monitoring.go to log and retry on proxy or parsing errors instead of failing immediately, making the monitoring tests more resilient to transient API issues. [1] [2]

Fleet GitRepo validation:

  • Simplified the VerifyGitRepo logic to rely solely on Fleet's status reporting, removing cluster-wide pod health checks that could cause unrelated test failures. This change addresses issues with false negatives due to unrelated pods in error states.

Miscellaneous improvements:

  • Refactored deployment listing for Istio to add polling with a timeout, ensuring deployments are reliably detected before proceeding with tests.

These changes collectively make the test suites more robust, reduce false failures, and improve the clarity and maintainability of the test code.

The cluster-wide pods.StatusPods() check in VerifyGitRepo() and fleet
test SetupSuite() was causing false failures when unrelated test suites
(chart installs) left Error-state helm job pods on the cluster. Fleet's
own git repo and cluster status reporting is authoritative for deployment
health.

Fixes 7 fleet test failures in pit.daily suite.
Related: rancher/shepherd#574
- appco/istio: Create fresh session/client in TearDownSuite to avoid
  'attempted to register cleanup function to closed test session' error
- istio: Add preemptive cleanup of leftover istio resources in SetupSuite
  to prevent 409 namespace conflicts from appco istio tests
- monitoring: Add retryWatchAndWait helper with 3 retries for transient
  'error with watch connection' failures
- longhorn: Handle 409 namespace creation gracefully, add chart uninstall
  to TearDownSuite
- longhorn/chartinstall: Skip test on 409 namespace conflict when main
  longhorn suite is running concurrently
Add charts.RetryOnWatchError() to actions/charts/watch_retry.go as a
reusable wrapper for WatchAndWait operations that fail with transient
"error with watch connection" errors caused by the Rancher server's
watcher channel lifecycle bug.

- Extract retry logic from monitoring_test.go into shared helper
- Apply retry wrapper to all 13 WatchAndWait call sites in istio_test.go
- Apply retry wrapper to all 6 WatchAndWait call sites in monitoring_test.go
- Uses shepherd's wait.WatchConnectionError constant for matching
The watch connection error also occurs inside InstallRancherMonitoringChart
and InstallRancherIstioChart during the internal WatchWait on the App CR
status. Wrap these calls with RetryOnWatchError, checking GetChartStatus
first to avoid re-installing if a previous attempt succeeded but the watch
failed.

Also increase DefaultWatchRetries from 3 to 5 for more resilience.

Monitoring test now passes the install phase successfully.
…hecks

Both waitUnknownPrometheusTargets and checkPrometheusTargets were
treating JSON parse errors (e.g. "unexpected end of JSON input") and
proxy errors as hard failures that immediately stopped their retry
loops. Since these are transient (Prometheus not yet ready, proxy
connection flaky), return (false, nil) instead so polling continues.

Also guard against nil mapResponse["data"] to prevent panics on
incomplete responses.

With this fix, TestMonitoringChart passes end-to-end including
Prometheus target validation and alerting webhook verification.
listIstioDeployments now polls for up to 5 minutes (5s interval) for
Pilot and IngressGateway deployments to appear with the
operator.istio.io/version label. The istio operator takes time to
reconcile after the chart CR reports 'deployed', causing the upgrade
test to fail with 'expected: 2 deployments, actual: 0'.
@floatingman floatingman self-assigned this May 27, 2026
@floatingman floatingman requested a review from a team as a code owner May 27, 2026 15:09
@floatingman floatingman added the team/pit-crew slack notifier for pit crew label May 27, 2026
i.T().Log("Preemptively cleaning up istio resources")
charts.DeleteIstioResources(client, cluster.ID)
// DeleteIstioResources doesn't remove the namespace itself, so delete it via kubectl
kubectl.Command(client, nil, cluster.ID, []string{"sh", "-c", "kubectl delete namespace istio-system --ignore-not-found=true"}, "1MB")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kubectl.Command return values are being discarded

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the error should be logged at least even though it should be ignored

@floatingman

Comment on lines -150 to -158
// validate all resources on the cluster are actually in a good state, regardless of what fleet is reporting
podErrors := pods.StatusPods(client, k8sClusterID)
if len(podErrors) > 0 {
for _, err := range podErrors {
logrus.Error(err.Error())
}
return errors.New("pods are not healthy in " + steveClusterID)
}
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StatusPods function fails the test even if an unrelated pod is in an error state. That's what the comment below says.

Comment thread actions/charts/watch_retry.go Outdated
return lastErr
}
logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr)
time.Sleep(watchRetryDelay)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a kwait function, instand of time.Sleep?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tries to reduce flakiness in Rancher chart tests (Monitoring, Istio) and Fleet GitRepo validation by adding a watch-connection-error retry helper, wrapping chart install/wait calls with it, softening Prometheus target error handling, proactively cleaning up Istio state before/after suites, and removing cluster-wide pod-health gating from Fleet validation. It also includes incidental gofmt/goimports formatting changes across several unrelated files.

Changes:

  • New charts.RetryOnWatchError helper, applied around Monitoring/Istio install + WatchAndWait{Deployments,DaemonSets,StatefulSets} calls and Istio upgrade.
  • Prometheus target checks now log-and-retry on proxy/JSON errors; Istio test suites preemptively delete Istio resources/namespace and AppCo Istio uses a fresh session in TearDownSuite.
  • interoperability/fleet.VerifyGitRepo and Fleet test setup no longer run cluster-wide pods.StatusPods; Longhorn suites tolerate pre-existing longhorn-system namespace.

Reviewed changes

Copilot reviewed 14 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
actions/charts/watch_retry.go New RetryOnWatchError helper using time.Sleep between attempts.
validation/charts/monitoring.go Soften proxy/parse errors to retry; add partial ok check on activeTargets assertion.
validation/charts/monitoring_test.go Wrap install + workload waits with RetryOnWatchError in both test methods.
validation/charts/istio.go Add listIstioDeployments polling wrapper around the renamed listIstioDeploymentsOnce.
validation/charts/istio_test.go Preemptive Istio cleanup in SetupSuite; wrap install/upgrade/wait calls with RetryOnWatchError.
validation/charts/appco/istio_test.go Add TearDownSuite that opens a fresh session/client to delete Istio resources.
interoperability/fleet/fleet.go Remove cluster-wide pods.StatusPods check from VerifyGitRepo.
validation/fleet/public_gitrepo_test.go Drop pods.StatusPods gate in setup; document rationale.
validation/fleet/snapshot/snapshot_restore_existing_gitrepo_test.go Same pods.StatusPods removal as public_gitrepo.
validation/longhorn/longhorn_test.go Uninstall Longhorn chart in TearDownSuite; tolerate existing namespace.
validation/longhorn/chartinstall/installation_test.go Skip when namespace already exists to avoid 409 conflicts.
validation/rbac/globalroles/deprecated_restrictedadmin_role_test.go Import order and struct field alignment cleanup.
validation/auth/tokens/legacy_token_manager_ext_compatibility_test.go Whitespace cleanup in build tag.
validation/auth/kubeconfigs/kubeconfigs_watchlist_test.go Struct alignment and trailing whitespace cleanup.
interoperability/qainfraautomation/provisioning.go Reformat const block alignment.
interoperability/qainfraautomation/ansible/ansible.go Reformat struct field alignment.
actions/useractivity/useractivity.go Reorder imports and trailing newline.
actions/uiplugins/constants.go Tab/alignment cleanup for const block.
actions/tokens/exttokens/verify.go Whitespace and trailing-newline cleanup.
actions/kubeapi/nodes/nodes.go Trailing newline added.

Comment thread actions/charts/watch_retry.go Outdated
return lastErr
}
logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr)
time.Sleep(watchRetryDelay)
Comment thread validation/charts/monitoring.go Outdated

activeTargets := mapResponse["data"].(map[string]any)["activeTargets"].([]any)
if len(activeTargets) < 1 {
activeTargets, ok := mapResponse["data"].(map[string]any)["activeTargets"].([]any)
Comment thread validation/charts/istio_test.go Outdated
Comment on lines +37 to 51
// Create a fresh session and client for cleanup since the test sessions
// were already closed by TearDownTest. Without this, kubectl.Command
// fails with "attempted to register cleanup function to closed test session".
cleanupSession := session.NewSession()
cleanupClient, err := rancher.NewClient("", cleanupSession)
if err != nil {
i.T().Logf("Failed to create cleanup client: %v", err)
} else {
_, err := charts.DeleteIstioResources(cleanupClient, i.cluster.ID)
if err != nil {
i.T().Logf("Failed to delete Istio resources: %v", err)
}
}
cleanupSession.Cleanup()
i.session.Cleanup()
Comment on lines +89 to +90
}
require.NoError(l.T(), err)
floatingman and others added 2 commits May 28, 2026 10:23
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@hamistao hamistao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this! It was about time we handled some of these.

I do have some suggestions and observations but overall looks really good.

// which caused false failures when unrelated test suites (e.g., chart installs) left
// Error-state helm job pods on the cluster. Fleet's own status reporting (checked above)
// is authoritative for determining if the git repo deployment succeeded.
// See: https://github.com/rancher/shepherd/issues/574
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need comments on the history of the code

i.T().Log("Preemptively cleaning up istio resources")
charts.DeleteIstioResources(client, cluster.ID)
// DeleteIstioResources doesn't remove the namespace itself, so delete it via kubectl
kubectl.Command(client, nil, cluster.ID, []string{"sh", "-c", "kubectl delete namespace istio-system --ignore-not-found=true"}, "1MB")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the error should be logged at least even though it should be ignored

@floatingman

Comment on lines +24 to +40
func RetryOnWatchError(maxRetries int, fn func() error) error {
var lastErr error
for i := 0; i < maxRetries; i++ {
lastErr = fn()
if lastErr == nil {
return nil
}
if !strings.Contains(lastErr.Error(), wait.WatchConnectionError) {
return lastErr
}
logrus.Warnf("Watch connection error (attempt %d/%d): %v", i+1, maxRetries, lastErr)
_ = kwait.PollUntilContextTimeout(context.TODO(), watchRetryDelay, watchRetryDelay+time.Second, false, func(context.Context) (bool, error) {
return true, nil
})
}
return lastErr
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: Have you looked into whether there is a specific condition that we should wait on instead of just retrying? I am fine with keeping the retry logic I guess but it would be nice to know in case we should deal with something similar.

Comment on lines +59 to +63
// NOTE: Skipped cluster-wide StatusPods check here. When running the full pit.daily
// suite, chart tests (monitoring, istio, longhorn) leave Error-state helm job pods
// that cause false failures. Individual test methods verify fleet deployment health
// via fleet.VerifyGitRepo() which checks fleet status directly.
// See: https://github.com/rancher/shepherd/issues/574
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on how we may not need to record history here.

Comment on lines +84 to +92
if err != nil {
// If namespace already exists, it's likely the main longhorn test suite is running concurrently.
// Skip rather than fail to avoid 409 conflict.
if strings.Contains(err.Error(), "409") || strings.Contains(err.Error(), "AlreadyExists") {
l.T().Skipf("Skipping: namespace %s already exists (likely from concurrent longhorn test suite): %v", charts.LonghornNamespace, err)
} else {
require.NoError(l.T(), err)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be the only change I disagree with. I think it makes more sense to just ignore the error in these cases and move on. Assuming it is a concurrent test suite is far too conjectury. It would me most often something left over from another operation (manual or automated) that already finished.

Comment on lines +169 to +171
err = charts.RetryOnWatchError(charts.DefaultWatchRetries, func() error {
return extencharts.WatchAndWaitDeployments(client, i.project.ClusterID, charts.RancherMonitoringNamespace, metav1.ListOptions{})
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful with applying this retry logic on so many places. Have you actually seen the need for all these retries on waits? If so, do you think we can handle those differentely somehow?

Perhaps doing something like this could help instead:

	err = charts.WatchAndWaitDeployments(client, i.project.ClusterID, charts.RancherMonitoringNamespace, metav1.ListOptions{
		FieldSelector: "metadata.name=" + desiredDeploymentName
	})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team/pit-crew slack notifier for pit crew

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants