test: fix flaky CI failures in workapplier suite and e2e cost property tests#1270
Open
ryanzhang-oss wants to merge 1 commit intoAzure:mainfrom
Open
test: fix flaky CI failures in workapplier suite and e2e cost property tests#1270ryanzhang-oss wants to merge 1 commit intoAzure:mainfrom
ryanzhang-oss wants to merge 1 commit intoAzure:mainfrom
Conversation
Two separate CI flakiness fixes: 1. pkg/controllers/workapplier/suite_test.go: Increase GracefulShutdownTimeout from the default 30s to 2 minutes for all four controller managers in the integration test suite. With four managers running concurrently (each with multiple controllers), the default 30s grace period is insufficient to drain all runnables on a loaded CI runner, causing AfterSuite teardown to fail with 'context deadline exceeded' even though all 290 specs pass. 2. test/e2e/utils_test.go: Widen the per-CPU-core and per-GB-memory cost property tolerance from 0.002 to 0.005. The Azure Retail Prices API can return values that differ from the locally-computed expected value by exactly 0.002 (e.g. got=0.141, want=0.143), which hits the strict boundary of the original threshold and causes BeforeSuite to fail, aborting the entire custom e2e suite. A margin of 0.005 provides sufficient headroom while still catching genuine property provider bugs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two separate CI flakiness fixes identified from recent CI runs.
Fix 1:
pkg/controllers/workapplier/suite_test.go— workapplier AfterSuite teardown timeoutSymptom: All 290 specs pass, but the suite still reports
FAIL!becauseAfterSuitetimes out with:Root cause: The workapplier integration test suite runs 4 controller managers concurrently, each with multiple controllers. The default
GracefulShutdownTimeoutin controller-runtime is 30s. On a loaded CI runner, draining all runnables across 4 managers within 30s is insufficient, causing the manager'sStart()to return an error, which propagates to the AfterSuitewg.Wait().Fix: Set
GracefulShutdownTimeoutto 2 minutes for all 4 managers.Fix 2:
test/e2e/utils_test.go— cost property tolerance boundary failureSymptom: The
e2e-tests (custom)BeforeSuite fails with:This aborts the entire custom e2e suite (all tests skipped).
Root cause: The Azure Retail Prices API can return per-CPU-core cost values that differ from the locally-computed reference value by exactly
0.002. The original threshold check (diff > 0.002) treats the exact boundary as a failure, so a diff of exactly0.002causes a spurious test abort despite being within the intended acceptable range.Fix: Widen the tolerance from
0.002to0.005for both per-CPU-core cost and per-GB-memory cost checks. This provides headroom for minor API price fluctuations while still catching genuine property provider bugs (large divergences).Test plan