Skip to content

test: fix flaky CI failures in workapplier suite and e2e cost property tests#1270

Open
ryanzhang-oss wants to merge 1 commit intoAzure:mainfrom
ryanzhang-oss:fix/ci-flaky-tests
Open

test: fix flaky CI failures in workapplier suite and e2e cost property tests#1270
ryanzhang-oss wants to merge 1 commit intoAzure:mainfrom
ryanzhang-oss:fix/ci-flaky-tests

Conversation

@ryanzhang-oss
Copy link
Contributor

Summary

Two separate CI flakiness fixes identified from recent CI runs.

Fix 1: pkg/controllers/workapplier/suite_test.go — workapplier AfterSuite teardown timeout

Symptom: All 290 specs pass, but the suite still reports FAIL! because AfterSuite times out with:

failed waiting for all runnables to end within grace period of 30s: context deadline exceeded

Root cause: The workapplier integration test suite runs 4 controller managers concurrently, each with multiple controllers. The default GracefulShutdownTimeout in controller-runtime is 30s. On a loaded CI runner, draining all runnables across 4 managers within 30s is insufficient, causing the manager's Start() to return an error, which propagates to the AfterSuite wg.Wait().

Fix: Set GracefulShutdownTimeout to 2 minutes for all 4 managers.


Fix 2: test/e2e/utils_test.go — cost property tolerance boundary failure

Symptom: The e2e-tests (custom) BeforeSuite fails with:

member cluster per CPU core cost property diff: got=0.141000, want=0.143000, diff=0.002000

This aborts the entire custom e2e suite (all tests skipped).

Root cause: The Azure Retail Prices API can return per-CPU-core cost values that differ from the locally-computed reference value by exactly 0.002. The original threshold check (diff > 0.002) treats the exact boundary as a failure, so a diff of exactly 0.002 causes a spurious test abort despite being within the intended acceptable range.

Fix: Widen the tolerance from 0.002 to 0.005 for both per-CPU-core cost and per-GB-memory cost checks. This provides headroom for minor API price fluctuations while still catching genuine property provider bugs (large divergences).

Test plan

  • Existing tests cover both code paths; no new test cases needed
  • CI should be re-run to validate the fixes

Two separate CI flakiness fixes:

1. pkg/controllers/workapplier/suite_test.go:
   Increase GracefulShutdownTimeout from the default 30s to 2 minutes
   for all four controller managers in the integration test suite. With
   four managers running concurrently (each with multiple controllers),
   the default 30s grace period is insufficient to drain all runnables
   on a loaded CI runner, causing AfterSuite teardown to fail with
   'context deadline exceeded' even though all 290 specs pass.

2. test/e2e/utils_test.go:
   Widen the per-CPU-core and per-GB-memory cost property tolerance
   from 0.002 to 0.005. The Azure Retail Prices API can return values
   that differ from the locally-computed expected value by exactly
   0.002 (e.g. got=0.141, want=0.143), which hits the strict boundary
   of the original threshold and causes BeforeSuite to fail, aborting
   the entire custom e2e suite. A margin of 0.005 provides sufficient
   headroom while still catching genuine property provider bugs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant