pkg/election: add lease keepalive metrics#10622
pkg/election: add lease keepalive metrics#10622JmPotato wants to merge 9 commits intotikv:masterfrom
Conversation
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Fold keepAliveWorker into Lease.KeepAlive so lease renewal runs in a single goroutine whose select multiplexes ctx cancellation, the etcd keepalive channel, and the leaseTimeout watchdog. The previous worker goroutine and its 1-slot output channel existed only to funnel responses into the outer loop; consuming the etcd channel directly achieves the same thing without the extra hop. - Extract processKeepAliveResponse, loadLeaseID and resetTimer so each helper carries a single responsibility. - Bake purpose and lease-id into a scoped logger (log.L().With) so every keepalive log line carries them without repeating fields at the call site. Existing log strings and field names are preserved for operator alerts. - Document the closed field: Close() permanently halts the etcd lessor, so Grant() rebuilds l.lease on reuse to avoid native KeepAlive returning ErrKeepAliveHalted immediately. - Rewrite tests that referenced keepAliveWorker to exercise Lease.KeepAlive directly, and add TestLeaseKeepAliveKeepsExpireMonotonic to pin the maxExpire guard. Signed-off-by: JmPotato <github@ipotato.me>
- Use t.Context() for tests whose only cancel() was the defer cleanup, and context.WithCancel(t.Context()) where the test still needs to cancel mid-run so the ctx is tied to the test lifecycle. - Replace direct lease.expireTime.Load().(time.Time) accesses with the existing getExpireTime() helper so tests go through the same nil-safe path as production callers. - Drop the ad-hoc Eventually timeouts in favour of the repo-standard testutil.Eventually (20s waitFor / 100ms tick); the custom 1s/10ms and 2s/50ms values only narrowed the failure window and provided no extra signal. Signed-off-by: JmPotato <github@ipotato.me>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughRefactors lease keepalive from a worker/ticker model to a single streaming loop over Changes
Sequence Diagram(s)sequenceDiagram
participant Leader as Leader (pkg/election.Lease)
participant Lessor as Etcd Lessor (`l.lease`)
participant Stream as KeepAlive Stream
participant Watchdog as Watchdog Timer
participant Metrics as Prometheus Metrics
rect rgba(200,220,255,0.5)
Leader->>Lessor: KeepAlive(ctx, leaseID) -> stream
end
rect rgba(220,255,200,0.5)
Stream->>Leader: keepalive response (TTL, leaseID)
Leader->>Leader: processKeepAliveResponse -> compute expireTime
Leader->>Metrics: observe interval, update localTTLRemaining
end
rect rgba(255,230,200,0.5)
Watchdog->>Leader: tick -> check local estimate
Leader-->>Leader: if expired -> step down, inc failure reason
end
Stream-->>Leader: close / stream error
Leader->>Metrics: increment renewalFailureTotal (reason)
Leader->>Lessor: Close / revoke
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
pkg/election/metrics.go (1)
19-72: LGTM — clean metrics declarations.The collectors follow Prometheus conventions (namespace + subsystem + unit suffix), labels are low-cardinality (
purpose,reasonare bounded enums), and the help texts are descriptive — particularly thelocalTTLRemaininghelp text which explicitly disclaims it isn't the etcd-authoritative TTL. The reason constants are exported with proper GoDoc and reused at the call sites inlease.go.One small ergonomic suggestion (optional): the three
prometheus.MustRegistercalls ininit()can be collapsed into a single varargs call.♻️ Optional: combine MustRegister calls
func init() { - prometheus.MustRegister(keepAliveResponseInterval) - prometheus.MustRegister(renewalFailureTotal) - prometheus.MustRegister(localTTLRemaining) + prometheus.MustRegister( + keepAliveResponseInterval, + renewalFailureTotal, + localTTLRemaining, + ) }Based on learnings: "Use Prometheus-style metrics with subsystem + unit; avoid high-cardinality labels".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/metrics.go` around lines 19 - 72, The init() function currently calls prometheus.MustRegister three times; replace those three calls with a single variadic call to prometheus.MustRegister(keepAliveResponseInterval, renewalFailureTotal, localTTLRemaining) to make registration concise and idiomatic while preserving the same collectors (refer to the init function and the collector variables keepAliveResponseInterval, renewalFailureTotal, and localTTLRemaining).pkg/election/lease_test.go (2)
361-370: Minor: threadpurposethrough the helper.Most callers of
newTestLeaseWithKeepAliveChimmediately overwritelease.Purpose = purposeafterwards (lines 215, 241, 263, 283, 302, 316, 342). Threading the purpose into the helper would remove the seven-line repetition and keep the test setup atomic.♻️ Suggested helper change
-func newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease { +func newTestLeaseWithKeepAliveCh(purpose string, keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease { lease := &Lease{ - Purpose: "test_lease", + Purpose: purpose, lease: &fakeLease{keepAliveCh: keepAliveCh}, leaseTimeout: leaseTimeout, } lease.ID.Store(clientv3.LeaseID(1)) lease.expireTime.Store(time.Now().Add(-time.Second)) return lease }Then at each call site:
- lease := newTestLeaseWithKeepAliveCh(keepAliveCh, time.Hour) - lease.Purpose = purpose + lease := newTestLeaseWithKeepAliveCh(purpose, keepAliveCh, time.Hour)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 361 - 370, Change newTestLeaseWithKeepAliveCh to accept a purpose string and assign it to lease.Purpose so callers don't need to overwrite it; specifically, modify the helper signature newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) to include a purpose parameter and set lease.Purpose = purpose inside the function, then update all call sites that currently set lease.Purpose after creation to pass the desired purpose into the newTestLeaseWithKeepAliveCh call (e.g., replace post-construction assignments like lease.Purpose = purpose with passing purpose into the helper).
200-208: Sleep-based monotonicity probe is acceptable but worth noting.
time.Sleep(100 * time.Millisecond)at line 205 is the only nondeterministic synchronization in the new tests. It's pragmatic — the test fails only if the monotonicity guard regresses, and a too-short sleep trivially passes — so a missing guard wouldn't cause spurious failures, only spurious passes. Fine to keep, but if you'd like a more deterministic approach, a follow-up could send a third response with a larger TTL and useEventuallyto wait for it to take effect, which proves the loop has drained the smaller TTL.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 200 - 208, The test currently uses time.Sleep(100 * time.Millisecond) to wait for the keep-alive processing loop; replace this nondeterministic sleep with a deterministic probe: after sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a retry/assertion helper (like re.Eventually or require.Eventually) to wait until lease.getExpireTime() reflects the later larger TTL, which guarantees the loop drained the smaller TTL rather than relying on an arbitrary sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/election/lease.go`:
- Around line 226-241: The watchdog case can increment renewalFailureTotal
spuriously during shutdown races; modify the watchdog.C branch in
pkg/election/lease.go so after computing remaining := time.Until(actualExpire)
and before logging/incrementing you check the context (ctx.Err() == nil) —
mirror the guard used in the channel-closed branch — and only call logger.Info,
renewalFailureTotal.WithLabelValues(...).Inc(), and return when ctx is still
active; otherwise treat it as a clean shutdown (reset or continue as
appropriate). Ensure you reference watchdog.C, l.getExpireTime(),
localTTLRemaining, renewalFailureTotal, ReasonWatchdogTimeout and l.Purpose in
the change.
---
Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 361-370: Change newTestLeaseWithKeepAliveCh to accept a purpose
string and assign it to lease.Purpose so callers don't need to overwrite it;
specifically, modify the helper signature
newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse,
leaseTimeout time.Duration) to include a purpose parameter and set lease.Purpose
= purpose inside the function, then update all call sites that currently set
lease.Purpose after creation to pass the desired purpose into the
newTestLeaseWithKeepAliveCh call (e.g., replace post-construction assignments
like lease.Purpose = purpose with passing purpose into the helper).
- Around line 200-208: The test currently uses time.Sleep(100 *
time.Millisecond) to wait for the keep-alive processing loop; replace this
nondeterministic sleep with a deterministic probe: after
sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger
TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a
retry/assertion helper (like re.Eventually or require.Eventually) to wait until
lease.getExpireTime() reflects the later larger TTL, which guarantees the loop
drained the smaller TTL rather than relying on an arbitrary sleep.
In `@pkg/election/metrics.go`:
- Around line 19-72: The init() function currently calls prometheus.MustRegister
three times; replace those three calls with a single variadic call to
prometheus.MustRegister(keepAliveResponseInterval, renewalFailureTotal,
localTTLRemaining) to make registration concise and idiomatic while preserving
the same collectors (refer to the init function and the collector variables
keepAliveResponseInterval, renewalFailureTotal, and localTTLRemaining).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b252437b-213b-4a52-9752-a592ac7ce8bb
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.gopkg/election/metrics.go
| case <-watchdog.C: | ||
| // Timer delivery can race with a keepalive response that already | ||
| // extended expireTime. Only step down when the local lease | ||
| // estimate is actually exhausted. | ||
| actualExpire := l.getExpireTime() | ||
| remaining := time.Until(actualExpire) | ||
| localTTLRemaining.WithLabelValues(l.Purpose).Set(remaining.Seconds()) | ||
| if remaining > 0 { | ||
| watchdog.Reset(remaining) | ||
| continue | ||
| } | ||
| logger.Info("keep alive lease too slow", | ||
| zap.Duration("timeout-duration", l.leaseTimeout), | ||
| zap.Time("actual-expire", actualExpire)) | ||
| renewalFailureTotal.WithLabelValues(l.Purpose, ReasonWatchdogTimeout).Inc() | ||
| return |
There was a problem hiding this comment.
Watchdog branch can over-count ReasonWatchdogTimeout during shutdown races.
When ctx is cancelled at almost the same time as watchdog.C fires, Go's select may pick the watchdog case even though the caller has already abandoned the loop. Unlike the channel-closed branch (line 194), this branch unconditionally increments renewalFailureTotal{reason="watchdog_timeout"}, so a clean shutdown of an already-expired lease can produce a phantom failure event.
The contrast is intentional in TestNoFailureOnContextCancel, but that test only exercises the early-cancel path — once leaseTimeout has elapsed without responses, the race becomes observable. Consider mirroring the ctx.Err() == nil guard from the channel-closed branch:
🛡️ Suggested guard
case <-watchdog.C:
// Timer delivery can race with a keepalive response that already
// extended expireTime. Only step down when the local lease
// estimate is actually exhausted.
actualExpire := l.getExpireTime()
remaining := time.Until(actualExpire)
localTTLRemaining.WithLabelValues(l.Purpose).Set(remaining.Seconds())
if remaining > 0 {
watchdog.Reset(remaining)
continue
}
+ if ctx.Err() != nil {
+ return
+ }
logger.Info("keep alive lease too slow",
zap.Duration("timeout-duration", l.leaseTimeout),
zap.Time("actual-expire", actualExpire))
renewalFailureTotal.WithLabelValues(l.Purpose, ReasonWatchdogTimeout).Inc()
return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/election/lease.go` around lines 226 - 241, The watchdog case can
increment renewalFailureTotal spuriously during shutdown races; modify the
watchdog.C branch in pkg/election/lease.go so after computing remaining :=
time.Until(actualExpire) and before logging/incrementing you check the context
(ctx.Err() == nil) — mirror the guard used in the channel-closed branch — and
only call logger.Info, renewalFailureTotal.WithLabelValues(...).Inc(), and
return when ctx is still active; otherwise treat it as a clean shutdown (reset
or continue as appropriate). Ensure you reference watchdog.C, l.getExpireTime(),
localTTLRemaining, renewalFailureTotal, ReasonWatchdogTimeout and l.Purpose in
the change.
Signed-off-by: JmPotato <github@ipotato.me>
02b6b75 to
b0c7b7f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
pkg/election/lease_test.go (1)
392-401: Histogram introspection LGTM, but worth a brief comment.The
observer.(prometheus.Metric)assertion works becauseclient_golanghistogram instances implement bothObserverandMetric. This is intentional but not obvious; a one-line comment explaining the assertion would help future readers and guard against regressions if someone replaces the histogram with a wrapped observer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 392 - 401, In keepAliveResponseIntervalCount, add a one-line comment above the observer.(prometheus.Metric) type assertion explaining that the current prometheus histogram implementation also implements prometheus.Metric, so casting the Observer to prometheus.Metric is safe here; also note that this relies on the concrete histogram implementation and would break if the metric were replaced by a wrapped Observer, to warn future maintainers and prevent accidental regressions.pkg/election/lease.go (1)
86-96: Consider initializing metrics inNewLease, notGrant.
l.metricsis currently initialized only at the end ofGrant(). If a programmer ever invokesKeepAlive()before a successfulGrant()(or afterGrant()returns an error), the very first metric write — e.g.l.metrics.streamFailed.Inc()at line 178 — will deref a nilprometheus.Counterand panic.Since
newLeaseMetrics()only depends onl.Purpose(set inNewLease), initializing there makes the type zero-value-friendlier and matches the guideline "Keep structs zero-value friendly":🛡️ Suggested move
func NewLease(client *clientv3.Client, purpose string) *Lease { - return &Lease{ + l := &Lease{ Purpose: purpose, client: client, lease: clientv3.NewLease(client), } + l.initMetrics() + return l } @@ l.expireTime.Store(start.Add(time.Duration(leaseResp.TTL) * time.Second)) - l.initMetrics() return nilAs per coding guidelines: "Keep structs zero-value friendly; init maps/slices before use".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease.go` around lines 86 - 96, The metrics field l.metrics should be initialized in NewLease (where l.Purpose is already set) instead of only in Grant to avoid nil derefs when methods like KeepAlive access l.metrics before a successful Grant; call newLeaseMetrics(l.Purpose) during NewLease construction to set l.metrics and remove or keep a no-op initMetrics call from Grant (or guard uses) so Grant no longer is the sole initializer; update references to l.metrics (e.g., streamFailed) to assume it is non-nil after NewLease.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/election/lease_test.go`:
- Around line 275-292: Test uses a brittle 20ms watchdog timeout in
TestRenewalFailureWatchdogTimeout which can flake under CI; update the watchdog
duration passed to newTestLeaseWithKeepAliveCh from 20*time.Millisecond to a
more robust value (e.g., 100–200*time.Millisecond) so the test has enough time
for goroutine startup/scheduling without materially slowing it, leaving the rest
of the test (done channel, KeepAlive call, and assertions on renewalFailureValue
and localTTLRemainingValue) unchanged.
- Around line 187-208: The test uses time.Sleep(100*time.Millisecond) as a
race-prone sync; replace it with a deterministic synchronization so the
KeepAlive consumer has processed the small-TTL response before asserting
monotonicity: after sendKeepAliveResponse(t, keepAliveCh, 1) either (a) send a
marker over an unbuffered channel that the KeepAlive loop reads when it finishes
processing each response (add a test-only opt to lease.KeepAlive or a hook
callback) and wait for that marker, or (b) poll with testutil.Eventually until
lease.getExpireTime() stops changing or until a processed-response counter
increments (expose/read a test metric from the KeepAlive loop) before reading
afterSmall; update TestLeaseKeepAliveKeepsExpireMonotonic to use the chosen
deterministic wait instead of time.Sleep.
---
Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 392-401: In keepAliveResponseIntervalCount, add a one-line comment
above the observer.(prometheus.Metric) type assertion explaining that the
current prometheus histogram implementation also implements prometheus.Metric,
so casting the Observer to prometheus.Metric is safe here; also note that this
relies on the concrete histogram implementation and would break if the metric
were replaced by a wrapped Observer, to warn future maintainers and prevent
accidental regressions.
In `@pkg/election/lease.go`:
- Around line 86-96: The metrics field l.metrics should be initialized in
NewLease (where l.Purpose is already set) instead of only in Grant to avoid nil
derefs when methods like KeepAlive access l.metrics before a successful Grant;
call newLeaseMetrics(l.Purpose) during NewLease construction to set l.metrics
and remove or keep a no-op initMetrics call from Grant (or guard uses) so Grant
no longer is the sole initializer; update references to l.metrics (e.g.,
streamFailed) to assume it is non-nil after NewLease.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b76876a9-1227-4569-9e60-6dc4d576f1b4
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.gopkg/election/metrics.go
| func TestLeaseKeepAliveKeepsExpireMonotonic(t *testing.T) { | ||
| re := require.New(t) | ||
| keepAliveCh := make(chan *clientv3.LeaseKeepAliveResponse, 1) | ||
| lease := newTestLeaseWithKeepAliveCh("test_lease", keepAliveCh, time.Hour) | ||
| go lease.KeepAlive(t.Context()) | ||
|
|
||
| // The large TTL lands first and should push expireTime well into the future. | ||
| sendKeepAliveResponse(t, keepAliveCh, 30) | ||
| testutil.Eventually(re, func() bool { | ||
| return time.Until(lease.getExpireTime()) > 20*time.Second | ||
| }) | ||
| beforeSmall := lease.getExpireTime() | ||
|
|
||
| // A subsequent smaller TTL must not regress expireTime, because its absolute | ||
| // deadline is earlier than the stored maxExpire. | ||
| sendKeepAliveResponse(t, keepAliveCh, 1) | ||
| // Give the loop a beat to process and (incorrectly) store the smaller expire | ||
| // if the monotonicity guard were missing. | ||
| time.Sleep(100 * time.Millisecond) | ||
| afterSmall := lease.getExpireTime() | ||
| re.False(afterSmall.Before(beforeSmall)) | ||
| } |
There was a problem hiding this comment.
Potentially flaky: 100 ms sleep used as a synchronization barrier.
The test relies on time.Sleep(100ms) for the loop to consume and (potentially incorrectly) apply the smaller TTL response. On a slow/loaded CI runner the loop may not have run yet, in which case the assertion would pass even if the monotonicity guard regressed — i.e., a bug in lease.go could go undetected.
Consider draining the smaller-TTL response deterministically by following it with an unbuffered "marker" send that only completes once the consumer is back at the select, or by polling a metric that increments per processed response (e.g., keepAliveResponseIntervalCount(t, purpose) >= prev+1) before checking getExpireTime().
♻️ Sketch using histogram count as the sync signal
- sendKeepAliveResponse(t, keepAliveCh, 30)
- testutil.Eventually(re, func() bool {
- return time.Until(lease.getExpireTime()) > 20*time.Second
- })
- beforeSmall := lease.getExpireTime()
-
- // A subsequent smaller TTL must not regress expireTime, because its absolute
- // deadline is earlier than the stored maxExpire.
- sendKeepAliveResponse(t, keepAliveCh, 1)
- // Give the loop a beat to process and (incorrectly) store the smaller expire
- // if the monotonicity guard were missing.
- time.Sleep(100 * time.Millisecond)
- afterSmall := lease.getExpireTime()
- re.False(afterSmall.Before(beforeSmall))
+ purpose := "TestLeaseKeepAliveKeepsExpireMonotonic"
+ // (use newTestLeaseWithKeepAliveCh(purpose, ...) above)
+ sendKeepAliveResponse(t, keepAliveCh, 30)
+ testutil.Eventually(re, func() bool {
+ return time.Until(lease.getExpireTime()) > 20*time.Second
+ })
+ beforeSmall := lease.getExpireTime()
+ beforeCount := keepAliveResponseIntervalCount(t, purpose)
+
+ sendKeepAliveResponse(t, keepAliveCh, 1)
+ testutil.Eventually(re, func() bool {
+ return keepAliveResponseIntervalCount(t, purpose) > beforeCount
+ })
+ re.False(lease.getExpireTime().Before(beforeSmall))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/election/lease_test.go` around lines 187 - 208, The test uses
time.Sleep(100*time.Millisecond) as a race-prone sync; replace it with a
deterministic synchronization so the KeepAlive consumer has processed the
small-TTL response before asserting monotonicity: after sendKeepAliveResponse(t,
keepAliveCh, 1) either (a) send a marker over an unbuffered channel that the
KeepAlive loop reads when it finishes processing each response (add a test-only
opt to lease.KeepAlive or a hook callback) and wait for that marker, or (b) poll
with testutil.Eventually until lease.getExpireTime() stops changing or until a
processed-response counter increments (expose/read a test metric from the
KeepAlive loop) before reading afterSmall; update
TestLeaseKeepAliveKeepsExpireMonotonic to use the chosen deterministic wait
instead of time.Sleep.
| func TestRenewalFailureWatchdogTimeout(t *testing.T) { | ||
| re := require.New(t) | ||
| purpose := t.Name() | ||
| keepAliveCh := make(chan *clientv3.LeaseKeepAliveResponse) | ||
| lease := newTestLeaseWithKeepAliveCh(purpose, keepAliveCh, 20*time.Millisecond) | ||
| beforeFailure := renewalFailureValue(purpose, reasonWatchdogTimeout) | ||
| done := make(chan struct{}) | ||
| go func() { | ||
| lease.KeepAlive(t.Context()) | ||
| close(done) | ||
| }() | ||
|
|
||
| testutil.Eventually(re, func() bool { | ||
| return isClosed(done) | ||
| }) | ||
| re.Equal(beforeFailure+1, renewalFailureValue(purpose, reasonWatchdogTimeout)) | ||
| re.LessOrEqual(localTTLRemainingValue(purpose), float64(0)) | ||
| } |
There was a problem hiding this comment.
Tight 20 ms watchdog timeout may flake under load.
20 ms is enough for the deterministic timer firing, but on a heavily loaded CI runner the goroutine startup + scheduler latency can push the actual <-watchdog.C delivery much later, and a concurrent GC pause can extend time.Until(actualExpire) past zero (since expireTime is initialized to now-1s). Consider bumping to 100–200 ms to keep the test stable without slowing it materially.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/election/lease_test.go` around lines 275 - 292, Test uses a brittle 20ms
watchdog timeout in TestRenewalFailureWatchdogTimeout which can flake under CI;
update the watchdog duration passed to newTestLeaseWithKeepAliveCh from
20*time.Millisecond to a more robust value (e.g., 100–200*time.Millisecond) so
the test has enough time for goroutine startup/scheduling without materially
slowing it, leaving the rest of the test (done channel, KeepAlive call, and
assertions on renewalFailureValue and localTTLRemainingValue) unchanged.
|
@JmPotato: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: ref #9389
Depends on #10618. This PR is based on #10618 head
f5109eb9cbb4633ce134b127d972dcd55dcbe27eand should be reviewed/merged after the streaming keepalive PR.What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
Verification
make gotest GOTEST_ARGS='./pkg/election -run TestKeepAliveResponseIntervalRecorded -count=1'make gotest GOTEST_ARGS='./pkg/election -count=1'make gotest GOTEST_ARGS="./pkg/election -run 'Test(KeepAliveResponseIntervalRecorded|RenewalFailure|NoFailureOnContextCancel|LocalTTLGaugeUpdates)' -count=2"Release note
Summary by CodeRabbit
Tests
Chores