Skip to content

pkg/election: add lease keepalive metrics#10622

Open
JmPotato wants to merge 9 commits intotikv:masterfrom
JmPotato:bob/lease-keepalive-metrics
Open

pkg/election: add lease keepalive metrics#10622
JmPotato wants to merge 9 commits intotikv:masterfrom
JmPotato:bob/lease-keepalive-metrics

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented Apr 25, 2026

What problem does this PR solve?

Issue Number: ref #9389

Depends on #10618. This PR is based on #10618 head f5109eb9cbb4633ce134b127d972dcd55dcbe27e and should be reviewed/merged after the streaming keepalive PR.

What is changed and how does it work?

pkg/election: add lease keepalive metrics

Add Prometheus metrics for the streaming lease keepalive loop:
- `pd_lease_keepalive_response_interval_seconds{purpose}` for valid response intervals.
- `pd_lease_renewal_failure_total{purpose,reason}` for stream start failures, unexpected channel closes, invalid TTL responses, and watchdog timeouts.
- `pd_lease_local_ttl_remaining_seconds{purpose}` as an event-sampled local TTL estimate, not an etcd authoritative TTL.

The instrumentation is added only to existing `Lease.KeepAlive` branches and does not change the keepalive control flow or goroutine topology. Metric children are cached on `Lease` after a successful `Grant()`, so the keepalive hot path does not call `.WithLabelValues`.

Check List

Tests

  • Unit test

Code changes

  • None

Side effects

  • None

Related changes

Verification

  • make gotest GOTEST_ARGS='./pkg/election -run TestKeepAliveResponseIntervalRecorded -count=1'
  • make gotest GOTEST_ARGS='./pkg/election -count=1'
  • make gotest GOTEST_ARGS="./pkg/election -run 'Test(KeepAliveResponseIntervalRecorded|RenewalFailure|NoFailureOnContextCancel|LocalTTLGaugeUpdates)' -count=2"

Release note

None.

Summary by CodeRabbit

  • Tests

    • Expanded coverage for lease keepalive behavior, validating renewal, failure modes, watchdog timeouts, TTL handling, and metric side effects.
  • Chores

    • Redesigned lease renewal for more robust and deterministic keepalive handling and expiry tracking.
    • Added Prometheus metrics to observe keepalive intervals, renewal failure reasons, and local TTL estimates for improved troubleshooting.

Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Fold keepAliveWorker into Lease.KeepAlive so lease renewal runs in a
single goroutine whose select multiplexes ctx cancellation, the etcd
keepalive channel, and the leaseTimeout watchdog. The previous worker
goroutine and its 1-slot output channel existed only to funnel responses
into the outer loop; consuming the etcd channel directly achieves the
same thing without the extra hop.

- Extract processKeepAliveResponse, loadLeaseID and resetTimer so each
  helper carries a single responsibility.
- Bake purpose and lease-id into a scoped logger (log.L().With) so every
  keepalive log line carries them without repeating fields at the call
  site. Existing log strings and field names are preserved for operator
  alerts.
- Document the closed field: Close() permanently halts the etcd lessor,
  so Grant() rebuilds l.lease on reuse to avoid native KeepAlive
  returning ErrKeepAliveHalted immediately.
- Rewrite tests that referenced keepAliveWorker to exercise
  Lease.KeepAlive directly, and add TestLeaseKeepAliveKeepsExpireMonotonic
  to pin the maxExpire guard.

Signed-off-by: JmPotato <github@ipotato.me>
- Use t.Context() for tests whose only cancel() was the defer cleanup,
  and context.WithCancel(t.Context()) where the test still needs to
  cancel mid-run so the ctx is tied to the test lifecycle.
- Replace direct lease.expireTime.Load().(time.Time) accesses with the
  existing getExpireTime() helper so tests go through the same nil-safe
  path as production callers.
- Drop the ad-hoc Eventually timeouts in favour of the repo-standard
  testutil.Eventually (20s waitFor / 100ms tick); the custom 1s/10ms
  and 2s/50ms values only narrowed the failure window and provided no
  extra signal.

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign okjiang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

📝 Walkthrough

Walkthrough

Refactors lease keepalive from a worker/ticker model to a single streaming loop over l.lease.KeepAlive(), adds internal lifecycle state (closed) and metrics, refactors expiry logic, and hardens keepalive control flow with watchdogs, monotonic expiry updates, and detailed failure handling. (49 words)

Changes

Cohort / File(s) Summary
Lease Core Implementation
pkg/election/lease.go
Rewrote keepalive from worker/ticker to a streaming KeepAlive loop; added closed flag and metrics wiring; introduced helpers: getExpireTime, processKeepAliveResponse, loadLeaseID, resetTimer, initMetrics; refactored expiry access and Close/Grant control paths; added watchdog handling and monotonic expire updates.
Lease Testing
pkg/election/lease_test.go
Expanded tests with real-etcd and deterministic fake keepalive stream scenarios; added tests for expire-time advancement, stream-close and invalid-TTL termination, monotonic protection, watchdog behavior, and Prometheus metric assertions; added helpers and fakeLease implementation.
Metrics
pkg/election/metrics.go
New Prometheus metrics for lease subsystem: keepalive response interval histogram, renewal failure counter (labeled by reason) and local TTL gauge; defines reason constants ReasonStreamFailed, ReasonChannelClosed, ReasonInvalidTTL, ReasonWatchdogTimeout, and provides newLeaseMetrics.

Sequence Diagram(s)

sequenceDiagram
  participant Leader as Leader (pkg/election.Lease)
  participant Lessor as Etcd Lessor (`l.lease`)
  participant Stream as KeepAlive Stream
  participant Watchdog as Watchdog Timer
  participant Metrics as Prometheus Metrics

  rect rgba(200,220,255,0.5)
  Leader->>Lessor: KeepAlive(ctx, leaseID) -> stream
  end

  rect rgba(220,255,200,0.5)
  Stream->>Leader: keepalive response (TTL, leaseID)
  Leader->>Leader: processKeepAliveResponse -> compute expireTime
  Leader->>Metrics: observe interval, update localTTLRemaining
  end

  rect rgba(255,230,200,0.5)
  Watchdog->>Leader: tick -> check local estimate
  Leader-->>Leader: if expired -> step down, inc failure reason
  end

  Stream-->>Leader: close / stream error
  Leader->>Metrics: increment renewalFailureTotal (reason)
  Leader->>Lessor: Close / revoke
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped where keepalives now stream and flow,
Timers watch me, metrics help me grow,
Expiry climbs upward, never slides back,
Streams close or fail — I log every track,
Tiny rabbit cheers: renew, renew, let's go!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.74% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'pkg/election: add lease keepalive metrics' directly and clearly describes the main change: adding metrics to the lease keepalive mechanism in pkg/election, which aligns with the raw summary and PR objectives.
Description check ✅ Passed The PR description includes the required sections: problem statement with issue reference, detailed explanation of changes with commit message, comprehensive checklist, verification commands, and release notes; it fully addresses the template requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 25, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pkg/election/metrics.go (1)

19-72: LGTM — clean metrics declarations.

The collectors follow Prometheus conventions (namespace + subsystem + unit suffix), labels are low-cardinality (purpose, reason are bounded enums), and the help texts are descriptive — particularly the localTTLRemaining help text which explicitly disclaims it isn't the etcd-authoritative TTL. The reason constants are exported with proper GoDoc and reused at the call sites in lease.go.

One small ergonomic suggestion (optional): the three prometheus.MustRegister calls in init() can be collapsed into a single varargs call.

♻️ Optional: combine MustRegister calls
 func init() {
-	prometheus.MustRegister(keepAliveResponseInterval)
-	prometheus.MustRegister(renewalFailureTotal)
-	prometheus.MustRegister(localTTLRemaining)
+	prometheus.MustRegister(
+		keepAliveResponseInterval,
+		renewalFailureTotal,
+		localTTLRemaining,
+	)
 }

Based on learnings: "Use Prometheus-style metrics with subsystem + unit; avoid high-cardinality labels".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/metrics.go` around lines 19 - 72, The init() function currently
calls prometheus.MustRegister three times; replace those three calls with a
single variadic call to prometheus.MustRegister(keepAliveResponseInterval,
renewalFailureTotal, localTTLRemaining) to make registration concise and
idiomatic while preserving the same collectors (refer to the init function and
the collector variables keepAliveResponseInterval, renewalFailureTotal, and
localTTLRemaining).
pkg/election/lease_test.go (2)

361-370: Minor: thread purpose through the helper.

Most callers of newTestLeaseWithKeepAliveCh immediately overwrite lease.Purpose = purpose afterwards (lines 215, 241, 263, 283, 302, 316, 342). Threading the purpose into the helper would remove the seven-line repetition and keep the test setup atomic.

♻️ Suggested helper change
-func newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease {
+func newTestLeaseWithKeepAliveCh(purpose string, keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease {
 	lease := &Lease{
-		Purpose:      "test_lease",
+		Purpose:      purpose,
 		lease:        &fakeLease{keepAliveCh: keepAliveCh},
 		leaseTimeout: leaseTimeout,
 	}
 	lease.ID.Store(clientv3.LeaseID(1))
 	lease.expireTime.Store(time.Now().Add(-time.Second))
 	return lease
 }

Then at each call site:

-	lease := newTestLeaseWithKeepAliveCh(keepAliveCh, time.Hour)
-	lease.Purpose = purpose
+	lease := newTestLeaseWithKeepAliveCh(purpose, keepAliveCh, time.Hour)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 361 - 370, Change
newTestLeaseWithKeepAliveCh to accept a purpose string and assign it to
lease.Purpose so callers don't need to overwrite it; specifically, modify the
helper signature newTestLeaseWithKeepAliveCh(keepAliveCh chan
*clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) to include a
purpose parameter and set lease.Purpose = purpose inside the function, then
update all call sites that currently set lease.Purpose after creation to pass
the desired purpose into the newTestLeaseWithKeepAliveCh call (e.g., replace
post-construction assignments like lease.Purpose = purpose with passing purpose
into the helper).

200-208: Sleep-based monotonicity probe is acceptable but worth noting.

time.Sleep(100 * time.Millisecond) at line 205 is the only nondeterministic synchronization in the new tests. It's pragmatic — the test fails only if the monotonicity guard regresses, and a too-short sleep trivially passes — so a missing guard wouldn't cause spurious failures, only spurious passes. Fine to keep, but if you'd like a more deterministic approach, a follow-up could send a third response with a larger TTL and use Eventually to wait for it to take effect, which proves the loop has drained the smaller TTL.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 200 - 208, The test currently uses
time.Sleep(100 * time.Millisecond) to wait for the keep-alive processing loop;
replace this nondeterministic sleep with a deterministic probe: after
sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger
TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a
retry/assertion helper (like re.Eventually or require.Eventually) to wait until
lease.getExpireTime() reflects the later larger TTL, which guarantees the loop
drained the smaller TTL rather than relying on an arbitrary sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/election/lease.go`:
- Around line 226-241: The watchdog case can increment renewalFailureTotal
spuriously during shutdown races; modify the watchdog.C branch in
pkg/election/lease.go so after computing remaining := time.Until(actualExpire)
and before logging/incrementing you check the context (ctx.Err() == nil) —
mirror the guard used in the channel-closed branch — and only call logger.Info,
renewalFailureTotal.WithLabelValues(...).Inc(), and return when ctx is still
active; otherwise treat it as a clean shutdown (reset or continue as
appropriate). Ensure you reference watchdog.C, l.getExpireTime(),
localTTLRemaining, renewalFailureTotal, ReasonWatchdogTimeout and l.Purpose in
the change.

---

Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 361-370: Change newTestLeaseWithKeepAliveCh to accept a purpose
string and assign it to lease.Purpose so callers don't need to overwrite it;
specifically, modify the helper signature
newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse,
leaseTimeout time.Duration) to include a purpose parameter and set lease.Purpose
= purpose inside the function, then update all call sites that currently set
lease.Purpose after creation to pass the desired purpose into the
newTestLeaseWithKeepAliveCh call (e.g., replace post-construction assignments
like lease.Purpose = purpose with passing purpose into the helper).
- Around line 200-208: The test currently uses time.Sleep(100 *
time.Millisecond) to wait for the keep-alive processing loop; replace this
nondeterministic sleep with a deterministic probe: after
sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger
TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a
retry/assertion helper (like re.Eventually or require.Eventually) to wait until
lease.getExpireTime() reflects the later larger TTL, which guarantees the loop
drained the smaller TTL rather than relying on an arbitrary sleep.

In `@pkg/election/metrics.go`:
- Around line 19-72: The init() function currently calls prometheus.MustRegister
three times; replace those three calls with a single variadic call to
prometheus.MustRegister(keepAliveResponseInterval, renewalFailureTotal,
localTTLRemaining) to make registration concise and idiomatic while preserving
the same collectors (refer to the init function and the collector variables
keepAliveResponseInterval, renewalFailureTotal, and localTTLRemaining).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b252437b-213b-4a52-9752-a592ac7ce8bb

📥 Commits

Reviewing files that changed from the base of the PR and between ee0fa30 and 02b6b75.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • pkg/election/metrics.go

Comment thread pkg/election/lease.go
Comment on lines +226 to 241
case <-watchdog.C:
// Timer delivery can race with a keepalive response that already
// extended expireTime. Only step down when the local lease
// estimate is actually exhausted.
actualExpire := l.getExpireTime()
remaining := time.Until(actualExpire)
localTTLRemaining.WithLabelValues(l.Purpose).Set(remaining.Seconds())
if remaining > 0 {
watchdog.Reset(remaining)
continue
}
logger.Info("keep alive lease too slow",
zap.Duration("timeout-duration", l.leaseTimeout),
zap.Time("actual-expire", actualExpire))
renewalFailureTotal.WithLabelValues(l.Purpose, ReasonWatchdogTimeout).Inc()
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Watchdog branch can over-count ReasonWatchdogTimeout during shutdown races.

When ctx is cancelled at almost the same time as watchdog.C fires, Go's select may pick the watchdog case even though the caller has already abandoned the loop. Unlike the channel-closed branch (line 194), this branch unconditionally increments renewalFailureTotal{reason="watchdog_timeout"}, so a clean shutdown of an already-expired lease can produce a phantom failure event.

The contrast is intentional in TestNoFailureOnContextCancel, but that test only exercises the early-cancel path — once leaseTimeout has elapsed without responses, the race becomes observable. Consider mirroring the ctx.Err() == nil guard from the channel-closed branch:

🛡️ Suggested guard
 		case <-watchdog.C:
 			// Timer delivery can race with a keepalive response that already
 			// extended expireTime. Only step down when the local lease
 			// estimate is actually exhausted.
 			actualExpire := l.getExpireTime()
 			remaining := time.Until(actualExpire)
 			localTTLRemaining.WithLabelValues(l.Purpose).Set(remaining.Seconds())
 			if remaining > 0 {
 				watchdog.Reset(remaining)
 				continue
 			}
+			if ctx.Err() != nil {
+				return
+			}
 			logger.Info("keep alive lease too slow",
 				zap.Duration("timeout-duration", l.leaseTimeout),
 				zap.Time("actual-expire", actualExpire))
 			renewalFailureTotal.WithLabelValues(l.Purpose, ReasonWatchdogTimeout).Inc()
 			return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 226 - 241, The watchdog case can
increment renewalFailureTotal spuriously during shutdown races; modify the
watchdog.C branch in pkg/election/lease.go so after computing remaining :=
time.Until(actualExpire) and before logging/incrementing you check the context
(ctx.Err() == nil) — mirror the guard used in the channel-closed branch — and
only call logger.Info, renewalFailureTotal.WithLabelValues(...).Inc(), and
return when ctx is still active; otherwise treat it as a clean shutdown (reset
or continue as appropriate). Ensure you reference watchdog.C, l.getExpireTime(),
localTTLRemaining, renewalFailureTotal, ReasonWatchdogTimeout and l.Purpose in
the change.

Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato JmPotato force-pushed the bob/lease-keepalive-metrics branch from 02b6b75 to b0c7b7f Compare April 25, 2026 14:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/election/lease_test.go (1)

392-401: Histogram introspection LGTM, but worth a brief comment.

The observer.(prometheus.Metric) assertion works because client_golang histogram instances implement both Observer and Metric. This is intentional but not obvious; a one-line comment explaining the assertion would help future readers and guard against regressions if someone replaces the histogram with a wrapped observer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 392 - 401, In
keepAliveResponseIntervalCount, add a one-line comment above the
observer.(prometheus.Metric) type assertion explaining that the current
prometheus histogram implementation also implements prometheus.Metric, so
casting the Observer to prometheus.Metric is safe here; also note that this
relies on the concrete histogram implementation and would break if the metric
were replaced by a wrapped Observer, to warn future maintainers and prevent
accidental regressions.
pkg/election/lease.go (1)

86-96: Consider initializing metrics in NewLease, not Grant.

l.metrics is currently initialized only at the end of Grant(). If a programmer ever invokes KeepAlive() before a successful Grant() (or after Grant() returns an error), the very first metric write — e.g. l.metrics.streamFailed.Inc() at line 178 — will deref a nil prometheus.Counter and panic.

Since newLeaseMetrics() only depends on l.Purpose (set in NewLease), initializing there makes the type zero-value-friendlier and matches the guideline "Keep structs zero-value friendly":

🛡️ Suggested move
 func NewLease(client *clientv3.Client, purpose string) *Lease {
-	return &Lease{
+	l := &Lease{
 		Purpose: purpose,
 		client:  client,
 		lease:   clientv3.NewLease(client),
 	}
+	l.initMetrics()
+	return l
 }
@@
 	l.expireTime.Store(start.Add(time.Duration(leaseResp.TTL) * time.Second))
-	l.initMetrics()
 	return nil

As per coding guidelines: "Keep structs zero-value friendly; init maps/slices before use".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 86 - 96, The metrics field l.metrics
should be initialized in NewLease (where l.Purpose is already set) instead of
only in Grant to avoid nil derefs when methods like KeepAlive access l.metrics
before a successful Grant; call newLeaseMetrics(l.Purpose) during NewLease
construction to set l.metrics and remove or keep a no-op initMetrics call from
Grant (or guard uses) so Grant no longer is the sole initializer; update
references to l.metrics (e.g., streamFailed) to assume it is non-nil after
NewLease.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/election/lease_test.go`:
- Around line 275-292: Test uses a brittle 20ms watchdog timeout in
TestRenewalFailureWatchdogTimeout which can flake under CI; update the watchdog
duration passed to newTestLeaseWithKeepAliveCh from 20*time.Millisecond to a
more robust value (e.g., 100–200*time.Millisecond) so the test has enough time
for goroutine startup/scheduling without materially slowing it, leaving the rest
of the test (done channel, KeepAlive call, and assertions on renewalFailureValue
and localTTLRemainingValue) unchanged.
- Around line 187-208: The test uses time.Sleep(100*time.Millisecond) as a
race-prone sync; replace it with a deterministic synchronization so the
KeepAlive consumer has processed the small-TTL response before asserting
monotonicity: after sendKeepAliveResponse(t, keepAliveCh, 1) either (a) send a
marker over an unbuffered channel that the KeepAlive loop reads when it finishes
processing each response (add a test-only opt to lease.KeepAlive or a hook
callback) and wait for that marker, or (b) poll with testutil.Eventually until
lease.getExpireTime() stops changing or until a processed-response counter
increments (expose/read a test metric from the KeepAlive loop) before reading
afterSmall; update TestLeaseKeepAliveKeepsExpireMonotonic to use the chosen
deterministic wait instead of time.Sleep.

---

Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 392-401: In keepAliveResponseIntervalCount, add a one-line comment
above the observer.(prometheus.Metric) type assertion explaining that the
current prometheus histogram implementation also implements prometheus.Metric,
so casting the Observer to prometheus.Metric is safe here; also note that this
relies on the concrete histogram implementation and would break if the metric
were replaced by a wrapped Observer, to warn future maintainers and prevent
accidental regressions.

In `@pkg/election/lease.go`:
- Around line 86-96: The metrics field l.metrics should be initialized in
NewLease (where l.Purpose is already set) instead of only in Grant to avoid nil
derefs when methods like KeepAlive access l.metrics before a successful Grant;
call newLeaseMetrics(l.Purpose) during NewLease construction to set l.metrics
and remove or keep a no-op initMetrics call from Grant (or guard uses) so Grant
no longer is the sole initializer; update references to l.metrics (e.g.,
streamFailed) to assume it is non-nil after NewLease.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b76876a9-1227-4569-9e60-6dc4d576f1b4

📥 Commits

Reviewing files that changed from the base of the PR and between 02b6b75 and b0c7b7f.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • pkg/election/metrics.go

Comment on lines +187 to +208
func TestLeaseKeepAliveKeepsExpireMonotonic(t *testing.T) {
re := require.New(t)
keepAliveCh := make(chan *clientv3.LeaseKeepAliveResponse, 1)
lease := newTestLeaseWithKeepAliveCh("test_lease", keepAliveCh, time.Hour)
go lease.KeepAlive(t.Context())

// The large TTL lands first and should push expireTime well into the future.
sendKeepAliveResponse(t, keepAliveCh, 30)
testutil.Eventually(re, func() bool {
return time.Until(lease.getExpireTime()) > 20*time.Second
})
beforeSmall := lease.getExpireTime()

// A subsequent smaller TTL must not regress expireTime, because its absolute
// deadline is earlier than the stored maxExpire.
sendKeepAliveResponse(t, keepAliveCh, 1)
// Give the loop a beat to process and (incorrectly) store the smaller expire
// if the monotonicity guard were missing.
time.Sleep(100 * time.Millisecond)
afterSmall := lease.getExpireTime()
re.False(afterSmall.Before(beforeSmall))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potentially flaky: 100 ms sleep used as a synchronization barrier.

The test relies on time.Sleep(100ms) for the loop to consume and (potentially incorrectly) apply the smaller TTL response. On a slow/loaded CI runner the loop may not have run yet, in which case the assertion would pass even if the monotonicity guard regressed — i.e., a bug in lease.go could go undetected.

Consider draining the smaller-TTL response deterministically by following it with an unbuffered "marker" send that only completes once the consumer is back at the select, or by polling a metric that increments per processed response (e.g., keepAliveResponseIntervalCount(t, purpose) >= prev+1) before checking getExpireTime().

♻️ Sketch using histogram count as the sync signal
-	sendKeepAliveResponse(t, keepAliveCh, 30)
-	testutil.Eventually(re, func() bool {
-		return time.Until(lease.getExpireTime()) > 20*time.Second
-	})
-	beforeSmall := lease.getExpireTime()
-
-	// A subsequent smaller TTL must not regress expireTime, because its absolute
-	// deadline is earlier than the stored maxExpire.
-	sendKeepAliveResponse(t, keepAliveCh, 1)
-	// Give the loop a beat to process and (incorrectly) store the smaller expire
-	// if the monotonicity guard were missing.
-	time.Sleep(100 * time.Millisecond)
-	afterSmall := lease.getExpireTime()
-	re.False(afterSmall.Before(beforeSmall))
+	purpose := "TestLeaseKeepAliveKeepsExpireMonotonic"
+	// (use newTestLeaseWithKeepAliveCh(purpose, ...) above)
+	sendKeepAliveResponse(t, keepAliveCh, 30)
+	testutil.Eventually(re, func() bool {
+		return time.Until(lease.getExpireTime()) > 20*time.Second
+	})
+	beforeSmall := lease.getExpireTime()
+	beforeCount := keepAliveResponseIntervalCount(t, purpose)
+
+	sendKeepAliveResponse(t, keepAliveCh, 1)
+	testutil.Eventually(re, func() bool {
+		return keepAliveResponseIntervalCount(t, purpose) > beforeCount
+	})
+	re.False(lease.getExpireTime().Before(beforeSmall))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 187 - 208, The test uses
time.Sleep(100*time.Millisecond) as a race-prone sync; replace it with a
deterministic synchronization so the KeepAlive consumer has processed the
small-TTL response before asserting monotonicity: after sendKeepAliveResponse(t,
keepAliveCh, 1) either (a) send a marker over an unbuffered channel that the
KeepAlive loop reads when it finishes processing each response (add a test-only
opt to lease.KeepAlive or a hook callback) and wait for that marker, or (b) poll
with testutil.Eventually until lease.getExpireTime() stops changing or until a
processed-response counter increments (expose/read a test metric from the
KeepAlive loop) before reading afterSmall; update
TestLeaseKeepAliveKeepsExpireMonotonic to use the chosen deterministic wait
instead of time.Sleep.

Comment on lines +275 to +292
func TestRenewalFailureWatchdogTimeout(t *testing.T) {
re := require.New(t)
purpose := t.Name()
keepAliveCh := make(chan *clientv3.LeaseKeepAliveResponse)
lease := newTestLeaseWithKeepAliveCh(purpose, keepAliveCh, 20*time.Millisecond)
beforeFailure := renewalFailureValue(purpose, reasonWatchdogTimeout)
done := make(chan struct{})
go func() {
lease.KeepAlive(t.Context())
close(done)
}()

testutil.Eventually(re, func() bool {
return isClosed(done)
})
re.Equal(beforeFailure+1, renewalFailureValue(purpose, reasonWatchdogTimeout))
re.LessOrEqual(localTTLRemainingValue(purpose), float64(0))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tight 20 ms watchdog timeout may flake under load.

20 ms is enough for the deterministic timer firing, but on a heavily loaded CI runner the goroutine startup + scheduler latency can push the actual <-watchdog.C delivery much later, and a concurrent GC pause can extend time.Until(actualExpire) past zero (since expireTime is initialized to now-1s). Consider bumping to 100–200 ms to keep the test stable without slowing it materially.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 275 - 292, Test uses a brittle 20ms
watchdog timeout in TestRenewalFailureWatchdogTimeout which can flake under CI;
update the watchdog duration passed to newTestLeaseWithKeepAliveCh from
20*time.Millisecond to a more robust value (e.g., 100–200*time.Millisecond) so
the test has enough time for goroutine startup/scheduling without materially
slowing it, leaving the rest of the test (done channel, KeepAlive call, and
assertions on renewalFailureValue and localTTLRemainingValue) unchanged.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 25, 2026

@JmPotato: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 b0c7b7f link true /test pull-unit-test-next-gen-3
pull-unit-test-next-gen-2 b0c7b7f link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant