server: fix silent store heartbeat forwarding failure by liyishuai · Pull Request #10620 · tikv/pd

liyishuai · 2026-04-24T12:01:16Z

What problem does this PR solve?

Issue Number: Close #10619

When PD is deployed in microservice mode, StoreHeartbeat requests arriving at the monolithic PD server are handled locally and then forwarded to the scheduling server.

However, during a PD leader switch, the servicePrimaryMap in PD temporarily loses the primary address of the scheduling service until the etcd watcher catches up. During this window, GetServicePrimaryAddr returns an empty string, causing the scheduling gRPC client to evaluate to nil in updateSchedulingClient.

When the scheduling client was nil, the forwarded heartbeat was silently dropped. Because the scheduling service was starved for StoreHeartbeats, the stores' LastHeartbeatTS aged out locally on the scheduling server, eventually causing the stores to transition from disconnected -> unhealthy -> down. This triggered a false SchedulingServerDiscoverDownStore alert, even though the stores were perfectly healthy and successfully heartbeating to PD.

What is changed and how does it work?

This PR fixes the silent failure by adding an explicit check for a nil client when forwarding store heartbeats. If the client is nil, it increments a new Prometheus metric pd_server_forward_fail_total with labels ("store_heartbeat", "client"). This makes the forwarding failure observable and aligns the error handling with the existing RegionHeartbeat forwarding logic.

server: fix silent store heartbeat forwarding failure

When PD is in microservice mode but the scheduling primary's address
is not yet available (e.g. during a leader switch where the primary
watcher hasn't caught up), the scheduling client evaluates to nil.

Previously, this resulted in the StoreHeartbeat forward being silently
dropped, which starved the scheduling server of heartbeat information
and caused false 'SchedulingServerDiscoverDownStore' alerts as the
stores aged out locally on the scheduling server.

This commit adds an explicit check and increments a new Prometheus
metric 'pd_server_forward_fail_total' with labels
('store_heartbeat', 'client') when the client is nil, making the
forwarding failure observable and consistent with the region heartbeat
forwarding error handling.

Check List

Tests

Integration test

Code changes

Has HTTP APIs changed (Don't forget to add the declarative for the new API)

Side effects

Possible performance regression
Increased code complexity
Breaking backward compatibility

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
PR to update pingcap/tiup:
Need to cherry-pick to the release branch

Release note

Fix an issue where store heartbeats were silently dropped when forwarded from PD to the scheduling service during a leader switch, which could cause false down-store alerts.

Summary by CodeRabbit

New Features
- Added a metric counter that distinguishes heartbeat-forwarding failures when the scheduling service client is unavailable.
Tests
- Added an integration test that verifies the new metric increases when the scheduling primary is unavailable, making heartbeat-forwarding failures observable.

When PD is in microservice mode but the scheduling primary's address is not yet available (e.g. during a leader switch where the primary watcher hasn't caught up), the scheduling client evaluates to nil. Previously, this resulted in the StoreHeartbeat forward being silently dropped, which starved the scheduling server of heartbeat information and caused false 'SchedulingServerDiscoverDownStore' alerts as the stores aged out locally on the scheduling server. This commit adds an explicit check and increments a new Prometheus metric 'pd_server_forward_fail_total' with labels ('store_heartbeat', 'client') when the client is nil, making the forwarding failure observable and consistent with the region heartbeat forwarding error handling. Issue Number: close tikv#10620 Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

ti-chi-bot · 2026-04-24T12:01:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign connor1996 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-24T12:01:27Z

Hi @liyishuai. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-24T12:01:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3afa76e-348d-445c-bdb4-77ee954334fe

📥 Commits

Reviewing files that changed from the base of the PR and between ee0fa30 and 214dfb8.

📒 Files selected for processing (2)

server/grpc_service.go
tests/integrations/mcs/scheduling/server_test.go

📝 Walkthrough

Walkthrough

Adds a labeled Prometheus counter increment when StoreHeartbeat forwarding is skipped because the scheduling gRPC client is nil, and adds an integration test that clears the scheduling primary address to assert the forward-failure metric increases.

Changes

Cohort / File(s)	Summary
Server metric update `server/grpc_service.go`	Increment a distinct `pd_server_forward_fail_total` label when `StoreHeartbeat` forwarding is skipped due to missing scheduling client (cli == nil).
Integration test `tests/integrations/mcs/scheduling/server_test.go`	Add `TestStoreHeartbeatForwardFailureWithNoSchedulingPrimary` to simulate missing scheduling primary, record the `store_heartbeat,client` forward-fail metric before/after, and assert it increases.

Sequence Diagram(s)

sequenceDiagram
    participant Store as Store
    participant PD as PD Server
    participant Scheduling as Scheduling Service
    participant Metrics as Prometheus

    Store->>PD: Send StoreHeartbeat
    PD->>PD: Handle heartbeat locally
    alt Scheduling client available
        PD->>Scheduling: Forward StoreHeartbeat
        Scheduling-->>PD: Ack / Error
        PD->>Metrics: inc(pd_server_forward_fail_total) on send error
    else Scheduling client nil
        PD-->>Metrics: inc(pd_server_forward_fail_total) for missing client
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

rleungx
disksing
lhy1024

Poem

🐰 I hopped where heartbeats whisper and dwell,

When the client is gone I toll the bell.
A counter now counts each lost little plea,
So schedulers notice—no more mystery. 🌿🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'server: fix silent store heartbeat forwarding failure' accurately summarizes the main change: fixing a bug where StoreHeartbeat forwarding failures were silently dropped.
Description check	✅ Passed	The PR description comprehensively addresses the template requirements: clearly states the problem (Issue `#10619`), explains the changes with a detailed commit message, includes an integration test in the checklist, and provides a release note.
Linked Issues check	✅ Passed	The PR successfully implements all objectives from `#10619`: makes StoreHeartbeat forwarding failures observable via a new Prometheus metric, prevents silent dropping of requests when scheduling client is nil, provides parity with RegionHeartbeat error handling, and addresses the root cause of false down-store alerts.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the objectives: the code change adds the nil-client check with metric increment, and the test validates the fix. No extraneous modifications are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Fixes an observability gap in PD microservice mode where StoreHeartbeat forwarding to the scheduling service could be silently dropped when the scheduling primary address is temporarily unavailable (e.g., during leader switch).

Changes:

Add a new forward_fail_total{request="store_heartbeat", type="client"} counter for the “nil scheduling client” forwarding case.
Add an integration test to reproduce the missing-primary scenario and assert the metric increments.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
server/grpc_service.go	Increments a forward-failure metric when the scheduling client is nil during store-heartbeat forwarding.
tests/integrations/mcs/scheduling/server_test.go	Adds an integration test that simulates missing scheduling primary and checks the forwarding-failure metric.

+
+	// Simulate the bug condition: clear the scheduling primary address
+	// so GetServicePrimaryAddr returns "", mimicking a leader switch where
+	// the primary watcher hasn't caught up.


+	// Wait until PD recognizes scheduling as independent and forwards heartbeats.
+	testutil.Eventually(re, func() bool {
+		_, resp1Err := s.StoreHeartbeat(
+			context.Background(), &pdpb.StoreHeartbeatRequest{
+				Header: &pdpb.RequestHeader{ClusterId: suite.pdLeader.GetClusterID()},
+				Stats: &pdpb.StoreStats{
+					StoreId:  2,
+					Capacity: 100,
+				},
+			},
+		)
+		return resp1Err == nil
+	})


coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/scheduling/server_test.go`:
- Around line 398-432: Snapshot the current value of
pd_server_forward_fail_total{request="store_heartbeat",type="client"} from
prometheus.DefaultGatherer before calling
suite.pdLeader.GetServer().SetServicePrimaryAddr(constant.SchedulingServiceName,
""); after sending StoreHeartbeat, assert the counter increased by comparing to
that baseline (i.e. assert delta > 0) instead of checking absolute >0; also
ensure you restore the primary override (undo the SetServicePrimaryAddr call) in
a defer so the empty primary doesn't leak into other tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4414744d-02c6-435b-9da2-8a90306a08ac

📥 Commits

Reviewing files that changed from the base of the PR and between ee0fa30 and 4de5104.

📒 Files selected for processing (2)

server/grpc_service.go
tests/integrations/mcs/scheduling/server_test.go

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

liyishuai · 2026-04-24T12:18:15Z

@coderabbitai full review

coderabbitai · 2026-04-24T12:19:17Z

✅ Actions performed

Full review triggered.

Copilot AI review requested due to automatic review settings April 24, 2026 12:01

ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 24, 2026

ti-chi-bot Bot added contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 24, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 24, 2026

Copilot started reviewing on behalf of liyishuai April 24, 2026 12:01 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread tests/integrations/mcs/scheduling/server_test.go

ti-chi-bot Bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 24, 2026

tests: address review comments for forward failure metric snapshot

214dfb8

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

liyishuai force-pushed the fix-silent-store-heartbeat-forwarding-failure branch from ae33e59 to 214dfb8 Compare April 24, 2026 12:13

ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix silent store heartbeat forwarding failure#10620

server: fix silent store heartbeat forwarding failure#10620
liyishuai wants to merge 2 commits intotikv:masterfrom
liyishuai:fix-silent-store-heartbeat-forwarding-failure

liyishuai commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

liyishuai commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liyishuai commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liyishuai commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liyishuai commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading