server: fix silent store heartbeat forwarding failure#10620
server: fix silent store heartbeat forwarding failure#10620liyishuai wants to merge 2 commits intotikv:masterfrom
Conversation
When PD is in microservice mode but the scheduling primary's address
is not yet available (e.g. during a leader switch where the primary
watcher hasn't caught up), the scheduling client evaluates to nil.
Previously, this resulted in the StoreHeartbeat forward being silently
dropped, which starved the scheduling server of heartbeat information
and caused false 'SchedulingServerDiscoverDownStore' alerts as the
stores aged out locally on the scheduling server.
This commit adds an explicit check and increments a new Prometheus
metric 'pd_server_forward_fail_total' with labels
('store_heartbeat', 'client') when the client is nil, making the
forwarding failure observable and consistent with the region heartbeat
forwarding error handling.
Issue Number: close tikv#10620
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @liyishuai. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a labeled Prometheus counter increment when Changes
Sequence Diagram(s)sequenceDiagram
participant Store as Store
participant PD as PD Server
participant Scheduling as Scheduling Service
participant Metrics as Prometheus
Store->>PD: Send StoreHeartbeat
PD->>PD: Handle heartbeat locally
alt Scheduling client available
PD->>Scheduling: Forward StoreHeartbeat
Scheduling-->>PD: Ack / Error
PD->>Metrics: inc(pd_server_forward_fail_total) on send error
else Scheduling client nil
PD-->>Metrics: inc(pd_server_forward_fail_total) for missing client
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Fixes an observability gap in PD microservice mode where StoreHeartbeat forwarding to the scheduling service could be silently dropped when the scheduling primary address is temporarily unavailable (e.g., during leader switch).
Changes:
- Add a new
forward_fail_total{request="store_heartbeat", type="client"}counter for the “nil scheduling client” forwarding case. - Add an integration test to reproduce the missing-primary scenario and assert the metric increments.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| server/grpc_service.go | Increments a forward-failure metric when the scheduling client is nil during store-heartbeat forwarding. |
| tests/integrations/mcs/scheduling/server_test.go | Adds an integration test that simulates missing scheduling primary and checks the forwarding-failure metric. |
|
|
||
| // Simulate the bug condition: clear the scheduling primary address | ||
| // so GetServicePrimaryAddr returns "", mimicking a leader switch where | ||
| // the primary watcher hasn't caught up. |
| // Wait until PD recognizes scheduling as independent and forwards heartbeats. | ||
| testutil.Eventually(re, func() bool { | ||
| _, resp1Err := s.StoreHeartbeat( | ||
| context.Background(), &pdpb.StoreHeartbeatRequest{ | ||
| Header: &pdpb.RequestHeader{ClusterId: suite.pdLeader.GetClusterID()}, | ||
| Stats: &pdpb.StoreStats{ | ||
| StoreId: 2, | ||
| Capacity: 100, | ||
| }, | ||
| }, | ||
| ) | ||
| return resp1Err == nil | ||
| }) |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integrations/mcs/scheduling/server_test.go`:
- Around line 398-432: Snapshot the current value of
pd_server_forward_fail_total{request="store_heartbeat",type="client"} from
prometheus.DefaultGatherer before calling
suite.pdLeader.GetServer().SetServicePrimaryAddr(constant.SchedulingServiceName,
""); after sending StoreHeartbeat, assert the counter increased by comparing to
that baseline (i.e. assert delta > 0) instead of checking absolute >0; also
ensure you restore the primary override (undo the SetServicePrimaryAddr call) in
a defer so the empty primary doesn't leak into other tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4414744d-02c6-435b-9da2-8a90306a08ac
📒 Files selected for processing (2)
server/grpc_service.gotests/integrations/mcs/scheduling/server_test.go
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
ae33e59 to
214dfb8
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
What problem does this PR solve?
Issue Number: Close #10619
When PD is deployed in microservice mode,
StoreHeartbeatrequests arriving at the monolithic PD server are handled locally and then forwarded to the scheduling server.However, during a PD leader switch, the
servicePrimaryMapin PD temporarily loses the primary address of the scheduling service until the etcd watcher catches up. During this window,GetServicePrimaryAddrreturns an empty string, causing the scheduling gRPC client to evaluate tonilinupdateSchedulingClient.When the scheduling client was
nil, the forwarded heartbeat was silently dropped. Because the scheduling service was starved forStoreHeartbeats, the stores'LastHeartbeatTSaged out locally on the scheduling server, eventually causing the stores to transition fromdisconnected->unhealthy->down. This triggered a falseSchedulingServerDiscoverDownStorealert, even though the stores were perfectly healthy and successfully heartbeating to PD.What is changed and how does it work?
This PR fixes the silent failure by adding an explicit check for a
nilclient when forwarding store heartbeats. If the client isnil, it increments a new Prometheus metricpd_server_forward_fail_totalwith labels("store_heartbeat", "client"). This makes the forwarding failure observable and aligns the error handling with the existingRegionHeartbeatforwarding logic.Check List
Tests
Code changes
Side effects
Related changes
pingcap/docs/pingcap/docs-cn:pingcap/tiup:Release note
Summary by CodeRabbit
New Features
Tests