Skip to content

server: fix silent store heartbeat forwarding failure#10620

Open
liyishuai wants to merge 2 commits intotikv:masterfrom
liyishuai:fix-silent-store-heartbeat-forwarding-failure
Open

server: fix silent store heartbeat forwarding failure#10620
liyishuai wants to merge 2 commits intotikv:masterfrom
liyishuai:fix-silent-store-heartbeat-forwarding-failure

Conversation

@liyishuai
Copy link
Copy Markdown
Contributor

@liyishuai liyishuai commented Apr 24, 2026

What problem does this PR solve?

Issue Number: Close #10619

When PD is deployed in microservice mode, StoreHeartbeat requests arriving at the monolithic PD server are handled locally and then forwarded to the scheduling server.

However, during a PD leader switch, the servicePrimaryMap in PD temporarily loses the primary address of the scheduling service until the etcd watcher catches up. During this window, GetServicePrimaryAddr returns an empty string, causing the scheduling gRPC client to evaluate to nil in updateSchedulingClient.

When the scheduling client was nil, the forwarded heartbeat was silently dropped. Because the scheduling service was starved for StoreHeartbeats, the stores' LastHeartbeatTS aged out locally on the scheduling server, eventually causing the stores to transition from disconnected -> unhealthy -> down. This triggered a false SchedulingServerDiscoverDownStore alert, even though the stores were perfectly healthy and successfully heartbeating to PD.

What is changed and how does it work?

This PR fixes the silent failure by adding an explicit check for a nil client when forwarding store heartbeats. If the client is nil, it increments a new Prometheus metric pd_server_forward_fail_total with labels ("store_heartbeat", "client"). This makes the forwarding failure observable and aligns the error handling with the existing RegionHeartbeat forwarding logic.

server: fix silent store heartbeat forwarding failure

When PD is in microservice mode but the scheduling primary's address
is not yet available (e.g. during a leader switch where the primary
watcher hasn't caught up), the scheduling client evaluates to nil.

Previously, this resulted in the StoreHeartbeat forward being silently
dropped, which starved the scheduling server of heartbeat information
and caused false 'SchedulingServerDiscoverDownStore' alerts as the
stores aged out locally on the scheduling server.

This commit adds an explicit check and increments a new Prometheus
metric 'pd_server_forward_fail_total' with labels
('store_heartbeat', 'client') when the client is nil, making the
forwarding failure observable and consistent with the region heartbeat
forwarding error handling.

Check List

Tests

  • Integration test

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

Fix an issue where store heartbeats were silently dropped when forwarded from PD to the scheduling service during a leader switch, which could cause false down-store alerts.

Summary by CodeRabbit

  • New Features

    • Added a metric counter that distinguishes heartbeat-forwarding failures when the scheduling service client is unavailable.
  • Tests

    • Added an integration test that verifies the new metric increases when the scheduling primary is unavailable, making heartbeat-forwarding failures observable.

When PD is in microservice mode but the scheduling primary's address
is not yet available (e.g. during a leader switch where the primary
watcher hasn't caught up), the scheduling client evaluates to nil.

Previously, this resulted in the StoreHeartbeat forward being silently
dropped, which starved the scheduling server of heartbeat information
and caused false 'SchedulingServerDiscoverDownStore' alerts as the
stores aged out locally on the scheduling server.

This commit adds an explicit check and increments a new Prometheus
metric 'pd_server_forward_fail_total' with labels
('store_heartbeat', 'client') when the client is nil, making the
forwarding failure observable and consistent with the region heartbeat
forwarding error handling.

Issue Number: close tikv#10620

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
Copilot AI review requested due to automatic review settings April 24, 2026 12:01
@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 24, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign connor1996 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 24, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 24, 2026

Hi @liyishuai. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3afa76e-348d-445c-bdb4-77ee954334fe

📥 Commits

Reviewing files that changed from the base of the PR and between ee0fa30 and 214dfb8.

📒 Files selected for processing (2)
  • server/grpc_service.go
  • tests/integrations/mcs/scheduling/server_test.go

📝 Walkthrough

Walkthrough

Adds a labeled Prometheus counter increment when StoreHeartbeat forwarding is skipped because the scheduling gRPC client is nil, and adds an integration test that clears the scheduling primary address to assert the forward-failure metric increases.

Changes

Cohort / File(s) Summary
Server metric update
server/grpc_service.go
Increment a distinct pd_server_forward_fail_total label when StoreHeartbeat forwarding is skipped due to missing scheduling client (cli == nil).
Integration test
tests/integrations/mcs/scheduling/server_test.go
Add TestStoreHeartbeatForwardFailureWithNoSchedulingPrimary to simulate missing scheduling primary, record the store_heartbeat,client forward-fail metric before/after, and assert it increases.

Sequence Diagram(s)

sequenceDiagram
    participant Store as Store
    participant PD as PD Server
    participant Scheduling as Scheduling Service
    participant Metrics as Prometheus

    Store->>PD: Send StoreHeartbeat
    PD->>PD: Handle heartbeat locally
    alt Scheduling client available
        PD->>Scheduling: Forward StoreHeartbeat
        Scheduling-->>PD: Ack / Error
        PD->>Metrics: inc(pd_server_forward_fail_total) on send error
    else Scheduling client nil
        PD-->>Metrics: inc(pd_server_forward_fail_total) for missing client
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • rleungx
  • disksing
  • lhy1024

Poem

🐰 I hopped where heartbeats whisper and dwell,

When the client is gone I toll the bell.
A counter now counts each lost little plea,
So schedulers notice—no more mystery. 🌿🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'server: fix silent store heartbeat forwarding failure' accurately summarizes the main change: fixing a bug where StoreHeartbeat forwarding failures were silently dropped.
Description check ✅ Passed The PR description comprehensively addresses the template requirements: clearly states the problem (Issue #10619), explains the changes with a detailed commit message, includes an integration test in the checklist, and provides a release note.
Linked Issues check ✅ Passed The PR successfully implements all objectives from #10619: makes StoreHeartbeat forwarding failures observable via a new Prometheus metric, prevents silent dropping of requests when scheduling client is nil, provides parity with RegionHeartbeat error handling, and addresses the root cause of false down-store alerts.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objectives: the code change adds the nil-client check with metric increment, and the test validates the fix. No extraneous modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 24, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an observability gap in PD microservice mode where StoreHeartbeat forwarding to the scheduling service could be silently dropped when the scheduling primary address is temporarily unavailable (e.g., during leader switch).

Changes:

  • Add a new forward_fail_total{request="store_heartbeat", type="client"} counter for the “nil scheduling client” forwarding case.
  • Add an integration test to reproduce the missing-primary scenario and assert the metric increments.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
server/grpc_service.go Increments a forward-failure metric when the scheduling client is nil during store-heartbeat forwarding.
tests/integrations/mcs/scheduling/server_test.go Adds an integration test that simulates missing scheduling primary and checks the forwarding-failure metric.


// Simulate the bug condition: clear the scheduling primary address
// so GetServicePrimaryAddr returns "", mimicking a leader switch where
// the primary watcher hasn't caught up.
Comment thread tests/integrations/mcs/scheduling/server_test.go Outdated
Comment on lines +384 to +396
// Wait until PD recognizes scheduling as independent and forwards heartbeats.
testutil.Eventually(re, func() bool {
_, resp1Err := s.StoreHeartbeat(
context.Background(), &pdpb.StoreHeartbeatRequest{
Header: &pdpb.RequestHeader{ClusterId: suite.pdLeader.GetClusterID()},
Stats: &pdpb.StoreStats{
StoreId: 2,
Capacity: 100,
},
},
)
return resp1Err == nil
})
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/scheduling/server_test.go`:
- Around line 398-432: Snapshot the current value of
pd_server_forward_fail_total{request="store_heartbeat",type="client"} from
prometheus.DefaultGatherer before calling
suite.pdLeader.GetServer().SetServicePrimaryAddr(constant.SchedulingServiceName,
""); after sending StoreHeartbeat, assert the counter increased by comparing to
that baseline (i.e. assert delta > 0) instead of checking absolute >0; also
ensure you restore the primary override (undo the SetServicePrimaryAddr call) in
a defer so the empty primary doesn't leak into other tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4414744d-02c6-435b-9da2-8a90306a08ac

📥 Commits

Reviewing files that changed from the base of the PR and between ee0fa30 and 4de5104.

📒 Files selected for processing (2)
  • server/grpc_service.go
  • tests/integrations/mcs/scheduling/server_test.go

Comment thread tests/integrations/mcs/scheduling/server_test.go
@ti-chi-bot ti-chi-bot Bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 24, 2026
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
@liyishuai liyishuai force-pushed the fix-silent-store-heartbeat-forwarding-failure branch from ae33e59 to 214dfb8 Compare April 24, 2026 12:13
@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 24, 2026
@liyishuai
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

✅ Actions performed

Full review triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mcs: silent store heartbeat forwarding failure causes false down-store alerts

2 participants