server, core, tests: preserve region cpu stats in scheduling heartbeat #10609
server, core, tests: preserve region cpu stats in scheduling heartbeat #10609lhy1024 wants to merge 10 commits intotikv:masterfrom
Conversation
Signed-off-by: lhy1024 <admin@liudos.us>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughForwards CPU stats in region heartbeats, exposes CPU getters on heartbeat types, populates RegionInfo.cpuStats and derived CPU metrics from heartbeats, surfaces CPU usage in API responses, updates go.mod replaces for a kvproto fork, and adds/updates unit and integration tests validating CPU propagation and sync triggers. Changes
Sequence Diagram(s)sequenceDiagram
participant Store as Store (store/peer)
participant GRPC as gRPC Server
participant Scheduler as Scheduling Microservice
participant Core as pkg/core (RegionFromHeartbeat)
participant API as mcs HTTP API
Store->>GRPC: Send RegionHeartbeat / StoreHeartbeat (CpuStats)
GRPC->>Scheduler: Forward RegionHeartbeat (CpuStats)
Scheduler->>Core: RegionFromHeartbeat(heartbeat)
Core-->>Scheduler: RegionInfo (cpuStats populated)
Scheduler->>API: API request for region(s)
API->>Core: Query RegionInfo
Core-->>API: Return CPU usage via GetTotalCPUUsageFromStats()
API-->>Client: JSON with CPUUsage
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
go.mod (1)
8-8: Revert the temporary kvproto fork replacement before landing onmaster.The inline comments in
go.modalready document this fork is for coordinated PR development with kvproto. Before merging, uncomment the replacement directive and runmake tidyto ensurego.modandgo.sumpoint back to upstreamgithub.com/pingcap/kvprotoand remain clean.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@go.mod` at line 8, Remove the temporary replacement of github.com/pingcap/kvproto in go.mod (the replace directive that points to github.com/lhy1024/kvproto v0.0.0-20260421030141-ffd09d80e97a) so the module resolves back to the upstream github.com/pingcap/kvproto, then run make tidy to regenerate go.mod and go.sum; ensure the replace line is deleted or reverted and re-run tests to confirm dependencies are clean before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/core/region.go`:
- Around line 264-265: The RegionInfo update currently sets cpuUsage/cpuStats
from heartbeat.GetCpuUsage()/GetCpuStats() but the cache/sync refresh logic only
reacts to byte-flow changes, so CPU-only heartbeats leave cached CPU load stale;
modify the region heartbeat handling (the code that assigns cpuUsage and
cpuStats on RegionInfo and the associated update method, e.g.,
UpdateRegionHeartbeat or RegionInfo.applyHeartbeat) to detect changes to
cpuUsage or cpuStats and set the cache-dirty/changed flag (the same flag used
for byte-flow changes) so that CPU-only updates trigger the cache refresh/sync;
apply the same change to the other occurrence around the code block using the
same assignment pattern referenced in the diff (the second occurrence at the
region update near the 976-980 area).
In `@tests/integrations/mcs/scheduling/server_test.go`:
- Around line 675-682: The test's Eventually closure indexes GetStoresLoads()[1]
without ensuring the slice is populated, causing a panic when the store loads
are not yet available; modify the closure used in testutil.Eventually (the
anonymous func) to first retrieve stores :=
tc.GetPrimaryServer().GetCluster().GetStoresLoads(), verify len(stores) > 1 and
stores[1] is non-nil before assigning storeLoads or checking
storeLoads[statutils.StoreReadCPU], and only then compare
hotPeer.GetLoad(statutils.CPUDim) == 80 && storeLoads[statutils.StoreReadCPU] ==
200 so the check safely retries instead of panicking.
---
Nitpick comments:
In `@go.mod`:
- Line 8: Remove the temporary replacement of github.com/pingcap/kvproto in
go.mod (the replace directive that points to github.com/lhy1024/kvproto
v0.0.0-20260421030141-ffd09d80e97a) so the module resolves back to the upstream
github.com/pingcap/kvproto, then run make tidy to regenerate go.mod and go.sum;
ensure the replace line is deleted or reverted and re-run tests to confirm
dependencies are clean before merging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e9ae09aa-f930-4932-9f73-f98f6510b97c
⛔ Files ignored due to path filters (2)
go.sumis excluded by!**/*.sumtests/integrations/go.sumis excluded by!**/*.sum
📒 Files selected for processing (5)
go.modpkg/core/region.goserver/grpc_service.gotests/integrations/go.modtests/integrations/mcs/scheduling/server_test.go
Signed-off-by: lhy1024 <admin@liudos.us>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/core/region.go (1)
974-983:⚠️ Potential issue | 🟡 MinorCPU comparisons lack rounding, creating asymmetry with byte-flow handling and amplifying sync traffic.
The code compares CPU metrics (
GetCPUUsage()andGetReadCPUUsage()) with exact!=while byte metrics use rounded comparisons viaGetRoundBytesWritten()/GetRoundBytesRead(), which applyflowRoundDivisorrounding. Since CPU counters change frequently and lack this dampening mechanism, most heartbeats will triggersaveCache, needSync = true, substantially increasing region-sync traffic.Consider either:
- Apply rounding to CPU metrics analogous to byte-flow (using
flowRoundDivisoror a similar threshold)- Treat CPU-only changes as
saveCacheonly (notneedSync), preserving cache staleness fixes without amplifying cross-service sync loadConfirm the intended trade-off before merging, as this path runs on the region heartbeat hot path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/core/region.go` around lines 974 - 983, The CPU comparisons (region.GetCPUUsage(), region.GetReadCPUUsage()) are compared with exact != while bytes use rounding via GetRoundBytesWritten()/GetRoundBytesRead() and flowRoundDivisor, causing excessive needSync; modify the condition in the block that sets saveCache, needSync so CPU differences are either compared using a rounded/quantized value (e.g., introduce/GetRoundCPUUsage or compute CPU values divided by region.flowRoundDivisor or a small threshold) or only trigger saveCache (not needSync) when the only changes are CPU/read-CPU; ensure you update the comparisons involving region.GetCPUUsage(), region.GetReadCPUUsage(), origin.GetCPUUsage(), origin.GetReadCPUUsage() and the logic that assigns saveCache, needSync accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/go.mod`:
- Line 6: Revert the temporary replace pointing "github.com/pingcap/kvproto =>
github.com/lhy1024/kvproto v0.0.0-20260421030141-ffd09d80e97a" back to the
official upstream requirement once pingcap/kvproto#1455 is merged and released:
remove the replace directive (or update the require to the new released version
of github.com/pingcap/kvproto) and then run `make tidy` to update go.mod/go.sum
so CI has no diffs.
---
Outside diff comments:
In `@pkg/core/region.go`:
- Around line 974-983: The CPU comparisons (region.GetCPUUsage(),
region.GetReadCPUUsage()) are compared with exact != while bytes use rounding
via GetRoundBytesWritten()/GetRoundBytesRead() and flowRoundDivisor, causing
excessive needSync; modify the condition in the block that sets saveCache,
needSync so CPU differences are either compared using a rounded/quantized value
(e.g., introduce/GetRoundCPUUsage or compute CPU values divided by
region.flowRoundDivisor or a small threshold) or only trigger saveCache (not
needSync) when the only changes are CPU/read-CPU; ensure you update the
comparisons involving region.GetCPUUsage(), region.GetReadCPUUsage(),
origin.GetCPUUsage(), origin.GetReadCPUUsage() and the logic that assigns
saveCache, needSync accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2ce18ab7-95ad-456c-be1f-e0f0ab73101b
⛔ Files ignored due to path filters (1)
tools/go.sumis excluded by!**/*.sum
📒 Files selected for processing (4)
pkg/core/region.gopkg/core/region_test.gotests/integrations/mcs/scheduling/server_test.gotools/go.mod
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integrations/mcs/scheduling/server_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10609 +/- ##
==========================================
- Coverage 78.96% 77.31% -1.66%
==========================================
Files 532 532
Lines 71883 85656 +13773
==========================================
+ Hits 56766 66221 +9455
- Misses 11093 15407 +4314
- Partials 4024 4028 +4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tests/integrations/mcs/scheduling/server_test.go (1)
759-770:⚠️ Potential issue | 🟡 MinorGuard the store-load slice before indexing.
The map entry can exist before
StoreReadCPUis populated, so theEventuallyclosure can still panic instead of retrying.🧪 Suggested fix
storeLoads, ok := storesLoads[1] - if !ok { + if !ok || len(storeLoads) <= statutils.StoreReadCPU { return false } return hotPeer.GetLoad(statutils.CPUDim) == 80 && storeLoads[statutils.StoreReadCPU] == 200🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integrations/mcs/scheduling/server_test.go` around lines 759 - 770, The closure can panic when accessing storeLoads[statutils.StoreReadCPU] because the storeLoads slice may not be populated yet; in the Eventually closure (around testutil.Eventually, hotPeer, GetStoresLoads, storeLoads) add a guard to verify the slice has enough elements (e.g. if len(storeLoads) <= statutils.StoreReadCPU { return false }) before indexing, so the closure returns false and retries instead of panicking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/response/region.go`:
- Around line 165-171: InitRegion currently dereferences r via r.GetCPUUsage()
which breaks prior nil-safe behavior; ensure you check for nil before calling
GetCPUUsage or calling InitRegionWithCPUUsage. Fix by adding a nil guard at the
start of InitRegion (if r == nil { return nil }) or move the r == nil check into
InitRegionWithCPUUsage so it handles a nil core.RegionInfo safely; also apply
the same nil-check to the provider conversion path that currently calls
InitRegionWithCPUUsage (the code around the provider usage at lines ~302-308) so
you never call GetCPUUsage on a nil pointer — locate InitRegion,
InitRegionWithCPUUsage and the provider conversion call and update them to
return nil when the input region (or provider) is nil before dereferencing.
---
Duplicate comments:
In `@tests/integrations/mcs/scheduling/server_test.go`:
- Around line 759-770: The closure can panic when accessing
storeLoads[statutils.StoreReadCPU] because the storeLoads slice may not be
populated yet; in the Eventually closure (around testutil.Eventually, hotPeer,
GetStoresLoads, storeLoads) add a guard to verify the slice has enough elements
(e.g. if len(storeLoads) <= statutils.StoreReadCPU { return false }) before
indexing, so the closure returns false and retries instead of panicking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0d20cf63-f363-40bd-96d2-f88654a3647c
⛔ Files ignored due to path filters (3)
go.sumis excluded by!**/*.sumtests/integrations/go.sumis excluded by!**/*.sumtools/go.sumis excluded by!**/*.sum
📒 Files selected for processing (9)
go.modpkg/core/region.gopkg/core/region_test.gopkg/mcs/scheduling/server/apis/v1/api.gopkg/response/region.goserver/grpc_service.gotests/integrations/go.modtests/integrations/mcs/scheduling/server_test.gotools/go.mod
✅ Files skipped from review due to trivial changes (2)
- tests/integrations/go.mod
- go.mod
🚧 Files skipped from review as they are similar to previous changes (3)
- tools/go.mod
- server/grpc_service.go
- pkg/core/region_test.go
Signed-off-by: lhy1024 <admin@liudos.us>
|
Switching this PR back to temporary compatibility: the scheduling service keeps forwarding and consuming the deprecated The plan is to drop this field later together with the repository-wide |
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
|
/retest |
Signed-off-by: lhy1024 <admin@liudos.us>
|
Addressed the remaining CPU-only heartbeat sync feedback in 361aebc. now keeps CPU-only heartbeat changes as without . The reason is that the follower stream still serializes only region meta, leader, buckets, and byte/key ; it does not carry region CPU fields, so CPU-only was adding cross-node sync traffic without actually propagating CPU. Added/updated tests to cover the new cache-vs-sync behavior. |
|
Rechecked the latest feedback on Local validation on this head:
|
|
/test pull-unit-test-next-gen-3 |
What problem does this PR solve?
Issue Number: ref #10608
When PD forwards region heartbeats to a standalone scheduling service, the scheduling heartbeat path drops region CPU data. That makes the scheduling service build a region view that is missing CPU information available in the monolithic PD path.
This PR fixes the CPU half of #10608. Bucket metadata parity remains tracked in the same issue and is not included here.
What is changed and how does it work?
This depends on pingcap/kvproto#1455.
Although
cpu_usageis deprecated, this PR intentionally keeps using it temporarily in the scheduling microservice path. The standalone scheduling service still shares the legacyRegionInfoand HTTP APIcpu_usagebehavior with monolithic PD. Removing it only in MCS would introduce another temporary behavior split and extra response-layer complexity.The follow-up cleanup should remove scheduling-heartbeat
cpu_usagetogether with the repository-wideRegionInfoand APIcpu_usagecleanup.Hot read CPU scheduling remains unchanged: the CPU dimension still comes from
StoreHeartbeatpeercpu_statsand store CPU loads, and the integration test in this PR keeps that covered.Check List
Tests
Release note