store/copr: refine RC paging pre-charge with per-scan EMA#67759
store/copr: refine RC paging pre-charge with per-scan EMA#67759YuhaoZhang00 wants to merge 15 commits intopingcap:masterfrom
Conversation
When Resource Control is enabled, force-enable paging for eligible coprocessor requests and set a byte budget (paging_size_bytes) per page. This limits the scanned bytes of each page so that RU cost is bounded and pre-charged in Phase 1, preventing concurrent workers from all hitting Phase 2 simultaneously. Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
…ation Test rcPagingEligible conditions (RC enabled, resource group, TiKV, DAG) and verify pagingSizeBytes is correctly propagated to copTasks and cleared when small limit disables paging. Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Add a new session variable `tidb_rc_paging_size_bytes` to allow dynamically controlling the byte budget per page for RC paging via SQL. - Default: 4MB (paging.MaxPagingSizeBytes) - Set to 0 to disable byte-budget paging - Supports both GLOBAL and SESSION scope Usage: SET @@tidb_rc_paging_size_bytes = 4194304; -- 4MB (default) SET @@tidb_rc_paging_size_bytes = 1048576; -- 1MB SET @@tidb_rc_paging_size_bytes = 0; -- disable Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
Increase the bucket count from 10 (max 512KB) to 19 (max 256MB) for the tidb_distsql_copr_resp_size histogram to improve percentile accuracy when observing large coprocessor responses during paging size tuning. Signed-off-by: JmPotato <github@ipotato.me>
- Rename all RC-prefixed identifiers to generic names (e.g. RCPagingSizeBytes → PagingSizeBytes, rcPagingEligible → pagingBytesEligible) since byte-budget paging is general infrastructure, not RC-specific. - Remove RC gates (EnableResourceControl, ResourceGroupName) from pagingBytesEligible; only TiKV + DAG checks remain. - Remove unused MinPagingSizeBytes/MaxPagingSizeBytes constants. - Change DefPagingSizeBytes default from 4MB to 0 (disabled, opt-in). - Move byte-budget paging force-enable before checkStoreBatchCopr to preserve the paging-disables-batch-copr invariant. - Add tidb_paging_size_bytes to SET_VAR hint allowlist. - Fix copr_resp_size histogram: use float division to preserve sub-KiB precision, extend buckets to 256MB, correct help text unit to KiB. - Fix anonymous Paging struct literal in copr_test to include the new PagingSizeBytes field. - Add PagingSizeBytes to TestContextDetach fixture. - Add integration test for tidb_paging_size_bytes session variable. Signed-off-by: JmPotato <github@ipotato.me>
Update the PD client replace directive to pick up commit fda21466b565 which adds Limiter.RefundTokens and fixes the pre-charge token leakage in paging byte budget scenarios. Signed-off-by: JmPotato <github@ipotato.me>
Maintain a time-aware EMA of actual per-RPC MVCC read bytes on copTask and, once enough samples are in, pass the prediction to PD via the new tikvrpc.Request.PredictedReadBytes hint so pre-charge uses the learned estimate instead of the 4 MiB worst-case cap. Falls back to PagingSizeBytes on cold start. Propagated across region/lock retries; no-op for non-paging tasks. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Point to the demo/ema-precharge working branches in ~/work/client-go and ~/work/pd for local build + e2e simulation. Drop this commit before publishing as a PR; replace with pseudo-versions of pushed forks. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
Hi @YuhaoZhang00. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @YuhaoZhang00. Thanks for your PR. I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Adds tidb_distsql_copr_ema_observation_total with a {state=cold|ready}
label to quantify, in aggregate, how many per-logical-scan copTask EMA
Observe() calls happen while the EMA is still below its readiness
threshold vs after. This complements the PD-side paging pre-charge
source/bytes counters by showing how often the TiDB side could emit a
non-zero PredictedReadBytes hint for downstream pre-charge.
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
Added one observability metric on the TiDB side for upcoming e2e validation of the per-logical-scan EMA path:
Paired with the 4 PD-side paging pre-charge metrics (see tikv/pd#10599 comment), this lets the validation report show the cold-start window quantitatively rather than inferring it from QPS alone. Draft status unchanged. |
Move ruEMA init from lazy (handleTaskOnce) to eager (copIterator copTask constructors). Guarantees task.ema is non-nil on this path, so callers drop nil-safety in IsReady/Predict and the Observe call site. Trim comments. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
b482c53 to
d21c5b6
Compare
Move the RC paging EMA from per-copTask to per-copIterator so every paging RPC under the same kv.Request feeds a single learned estimate. Workers share a pointer to the iterator's EMA; access is now guarded by an internal mutex since multiple workers may Observe/Predict concurrently. Dropped the remain-task ema propagation: remain tasks re-enter through the same worker, which still holds the iterator's EMA, so there is nothing to copy. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
e52110e to
2c03c33
Compare
…opr_ema_send_total
The previous copr_ema_observation_total counter measured EMA readiness at
response-handling time (inside the paging branch), which is the wrong place
to answer "did this RPC's pre-charge dispatch a hint". The send-time decision
(req.PredictedReadBytes) happens earlier and for every cop RPC, while the
response-side counter's denominator is filtered to paging-branch responses
and its cold/ready flag can flip mid-flight due to concurrent workers'
Observe calls on the shared EMA.
Introduce copr_ema_send_total{type=cold|ready}, incremented at hint dispatch
on every outgoing cop RPC. This is the authoritative TiDB-side basis for
pre-charge coverage ratios and aligns with the PD-side M1/(M1+M6a) precharge
coverage metric.
Drop copr_ema_observation_total entirely: once send-side coverage is the
correct metric, the response-side variant has no independent use that isn't
better served by the raw EMA cold/ready sample counts exposed elsewhere.
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What problem does this PR solve?
Issue Number: ref #N/A (draft — issue to be filed)
Problem Summary:
Under Resource Control, TiDB pre-charges RUs for each paging coprocessor
RPC using a fixed
paging_size_bytes(currently 4 MiB, set by #67504). Thisworst-case charge typically overshoots the actual scanned bytes by a large
margin, and under a tight resource tier it stalls concurrent workers at
Phase 1 (token acquisition) even when the cluster has spare capacity. The
result is that large
paging_size_bytesvalues collapse QPS on workloadslike a 1500-thread JOIN benchmark, while small values (256 KiB) look fine.
What changed and how does it work?
Maintain a time-aware EMA of the actual per-RPC MVCC read bytes on
copTask, and once enough samples are in (>= 2), pass the prediction toPD via the new
tikvrpc.Request.PredictedReadByteshint so the pre-chargeuses the learned estimate instead of the 4 MiB cap. On cold start and for
non-paging tasks the behavior is unchanged (falls back to
PagingSizeBytes).The EMA state lives on
copTaskand is naturally scoped to one logical scanacross its paging rounds — no cross-RPC or cross-process sync. The
prediction is propagated across region/lock retries so learning is not lost.
Stacked with:
tikv/client-go*: improve log #1944: adds thePredictedReadBytesfield ontikvrpc.Requestand theRequestInfogettertikv/pdstore/tikv: collect exited goroutines when shutdown the server #10599: consumes the hint inBeforeKVRequest/AfterKVRequestCheck List
Tests
Side effects
Documentation
Release note
Draft note: E2E validation on the simulation cluster is pending. Once
the client-go + PD companion PRs land (or are pointed to via go.mod replace
for CI), this PR's CI will turn green. This is a draft for early visibility
and discussion.