planner: Add memory overhead for HashAgg on TiKV#67953
planner: Add memory overhead for HashAgg on TiKV#67953ti-chi-bot[bot] merged 16 commits intopingcap:masterfrom
Conversation
…ency division Previously, getPlanCostVer24PhysicalHashAgg divided ALL costs (agg, group, hash-build, hash-probe) by the concurrency factor (default 5). This gave HashAgg a 5x artificial discount over StreamAgg even for single-threaded scenarios where aggregation work is not reduced by parallelism — each input row is processed exactly once regardless of worker count. Root cause: threads share the same input stream so aggCost and groupCost represent the total work done, not per-thread work. Only the hash table operations (build + probe) genuinely benefit from parallel execution. Fix: for TiDB root and TiKV cop tasks, move aggCost and groupCost outside the concurrency division, matching the pattern already used by hash join where build-side work sits outside /concurrency. The new formula is: plan-cost = child-cost + agg-cost + group-cost + (hash-build-cost + hash-probe-cost) / concurrency For TiFlash MPP tasks, data is truly partitioned across nodes so each node handles a disjoint subset of rows — the original formula (divide everything by concurrency) is preserved to avoid disrupting MPP plan choices. Impact: high-NDV GROUP BY on an indexed column now correctly prefers StreamAgg over HashAgg, matching the optimizer's intent and reducing memory consumption from potentially 100MB-1GB (full hash table) to ~1.5MB (streaming over ordered index output). Low-NDV cases see minimal cost change. Add regression test TestHashAggMemCostNotDividedByConcurrency that verifies StreamAgg is cheaper than HashAgg for a 100%-NDV GROUP BY with an available index. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@terry1purcell I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
Hi @terry1purcell. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request modifies the hash aggregation cost calculation in the query planner to differentiate memory and CPU cost handling between MPP and non-MPP task types, causing plan optimization decisions to shift from hash aggregation to stream aggregation in certain scenarios. Test fixtures across multiple test suites are updated to reflect the new expected plan outputs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/planner/core/plan_cost_ver2_test.go (1)
815-860: LGTM — test directly validates the cost-model fix.The test correctly targets the claim: with 100% NDV on an indexed column, HashAgg's hash-table memory cost (now not divided by concurrency) makes StreamAgg on the ordered index scan cheaper. Running under
RunTestUnderCascadesWithDomainalso confirms the behavior holds under both cascades off and on, and the per-iteration store/domain isolates session state — so no explicit cost-factor reset is needed.Optional nit: the hot-loop
buf.WriteString(fmt.Sprintf(...))on lines 836–841 can befmt.Fprintf(&buf, "(%d,%d)", i, i)to avoid the intermediate string per row. Trivial, purely stylistic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/planner/core/plan_cost_ver2_test.go` around lines 815 - 860, Replace the hot-loop string allocation inside TestHashAggMemCostNotDividedByConcurrency: locate the loop that appends each row using buf.WriteString(fmt.Sprintf("(%d,%d)", i, i)) and change it to use fmt.Fprintf(&buf, "(%d,%d)", i, i) to avoid an intermediate fmt.Sprintf allocation per iteration while preserving exact output.pkg/planner/core/plan_cost_ver2.go (1)
622-637: Align the explanatory comments with the implemented formula.Line 631 and Lines 673-674 say the hash-join/build-side work sits outside
/concurrency, but Lines 675-676 still dividehashBuildCost. The formulas also omitstartCost, which is added in both branches.Proposed comment cleanup
-// plan-cost = child-cost + agg-cost + group-cost + (hash-build-cost + hash-probe-cost) / concurrency +// plan-cost = start-cost + child-cost + agg-cost + group-cost + (hash-build-cost + hash-probe-cost) / concurrency @@ -// Only the hash table operations (build + probe) benefit from parallel execution. -// This matches the hash join pattern where build-side work is placed outside /concurrency. +// Hash table build/probe costs remain concurrency-scaled by this model. @@ -// For TiFlash MPP tasks, data is truly partitioned across nodes so each node handles a -// disjoint subset of rows. All costs are divided by the MPP concurrency: +// For TiFlash MPP tasks, data is partitioned across TiFlash nodes, so keep +// concurrency scaling for the aggregation/grouping/hash portion: @@ -// plan-cost = child-cost + (agg-cost + group-cost + hash-build-cost + hash-probe-cost) / mppConcurrency +// plan-cost = start-cost + child-cost + (agg-cost + group-cost + hash-build-cost + hash-probe-cost) / mppConcurrency @@ - // Root (TiDB) and cop (TiKV) tasks: threads share the same input so aggregation - // work is not reduced by concurrency — each row is still processed once regardless - // of how many workers run in parallel. Only hash table operations benefit from - // parallel execution. This matches the hash join pattern where build-side work - // (including aggregation) sits outside /concurrency. + // Root (TiDB) and cop (TiKV) tasks: aggregation/grouping work is charged + // outside concurrency, while hash build/probe costs remain concurrency-scaled.As per coding guidelines, “Comments SHOULD explain non-obvious intent, constraints, invariants, concurrency guarantees, SQL/compatibility contracts, or important performance trade-offs, and SHOULD NOT restate what the code already makes clear.”
Also applies to: 663-677
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/planner/core/plan_cost_ver2.go` around lines 622 - 637, The comment block above getPlanCostVer24PhysicalHashAgg is inconsistent with the implemented calculation: update the explanatory comments to accurately reflect the implemented formulas (including that startCost is added in both TiDB/TiKV and TiFlash/MPP branches) and correct the statement about hash-build/hash-probe treatment — if hashBuildCost (and hashProbeCost) are divided by concurrency in the code they must be described as benefiting from parallelism, otherwise note they sit outside the /concurrency division; ensure the text around the two branches and the earlier comment section (the area covering the rationale currently spanning lines ~663-677) explicitly matches the code’s use of hashBuildCost, hashProbeCost and startCost so the comment no longer restates code but explains the intended concurrency semantics implemented by getPlanCostVer24PhysicalHashAgg.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/planner/core/plan_cost_ver2_test.go`:
- Around line 815-860: Replace the hot-loop string allocation inside
TestHashAggMemCostNotDividedByConcurrency: locate the loop that appends each row
using buf.WriteString(fmt.Sprintf("(%d,%d)", i, i)) and change it to use
fmt.Fprintf(&buf, "(%d,%d)", i, i) to avoid an intermediate fmt.Sprintf
allocation per iteration while preserving exact output.
In `@pkg/planner/core/plan_cost_ver2.go`:
- Around line 622-637: The comment block above getPlanCostVer24PhysicalHashAgg
is inconsistent with the implemented calculation: update the explanatory
comments to accurately reflect the implemented formulas (including that
startCost is added in both TiDB/TiKV and TiFlash/MPP branches) and correct the
statement about hash-build/hash-probe treatment — if hashBuildCost (and
hashProbeCost) are divided by concurrency in the code they must be described as
benefiting from parallelism, otherwise note they sit outside the /concurrency
division; ensure the text around the two branches and the earlier comment
section (the area covering the rationale currently spanning lines ~663-677)
explicitly matches the code’s use of hashBuildCost, hashProbeCost and startCost
so the comment no longer restates code but explains the intended concurrency
semantics implemented by getPlanCostVer24PhysicalHashAgg.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c741fa5e-db51-419f-9123-c7966eddc28c
📒 Files selected for processing (30)
pkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_out.jsonpkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_xut.jsonpkg/planner/core/casetest/cbotest/testdata/analyze_suite_out.jsonpkg/planner/core/casetest/cbotest/testdata/analyze_suite_xut.jsonpkg/planner/core/casetest/enforcempp/testdata/enforce_mpp_suite_out.jsonpkg/planner/core/casetest/enforcempp/testdata/enforce_mpp_suite_xut.jsonpkg/planner/core/casetest/hint/testdata/integration_suite_out.jsonpkg/planner/core/casetest/hint/testdata/integration_suite_xut.jsonpkg/planner/core/casetest/mpp/testdata/integration_suite_out.jsonpkg/planner/core/casetest/mpp/testdata/integration_suite_xut.jsonpkg/planner/core/casetest/partition/testdata/partition_pruner_out.jsonpkg/planner/core/casetest/partition/testdata/partition_pruner_xut.jsonpkg/planner/core/casetest/physicalplantest/testdata/plan_suite_out.jsonpkg/planner/core/casetest/physicalplantest/testdata/plan_suite_xut.jsonpkg/planner/core/casetest/rule/testdata/join_reorder_suite_out.jsonpkg/planner/core/casetest/rule/testdata/join_reorder_suite_xut.jsonpkg/planner/core/casetest/rule/testdata/outer2inner_out.jsonpkg/planner/core/casetest/rule/testdata/outer2inner_xut.jsonpkg/planner/core/casetest/scalarsubquery/testdata/plan_suite_out.jsonpkg/planner/core/casetest/scalarsubquery/testdata/plan_suite_xut.jsonpkg/planner/core/casetest/testdata/integration_suite_out.jsonpkg/planner/core/casetest/testdata/integration_suite_xut.jsonpkg/planner/core/casetest/testdata/json_plan_suite_out.jsonpkg/planner/core/casetest/testdata/json_plan_suite_xut.jsonpkg/planner/core/casetest/testdata/plan_normalized_suite_out.jsonpkg/planner/core/casetest/testdata/plan_normalized_suite_xut.jsonpkg/planner/core/casetest/tpch/testdata/tpch_suite_out.jsonpkg/planner/core/casetest/tpch/testdata/tpch_suite_xut.jsonpkg/planner/core/plan_cost_ver2.gopkg/planner/core/plan_cost_ver2_test.go
| // | ||
| // For TiDB root and TiKV cop tasks: | ||
| // | ||
| // plan-cost = child-cost + agg-cost + group-cost + (hash-build-cost + hash-probe-cost) / concurrency |
There was a problem hiding this comment.
clear comment and make sense
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/planner/core/plan_cost_ver2.go`:
- Around line 664-686: The code only checks the immediate child for
*physicalop.PhysicalTableReader when setting childIsTiFlash, so plans with unary
wrappers (e.g., Projection/Selection) hide a TiFlash TableReader; change the
detection to walk down unary single-child operators starting from
p.Children()[0] until you either find a *physicalop.PhysicalTableReader (set
childIsTiFlash = true) or hit a node with zero or multiple children (leave
false). Implement this by replacing the direct type assertion with a small loop
that examines p.Children()[0], repeatedly assigns child = child.Children()[0]
while len(child.Children())==1 and the child is one of the expected unary
wrapper types (e.g., physicalop.PhysicalProjection,
physicalop.PhysicalSelection, etc.), and then checks for
*physicalop.PhysicalTableReader to set childIsTiFlash.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f65c5efc-0223-4cf8-b4f0-f002e2ddcb15
📒 Files selected for processing (9)
pkg/planner/core/casetest/enforcempp/testdata/enforce_mpp_suite_out.jsonpkg/planner/core/casetest/enforcempp/testdata/enforce_mpp_suite_xut.jsonpkg/planner/core/casetest/tpch/testdata/tpch_suite_out.jsonpkg/planner/core/casetest/tpch/testdata/tpch_suite_xut.jsonpkg/planner/core/plan_cost_ver2.gotests/integrationtest/r/globalindex/aggregate.resulttests/integrationtest/r/index_merge.resulttests/integrationtest/r/infoschema/infoschema.resulttests/integrationtest/r/planner/cascades/integration.result
✅ Files skipped from review due to trivial changes (1)
- tests/integrationtest/r/infoschema/infoschema.result
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/planner/core/casetest/testdata/json_plan_suite_out.json (1)
108-153: Consider excluding volatile runtime fields from recorded JSON fixtures.
executeInfoand related timing/memory values in Line 108-Line 153 are runtime-volatile, but the current verifier (pkg/planner/core/casetest/plan_test.go:365-396) only asserts stable fields (ID,EstRows,ActRows,TaskType,AccessObject,OperatorInfo). Stripping or normalizing volatile fields during record would reduce testdata churn.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/planner/core/casetest/testdata/json_plan_suite_out.json` around lines 108 - 153, The recorded JSON fixture json_plan_suite_out.json contains volatile runtime fields (executeInfo, memoryInfo, diskInfo and similar timing/memory strings) that cause churn; update the recorder to strip or normalize these fields when writing fixtures so only stable fields (ID, EstRows, ActRows, TaskType, AccessObject, OperatorInfo, etc.) are kept, i.e., remove or replace executeInfo/memoryInfo/diskInfo (and nested equivalents under subOperators/IndexFullScan_*/IndexReader_*) with a stable sentinel (empty string or null) during serialization so the existing verifier that asserts stable fields continues to pass without noisy runtime differences.pkg/planner/core/plan_cost_ver2.go (1)
686-695: Extract the CPU-only hash-build term instead of rebuilding it inline.This branch now carries a second copy of the hash-build formula, while the full
hashBuildCostis still created earlier and ignored here. Pulling the CPU-only part into a helper would keep the traced expressions in one place and avoid branch drift the next timehashBuildCostVer2changes.Refactor sketch
-func hashBuildCostVer2(option *costusage.PlanCostOption, buildRows, buildRowSize, nKeys float64, cpuFactor, memFactor costusage.CostVer2Factor) costusage.CostVer2 { - // TODO: 1) consider types of keys, 2) dedicated factor for build-probe hash table +func hashBuildCPUCostVer2(option *costusage.PlanCostOption, buildRows, nKeys float64, cpuFactor costusage.CostVer2Factor) costusage.CostVer2 { hashKeyCost := costusage.NewCostVer2(option, cpuFactor, buildRows*nKeys*cpuFactor.Value, func() string { return fmt.Sprintf("hashkey(%v*%v*%v)", buildRows, nKeys, cpuFactor) }) - hashMemCost := costusage.NewCostVer2(option, memFactor, - buildRows*buildRowSize*memFactor.Value, - func() string { return fmt.Sprintf("hashmem(%v*%v*%v)", buildRows, buildRowSize, memFactor) }) hashBuildCost := costusage.NewCostVer2(option, cpuFactor, buildRows*cpuFactor.Value, func() string { return fmt.Sprintf("hashbuild(%v*%v)", buildRows, cpuFactor) }) + return costusage.SumCostVer2(hashKeyCost, hashBuildCost) +} + +func hashBuildCostVer2(option *costusage.PlanCostOption, buildRows, buildRowSize, nKeys float64, cpuFactor, memFactor costusage.CostVer2Factor) costusage.CostVer2 { + // TODO: 1) consider types of keys, 2) dedicated factor for build-probe hash table + hashMemCost := costusage.NewCostVer2(option, memFactor, + buildRows*buildRowSize*memFactor.Value, + func() string { return fmt.Sprintf("hashmem(%v*%v*%v)", buildRows, buildRowSize, memFactor) }) - return costusage.SumCostVer2(hashKeyCost, hashMemCost, hashBuildCost) + return costusage.SumCostVer2(hashBuildCPUCostVer2(option, buildRows, nKeys, cpuFactor), hashMemCost) }As per coding guidelines, "Code SHOULD remain maintainable for future readers with basic TiDB familiarity" and "Code SHOULD be self-documenting through clear naming and structure."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/planner/core/plan_cost_ver2.go` around lines 686 - 695, The branch duplicates the hash-build CPU formula by recreating hashBuildCPUCost inline while the full hashBuildCost created earlier is ignored; extract the CPU-only portion into a helper (e.g., cpuOnlyHashBuildCost or a method on costusage like hashBuildCostCPUOnly) that returns the CPU-only CostVer2 for the hash build and replace the inline construction of hashBuildCPUCost in this file with a call to that helper, and update any other places that should use the same CPU-only term (and keep the original hashBuildCost creation intact so traced expressions remain consistent).pkg/planner/core/plan_cost_ver2_test.go (1)
832-845: Make the “wide row” setup less margin-sensitive.This regression depends on
outputRowSizebeing large enough to keepHashAggaboveStreamAgg, but the inserted literals are still fairly short relative to the declaredvarchar(200)columns. Fillingc/dcloser to their declared width would give the assertion more headroom against future factor tuning.Suggested adjustment
// Insert rows where every b value is unique (100% NDV). + wideC := strings.Repeat("c", 180) + wideD := strings.Repeat("d", 180) var buf strings.Builder buf.WriteString("insert into t_high_ndv values ") for i := 0; i < 1000; i++ { if i > 0 { buf.WriteString(",") } - buf.WriteString(fmt.Sprintf("(%d,%d,'%s','%s')", i, i, "padding-data-for-wide-rows", "more-padding-data-here")) + buf.WriteString(fmt.Sprintf("(%d,%d,'%s','%s')", i, i, wideC, wideD)) }As per coding guidelines, "Keep test changes minimal and deterministic; avoid broad golden/testdata churn unless required."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/planner/core/plan_cost_ver2_test.go` around lines 832 - 845, The test's "wide row" insert into table t_high_ndv is too short to reliably bias cost toward HashAgg; modify the value generation used in the insert builder inside plan_cost_ver2_test.go so the c and d literals are filled close to their declared varchar(200) length (e.g., use strings.Repeat or a precomputed longPad variable and insert longPad[:200] or similar) when building the loop that writes values into t_high_ndv; update the loop that appends fmt.Sprintf("(%d,%d,'%s','%s')", ...) to use those padded strings for columns c and d so outputRowSize is clearly large and the test becomes less margin-sensitive.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/planner/core/casetest/testdata/json_plan_suite_out.json`:
- Around line 108-153: The recorded JSON fixture json_plan_suite_out.json
contains volatile runtime fields (executeInfo, memoryInfo, diskInfo and similar
timing/memory strings) that cause churn; update the recorder to strip or
normalize these fields when writing fixtures so only stable fields (ID, EstRows,
ActRows, TaskType, AccessObject, OperatorInfo, etc.) are kept, i.e., remove or
replace executeInfo/memoryInfo/diskInfo (and nested equivalents under
subOperators/IndexFullScan_*/IndexReader_*) with a stable sentinel (empty string
or null) during serialization so the existing verifier that asserts stable
fields continues to pass without noisy runtime differences.
In `@pkg/planner/core/plan_cost_ver2_test.go`:
- Around line 832-845: The test's "wide row" insert into table t_high_ndv is too
short to reliably bias cost toward HashAgg; modify the value generation used in
the insert builder inside plan_cost_ver2_test.go so the c and d literals are
filled close to their declared varchar(200) length (e.g., use strings.Repeat or
a precomputed longPad variable and insert longPad[:200] or similar) when
building the loop that writes values into t_high_ndv; update the loop that
appends fmt.Sprintf("(%d,%d,'%s','%s')", ...) to use those padded strings for
columns c and d so outputRowSize is clearly large and the test becomes less
margin-sensitive.
In `@pkg/planner/core/plan_cost_ver2.go`:
- Around line 686-695: The branch duplicates the hash-build CPU formula by
recreating hashBuildCPUCost inline while the full hashBuildCost created earlier
is ignored; extract the CPU-only portion into a helper (e.g.,
cpuOnlyHashBuildCost or a method on costusage like hashBuildCostCPUOnly) that
returns the CPU-only CostVer2 for the hash build and replace the inline
construction of hashBuildCPUCost in this file with a call to that helper, and
update any other places that should use the same CPU-only term (and keep the
original hashBuildCost creation intact so traced expressions remain consistent).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ca39a838-8117-43ea-a3a1-e0cb05eb8a5a
📒 Files selected for processing (11)
pkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_out.jsonpkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_xut.jsonpkg/planner/core/casetest/physicalplantest/testdata/plan_suite_out.jsonpkg/planner/core/casetest/physicalplantest/testdata/plan_suite_xut.jsonpkg/planner/core/casetest/testdata/json_plan_suite_out.jsonpkg/planner/core/casetest/testdata/json_plan_suite_xut.jsonpkg/planner/core/casetest/tpch/testdata/tpch_suite_out.jsonpkg/planner/core/casetest/tpch/testdata/tpch_suite_xut.jsonpkg/planner/core/plan_cost_ver2.gopkg/planner/core/plan_cost_ver2_test.gotests/integrationtest/r/planner/cascades/integration.result
✅ Files skipped from review due to trivial changes (2)
- pkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_xut.json
- pkg/planner/core/casetest/binaryplan/testdata/binary_plan_suite_out.json
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/planner/core/casetest/physicalplantest/testdata/plan_suite_xut.json
Restore mpp integration_suite testdata to match master — the stream_count removal was picked up as collateral during re-recording and is unrelated to the HashAgg cost model change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67953 +/- ##
================================================
- Coverage 77.7565% 77.1269% -0.6297%
================================================
Files 1990 1972 -18
Lines 551624 554928 +3304
================================================
- Hits 428924 427999 -925
- Misses 121780 126755 +4975
+ Partials 920 174 -746
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
The hashMemCost term added by the HashAgg cost-model fix was applied unconditionally for root tasks, which over-penalized HashAgg whenever a StreamAgg alternative would itself need an explicit Sort to consume the child rows in GROUP BY order. In those cases the planner switched to Sort+StreamAgg even though paying for a sort over the join/agg output is strictly more expensive than HashAgg's hash table. Gate hashMemCost on a structural check of the HashAgg's child plan: only charge the memory penalty when the child reaches a base-table reader (IndexReader / IndexLookUpReader / IndexMergeReader / TableReader) through pass-through operators. Joins, aggregations, and other order-breaking operators skip the penalty since the StreamAgg alternative there carries its own (correctly-costed) Sort. Add a companion unit test that exercises the gate on a hash-join input to ensure HashAgg stays cheaper than Sort+StreamAgg in that scenario, alongside the existing test that keeps StreamAgg cheaper for the indexed high-NDV path. Re-record planner casetest fixtures (binary_plan, json_plan, tpch); the previously-recorded TPCH MPP plan flip reverts to its master shape, confirming the gate's effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Append two cases to plan_cost_ver2.test that capture the plan-shape outcomes of the HashAgg memory-penalty gating as visible result diffs, making the optimizer behavior change easy to review. Case 1 (intended win): high-NDV GROUP BY on a clustered-PK table with wide payload columns. The TableFullScan delivers rows in PK prefix order for free, so the planner picks StreamAgg over HashAgg. Case 2 (gating prevents regression): high-NDV GROUP BY over a hash-join output. No free ordering on the group key, so the gating skips the HashAgg memory penalty and the planner picks HashAgg rather than Sort+StreamAgg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/retest-required |
|
@terry1purcell: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
henrybw
left a comment
There was a problem hiding this comment.
LGTM. Just had a few nitpicks about making the comments more concise and less redundant, but those are nitpicks and thus nonblocking.
| // requiring an explicit Sort. The check is structural: walk through | ||
| // order-preserving operators and accept only base-table access paths | ||
| // (IndexReader / IndexLookUpReader / IndexMergeReader / TableReader). Joins, | ||
| // aggregations, applies, and other order-breaking operators return false — | ||
| // for those, the StreamAgg alternative would need a Sort, and applying the | ||
| // HashAgg memory penalty would unfairly favour Sort+StreamAgg over HashAgg. |
There was a problem hiding this comment.
Nit: I don't think comments that just paraphrase what the code does add much value. This particular point about the memory cost also seems to be repeated three times (once in the comment above getPlanCostVer24PhysicalHashAgg and another time inside the if taskType == property.RootTaskType { block).
| // Root (TiDB) tasks: partial workers each process a disjoint subset of input | ||
| // rows, so all CPU work (agg, group, hash key, probe) is genuinely parallelized | ||
| // and divided by concurrency. However, each partial worker maintains its own | ||
| // hash table, so total memory scales with the number of workers. We model this | ||
| // as concurrency * outputRows * rowSize * memFactor, placed outside the | ||
| // concurrency division. The penalty is gated on whether the child plan can | ||
| // provide ordering on the GROUP BY keys naturally — only then is the StreamAgg | ||
| // alternative free of additional Sort cost and able to benefit from HashAgg's | ||
| // memory penalty. When no free ordering is available, we skip the penalty so | ||
| // the optimizer doesn't get steered into a Sort+StreamAgg plan that costs more | ||
| // than HashAgg in practice. |
There was a problem hiding this comment.
Nit: I'm not sure this comment is adding anything more than what the comment above this function has already said. At the very least, I think it could be made more concise.
|
/retest-required |
1 similar comment
|
/retest-required |
0xPoe
left a comment
There was a problem hiding this comment.
Thanks! ![]()
/hold
for https://github.com/pingcap/tidb/pull/67953/changes#r3178508370
| }) | ||
| } | ||
|
|
||
| func TestHashAggMemCostGatedOnFreeOrdering(t *testing.T) { |
There was a problem hiding this comment.
This test appears to pass without your fix. Perhaps we should redesign the case to ensure it can verify that the gate functions correctly. I typically verify it by running these regression tests without fixes to check if it truly serves as the regression test.
Please correct me if I misunderstood this case.
|
/retest-required |
0xPoe
left a comment
There was a problem hiding this comment.
Thanks! ![]()
Feel free to unhold. Thank you for addressing my comments.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 0xPoe, henrybw The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
|
/retest |
What problem does this PR solve?
Issue Number: close #68083
Problem Summary:
What changed and how does it work?
planner: fix HashAgg cost model — place hash table memory cost outside concurrency division
Previously, getPlanCostVer24PhysicalHashAgg divided ALL costs (agg, group, hash-build, hash-probe) by the concurrency factor (default 5). This gave HashAgg an artificial
discount on hash table memory, even though each partial worker maintains its own hash table and total memory does not decrease with more workers.
Root cause
The cost model treated hash table memory the same as CPU work — dividing it by concurrency. But while CPU work (aggregation, grouping, hash key computation, probing) is
genuinely parallelized across partial workers that each process a disjoint subset of input rows, memory is not: each partial worker builds its own hash table, so total memory
scales with concurrency * outputRows * rowSize.
Fix
For TiDB root and TiKV cop tasks, split the hash build cost into its CPU and memory components:
New formula (root/cop):
plan-cost = child-cost + concurrency * hashMemCost + (aggCost + groupCost + hashBuildCPUCost + hashProbeCost) / concurrency
MPP (unchanged):
plan-cost = child-cost + (aggCost + groupCost + hashBuildCost + hashProbeCost) / concurrency
Impact
Regression test
TestHashAggMemCostNotDividedByConcurrency verifies that StreamAgg is cheaper than HashAgg for a 100%-NDV GROUP BY with wide output rows and an available index.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Bug Fixes
Tests