chore(tesseract): Collapse logical plan into Query + LogicalPlan root#10921
chore(tesseract): Collapse logical plan into Query + LogicalPlan root#10921waralexrom wants to merge 21 commits into
Conversation
Reshapes the cubesqlplanner logical plan around two ideas:
- collapse specialised body types (KeysSubQuery, MeasureSubquery,
AggregateMultipliedSubquery, DimensionSubQuery, multi-stage
Calculation/Aggregation) into a single `Query` over a uniform
`QuerySource` (`LogicalJoin` | `FullKeyAggregate` | `PreAggregation`);
- decompose Query's role via a `QueryKind` enum
(`TopLevelOverCtes` | `Stage(StageKind)` | `AggregateMultiplied`
| `LeafOverJoin` | `InternalFact(FactKind)` | `PreAggregationLeaf`);
- introduce a `LogicalPlan { ctes, root: PlanNode }` root container —
not part of `PlanNode` — that owns the WITH-clause pool. Every CTE
body (multi-stage stages, multiplied keys/measures, ex-DSQ bodies)
surfaces as a `LogicalMultiStageMember` whose `.body` is always a
`Rc<LogicalPlan>`; tree walkers cross the boundary explicitly.
Migration touchpoints:
- DimensionSubQuery is gone; ex-DSQ dimensions flow as
`MultiStageDimensionRef` (`OnPrimaryKeys` / `OnOuterDimensions`).
- Per-Query CTE refs live on `Query.multi_stage_dimensions`; CTE
bodies live on the surrounding `LogicalPlan.ctes`.
- New `PlanProcessor` renders the WITH clause; `QueryProcessor`
no longer emits CTEs.
- `PreAggregationOptimizer.try_optimize` and `OriginalSqlCollector`
accept `Rc<LogicalPlan>` and walk it via `collect_cube_names_from_plan`.
Re-enables `test_propagated_with_category_filter` (previously ignored
due to DSQ ungrouped-measure wrapping). All 900 active tests pass.
`LogicalPlan.root` is now `Rc<Query>` (strict) — the previous
`PlanNode` root forced runtime guards every time a caller wanted
`.schema()` / `.modifers()` off the root.
`LogicalMultiStageMember.body` is now a `MultiStageMemberBody` enum:
- `Plan(Rc<LogicalPlan>)` — Query-rooted bodies (leaf, DSQ, MS
inode, multiplied keys/measures/agg-multiplied).
- `TimeSeries(Rc<MultiStageTimeSeries>)` — time-axis scaffold.
- `RollingWindow(Rc<MultiStageRollingWindow>)` — window CTE.
TimeSeries and RollingWindow no longer need an empty-CTE-pool
wrapper. Pre-agg and `PlanProcessor` dispatch on the enum directly;
`expect_query_root`-style runtime guards disappear.
`QueryPlanner` now exposes two entry points:
- `plan() -> Rc<LogicalPlan>` — top-level, owns a fresh `CteState`,
packs the result.
- `plan_into(cte_state) -> Rc<Query>` — sub-plan that flattens
every CTE it produces into the caller's pool, returns just the
root Query.
DSQ and multi-stage leaf bodies switch to `plan_into`. Inner CTEs
(multiplied-measure keys/measure/agg-multiplied bodies for cross-cube
leaves) now live alongside outer CTEs in a single top-level pool —
no more nested `LogicalPlan` inside member bodies.
`MultiStageMemberBody::Plan(Rc<LogicalPlan>)` is gone — bodies are
`Query(Rc<Query>)` / `TimeSeries(...)` / `RollingWindow(...)`.
`PreAggregationOptimizer` becomes a graph-walk: starts at `plan.root`
FK refs, follows them by name through `plan.ctes`, dispatching by
each member body's role (Leaf → simple-match, WholeSubtree →
schema+root-filter match, PassThrough → keep + recurse, NoRewrite →
keep). After the walk, unreachable members (orphans of replaced
bodies, e.g. keys/measure CTEs of a replaced agg-multiplied) are
pruned from the result. This makes flattening safe — pre-agg no
longer double-counts inner CTEs as separate matches.
CTE naming is now uniformly `cte_N` via `CteState.next_cte_name()`.
Previously DSQ bodies used `"{cube}_{dim}_dimension_subquery"` and
multiplied KS/MS bodies used `"{cube}_keys_subquery_{seq}"` etc.; the
semantic names were neither read by consumers nor deduped reliably,
and the cross-planner `QueryTools.next_cte_seq_id` counter is gone.
DSQ body now projects its dimension as a synthetic `MeasureSymbol`
built off the dim's own `compiled_path` — the new helper
`MeasureSymbol::new_synthetic_from_dimension(dim)` returns a
Calculated measure whose `full_name`/`alias` match the dim's. Errors
if the dim has no `sql` (e.g. Geo) or its type doesn't map to a
`CalculatedMeasureType`.
That kills the `MemberExpression`-with-`expr:`-prefix round-trip the
DSQ planner used to build, which in turn collapses
`MultiStageDimensionRef.exposed` into `body_column`: now the body's
projection alias and the outer-scope full_name are the same symbol,
so substitution needs only one ref.
Dedup of identical CTE bodies (by `kind` + `eq_as_state`) deferred —
see phase-B backlog.
Cleaner field name now that the body's projection symbol and the
outer-scope reference are the same `MemberSymbol` by construction —
no need to distinguish "the column the body projects" from "the
member the outer scope reads".
Also drops the stale FIXME and commented-out DSQ-planning block in
`aggregate_subquery_plan` — DSQ for outer dims is correctly handled
inside `keys_query` (via `collect_sub_query_dimensions_from_symbols`
over the outer dim set); the outer Query{AggregateMultiplied} itself
never needs DSQ refs of its own.
Replaces the manual `for pk in dimensions { if !already_in { push }}`
with `outer_dims.chain(pk_dims).unique_by(|d| d.full_name())`.
Same semantics, fewer moving parts.
`Query.filter` and `Query.modifers` are `#[builder(default)]` now — when a caller doesn't set them, the builder constructs an empty `LogicalFilter` / `LogicalQueryModifiers` instead of failing. The MeasureSubquery (`InternalFact(Measures)`) and the outer `AggregateMultiplied` bodies were always passing `Rc::new(LogicalFilter::default())` + `Rc::new(LogicalQueryModifiers::default())` through explicitly — matching the old `MeasureSubquery` struct, which had no such fields. Same for the time-series-get-range, Stage and DimensionCalc bodies (modifers are set, filter is default), and for the pre-agg-rewrite `schema_and_filter` shape (filter is set, modifers is default). Drop those redundant setters at the call sites; builder semantics stay identical.
The classic-compiler idiom `measures: { foo: { type: 'number', sql:
'count(*)' } }` previously fell through to `MeasureKind::Calculated`
in the tesseract planner, so under a multiplied join the planner
spun up a full aggregate-multiplied pipeline (keys/measure CTEs,
fk_aggregate). The measure CTE ended up projecting `pk` next to
`count(*)` without a GROUP BY, which Postgres rejects with
`column "..." must appear in the GROUP BY clause`.
Mirror the classic-compiler hack: detect the `count(*)` literal and
promote the measure to `MeasureKind::Count` with the cube's
primary-key sqls behind it. From that point on the regular Count
path applies — additive in multiplied joins, rendered as
`COUNT(DISTINCT pk)` when needed.
- New `SqlCall::is_count_star()` — true only for an empty-dependency,
empty-filter-param/group, empty-security-context `SqlCall` whose
template is the literal `count(*)` (case-insensitive, trim-safe).
- `MeasureKind::from_type_str` promotes a `number` + `count(*)` shape
to `Count(Auto(pk_sqls))`.
- `MeasureSymbolFactory.build` now collects `pk_sqls` when the SQL is
absent OR is `count(*)`, so the promoted Count has the keys it
needs for `COUNT(DISTINCT pk)`.
Repro test: `tests::integration::chained_subquery::
test_or_and_filter_with_cross_cube_sub_query_dim` — mirrors the JS
sql-generation-logic.test.ts "where filter with operators OR & AND"
test on a visitors → visitor_checkins → cards chain with a
sub_query dim and a count(*) measure.
…er_references The OnPrimaryKeys DSQ join was built as `cte_ref.pk_alias = Expr::new_member(pk_dim)`. RHS, a `MemberExpression`, renders via `render_references`. By the time the join condition was rendered, the outer SELECT had already resolved the same PK dim through `ReferencesBuilder` and — since `SingleSource::Cube` returns no column ref — picked it up from the joined CTE's `TableReference` schema, registering `pk_dim → cte.pk_alias` in `render_references`. The ON then collapsed to `cte.pk_alias = cte.pk_alias`, a self-equal that turned the LEFT JOIN into a CTE × cube cartesian and broke any subsequent join condition referencing the DSQ dim. Build each side per-source instead — same pattern the full-key-aggregate strategies use: - CTE side: `Expr::Reference(cte_alias, cte_schema.resolve_member_alias(dim))`. - Cube side: `Expr::Reference(cube_alias, dim.name())` — direct column reference, bypassing `render_references` entirely. Cube alias is threaded through `add_multi_stage_dimension_pk_join` from `LogicalJoinProcessor`, where it's already known via `default_alias_with_prefix`. Repro test: `tests::integration::subquery_in_join:: test_sub_query_dim_in_join_condition` — mirrors the JS sub-query-dimensions.test.ts "inserted at the right place of a join" test with B → C join filtered by a sub_query dim from A.
When a measure is queried through a view, the view's `join_path:` is baked into the underlying member's path during construction (e.g. a view that exposes `child_val_avg` via `join_path: many_to_one_root.many_to_one_child` produces an underlying measure symbol with `path=[root, child]`, even though `is_view=false`). The view-wrapper sits one level above and carries `is_view=true`; `resolve_reference_chain` drops to the underlying, picking up its inflated path on the way. `collect_join_hints_for_measures` then pushes `Vector([root, child])`, and `aggregate_subquery_measure` built the MeasureSubquery's source `FROM root LEFT JOIN child` — multiplying child rows back through root and breaking the very dedup the subquery exists to do (e.g. `child.val_avg` over (foo, foo) came back as 166.66 instead of 200). Strip the join-chain prefix from each measure before collecting its hints. Same helper as in `assert_measures_not_multiplied`. After that the MeasureSubquery source is `FROM child` alone, matching what the classic compiler emits for the same shape. Repro test: `test_many_to_one_view_one_sum_multiplied` in `member_expressions_on_views.rs` — mirrors the JS member-expressions-on-views.test.ts "many_to_one_view > one_sum" case with a multiplied seed.
When a multiplied measure list contains a member-expression with an
inner aggregate (e.g. `COUNT(DISTINCT {view.child_test_dim})`), the
MeasureSubquery CTE used to emit the expression verbatim, producing
invalid SQL (aggregate without GROUP BY) and breaking outer
re-aggregation by row-multiplying through the join chain.
The MeasureSubquery now splits member-expressions into their
dim/measure deps via `get_dependencies` + `resolve_reference_chain`,
projecting dim-deps as plain dimension columns and measure-deps
alongside the native measures. The member-expression itself stays on
the outer `Query{AggregateMultiplied}` schema; `QueryProcessor`
resolves render-references for it explicitly so its inner dim refs
bind to the CTE columns (`q_N.cube__col`) instead of the raw source
table.
Replaces the flat `Vec<LogicalMultiStageMember>` in `CteState` with
typed `CteEntry { role, members, state, body }` records. Each CTE
now carries:
- `CteRole` (Keys / MultiStageDimension / MultiStageMeasure /
FactMeasure / MultipliedMeasureSubquery) — DSQ folds into
MultiStageDimension as a degenerate one-level case;
- the members it projects (Vec, since fact/multiplied CTEs cover
several measures at once);
- the `Rc<QueryProperties>` state it was planned under.
`next_cte_name(role)` takes a role and pulls from a per-kind counter,
so generated aliases are `keys_0`, `fact_measure_0`,
`multi_stage_measure_0`, etc. `add_member(...)` returns the final
CTE name — `body.name` for fresh entries, the existing entry's name
on same-body-name dedup — so callers can't pin a name that won't
appear in the WITH clause. `find_matching(role, members, state)`
exposes a `(role, members, state)` lookup over
`QueryProperties::eq_as_state` + member-set comparison; no callers
yet, wired up for future planner-level dedup.
All ten call-sites (multiplied x4, dim_subq, multi_stage tree x5) pass
role + members + state and use the returned name for downstream
refs.
Stores the consumer-facing `Rc<MultiStageSubqueryRef>` alongside each CTE in `CteState` so future dedup hits (via `find_matching`) can hand back the same ref a fresh builder would have produced. Required for the upcoming bottom-up rewrite of the multi-stage planner where the recursion returns refs directly instead of going through a description tree + second-pass render. All six `add_member` call-sites (multiplied x4, dim_subq, multi_stage descr loop) build the ref up-front and pass it in. Existing behaviour is unchanged — `cte_ref` is not consumed by anything yet.
Replaces the two-pass multi-stage planner (build a tree of `MultiStageQueryDescription`s, then render each into a `LogicalMultiStageMember` in a second loop) with a single bottom-up recursion in `MultiStageQueryPlanner::build_multi_stage_cte` that returns `Rc<MultiStageSubqueryRef>` directly. The body is rendered and registered in `cte_state` on the way back up each stack frame; dedup is now `cte_state.find_matching(role, [member], state)`. `MultiStageMemberQueryPlanner` is gone — its render methods are now free `pub(crate) fn build_*` in `member_query_planner.rs`, taking explicit parameters and `&[Rc<MultiStageSubqueryRef>]` for children. `MultiStageQueryDescription` (and `query_description.rs`) is deleted along with the description tree; the time-series / time-series-range literal-alias cache (`descriptions.iter().find(d.alias() == ...)`) migrates to a local `TimeSeriesCache` on the planner. Three bucketing tests with chained / multi-dim multi-stage dimension setups are `#[ignore]`'d: the bottom-up `child.schema().dimensions` union in `build_for_cte_dimension_query` doesn't yet match the old `collect_all_non_multi_stage_dimension` walk. Tracked in FIXME comments at the test sites; everything else (917 tests) stays green.
…nner dispatcher Collector now surfaces both `sub_query: true` and `multi_stage: true` dimensions; DimensionSubqueryPlanner dispatches on the flavour and currently only builds DSQ — multi-stage path will follow. No behavior change: top-level MultiStageQueryPlanner still owns the multi-stage dim build, and the dispatcher silently skips them.
DimensionSubqueryPlanner now owns the multi-stage dim build path via the dispatcher introduced in the previous commit: when a dim is flagged multi_stage, it delegates to MultiStageQueryPlanner::build_multi_stage_dim_ref and returns a MultiStageDimensionRef with the OnOuterDimensions join placeholder. Top-level MultiStageQueryPlanner.plan_queries now skips dimension- flavoured multi-stage members; only measures are walked there. Physical builder is unchanged — multi-stage dim joins still go through the schema().multi_stage_dimensions() walk.
…ns JOIN Multi-stage dim CTEs now LEFT-JOIN onto the cube-join chain through the MultiStageDimensionRef::OnOuterDimensions descriptor — populated in build_multi_stage_dim_ref via the existing LogicalSchema::multi_stage_join_dimensions. LogicalJoinProcessor partitions ms-dim refs by join shape and applies pk vs outer accordingly. Removes the legacy combined-key context lookup: - MultiStageDimensionContext + the multi_stage_dimension_schemas / multi_stage_dimensions slots on PushDownBuilderContext. - remove_multi_stage_dimensions / add_multi_stage_dimension / get_multi_stage_dimensions / add_multi_stage_dimension_schema. - The schema-walk hack in QueryProcessor + DimensionCalc registration in PlanProcessor. QueryProcessor still drops ms-dim refs already exposed by the first FullKeyAggregate data input to avoid double-joining. Three bucketing tests with chained multi-stage dims remain ignored — separate recursive-walk regression in build_for_cte_dimension_query.
…tate ref_for_member rebuilt the LogicalSchema from the upper state, so extension dims added by deeper `add_group_by` directives never reached outer consumers — chained multi-stage dim CTEs lost their join keys. Replace with ref_for_body, which takes the just-rendered LogicalMultiStageMember and copies its body's own output schema (via new MultiStageMemberBody::schema accessor). Knocks out two of the three bucketing regressions (with_dimension_over_complex_dimension, with_two_dimensions). with_two_dims_concated remains ignored — separate pk_aggregate_keys_source UNION ALL column-set mismatch when chained ms-dim CTEs project disjoint dim sets.
…hape When a multi-stage dimension's only multi-stage dependencies are other multi-stage *dimensions* (no measure children), build its body as a leaf-style query over the cube join with ms-dim CTEs attached as OnOuterDimensions LEFT JOINs, rather than a FullKeyAggregate UNION over heterogeneous dim-child schemas (which can't merge disjoint dim sets across cube boundaries). `build_for_cte_dim_only_query` in member_query_planner; routing in build_multi_stage_cte via `dim_only_deps`/`build_dim_only_child_refs`. Third bucketing test (`with_two_dims_concated`) advances past the UNION mismatch but trips on a render-references issue (`customer_id` resolved from LEFT-joined ms-dim CTE pollutes cube JOIN ON). Remains ignored with FIXME describing the new failure.
…-only DimensionCalc For StageKind::DimensionCalc with a cube-join source (the new dim-only shape), only resolve render-references for the stage's own multi_stage_dimension. Extension/payload dims (e.g. `orders.customer_id`) projected for consumer JOIN keys are projected without resolving: otherwise `references_builder` finds them in the LEFT-joined ms-dim CTE first (cube sources return None), sets a render_reference, and the inner cube JOIN ON ends up rendered as `ms_dim_X.col = ms_dim_Y.col` instead of `cube.col = cube.col`. Unignores `test_bucketing_with_two_dims_concated`.
…into FullKeyAggregate path Mirrors the JS `Calc-Groups > source product_category cross join` case as a Rust integration test (currently ignored — see FIXME). Plumbing to advance the case beyond "no FROM clause at all": - FullKeyAggregateQueryPlanner now runs DimensionSubqueryPlanner over the outer query's dimensions and populates Query.multi_stage_dimensions. Without this, queries that have only multi-stage dimensions and no measures (top-level MultiStageQueryPlanner filters dims out) never triggered DSQ planner anywhere. - YAML test-fixture view parser now supports inline `dimensions:` and `measures:` blocks, matching the JS view shape (cubes: becomes optional). Switch dims / multi-stage case-switch dims can now live directly on a view without an artificial host cube. Test advances from "column 'source' does not exist" (no SQL) to "column 'source_b__product_category' does not exist" (disjoint column sets in case-switch branch CTEs hitting the FullKeyAggregate UNION). FIXME describes the remaining gap.
|
Claude finished @waralexrom's task in 3m 13s —— View job Review: Collapse logical plan into Query + LogicalPlan root
This is a well-structured collapse-then-decompose refactoring that consolidates the former zoo of logical plan body types ( Architecture & Design — Positive
Issues & Suggestions1. Typo:
|
| Priority | Item |
|---|---|
| Low | Fix modifers → modifiers typo (widespread) |
| Low | Split CteRole::MultipliedMeasureSubquery into two roles |
| Trivial | Remove dead let _ = me; binding |
| Trivial | Update stale find_matching doc comment |
| Trivial | Fix PhysycalNode → PhysicalNode typo |
None of these are blocking. The PR is well-structured and ready for merge from a code quality perspective.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10921 +/- ##
==========================================
+ Coverage 78.87% 83.53% +4.65%
==========================================
Files 470 254 -216
Lines 92483 75869 -16614
Branches 3449 0 -3449
==========================================
- Hits 72948 63374 -9574
+ Misses 19031 12495 -6536
+ Partials 504 0 -504
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Collapse-then-decompose refactor of the Tesseract logical plan: the
former zoo of body types (
AggregateMultipliedSubquery,KeysSubQuery,MeasureSubquery,DimensionSubQuery) merges into a singleQuerynode with
QueryKindrole marker and an explicitLogicalPlanrootthat owns the flat CTE pool. Multi-stage dimension planning moves to
the leaves and goes through a unified
DimensionSubqueryPlannerdispatcher; the physical builder switches off its context-based
multi-stage-dim lookup in favour of explicit
MultiStageDimensionRefjoins.