expose SpjNormalForm::predicate() + Predicate Display impl#51
Closed
zhuqi-lucas wants to merge 23 commits into
Closed
expose SpjNormalForm::predicate() + Predicate Display impl#51zhuqi-lucas wants to merge 23 commits into
zhuqi-lucas wants to merge 23 commits into
Conversation
* Upgrade to DF48 * fix bug * update * update more
* Upgrade to DF49 * fix licenses * use 49 * resolve comments
* Upgrade DF to 49.0.2 * fix clippy and upgrade rust * upgrade version for Cargo Deny
* Support static partition columns for MV * runtime checks * unit test for dynamic partition columns * lint
* inline mermaid diagrams * additional comment * mention duplicates, not cardinality * more documentation
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Upgrade DF51.0.0 * udpate
…te types (#21) (datafusion-contrib#108) Co-authored-by: Matt Friede <7852262+Friede80@users.noreply.github.com>
* Upgrade DF52 * update test * use 52
…n-contrib#112) * Expose a get_mv_candidates_for_table API for ViewMatcher * refine comments
…b#113) * perf: single-pass plan traversal in Predicate::new * address comments * add join error test
* Optimize rewrite performanc * Add test * fmt * Address comments to add details
`SpjNormalForm` already exposes public getters for `output_schema`,
`output_exprs`, and `referenced_tables`, but the `predicate` field is
private. Downstream consumers that want to surface the SPJ normal form
(e.g. atlas catalog tables exposing `mv_normal_form` for debugging why
view matching does or does not engage) currently have to walk the
original `LogicalPlan` and re-collect Filter / TableScan filters as a
string, which loses the normalization the analyzer applied.
This change:
* Promotes `Predicate` and `ColumnEquivalenceClass` from private to
`pub` so they can be referenced from public APIs.
* Adds `pub fn predicate(&self) -> &Predicate` on `SpjNormalForm`.
* Adds a `Display` impl on `Predicate` that AND-joins:
- pairwise equalities derived from each multi-column equivalence
class (singletons emit nothing),
- narrowed range intervals per equivalence class (classes whose
interval is still the default unbounded interval are skipped, so
only meaningful constraints surface),
- residual filter expressions, sorted for deterministic output.
The internal fields stay private, so this does not lock in any
particular representation; only the read-only view is published.
A unit test exercises the new getter + Display on a predicate that
populates all three buckets (eq class, range, residual).
There was a problem hiding this comment.
Pull request overview
This PR updates the SPJ normal form surface area to make normalized predicates externally accessible (via a new SpjNormalForm::predicate() accessor and Display for Predicate) so downstream consumers can stringify/inspect predicates without re-walking the original LogicalPlan. In addition, it includes a broader upstream sync: DataFusion/Arrow dependency upgrades, optimizer/execution API adjustments, and incremental-view-maintenance dependency-analysis improvements and documentation.
Changes:
- Expose predicate-related types/APIs:
SpjNormalForm::predicate(),Display for Predicate, and makePredicate/ColumnEquivalenceClasspublic (fields remain private). - Refactor/optimize normal-form construction and rewriting (single-pass predicate collection; rewrite fast paths).
- Upgrade ecosystem + tooling: DataFusion 52 / Arrow 57, cargo-deny action v2, expanded docs/tests, and add a Criterion benchmark.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/materialized_listing_table.rs | Updates for DataFusion API changes around partition values and ListingTableConfig setup. |
| src/rewrite/normal_form.rs | Adds predicate() accessor + Predicate display; refactors predicate collection and rewrite performance paths; expands tests. |
| src/rewrite/exploitation.rs | Adapts view exploitation/matching to updated DataFusion APIs; changes CostFn signature and ordering/statistics interfaces; adds MV candidate helper APIs. |
| src/rewrite.rs | Removes redundant top-level module comment (moved into module docs). |
| src/materialized/util.rs | Adds clarifying doc comment + link to upstream issue for sync/async catalog lookup. |
| src/materialized/row_metadata.rs | Minor error-message formatting update. |
| src/materialized/hive_partition.rs | Updates Arrow/DataFusion type imports and derives for UDF struct. |
| src/materialized/file_metadata.rs | Updates schema/DFSchema conversions and Arrow error wrapping for newer DataFusion. |
| src/materialized/dependencies.rs | Improves docs + stale-files logic; introduces static partition columns support in dependency analysis; adds multiple new tests/guards. |
| src/materialized.rs | Introduces static_partition_columns() on Materialized; changes cast_to_materialized to return Result and validate invariants. |
| src/lib.rs | Expands crate-level documentation (features, workflow, references). |
| deny.toml | Allows additional licenses used by upgraded dependencies. |
| CHANGELOG.md | Adds 0.2.0 release section and entries. |
| Cargo.toml | Bumps crate version + Rust/MSRV; upgrades Arrow/DataFusion deps; adds aquamarine + criterion + bench target. |
| benches/materialized_views_benchmark.rs | Adds Criterion benchmark for normal-form creation and rewrite performance. |
| .github/workflows/ci.yml | Updates cargo-deny GitHub Action to v2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+154
to
+162
| let static_partition_cols = materialized.static_partition_columns(); | ||
| let all_partition_cols = materialized.partition_columns(); | ||
|
|
||
| if materialized.partition_columns()[..static_partition_cols.len()] != static_partition_cols[..] | ||
| { | ||
| return Err(DataFusionError::Plan(format!( | ||
| "Materialized view's static partition columns ({static_partition_cols:?}) must be a prefix of all partition columns ({all_partition_cols:?})" | ||
| ))); | ||
| } |
Comment on lines
+280
to
+286
| } else { | ||
| // Slow path: complex expressions need full transform | ||
| let new_output_expr = other | ||
| .predicate | ||
| .normalize_expr(output_expr.clone()) | ||
| .rewrite(&mut &*other)? | ||
| .data; |
Comment on lines
325
to
+329
| let all_filters = eq_filters | ||
| .into_iter() | ||
| .chain(range_filters) | ||
| .chain(residual_filters) | ||
| .map(|expr| expr.rewrite(&mut other).unwrap().data) | ||
| .map(|expr| expr.rewrite(&mut &*other).unwrap().data) |
Comment on lines
416
to
425
| if candidates.is_empty() { | ||
| return Err(DataFusionError::Plan( | ||
| "can't create OneOfExec with empty children".to_string(), | ||
| )); | ||
| } | ||
| let best = candidates | ||
|
|
||
| let best = cost(Box::new(candidates.iter().map(|c| c.as_ref()))) | ||
| .iter() | ||
| .position_min_by_key(|candidate| OrderedFloat(cost(candidate.as_ref()))) | ||
| .position_min_by_key(|&cost| OrderedFloat(*cost)) | ||
| .unwrap(); |
Comment on lines
+2066
to
+2070
| // Test case: Cross join where only columns from left table (t1) are selected | ||
| // The cross join with t3 affects cardinality but we don't select any t3 columns | ||
| // Expected: Only files from t1 should be in dependencies, not from t3 | ||
| // BUG: Currently t3 files are incorrectly included in dependencies | ||
| let query = "SELECT t1.column1, t1.column2 FROM t1 CROSS JOIN t3"; |
Comment on lines
+2074
to
+2080
| println!("Original plan:\n{}", plan.display_indent()); | ||
|
|
||
| // We're partitioning on column1 which only comes from t1 | ||
| let partition_col_indices: HashSet<usize> = [0].into_iter().collect(); // column1 is at index 0 | ||
|
|
||
| let analyzed = pushdown_projection_inexact(plan.clone(), &partition_col_indices)?; | ||
| println!("After pushdown:\n{}", analyzed.display_indent()); |
Comment on lines
+2108
to
+2110
| // Print the actual dependencies for debugging | ||
| println!("Actual dependencies:"); | ||
| println!("{}", pretty_format_batches(&batches)?); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-pick of upstream PR datafusion-contrib#120 so atlas can pick up the new
predicate()accessor +Display for Predicatewithout waiting for the upstream review cycle.Changes:
PredicateandColumnEquivalenceClassfrom private topubso they can be referenced from public APIs.pub fn predicate(&self) -> &PredicateonSpjNormalForm.Displayimpl onPredicatethat AND-joins pairwise eq-class equalities, narrowed range intervals (default unbounded ones are skipped), and sorted residuals.Internal fields stay private; only the read-only view is published.
Motivation
Atlas-side
meta.mv_normal_form/meta.mv_candidate_mapcatalog tables (X-2975, atlas PR #5343) currently walk the originalLogicalPlanto populate thepredicatescolumn becauseSpjNormalForm::predicateis private. With this change, that workaround drops tonormal_form.predicate().to_string().Tests
22/22 lib tests pass. Adds
test_predicate_getter_and_displaycovering eq class + range + residual buckets.