Skip to content

expose SpjNormalForm::predicate() + Predicate Display impl#51

Closed
zhuqi-lucas wants to merge 23 commits into
mainfrom
qizhu/expose-spj-predicate
Closed

expose SpjNormalForm::predicate() + Predicate Display impl#51
zhuqi-lucas wants to merge 23 commits into
mainfrom
qizhu/expose-spj-predicate

Conversation

@zhuqi-lucas

Copy link
Copy Markdown
Collaborator

Summary

Cherry-pick of upstream PR datafusion-contrib#120 so atlas can pick up the new predicate() accessor + Display for Predicate without waiting for the upstream review cycle.

Changes:

  • Promotes Predicate and ColumnEquivalenceClass from private to pub so they can be referenced from public APIs.
  • Adds pub fn predicate(&self) -> &Predicate on SpjNormalForm.
  • Adds a Display impl on Predicate that AND-joins pairwise eq-class equalities, narrowed range intervals (default unbounded ones are skipped), and sorted residuals.

Internal fields stay private; only the read-only view is published.

Motivation

Atlas-side meta.mv_normal_form / meta.mv_candidate_map catalog tables (X-2975, atlas PR #5343) currently walk the original LogicalPlan to populate the predicates column because SpjNormalForm::predicate is private. With this change, that workaround drops to normal_form.predicate().to_string().

Tests

22/22 lib tests pass. Adds test_predicate_getter_and_display covering eq class + range + residual buckets.

xudong963 and others added 23 commits June 26, 2025 09:27
* Upgrade to DF48

* fix bug

* update

* update more
* Upgrade to DF49

* fix licenses

* use 49

* resolve comments
* Upgrade DF to 49.0.2

* fix clippy and upgrade rust

* upgrade version for Cargo Deny
* Support static partition columns for MV

* runtime checks

* unit test for dynamic partition columns

* lint
* inline mermaid diagrams

* additional comment

* mention duplicates, not cardinality

* more documentation
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Upgrade DF51.0.0

* udpate
…te types (#21) (datafusion-contrib#108)

Co-authored-by: Matt Friede <7852262+Friede80@users.noreply.github.com>
* Upgrade DF52

* update test

* use 52
…n-contrib#112)

* Expose a get_mv_candidates_for_table API for ViewMatcher

* refine comments
…b#113)

* perf: single-pass plan traversal in Predicate::new

* address comments

* add join error test
* Optimize rewrite performanc

* Add test

* fmt

* Address comments to add details
`SpjNormalForm` already exposes public getters for `output_schema`,
`output_exprs`, and `referenced_tables`, but the `predicate` field is
private. Downstream consumers that want to surface the SPJ normal form
(e.g. atlas catalog tables exposing `mv_normal_form` for debugging why
view matching does or does not engage) currently have to walk the
original `LogicalPlan` and re-collect Filter / TableScan filters as a
string, which loses the normalization the analyzer applied.

This change:

* Promotes `Predicate` and `ColumnEquivalenceClass` from private to
  `pub` so they can be referenced from public APIs.
* Adds `pub fn predicate(&self) -> &Predicate` on `SpjNormalForm`.
* Adds a `Display` impl on `Predicate` that AND-joins:
  - pairwise equalities derived from each multi-column equivalence
    class (singletons emit nothing),
  - narrowed range intervals per equivalence class (classes whose
    interval is still the default unbounded interval are skipped, so
    only meaningful constraints surface),
  - residual filter expressions, sorted for deterministic output.

The internal fields stay private, so this does not lock in any
particular representation; only the read-only view is published.

A unit test exercises the new getter + Display on a predicate that
populates all three buckets (eq class, range, residual).
Copilot AI review requested due to automatic review settings June 23, 2026 06:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SPJ normal form surface area to make normalized predicates externally accessible (via a new SpjNormalForm::predicate() accessor and Display for Predicate) so downstream consumers can stringify/inspect predicates without re-walking the original LogicalPlan. In addition, it includes a broader upstream sync: DataFusion/Arrow dependency upgrades, optimizer/execution API adjustments, and incremental-view-maintenance dependency-analysis improvements and documentation.

Changes:

  • Expose predicate-related types/APIs: SpjNormalForm::predicate(), Display for Predicate, and make Predicate / ColumnEquivalenceClass public (fields remain private).
  • Refactor/optimize normal-form construction and rewriting (single-pass predicate collection; rewrite fast paths).
  • Upgrade ecosystem + tooling: DataFusion 52 / Arrow 57, cargo-deny action v2, expanded docs/tests, and add a Criterion benchmark.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/materialized_listing_table.rs Updates for DataFusion API changes around partition values and ListingTableConfig setup.
src/rewrite/normal_form.rs Adds predicate() accessor + Predicate display; refactors predicate collection and rewrite performance paths; expands tests.
src/rewrite/exploitation.rs Adapts view exploitation/matching to updated DataFusion APIs; changes CostFn signature and ordering/statistics interfaces; adds MV candidate helper APIs.
src/rewrite.rs Removes redundant top-level module comment (moved into module docs).
src/materialized/util.rs Adds clarifying doc comment + link to upstream issue for sync/async catalog lookup.
src/materialized/row_metadata.rs Minor error-message formatting update.
src/materialized/hive_partition.rs Updates Arrow/DataFusion type imports and derives for UDF struct.
src/materialized/file_metadata.rs Updates schema/DFSchema conversions and Arrow error wrapping for newer DataFusion.
src/materialized/dependencies.rs Improves docs + stale-files logic; introduces static partition columns support in dependency analysis; adds multiple new tests/guards.
src/materialized.rs Introduces static_partition_columns() on Materialized; changes cast_to_materialized to return Result and validate invariants.
src/lib.rs Expands crate-level documentation (features, workflow, references).
deny.toml Allows additional licenses used by upgraded dependencies.
CHANGELOG.md Adds 0.2.0 release section and entries.
Cargo.toml Bumps crate version + Rust/MSRV; upgrades Arrow/DataFusion deps; adds aquamarine + criterion + bench target.
benches/materialized_views_benchmark.rs Adds Criterion benchmark for normal-form creation and rewrite performance.
.github/workflows/ci.yml Updates cargo-deny GitHub Action to v2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/materialized.rs
Comment on lines +154 to +162
let static_partition_cols = materialized.static_partition_columns();
let all_partition_cols = materialized.partition_columns();

if materialized.partition_columns()[..static_partition_cols.len()] != static_partition_cols[..]
{
return Err(DataFusionError::Plan(format!(
"Materialized view's static partition columns ({static_partition_cols:?}) must be a prefix of all partition columns ({all_partition_cols:?})"
)));
}
Comment on lines +280 to +286
} else {
// Slow path: complex expressions need full transform
let new_output_expr = other
.predicate
.normalize_expr(output_expr.clone())
.rewrite(&mut &*other)?
.data;
Comment on lines 325 to +329
let all_filters = eq_filters
.into_iter()
.chain(range_filters)
.chain(residual_filters)
.map(|expr| expr.rewrite(&mut other).unwrap().data)
.map(|expr| expr.rewrite(&mut &*other).unwrap().data)
Comment on lines 416 to 425
if candidates.is_empty() {
return Err(DataFusionError::Plan(
"can't create OneOfExec with empty children".to_string(),
));
}
let best = candidates

let best = cost(Box::new(candidates.iter().map(|c| c.as_ref())))
.iter()
.position_min_by_key(|candidate| OrderedFloat(cost(candidate.as_ref())))
.position_min_by_key(|&cost| OrderedFloat(*cost))
.unwrap();
Comment on lines +2066 to +2070
// Test case: Cross join where only columns from left table (t1) are selected
// The cross join with t3 affects cardinality but we don't select any t3 columns
// Expected: Only files from t1 should be in dependencies, not from t3
// BUG: Currently t3 files are incorrectly included in dependencies
let query = "SELECT t1.column1, t1.column2 FROM t1 CROSS JOIN t3";
Comment on lines +2074 to +2080
println!("Original plan:\n{}", plan.display_indent());

// We're partitioning on column1 which only comes from t1
let partition_col_indices: HashSet<usize> = [0].into_iter().collect(); // column1 is at index 0

let analyzed = pushdown_projection_inexact(plan.clone(), &partition_col_indices)?;
println!("After pushdown:\n{}", analyzed.display_indent());
Comment on lines +2108 to +2110
// Print the actual dependencies for debugging
println!("Actual dependencies:");
println!("{}", pretty_format_batches(&batches)?);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants