Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
d4cc10f
Upgrade DataFusion 48.0.0 (#61)
xudong963 Jun 26, 2025
25e5ccc
Upgrade to DF49 (#75)
xudong963 Aug 1, 2025
3026895
Upgrade DF to 49.0.2 (#86)
zhuqi-lucas Sep 3, 2025
d8364fb
make cost fn accept candidates (#83)
xudong963 Sep 13, 2025
540f29e
Fix empty unnest columns handling when pushdown_projection_inexact (#88)
zhuqi-lucas Sep 15, 2025
169eb66
upgrade to DF50 (#87)
xudong963 Sep 16, 2025
5915f4c
Support static partition columns for MV (#89)
suremarc Sep 25, 2025
0c408a7
Improved documentation on IVM algorithm (#90)
suremarc Oct 10, 2025
3910e12
Support limit pushdown for OneOfExec (#94)
xudong963 Oct 11, 2025
f3d5eb1
Improve the doc (#95)
xudong963 Oct 24, 2025
6162aea
Chore: remove useless lines in changelog (#97)
xudong963 Oct 24, 2025
e620594
chore: release v0.2.0 (#96)
github-actions[bot] Oct 24, 2025
ec7e88a
Add benchmark for heavy operation for datafusion-materialized-views (…
zhuqi-lucas Oct 28, 2025
f1f7ad8
Upgrade DF51.0.0 (#104)
xudong963 Nov 19, 2025
9c91580
Fix mv dependencies involving unrelated files (#107)
xudong963 Dec 12, 2025
4539acf
prevent rewriting strict inequality to closed interval for non-discre…
xudong963 Dec 15, 2025
0d1aefa
Expose mv_plans for ViewMatcher (#22) (#109)
xudong963 Dec 19, 2025
547aa4d
Upgrade DF52 (#111)
xudong963 Jan 12, 2026
4071577
Expose a `get_mv_candidates_for_table` API for ViewMatcher (#112)
xudong963 Jan 12, 2026
02bad14
perf: single-pass plan traversal in Predicate::new (#113)
zhuqi-lucas Jan 20, 2026
429e52a
Optimize rewrite performance (#115)
zhuqi-lucas Feb 3, 2026
152c885
Support view matcher with boolean binary operation (#117)
zhuqi-lucas Mar 9, 2026
99ff72d
expose SpjNormalForm::predicate() + Predicate Display impl
zhuqi-lucas Jun 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: EmbarkStudios/cargo-deny-action@v1
- uses: EmbarkStudios/cargo-deny-action@v2
with:
command: check license

Expand Down
42 changes: 41 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,47 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [0.2.0](https://github.com/datafusion-contrib/datafusion-materialized-views/compare/v0.1.1...v0.2.0) - 2025-10-24

### Added
- `Decorator` trait ([#26](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/26)) (by @suremarc) - #26

### Other
- remove useless lines in changelog ([#97](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/97)) (by @xudong963) - #97
- Improve the doc ([#95](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/95)) (by @xudong963) - #95
- Support limit pushdown for OneOfExec ([#94](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/94)) (by @xudong963) - #94
- Improved documentation on IVM algorithm ([#90](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/90)) (by @suremarc) - #90
- Support static partition columns for MV ([#89](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/89)) (by @suremarc) - #89
- upgrade to DF50 ([#87](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/87)) (by @xudong963) - #87
- Fix empty unnest columns handling when pushdown_projection_inexact ([#88](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/88)) (by @zhuqi-lucas) - #88
- make cost fn accept candidates ([#83](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/83)) (by @xudong963) - #83
- Upgrade DF to 49.0.2 ([#86](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/86)) (by @zhuqi-lucas) - #86
- Upgrade to DF49 ([#75](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/75)) (by @xudong963) - #75
- Upgrade DataFusion 48.0.0 ([#61](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/61)) (by @xudong963) - #61
- Allow customization of `list_all_files` function. ([#69](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/69)) (by @jared-m-combs) - #69
- Allow for 'special' partitions that are omitted in the staleness check. ([#68](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/68)) (by @jared-m-combs) - #68
- don't panic if eq class is not found ([#60](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/60)) (by @suremarc) - #60
- Handle table scan filters that reference dropped columns ([#59](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/59)) (by @suremarc) - #59
- exclude some materialized views from query rewriting ([#57](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/57)) (by @suremarc) - #57
- Optimize performance bottleneck if projection is large ([#56](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/56)) (by @xudong963) - #56
- Upgrade df47 ([#55](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/55)) (by @xudong963) - #55
- Update itertools requirement from 0.13 to 0.14 ([#32](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/32)) (by @dependabot[bot]) - #32
- Update ordered-float requirement from 4.6.0 to 5.0.0 ([#49](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/49)) (by @dependabot[bot]) - #49
- Upgrade DF46 ([#48](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/48)) (by @xudong963) - #48
- Update extension ([#45](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/45)) (by @matthewmturner) - #45
- make explain output stable ([#44](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/44)) (by @suremarc) - #44
- Add alternate analysis for MVs with no partition columns ([#39](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/39)) (by @suremarc) - #39
- upgrade to datafusion 45 ([#38](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/38)) (by @suremarc) - #38
- use nanosecond timestamps in file metadata ([#28](https://github.com/datafusion-contrib/datafusion-materialized-views/pull/28)) (by @suremarc) - #28

### Contributors

* @xudong963
* @suremarc
* @zhuqi-lucas
* @jared-m-combs
* @dependabot[bot]
* @matthewmturner

## [0.1.1](https://github.com/datafusion-contrib/datafusion-materialized-views/compare/v0.1.0...v0.1.1) - 2025-01-07

Expand Down
37 changes: 22 additions & 15 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,39 +17,46 @@

[package]
name = "datafusion-materialized-views"
version = "0.1.1"
version = "0.2.0"
edition = "2021"
homepage = "https://github.com/datafusion-contrib/datafusion-materialized-views"
repository = "https://github.com/datafusion-contrib/datafusion-materialized-views"
authors = ["Matthew Cramerus <matt@polygon.io>"]
license = "Apache-2.0"
description = "Materialized Views & Query Rewriting in DataFusion"
keywords = ["arrow", "arrow-rs", "datafusion"]
rust-version = "1.80"
rust-version = "1.85.1"

[dependencies]
arrow = "55"
arrow-schema = "55"
async-trait = "0.1"
aquamarine = "0.6.0"
arrow = "57.1.0"
arrow-schema = "57.1.0"
async-trait = "0.1.89"
dashmap = "6"
datafusion = "47"
datafusion-common = "47"
datafusion-expr = "47"
datafusion-functions = "47"
datafusion-functions-aggregate = "47"
datafusion-optimizer = "47"
datafusion-physical-expr = "47"
datafusion-physical-plan = "47"
datafusion-sql = "47"
datafusion = "52"
datafusion-common = "52"
datafusion-expr = "52"
datafusion-functions = "52"
datafusion-functions-aggregate = "52"
datafusion-optimizer = "52"
datafusion-physical-expr = "52"
datafusion-physical-plan = "52"
datafusion-sql = "52"
futures = "0.3"
itertools = "0.14"
log = "0.4"
object_store = "0.12"
object_store = "0.12.4"
ordered-float = "5.0.0"

[dev-dependencies]
anyhow = "1.0.95"
criterion = "0.4"
env_logger = "0.11.6"
tempfile = "3.14.0"
tokio = "1.42.0"
url = "2.5.4"

[[bench]]
name = "materialized_views_benchmark"
harness = false
path = "benches/materialized_views_benchmark.rs"
182 changes: 182 additions & 0 deletions benches/materialized_views_benchmark.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use std::sync::Arc;
use std::time::Duration;

use datafusion::datasource::provider_as_source;
use datafusion::datasource::TableProvider;
use datafusion::prelude::SessionContext;
use datafusion_common::Result as DfResult;
use datafusion_expr::LogicalPlan;
use datafusion_materialized_views::rewrite::normal_form::SpjNormalForm;
use datafusion_sql::TableReference;
use tokio::runtime::Builder;

// Utility: generate CREATE TABLE SQL with n columns named c0..c{n-1}
fn make_create_table_sql(table_name: &str, ncols: usize) -> String {
let cols = (0..ncols)
.map(|i| format!("c{} INT", i))
.collect::<Vec<_>>()
.join(", ");
format!(
"CREATE TABLE {table} ({cols})",
table = table_name,
cols = cols
)
}

// Utility: generate a base SELECT that projects all columns and has a couple filters
fn make_base_sql(table_name: &str, ncols: usize) -> String {
let cols = (0..ncols)
.map(|i| format!("c{}", i))
.collect::<Vec<_>>()
.join(", ");
let mut where_clauses = vec![];
if ncols > 0 {
where_clauses.push("c0 >= 0".to_string());
}
if ncols > 1 {
where_clauses.push("c0 + c1 >= 0".to_string());
}
let where_part = if where_clauses.is_empty() {
"".to_string()
} else {
format!(" WHERE {}", where_clauses.join(" AND "))
};
format!("SELECT {cols} FROM {table}{where}", cols = cols, table = table_name, where = where_part)
}

// Utility: generate a query that is stricter and selects subset (so rewrite_from has a chance)
fn make_query_sql(table_name: &str, ncols: usize) -> String {
let take = std::cmp::max(1, ncols / 2);
let cols = (0..take)
.map(|i| format!("c{}", i))
.collect::<Vec<_>>()
.join(", ");

let mut where_clauses = vec![];
if ncols > 0 {
where_clauses.push("c0 >= 10".to_string());
}
if ncols > 1 {
where_clauses.push("c0 * c1 > 100".to_string());
}
if ncols > 10 {
where_clauses.push(format!("c{} >= 0", ncols - 1));
}

let where_part = if where_clauses.is_empty() {
"".to_string()
} else {
format!(" WHERE {}", where_clauses.join(" AND "))
};

format!("SELECT {cols} FROM {table}{where}", cols = cols, table = table_name, where = where_part)
}

// Build fixture: create SessionContext, the table, then return LogicalPlans for base & query and table provider
fn build_fixture_for_cols(
rt: &tokio::runtime::Runtime,
ncols: usize,
) -> DfResult<(LogicalPlan, LogicalPlan, Arc<dyn TableProvider>)> {
rt.block_on(async move {
let ctx = SessionContext::new();

// create table
let table_name = "t";
let create_sql = make_create_table_sql(table_name, ncols);
ctx.sql(&create_sql).await?.collect().await?; // create table in catalog

// base and query plans (optimize to normalize)
let base_sql = make_base_sql(table_name, ncols);
let query_sql = make_query_sql(table_name, ncols);

let base_df = ctx.sql(&base_sql).await?;
let base_plan = base_df.into_optimized_plan()?;

let query_df = ctx.sql(&query_sql).await?;
let query_plan = query_df.into_optimized_plan()?;

// get table provider (Arc<dyn TableProvider>)
let table_ref = TableReference::bare(table_name);
let provider: Arc<dyn TableProvider> = ctx.table_provider(table_ref.clone()).await?;

Ok((base_plan, query_plan, provider))
})
}

// Criterion benchmark
fn criterion_benchmark(c: &mut Criterion) {
// columns to test
let col_cases = vec![1usize, 10, 20, 40, 80, 160, 320];

// build a tokio runtime that's broadly compatible
let rt = Builder::new_current_thread()
.enable_all()
.build()
.expect("tokio runtime");

let mut group = c.benchmark_group("view_matcher_spj");
group.warm_up_time(Duration::from_secs(1));
group.measurement_time(Duration::from_secs(5));
group.sample_size(30);

for &ncols in &col_cases {
// Build fixture
let (base_plan, query_plan, provider) =
build_fixture_for_cols(&rt, ncols).expect("fixture");

// Measure SpjNormalForm::new for base_plan and query_plan separately
let id_base = BenchmarkId::new("spj_normal_form_new", format!("cols={}", ncols));
group.throughput(Throughput::Elements(1));
group.bench_with_input(id_base, &base_plan, |b, plan| {
b.iter(|| {
let _nf = SpjNormalForm::new(plan).unwrap();
});
});

let id_query_nf = BenchmarkId::new("spj_normal_form_new_query", format!("cols={}", ncols));
group.bench_with_input(id_query_nf, &query_plan, |b, plan| {
b.iter(|| {
let _nf = SpjNormalForm::new(plan).unwrap();
});
});

// Precompute normal forms once (to measure rewrite_from cost only)
let base_nf = SpjNormalForm::new(&base_plan).expect("base_nf");
let query_nf = SpjNormalForm::new(&query_plan).expect("query_nf");

// qualifier for rewrite_from and a source created from the provider
let qualifier = TableReference::bare("mv");
let source = provider_as_source(Arc::clone(&provider));

// Benchmark rewrite_from (this is the heavy check)
let id_rewrite = BenchmarkId::new("rewrite_from", format!("cols={}", ncols));
group.bench_with_input(id_rewrite, &ncols, |b, &_n| {
b.iter(|| {
let _res = query_nf.rewrite_from(&base_nf, qualifier.clone(), Arc::clone(&source));
});
});
}

group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
4 changes: 3 additions & 1 deletion deny.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ allow = [
"BSD-3-Clause",
"CC0-1.0",
"Unicode-3.0",
"Zlib"
"Zlib",
"ISC",
"bzip2-1.0.6"
]
version = 2
35 changes: 34 additions & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,37 @@

#![deny(missing_docs)]

//! `datafusion-materialized-views` implements algorithms and functionality for materialized views in DataFusion.
//! # datafusion-materialized-views
//!
//! `datafusion-materialized-views` provides robust algorithms and core functionality for working with materialized views in [DataFusion](https://arrow.apache.org/datafusion/).
//!
//! ## Key Features
//!
//! - **Incremental View Maintenance**: Efficiently tracks dependencies between Hive-partitioned tables and their materialized views, allowing users to determine which partitions need to be refreshed when source data changes. This is achieved via UDTFs such as `mv_dependencies` and `stale_files`.
//! - **Query Rewriting**: Implements a view matching optimizer that rewrites queries to automatically leverage materialized views when beneficial, based on the techniques described in the [paper](https://dsg.uwaterloo.ca/seminars/notes/larson-paper.pdf).
//! - **Pluggable Metadata Sources**: Supports custom metadata sources for incremental view maintenance, with default support for object store metadata via the `FileMetadata` and `RowMetadataRegistry` components.
//! - **Extensible Table Abstractions**: Defines traits such as `ListingTableLike` and `Materialized` to abstract over Hive-partitioned tables and materialized views, enabling custom implementations and easy registration for use in the maintenance and rewriting logic.
//!
//! ## Typical Workflow
//!
//! 1. **Define and Register Views**: Implement a custom table type that implements the `Materialized` trait, and register it using `register_materialized`.
//! 2. **Metadata Initialization**: Set up `FileMetadata` and `RowMetadataRegistry` to track file-level and row-level metadata.
//! 3. **Dependency Tracking**: Use the `mv_dependencies` UDTF to generate build graphs for materialized views, and `stale_files` to identify partitions that require recomputation.
//! 4. **Query Optimization**: Enable the query rewriting optimizer to transparently rewrite queries to use materialized views where possible.
//!
//! ## Example
//!
//! See the README and integration tests for a full walkthrough of setting up and maintaining a materialized view, including dependency tracking and query rewriting.
//!
//! ## Limitations
//!
//! - Currently supports only Hive-partitioned tables in object storage, with the smallest update unit being a file.
//! - Future work may generalize to other storage backends and partitioning schemes.
//!
//! ## References
//!
//! - [Optimizing Queries Using Materialized Views: A Practical, Scalable Solution](https://dsg.uwaterloo.ca/seminars/notes/larson-paper.pdf)
//! - [DataFusion documentation](https://datafusion.apache.org/)

/// Code for incremental view maintenance against Hive-partitioned tables.
///
Expand All @@ -42,6 +72,9 @@
pub mod materialized;

/// An implementation of Query Rewriting, an optimization that rewrites queries to make use of materialized views.
///
/// The implementation is based heavily on [this paper](https://dsg.uwaterloo.ca/seminars/notes/larson-paper.pdf),
/// *Optimizing Queries Using Materialized Views: A Practical, Scalable Solution*.
pub mod rewrite;

/// Configuration options for materialized view related features.
Expand Down
Loading
Loading