diff --git a/content/blog/2026-06-12-datafusion-54.0.0.md b/content/blog/2026-06-12-datafusion-54.0.0.md new file mode 100644 index 00000000..6febb8d2 --- /dev/null +++ b/content/blog/2026-06-12-datafusion-54.0.0.md @@ -0,0 +1,442 @@ +--- +layout: post +title: Apache DataFusion 54.0.0 Released +date: 2026-06-12 +author: pmc +categories: [release] +--- + + + +[TOC] + +We are proud to announce the release of [DataFusion 54.0.0]. This post highlights +some of the major improvements since [DataFusion 53.0.0]. Notable additions +include `LATERAL` joins, SQL lambda functions, and a new Avro reader, alongside +significant join, scan, and planning performance improvements. The complete list +of changes is available in the [changelog]. This release represents roughly 11 +weeks of development and 740 commits. Thanks to the [139 contributors] +(a new record!) for making it possible. + +[DataFusion 54.0.0]: https://crates.io/crates/datafusion/54.0.0 +[DataFusion 53.0.0]: https://datafusion.apache.org/blog/2026/04/02/datafusion-53.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-54/dev/changelog/54.0.0.md +[139 contributors]: https://github.com/apache/datafusion/blob/branch-54/dev/changelog/54.0.0.md#credits + +## Performance Improvements 🚀 + + + +**Figure 1**: Average and median normalized execution times for DataFusion 54.0.0 on ClickBench queries, compared to previous releases. +Query times are normalized using the ClickBench definition. See the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/) +for more details. + +We continue to make significant performance improvements in DataFusion, as +explained below. This release prunes more redundant work out of plans and +makes joins, repartitioning, scans, and many built-in functions faster. + +### Execution Operator Improvements + +**Physical Execution of Uncorrelated Scalar Subqueries**: +DataFusion previously executed an uncorrelated scalar subquery (one that doesn't +depend on the outer query) by rewriting it into a join. DataFusion 54 instead +evaluates it once with a new physical operator. This lets functions use their +specialized scalar code paths, and allows uncorrelated scalar subqueries in +`ORDER BY`, `JOIN ON`, and as arguments to aggregate functions. +Thanks to [@neilconway] for implementing this feature, with reviews from +[@Dandandan], [@alamb], and [@timsaucer]. Related PRs: [#21240] + +**Faster Sort-Merge Joins**: +Semi, anti, and mark joins now track matches with a per-row bitset instead of +materializing `(outer, inner)` pairs. Batched deferred filtering makes +near-unique `LEFT` and `FULL` joins 20-50x faster. Finally, join-key comparisons +now use a `DynComparator` that resolves the column type once rather than per row, +making microbenchmarks up to 12% faster and TPC-H ~5% faster overall. +Thanks to [@mbutrovich] for this work, with reviews from [@Dandandan], +[@comphead], and [@rluvaton]. Related PRs: [#20806], [#21184], [#21484], [#21517] + +**Faster Repartitioning**: +`RepartitionExec` now coalesces batches before sending them to distributor +channels, cutting per-batch overhead for up to 50% faster execution on some +repartition-heavy queries. +Thanks to [@gabotechs] for this work, with reviews from [@Dandandan] and +[@alamb]. Related PRs: [#22010] + +**Faster Functions and Hashing**: +DataFusion ships hundreds of built-in functions, so speeding them up pays off +across many workloads. This release optimizes many, including [array_to_string], +[array_concat], [array_sort], [split_part], [substr], [strpos], [left], [right], +[string_agg], and [approx_distinct], plus better `NULL` handling across many +array and datetime functions. The `first_value` and `last_value` aggregates are +also substantially faster over `Utf8` and `Binary` columns thanks to a new +`GroupsAccumulator` ([#21090]). DataFusion 54 also swaps `ahash` for [foldhash] in +`datafusion-common`, and optimizes [regexp_replace] by stripping trailing `.*` +from anchored patterns. +Thanks to the many contributors who drove this work, especially [@UBarney], +[@neilconway], [@Dandandan], [@zhangxffff], [@lyne7-sc], [@CuteChuanChuan], +[@kumarUjjawal], and [@coderfender]. + +### Planner Improvements + +**Pruning Functionally Redundant Sort Keys**: +Sorting is expensive, so it pays to sort by as few columns as possible. +DataFusion 54 now drops functionally redundant `ORDER BY` keys: when an earlier +key determines a later one, the later key can't change the ordering, so removing +it cuts sorting cost without affecting results. +Thanks to [@xiedeyantu] for implementing this feature, with reviews from +[@alamb] and [@neilconway]. Related PRs: [#21362] + +**Skip Redundant Parquet Filters**: +When statistics prove a filter matches every row in a Parquet row group, +DataFusion now skips evaluating it — both row filters and page-level pruning — +for that row group instead of re-checking each row. +Thanks to [@xudong963] for implementing this, building on a suggestion from +[@crepererum]. Related issues and PRs: [#19028], [#21637] + +**Statistics-Driven Sort Pushdown and TopK**: +Files and Parquet row groups are now ordered using statistics, which can avoid +sorting entirely and improve dynamic filtering and early stopping for TopK +(`ORDER BY ... LIMIT`) queries. The most promising data is read first, often +satisfying the `LIMIT` before scanning the rest. +Thanks to [@zhuqi-lucas] for driving this work, with reviews from [@adriangb]. +Related PRs: [#21182], [#21426], [#21956] + +**Improved Statistics and Cardinality Estimation**: +Good plans depend on good statistics. This release extracts NDV (number of +distinct values) statistics from Parquet metadata, uses NDV for equality-filter +selectivity, adds a pluggable [StatisticsRegistry] for operator-level statistics +propagation, and improves cardinality estimation for semi and anti joins. +Thanks to [@asolimando], [@jonathanc-n], and [@buraksenn] for driving this work. +Related PRs: [#19957], [#20789], [#21077], [#21081], [#21483], [#20904] + +### Scan Improvements + +**Morsel-Driven Parquet Scans**: +Parquet scan parallelism was previously bounded by the slowest scan thread, so +data skew (large row groups, less-selective filters, or variable object store +latency) left cores underutilized. DataFusion 54 reworks the Parquet scan around +a [morsel-driven design], where idle threads dynamically pull small units of work +("morsels") instead of each being assigned a fixed partition up front. This +spreads work more evenly and can be up to ~2x faster for skewed scans such as +ClickBench. +Thanks to [@Dandandan], [@alamb], [@adriangb], [@xudong963], and [@zhuqi-lucas] +for collaborating on this substantial effort. Related issues and PRs: [#20529], +[#21327], [#21342], [#21351] + +**Struct Field Filter Pushdown and Leaf-Level Projection**: +Filters on struct fields (e.g. `WHERE s['foo'] > 67`) are now pushed down into the +Parquet decoder rather than evaluated after a full scan, and both filtering and +projection read only the struct leaves they actually access, +significantly improving performance for nested and `Variant` data in large +Parquet files. +Thanks to [@friendlymatthew] for this work, with reviews from [@adriangb], +[@cetra3], and [@AdamGS]. Related PRs: [#20822], [#20854], [#20925] + +## New Features ✨ + +### `LATERAL` Joins + +Lateral joins have been long requested ([#10048]). DataFusion 54 adds basic +support for `CROSS JOIN LATERAL`, `INNER JOIN LATERAL`, and `LEFT JOIN LATERAL` +([#21202], [#21352]). A lateral subquery in the `FROM` clause can reference +columns from preceding tables — handy for expanding a per-row series or +correlating against a set-returning function. It uses decorrelation, so the +subquery is evaluated once rather than re-executed per outer row. + +```sql +-- For each row in t1, expand a series 1..t1_int and join the values back +SELECT t1_id, t1_name, i +FROM join_t1 t1 +CROSS JOIN LATERAL ( + SELECT * FROM unnest(generate_series(1, t1_int)) +) AS series(i); +``` + +Thanks to [@neilconway] for implementing this feature, with reviews from +[@Dandandan], [@alamb], and [@crm26]. + +### Lambda Functions + +DataFusion now supports lambda expressions (`x -> expr`) with column capture, +plus new higher-order array UDFs like [array_transform], [array_filter], and +[array_any_match] ([#21323], [#21679]). Lambdas express per-element computation +directly in SQL: + +```sql +-- Apply `x * 10` to every element +SELECT array_transform([1, 2, 3, 4, 5], x -> x * 10); +-- [10, 20, 30, 40, 50] + +-- Keep only elements where `x > 2` +SELECT array_filter([1, 2, 3, 4, 5], x -> x > 2); +-- [3, 4, 5] + +-- True if any element satisfies `x > 2` +SELECT array_any_match([1, 2, 3], x -> x > 2); +-- true +``` + +Lambdas compose, so you can filter then transform in one expression: + +```sql +-- Keep elements > 2, then multiply each survivor by 10 +SELECT array_transform(array_filter([1, 2, 3, 4, 5], x -> x > 2), x -> x * 10); +-- [30, 40, 50] +``` + +Thanks to [@gstvg] and [@rluvaton] for leading this effort (first prototyped in +[#18921]), [@ologlogn] and [@LiaCastaneda] for the [array_filter] and +[array_any_match] functions, and [@benbellick], [@comphead], [@martin-g], +[@pepijnve], and [@shehabgamin] for reviews. + +### Spilling Nested Loop Joins + +`NestedLoopJoinExec` previously failed with an out-of-memory error when the build +side exceeded the memory budget. DataFusion 54 adds a memory-limited path that +transparently spills to disk and completes the query instead ([#21448]), with +zero overhead when memory is sufficient. It currently covers `INNER`, `LEFT`, +`LEFT SEMI`, `LEFT ANTI`, and `LEFT MARK` joins. Thanks to [@viirya] for +implementing this feature, with reviews from [@2010YOUY01]. + +### New Avro Reader + +The Avro reader now uses the `arrow-avro` crate ([#17861]), replacing internal +conversion code with a faster, better-maintained implementation shared with the +Arrow ecosystem (see [Introducing arrow-avro]). Thanks to [@getChan] for this +work, with reviews from [@adriangb], [@alamb], and [@jecsand838]. + +[Introducing arrow-avro]: https://arrow.apache.org/blog/2025/10/23/introducing-arrow-avro/ + +### Extension Type Registry + +Arrow extension types let users layer their own semantics on top of a physical +storage type. DataFusion 54 adds a registry for registering their behavior +([#18223], [#20312]), several more canonical extension types ([#21291]), and the +ability to cast to an extension type in logical expressions ([#18136]). Thanks to +[@tschwarzinger] and [@paleolimbot] for driving this work, with reviews from +[@adriangb] and [@cetra3]. + +### Content-Defined Chunking for Parquet + +DataFusion's Parquet writer can now use content-defined chunking (CDC) +([#21110]), which aligns data page boundaries with the data rather than fixed row +counts. This improves deduplication and incremental storage, since inserting or +editing a few rows no longer shifts every later page boundary. For background, +see [Parquet Content-Defined Chunking] and [Improving Parquet Dedupe on Hugging +Face Hub]. Thanks to [@kszucs] for this feature, with reviews from [@alamb]. + +[Parquet Content-Defined Chunking]: https://huggingface.co/blog/parquet-cdc +[Improving Parquet Dedupe on Hugging Face Hub]: https://huggingface.co/blog/improve_parquet_dedupe + +### New Functions + +**SQL and Scalar Functions**: +DataFusion 54 adds new scalar functions including [array_compact], +[cosine_distance], [inner_product], [array_normalize], [cast_to_type], and +[with_metadata], plus nanosecond `date_part` support ([#20674]) and the `:` JSON +access operator ([#20628]). The [cosine_distance], [inner_product], and +[array_normalize] additions round out DataFusion's vector-search building blocks. Thanks to [@comphead], [@crm26], +[@adriangb], [@mhilton], and [@Samyak2] for these contributions. + +**Spark-Compatible Functions**: +The [datafusion-spark crate] gains many new or improved Spark-compatible +functions, including [round], [floor], [ceil], [soundex], [xxhash64], +[array_contains], [array_compact], int/float-to-timestamp casts, and UTF-8 +validation functions. +Thanks to the contributors who drove this work, especially [@comphead], +[@coderfender], [@SubhamSinghal], [@kazantsev-maksim], [@andygrove], [@buraksenn], +[@davidlghellin], [@athlcode], and [@shivbhatia10]. + +## Upgrade Guide and Changelog 📖 + +Upgrading to 54.0.0 should be straightforward for most users, though there are +some breaking changes. See the [Upgrade Guide] for details and +migration snippets, and the [changelog] for the full list of changes. + +## About DataFusion + +[Apache DataFusion] is an extensible query engine, written in [Rust], that uses +[Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast, data-centric systems such as databases, dataframe libraries, +and machine learning and streaming applications. While [DataFusion's primary +design goal] is to accelerate the creation of other data-centric systems, it +provides a reasonable experience directly out of the box as a [dataframe +library], [Python library], and [command-line SQL tool]. + +DataFusion's core thesis is that, as a community, together we can build much +more advanced technology than any of us as individuals or companies could build +alone. Without DataFusion, highly performant vectorized query engines would +remain the domain of a few large companies and world-class research +institutions. With DataFusion, we can all build on top of a shared foundation +and focus on what makes our projects unique. + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors works together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us, we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests, or code. A list of open issues suitable for beginners is [here], and you +can find out how to reach us on the [communication doc]. + +[Apache DataFusion]: https://datafusion.apache.org/ +[Rust]: https://www.rust-lang.org/ +[Apache Arrow]: https://arrow.apache.org +[DataFusion's primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[Python library]: https://datafusion.apache.org/python/ +[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ +[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html + +[@xiedeyantu]: https://github.com/xiedeyantu +[@zhuqi-lucas]: https://github.com/zhuqi-lucas +[@neilconway]: https://github.com/neilconway +[@mbutrovich]: https://github.com/mbutrovich +[@gabotechs]: https://github.com/gabotechs +[@Dandandan]: https://github.com/Dandandan +[@zhangxffff]: https://github.com/zhangxffff +[@lyne7-sc]: https://github.com/lyne7-sc +[@CuteChuanChuan]: https://github.com/CuteChuanChuan +[@kumarUjjawal]: https://github.com/kumarUjjawal +[@coderfender]: https://github.com/coderfender +[@gstvg]: https://github.com/gstvg +[@rluvaton]: https://github.com/rluvaton +[@getChan]: https://github.com/getChan +[@tschwarzinger]: https://github.com/tschwarzinger +[@paleolimbot]: https://github.com/paleolimbot +[@kszucs]: https://github.com/kszucs +[@comphead]: https://github.com/comphead +[@crm26]: https://github.com/crm26 +[@adriangb]: https://github.com/adriangb +[@mhilton]: https://github.com/mhilton +[@Samyak2]: https://github.com/Samyak2 +[@SubhamSinghal]: https://github.com/SubhamSinghal +[@kazantsev-maksim]: https://github.com/kazantsev-maksim +[@andygrove]: https://github.com/andygrove +[@buraksenn]: https://github.com/buraksenn +[@davidlghellin]: https://github.com/davidlghellin +[@asolimando]: https://github.com/asolimando +[@jonathanc-n]: https://github.com/jonathanc-n +[@alamb]: https://github.com/alamb +[@timsaucer]: https://github.com/timsaucer +[@martin-g]: https://github.com/martin-g +[@jecsand838]: https://github.com/jecsand838 +[@cetra3]: https://github.com/cetra3 +[@xudong963]: https://github.com/xudong963 +[@crepererum]: https://github.com/crepererum +[@viirya]: https://github.com/viirya +[@2010YOUY01]: https://github.com/2010YOUY01 +[@UBarney]: https://github.com/UBarney +[@friendlymatthew]: https://github.com/friendlymatthew +[@AdamGS]: https://github.com/AdamGS +[@athlcode]: https://github.com/athlcode +[@shivbhatia10]: https://github.com/shivbhatia10 +[@ologlogn]: https://github.com/ologlogn +[@LiaCastaneda]: https://github.com/LiaCastaneda +[@benbellick]: https://github.com/benbellick +[@pepijnve]: https://github.com/pepijnve +[@shehabgamin]: https://github.com/shehabgamin + +[array_to_string]: https://github.com/apache/datafusion/pull/20639 +[array_concat]: https://github.com/apache/datafusion/pull/20620 +[array_sort]: https://github.com/apache/datafusion/pull/21083 +[split_part]: https://github.com/apache/datafusion/pull/21119 +[substr]: https://github.com/apache/datafusion/pull/21366 +[strpos]: https://github.com/apache/datafusion/pull/20754 +[left]: https://github.com/apache/datafusion/pull/21442 +[right]: https://github.com/apache/datafusion/pull/21442 +[string_agg]: https://github.com/apache/datafusion/pull/21154 +[approx_distinct]: https://github.com/apache/datafusion/pull/21037 +[foldhash]: https://github.com/apache/datafusion/pull/20958 +[regexp_replace]: https://github.com/apache/datafusion/pull/21379 +[array_transform]: https://github.com/apache/datafusion/pull/21679 +[array_filter]: https://github.com/apache/datafusion/pull/21895 +[array_any_match]: https://github.com/apache/datafusion/pull/21903 +[array_compact]: https://github.com/apache/datafusion/pull/21522 +[cosine_distance]: https://github.com/apache/datafusion/pull/21542 +[inner_product]: https://github.com/apache/datafusion/pull/21861 +[array_normalize]: https://github.com/apache/datafusion/pull/22013 +[cast_to_type]: https://github.com/apache/datafusion/pull/21322 +[with_metadata]: https://github.com/apache/datafusion/pull/21509 + +[round]: https://github.com/apache/datafusion/pull/21062 +[floor]: https://github.com/apache/datafusion/pull/21933 +[ceil]: https://github.com/apache/datafusion/pull/20593 +[soundex]: https://github.com/apache/datafusion/pull/20725 +[xxhash64]: https://github.com/apache/datafusion/pull/21967 +[array_contains]: https://github.com/apache/datafusion/pull/20685 + +[morsel-driven design]: https://db.in.tum.de/~leis/papers/morsels.pdf +[StatisticsRegistry]: https://docs.rs/datafusion/latest/datafusion/physical_plan/operator_statistics/struct.StatisticsRegistry.html + +[#17861]: https://github.com/apache/datafusion/pull/17861 +[#18921]: https://github.com/apache/datafusion/pull/18921 +[#10048]: https://github.com/apache/datafusion/issues/10048 +[#20674]: https://github.com/apache/datafusion/pull/20674 +[#20628]: https://github.com/apache/datafusion/pull/20628 +[#19028]: https://github.com/apache/datafusion/issues/19028 +[#20529]: https://github.com/apache/datafusion/issues/20529 +[#21327]: https://github.com/apache/datafusion/pull/21327 +[#21342]: https://github.com/apache/datafusion/pull/21342 +[#21351]: https://github.com/apache/datafusion/pull/21351 +[#21448]: https://github.com/apache/datafusion/pull/21448 +[#21637]: https://github.com/apache/datafusion/pull/21637 +[#21090]: https://github.com/apache/datafusion/pull/21090 +[#20822]: https://github.com/apache/datafusion/pull/20822 +[#20854]: https://github.com/apache/datafusion/pull/20854 +[#20925]: https://github.com/apache/datafusion/pull/20925 +[#18223]: https://github.com/apache/datafusion/issues/18223 +[#18136]: https://github.com/apache/datafusion/pull/18136 +[#20312]: https://github.com/apache/datafusion/pull/20312 +[#20789]: https://github.com/apache/datafusion/pull/20789 +[#20806]: https://github.com/apache/datafusion/pull/20806 +[#20904]: https://github.com/apache/datafusion/pull/20904 +[#19957]: https://github.com/apache/datafusion/pull/19957 +[#21077]: https://github.com/apache/datafusion/pull/21077 +[#21081]: https://github.com/apache/datafusion/pull/21081 +[#21110]: https://github.com/apache/datafusion/pull/21110 +[#21182]: https://github.com/apache/datafusion/pull/21182 +[#21184]: https://github.com/apache/datafusion/pull/21184 +[#21202]: https://github.com/apache/datafusion/pull/21202 +[#21240]: https://github.com/apache/datafusion/pull/21240 +[#21291]: https://github.com/apache/datafusion/pull/21291 +[#21323]: https://github.com/apache/datafusion/pull/21323 +[#21352]: https://github.com/apache/datafusion/pull/21352 +[#21362]: https://github.com/apache/datafusion/pull/21362 +[#21426]: https://github.com/apache/datafusion/pull/21426 +[#21483]: https://github.com/apache/datafusion/pull/21483 +[#21484]: https://github.com/apache/datafusion/pull/21484 +[#21517]: https://github.com/apache/datafusion/pull/21517 +[#21679]: https://github.com/apache/datafusion/pull/21679 +[#21956]: https://github.com/apache/datafusion/pull/21956 +[#22010]: https://github.com/apache/datafusion/pull/22010 + +[datafusion-spark crate]: https://docs.rs/datafusion-spark/latest/datafusion_spark/index.html diff --git a/content/images/datafusion-54.0.0/performance_over_time_clickbench.png b/content/images/datafusion-54.0.0/performance_over_time_clickbench.png new file mode 100644 index 00000000..d2f039fd Binary files /dev/null and b/content/images/datafusion-54.0.0/performance_over_time_clickbench.png differ