From 20e3d3cdf5e103de10c65df286c723d4024bc333 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 09:59:35 -0600 Subject: [PATCH 01/20] blog: add Apache DataFusion Comet 0.17.0 release post Draft release announcement for Comet 0.17.0, focused on the JVM codegen dispatcher, expanded expression coverage, and the native-shuffle FFI round-trip removal, told through the Arrow-native framing. Stats and date are placeholders to finalize at release. --- .../2026-06-16-datafusion-comet-0.17.0.md | 205 ++++++++++++++++++ 1 file changed, 205 insertions(+) create mode 100644 content/blog/2026-06-16-datafusion-comet-0.17.0.md diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md new file mode 100644 index 00000000..1266628d --- /dev/null +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -0,0 +1,205 @@ +--- +layout: post +title: Apache DataFusion Comet 0.17.0 Release +date: 2026-06-16 +author: pmc +categories: [subprojects] +--- + + + +[TOC] + + + +The Apache DataFusion PMC is pleased to announce version 0.17.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. + +Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for +improved performance and efficiency without requiring any code changes. + +This release covers approximately five weeks of development work and is the result of merging XXX PRs from 19 +contributors. See the [change log] for more information. + +[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.17.0.md + +## Arrow-Native, End to End + +Comet's value proposition has always been "keep your Spark data columnar and skip the per-row overhead of +Spark's row-based engine," but the way we describe it has not always kept pace with how Comet actually works. +This release leans into a clearer framing: Comet keeps Spark queries **Arrow-native end to end**. Operators, +expressions, shuffle, and broadcast all stay in Apache Arrow columnar format, avoiding the cost of +materializing and transitioning row-by-row data. + +Within that Arrow-native pipeline, the work of an operator or expression runs in one of two ways: + +- **Rust-implemented**: native Rust code, executed through Apache DataFusion. This is what most people picture + when they think of Comet. +- **JVM-implemented**: Scala or Java code that operates directly on Arrow batches, including expressions + produced by Spark's own code generation. + +The implementation language is an internal detail. From the query's perspective the data never leaves Arrow +columnar format, so there is no per-row materialization cost at the boundary between a Rust-implemented and a +JVM-implemented step. This distinction matters for 0.17.0 because the JVM-implemented path, driven by Comet's +codegen dispatcher, is where a large share of this release's new capability lands. + +The framing is now reflected in Comet's documentation: the README leads with the Arrow-native value +proposition, and the older internal scan names (`native_datafusion`, `native_iceberg_compat`) have been +retired in favor of clearer terminology. + +## JVM Codegen Dispatch + +The headline feature of 0.17.0 is the maturation of Comet's **JVM codegen dispatcher**. + +Comet has long fallen back to Spark whenever an expression had no native Rust implementation, or where the +Rust implementation could diverge from Spark on edge cases. A fallback is correct, but it is expensive: the +surrounding project, exchange, and sort operators drop out of the Comet pipeline, and the data takes a +columnar-to-row round trip into Spark and back. + +The codegen dispatcher offers a third option. Instead of falling back, Comet runs the Spark expression's own +generated code (`doGenCode`) inside the Comet pipeline, operating directly on Arrow batches. The result is a +JVM-implemented Arrow-native expression: the query stays in the pipeline, and because the expression is +evaluated by Spark's own code, the result is guaranteed to match Spark exactly across every supported Spark +version. When the dispatcher is disabled, Comet falls back cleanly as before. + +This release puts the dispatcher to work across a wide surface: + +- **Scala and Java UDFs, now enabled by default.** Eligible Spark `ScalaUDF` expressions are routed through + the dispatcher and executed inside the Comet pipeline, so a project around a UDF no longer forces a + fallback and a columnar-to-row round trip. The path has broad type coverage (scalars, arbitrarily nested + complex types, and higher-order functions) and is backed by end-to-end, fuzz, and Iceberg test coverage. + It can be disabled with `spark.comet.exec.scalaUDF.codegen.enabled=false`. +- **100% Spark-compatible regular expressions.** Regex expressions now dispatch to Spark's own + implementation, eliminating the long-standing compatibility gaps of a separate native regex engine. +- **100% Spark-compatible JSON functions.** JSON expression handling follows the same approach, matching + Spark's behavior precisely. +- **More scalar and structured-text functions**, including a batch of math and string functions, AES + encryption and decryption, `Upper` / `Lower` / `InitCap`, `GetTimestamp`, and an expanded set of date and + time expressions. + +Just as importantly, 0.17.0 changes how Comet treats expressions whose native Rust path is known to diverge +from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark +unless the user opted into the divergent native behavior. Now, when `spark.comet.expr.allowIncompatible` is +left at its default of `false`, the expression is routed through the codegen dispatcher and evaluated +correctly inside Comet rather than triggering a fallback. `allowIncompatible=true` becomes a pure performance +knob for users who accept the faster native path's divergence. Expressions such as `from_unixtime`, and the +`TimestampNTZ` branches of `hour`, `minute`, and `second`, now stay in the pipeline by default. + +## Expanded Expression Coverage + +Partly through the codegen dispatcher and partly through new Rust implementations, Comet's expression coverage +grew substantially in this cycle. More than 120 Spark expressions have gained support since 0.16.0, spanning +nearly every function family: + +- **Date and time** (~25): `convert_timezone`, `make_date`, `months_between`, `next_day`, + `from_utc_timestamp` / `to_utc_timestamp`, `date_from_unix_date`, the `timestamp_*` and `unix_*` second / + milli / micro conversions, and the current date/time/timezone functions. +- **Math** (~24): `acosh`, `asinh`, `atanh`, `cbrt`, `csc`, `sec`, `hypot`, `log1p`, `bin`, `conv`, + `factorial`, `pmod`, `width_bucket`, `rint`, and more. +- **String** (~16): `elt`, `find_in_set`, `format_number`, `format_string`, `levenshtein`, `locate`, + `overlay`, `soundex`, `split`, `substring_index`, `unbase64`, `to_char`, and `to_number`. +- **XPath** (9): the full `xpath`, `xpath_boolean`, `xpath_double`, `xpath_int`, `xpath_long`, `xpath_string` + family. +- **JSON, CSV, and XML** (9): `from_csv`, `to_csv`, `schema_of_csv`, `schema_of_json`, `json_object_keys`, + `json_array_length`, `from_xml`, `to_xml`, and `schema_of_xml`. +- **Array and map** (11): `array_position`, `array_size`, `arrays_zip`, `slice`, `sort_array`, `sequence`, + `map_concat`, `map_contains_key`, and `map_from_entries`. +- **Aggregate and window** (7): `any_value`, `count_if`, the `regr_*` regression aggregates, plus `lag` and + `lead`. +- **Conditional and null handling** (7): `greatest`, `least`, `nullif`, `ifnull` / `nvl`, `nvl2`, and + `equal_null`. + +For the authoritative, always-current status of every Spark built-in expression, see the +[Spark Expression Support](https://datafusion.apache.org/comet/user-guide/latest/expressions.html) reference. + +## Performance + +### Removing an FFI Round Trip from Native Shuffle + +The most significant performance change in 0.17.0 targets the shuffle write path. When a native subtree feeds +a Comet shuffle, Comet previously ran two separate native iterators per partition: one for the upstream +subtree, and a second rooted at a synthetic `Scan("ShuffleWriterInput") -> ShuffleWriter` that consumed the +batches back. The JVM never actually read this data, so the native-to-JVM-and-back hop was pure overhead. + +Although the Arrow C Data Interface is zero-copy, this particular round trip was not. The synthetic scan left +its batches marked as not FFI-safe, so on import every batch was deep-copied into freshly allocated buffers. +0.17.0 collapses the two iterators into a single native plan rooted at the shuffle writer, with the upstream +subtree as a direct child. That removes the Arrow FFI export and import, the per-batch deep copy, and one +`createPlan` / `releasePlan` pair per partition, while preserving all of the existing per-partition setup +(broadcast alignment, subqueries, encryption) and input metrics reporting. This is the Arrow-native principle +in practice: the win came not from faster copying, but from removing a data-format boundary that should never +have cost anything. + +### Lower Per-Batch Overhead in Arrow Vectors + +Two changes reduce repeated work when reading Arrow columns across the JNI boundary. Comet now caches the +validity buffer address on `CometDecodedVector` and the offset buffer address for variable-width vectors on +`CometPlainVector`, so these addresses are resolved once per vector rather than on every access. + +### Faster Statistical Aggregates + +The variance, standard deviation, covariance, and correlation aggregates now use DataFusion's +`GroupsAccumulator` interface, which is substantially more efficient for grouped aggregation than the +row-accumulator path they used previously. + +Additional smaller improvements include bulk-NULL handling in `split` and `substring` (skipping a per-row +allocation), and a new `interleave_time` shuffle metric with tuned output buffer sizing to make shuffle cost +easier to attribute. + +## Correctness and Test Coverage + +Much of the correctness work this cycle is part of a deliberate push toward an eventual **1.0.0 release**. +Reaching 1.0.0 means being able to state precisely, and stand behind, exactly which Spark expressions Comet +accelerates and how faithfully it matches Spark on each one. Two efforts in 0.17.0 move directly toward that +goal. + +First, this cycle included a systematic audit of Comet's expression implementations against Apache Spark +3.4.3, 3.5.8, 4.0.1, and 4.1.1, covering the hash, JSON, collection, map, predicate, bitwise, conditional, +array, struct, math, and cast expression families, along with cast behavior. Each audit compared Comet's +behavior to Spark across all four versions and expanded test coverage where gaps were found. + +Second, the codegen dispatcher raises the correctness floor structurally: for every dispatched expression, +results are guaranteed to match Spark exactly because Spark's own generated code does the evaluation. Together +with the audits, this gives a meaningful increase in confidence that Comet matches Spark across the supported +version matrix, and it sharpens the Spark Expression Support reference into a status page the project can +commit to as it approaches 1.0.0. + +As always, most cross-version behavior differences were caught because Comet runs the full Apache Spark SQL +test suite against each supported Spark version as part of CI. + +## Compatibility + +Supported platforms include: + +- **Spark 3.4.3** with Java 11/17 and Scala 2.12/2.13 +- **Spark 3.5.8** with Java 11/17 and Scala 2.12/2.13 +- **Spark 4.0.2** with Java 17 and Scala 2.13 +- **Spark 4.1.1** with Java 17 and Scala 2.13 + +See the [Spark Version Compatibility] page for known limitations specific to each version. + +[Spark Version Compatibility]: https://datafusion.apache.org/comet/user-guide/latest/compatibility/spark-versions.html + +This release builds on **DataFusion 53.1** and **Arrow 58.3**. + +## Get Started with Comet 0.17.0 + +Ready to try it out? Follow the [Comet 0.17.0 Installation Guide](https://datafusion.apache.org/comet/user-guide/0.17/installation.html) +to get up and running, then point Comet at your existing Spark workloads, including Scala and Java UDFs and +Spark 4 with ANSI mode enabled, and see the speedup for yourself. From dd7ab9a40aeec0e92ef7790e82c40c72a12a6c52 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:05:10 -0600 Subject: [PATCH 02/20] blog: focus arrow-native section on capabilities, drop framing meta-commentary --- .../2026-06-16-datafusion-comet-0.17.0.md | 20 +++++++------------ 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 1266628d..97022bc9 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -41,11 +41,9 @@ contributors. See the [change log] for more information. ## Arrow-Native, End to End -Comet's value proposition has always been "keep your Spark data columnar and skip the per-row overhead of -Spark's row-based engine," but the way we describe it has not always kept pace with how Comet actually works. -This release leans into a clearer framing: Comet keeps Spark queries **Arrow-native end to end**. Operators, -expressions, shuffle, and broadcast all stay in Apache Arrow columnar format, avoiding the cost of -materializing and transitioning row-by-row data. +Comet keeps Spark queries **Arrow-native end to end**: operators, expressions, shuffle, and broadcast all stay +in Apache Arrow columnar format, avoiding the per-row overhead that Spark's row-based engine incurs from +materializing and transitioning data one row at a time. Within that Arrow-native pipeline, the work of an operator or expression runs in one of two ways: @@ -54,14 +52,10 @@ Within that Arrow-native pipeline, the work of an operator or expression runs in - **JVM-implemented**: Scala or Java code that operates directly on Arrow batches, including expressions produced by Spark's own code generation. -The implementation language is an internal detail. From the query's perspective the data never leaves Arrow -columnar format, so there is no per-row materialization cost at the boundary between a Rust-implemented and a -JVM-implemented step. This distinction matters for 0.17.0 because the JVM-implemented path, driven by Comet's -codegen dispatcher, is where a large share of this release's new capability lands. - -The framing is now reflected in Comet's documentation: the README leads with the Arrow-native value -proposition, and the older internal scan names (`native_datafusion`, `native_iceberg_compat`) have been -retired in favor of clearer terminology. +Because the data never leaves Arrow columnar format, there is no per-row materialization cost at the boundary +between a Rust-implemented and a JVM-implemented step. 0.17.0 extends both paths, but it builds especially +heavily on the JVM-implemented one: its codegen dispatcher brings a large set of new expressions, UDF support, +and exact Spark compatibility into the Arrow-native pipeline, as the following sections describe. ## JVM Codegen Dispatch From 751546d9c16d15f1364b5df53d9b8b1afc7064b7 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:05:54 -0600 Subject: [PATCH 03/20] blog: update Comet description to reflect Arrow-native dual-path execution --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 97022bc9..a26885bd 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -31,8 +31,10 @@ limitations under the License. The Apache DataFusion PMC is pleased to announce version 0.17.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. -Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for -improved performance and efficiency without requiring any code changes. +Comet is an accelerator for Apache Spark that keeps query execution in Apache Arrow columnar format end to +end, running operators and expressions as native Rust code (via Apache DataFusion) or as JVM code that +operates directly on Arrow batches, for improved performance and efficiency without requiring any code +changes. This release covers approximately five weeks of development work and is the result of merging XXX PRs from 19 contributors. See the [change log] for more information. From 4b15d595878845197dc6efce6262c874776c81fc Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:06:49 -0600 Subject: [PATCH 04/20] blog: drop boilerplate Comet description from intro --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index a26885bd..21029a8f 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -31,11 +31,6 @@ limitations under the License. The Apache DataFusion PMC is pleased to announce version 0.17.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. -Comet is an accelerator for Apache Spark that keeps query execution in Apache Arrow columnar format end to -end, running operators and expressions as native Rust code (via Apache DataFusion) or as JVM code that -operates directly on Arrow batches, for improved performance and efficiency without requiring any code -changes. - This release covers approximately five weeks of development work and is the result of merging XXX PRs from 19 contributors. See the [change log] for more information. From 4e83203672e89cf7754fab5af821a92ba84dad6a Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:09:36 -0600 Subject: [PATCH 05/20] blog: lead with JVM codegen dispatch, follow with arrow-native framing --- .../2026-06-16-datafusion-comet-0.17.0.md | 38 ++++++++++--------- 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 21029a8f..88a2efa9 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -36,24 +36,6 @@ contributors. See the [change log] for more information. [change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.17.0.md -## Arrow-Native, End to End - -Comet keeps Spark queries **Arrow-native end to end**: operators, expressions, shuffle, and broadcast all stay -in Apache Arrow columnar format, avoiding the per-row overhead that Spark's row-based engine incurs from -materializing and transitioning data one row at a time. - -Within that Arrow-native pipeline, the work of an operator or expression runs in one of two ways: - -- **Rust-implemented**: native Rust code, executed through Apache DataFusion. This is what most people picture - when they think of Comet. -- **JVM-implemented**: Scala or Java code that operates directly on Arrow batches, including expressions - produced by Spark's own code generation. - -Because the data never leaves Arrow columnar format, there is no per-row materialization cost at the boundary -between a Rust-implemented and a JVM-implemented step. 0.17.0 extends both paths, but it builds especially -heavily on the JVM-implemented one: its codegen dispatcher brings a large set of new expressions, UDF support, -and exact Spark compatibility into the Arrow-native pipeline, as the following sections describe. - ## JVM Codegen Dispatch The headline feature of 0.17.0 is the maturation of Comet's **JVM codegen dispatcher**. @@ -92,6 +74,26 @@ correctly inside Comet rather than triggering a fallback. `allowIncompatible=tru knob for users who accept the faster native path's divergence. Expressions such as `from_unixtime`, and the `TimestampNTZ` branches of `hour`, `minute`, and `second`, now stay in the pipeline by default. +## Arrow-Native, End to End + +The codegen dispatcher works because of a property that runs through all of Comet: queries stay **Arrow-native +end to end**. Operators, expressions, shuffle, and broadcast all remain in Apache Arrow columnar format, +avoiding the per-row overhead that Spark's row-based engine incurs from materializing and transitioning data +one row at a time. + +Within that Arrow-native pipeline, the work of an operator or expression runs in one of two ways: + +- **Rust-implemented**: native Rust code, executed through Apache DataFusion. This is what most people picture + when they think of Comet. +- **JVM-implemented**: Scala or Java code that operates directly on Arrow batches, including the + codegen-dispatched expressions and UDFs described above. + +Because the data never leaves Arrow columnar format, there is no per-row materialization cost at the boundary +between a Rust-implemented and a JVM-implemented step. That is what makes dispatch worthwhile: a dispatched +expression sits directly in the pipeline alongside Rust-implemented operators, with no columnar-to-row +transition between them. The expansion of the JVM-implemented path in 0.17.0 widens what Comet can keep +Arrow-native rather than hand back to Spark. + ## Expanded Expression Coverage Partly through the codegen dispatcher and partly through new Rust implementations, Comet's expression coverage From 02be106b6911d57a123a23055a5131d278a5ec7b Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:23:12 -0600 Subject: [PATCH 06/20] blog: correct codegen dispatcher as new in 0.17.0 and fix one-way c2r description --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 88a2efa9..ca8180d0 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -38,12 +38,12 @@ contributors. See the [change log] for more information. ## JVM Codegen Dispatch -The headline feature of 0.17.0 is the maturation of Comet's **JVM codegen dispatcher**. +The headline feature of 0.17.0 is a new mechanism introduced this cycle: Comet's **JVM codegen dispatcher**. -Comet has long fallen back to Spark whenever an expression had no native Rust implementation, or where the +Comet has always fallen back to Spark whenever an expression had no native Rust implementation, or where the Rust implementation could diverge from Spark on edge cases. A fallback is correct, but it is expensive: the -surrounding project, exchange, and sort operators drop out of the Comet pipeline, and the data takes a -columnar-to-row round trip into Spark and back. +surrounding project, exchange, and sort operators drop out of the Comet pipeline, and a columnar-to-row +conversion is needed to feed the data into Spark's row-based operators. The codegen dispatcher offers a third option. Instead of falling back, Comet runs the Spark expression's own generated code (`doGenCode`) inside the Comet pipeline, operating directly on Arrow batches. The result is a @@ -53,9 +53,9 @@ version. When the dispatcher is disabled, Comet falls back cleanly as before. This release puts the dispatcher to work across a wide surface: -- **Scala and Java UDFs, now enabled by default.** Eligible Spark `ScalaUDF` expressions are routed through +- **Scala and Java UDFs, enabled by default.** Eligible Spark `ScalaUDF` expressions are routed through the dispatcher and executed inside the Comet pipeline, so a project around a UDF no longer forces a - fallback and a columnar-to-row round trip. The path has broad type coverage (scalars, arbitrarily nested + fallback and a columnar-to-row conversion. The path has broad type coverage (scalars, arbitrarily nested complex types, and higher-order functions) and is backed by end-to-end, fuzz, and Iceberg test coverage. It can be disabled with `spark.comet.exec.scalaUDF.codegen.enabled=false`. - **100% Spark-compatible regular expressions.** Regex expressions now dispatch to Spark's own From f7c454caf4ce6048c908b8893b47c5138adee74d Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:24:25 -0600 Subject: [PATCH 07/20] blog: clarify dispatched expressions match Spark speed; benefit is avoiding stage fallback --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index ca8180d0..08b9d168 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -51,6 +51,12 @@ JVM-implemented Arrow-native expression: the query stays in the pipeline, and be evaluated by Spark's own code, the result is guaranteed to match Spark exactly across every supported Spark version. When the dispatcher is disabled, Comet falls back cleanly as before. +It is worth being clear about where the benefit comes from. A dispatched expression runs Spark's own generated +code, so the expression itself is not faster than it would be in Spark; expect roughly equivalent per-expression +performance. The win is structural: a single unsupported expression no longer forces an entire query stage out +of the Comet pipeline. The surrounding operators stay Arrow-native, and the stage as a whole avoids the +columnar-to-row conversion and row-based Spark execution that a fallback would otherwise impose. + This release puts the dispatcher to work across a wide surface: - **Scala and Java UDFs, enabled by default.** Eligible Spark `ScalaUDF` expressions are routed through From f71a1e57ed183ba58da250ca11309cc89680250b Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 11 Jun 2026 10:28:53 -0600 Subject: [PATCH 08/20] blog: replace 'this cycle' with 'this release' --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 08b9d168..6099824b 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -38,7 +38,7 @@ contributors. See the [change log] for more information. ## JVM Codegen Dispatch -The headline feature of 0.17.0 is a new mechanism introduced this cycle: Comet's **JVM codegen dispatcher**. +The headline feature of 0.17.0 is a new mechanism introduced in this release: Comet's **JVM codegen dispatcher**. Comet has always fallen back to Spark whenever an expression had no native Rust implementation, or where the Rust implementation could diverge from Spark on edge cases. A fallback is correct, but it is expensive: the @@ -103,7 +103,7 @@ Arrow-native rather than hand back to Spark. ## Expanded Expression Coverage Partly through the codegen dispatcher and partly through new Rust implementations, Comet's expression coverage -grew substantially in this cycle. More than 120 Spark expressions have gained support since 0.16.0, spanning +grew substantially in this release. More than 120 Spark expressions have gained support since 0.16.0, spanning nearly every function family: - **Date and time** (~25): `convert_timezone`, `make_date`, `months_between`, `next_day`, @@ -163,12 +163,12 @@ easier to attribute. ## Correctness and Test Coverage -Much of the correctness work this cycle is part of a deliberate push toward an eventual **1.0.0 release**. +Much of the correctness work in this release is part of a deliberate push toward an eventual **1.0.0 release**. Reaching 1.0.0 means being able to state precisely, and stand behind, exactly which Spark expressions Comet accelerates and how faithfully it matches Spark on each one. Two efforts in 0.17.0 move directly toward that goal. -First, this cycle included a systematic audit of Comet's expression implementations against Apache Spark +First, this release included a systematic audit of Comet's expression implementations against Apache Spark 3.4.3, 3.5.8, 4.0.1, and 4.1.1, covering the hash, JSON, collection, map, predicate, bitwise, conditional, array, struct, math, and cast expression families, along with cast behavior. Each audit compared Comet's behavior to Spark across all four versions and expanded test coverage where gaps were found. From 3cea0a42a0e243171de46eb6eaa185ab86c97e9f Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 12 Jun 2026 07:34:49 -0600 Subject: [PATCH 09/20] blog: promote 1.0.0 prep to top-level section and soften opening title --- .../2026-06-16-datafusion-comet-0.17.0.md | 30 +++++++++++-------- 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 6099824b..030b2933 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -36,9 +36,10 @@ contributors. See the [change log] for more information. [change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.17.0.md -## JVM Codegen Dispatch +## Fewer Fallbacks to Spark -The headline feature of 0.17.0 is a new mechanism introduced in this release: Comet's **JVM codegen dispatcher**. +The headline feature of 0.17.0 is a new mechanism that keeps more of your query running inside Comet instead +of falling back to Spark: the **JVM codegen dispatcher**. Comet has always fallen back to Spark whenever an expression had no native Rust implementation, or where the Rust implementation could diverge from Spark on edge cases. A fallback is correct, but it is expensive: the @@ -161,23 +162,28 @@ Additional smaller improvements include bulk-NULL handling in `split` and `subst allocation), and a new `interleave_time` shuffle metric with tuned output buffer sizing to make shuffle cost easier to attribute. -## Correctness and Test Coverage +## Preparing for the 1.0.0 Release Much of the correctness work in this release is part of a deliberate push toward an eventual **1.0.0 release**. Reaching 1.0.0 means being able to state precisely, and stand behind, exactly which Spark expressions Comet accelerates and how faithfully it matches Spark on each one. Two efforts in 0.17.0 move directly toward that goal. -First, this release included a systematic audit of Comet's expression implementations against Apache Spark -3.4.3, 3.5.8, 4.0.1, and 4.1.1, covering the hash, JSON, collection, map, predicate, bitwise, conditional, -array, struct, math, and cast expression families, along with cast behavior. Each audit compared Comet's -behavior to Spark across all four versions and expanded test coverage where gaps were found. +First, the codegen dispatcher raises the correctness floor structurally: for every dispatched expression, +results are guaranteed to match Spark exactly because Spark's own generated code does the evaluation. This +gives a meaningful increase in confidence that Comet matches Spark across the supported version matrix, and it +sharpens the Spark Expression Support reference into a status page the project can commit to as it approaches +1.0.0. -Second, the codegen dispatcher raises the correctness floor structurally: for every dispatched expression, -results are guaranteed to match Spark exactly because Spark's own generated code does the evaluation. Together -with the audits, this gives a meaningful increase in confidence that Comet matches Spark across the supported -version matrix, and it sharpens the Spark Expression Support reference into a status page the project can -commit to as it approaches 1.0.0. +Second, this release included a systematic audit of Comet's existing expression implementations against Spark, +detailed below. + +### Expression Audit + +The audit compared Comet's expression implementations against Apache Spark 3.4.3, 3.5.8, 4.0.1, and 4.1.1, +covering the hash, JSON, collection, map, predicate, bitwise, conditional, array, struct, math, and cast +expression families, along with cast behavior. Each audit compared Comet's behavior to Spark across all four +versions and expanded test coverage where gaps were found. As always, most cross-version behavior differences were caught because Comet runs the full Apache Spark SQL test suite against each supported Spark version as part of CI. From 23ee2695cf9e916e933454d42c7160228ad89774 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 12 Jun 2026 07:36:21 -0600 Subject: [PATCH 10/20] blog: split Java/Scala UDF support into its own section --- .../blog/2026-06-16-datafusion-comet-0.17.0.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 030b2933..e487f5e2 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -60,11 +60,6 @@ columnar-to-row conversion and row-based Spark execution that a fallback would o This release puts the dispatcher to work across a wide surface: -- **Scala and Java UDFs, enabled by default.** Eligible Spark `ScalaUDF` expressions are routed through - the dispatcher and executed inside the Comet pipeline, so a project around a UDF no longer forces a - fallback and a columnar-to-row conversion. The path has broad type coverage (scalars, arbitrarily nested - complex types, and higher-order functions) and is backed by end-to-end, fuzz, and Iceberg test coverage. - It can be disabled with `spark.comet.exec.scalaUDF.codegen.enabled=false`. - **100% Spark-compatible regular expressions.** Regex expressions now dispatch to Spark's own implementation, eliminating the long-standing compatibility gaps of a separate native regex engine. - **100% Spark-compatible JSON functions.** JSON expression handling follows the same approach, matching @@ -81,6 +76,17 @@ correctly inside Comet rather than triggering a fallback. `allowIncompatible=tru knob for users who accept the faster native path's divergence. Expressions such as `from_unixtime`, and the `TimestampNTZ` branches of `hour`, `minute`, and `second`, now stay in the pipeline by default. +## User-Defined Functions in Java and Scala + +Building on the codegen dispatcher, 0.17.0 adds support for arbitrary user-defined functions written in Java +and Scala, enabled by default. Eligible Spark `ScalaUDF` expressions are routed through the dispatcher and +executed inside the Comet pipeline, so a project around a UDF no longer forces a fallback and a +columnar-to-row conversion. + +The path has broad type coverage — scalars, arbitrarily nested complex types, and higher-order functions — and +is backed by end-to-end, fuzz, and Iceberg test coverage. It can be disabled with +`spark.comet.exec.scalaUDF.codegen.enabled=false`. + ## Arrow-Native, End to End The codegen dispatcher works because of a property that runs through all of Comet: queries stay **Arrow-native From ec374aafa4ec4bc24d840be848eee1c1695fd133 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 12 Jun 2026 07:38:21 -0600 Subject: [PATCH 11/20] blog: credit the audit, not the Spark SQL suite, for most fixes --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index e487f5e2..ab8279bc 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -191,8 +191,11 @@ covering the hash, JSON, collection, map, predicate, bitwise, conditional, array expression families, along with cast behavior. Each audit compared Comet's behavior to Spark across all four versions and expanded test coverage where gaps were found. -As always, most cross-version behavior differences were caught because Comet runs the full Apache Spark SQL -test suite against each supported Spark version as part of CI. +This deliberate, edge-case-by-edge-case study surfaced the bulk of the behavior differences fixed in this +release. By working through each expression's corner cases and writing targeted Comet SQL tests for them, the +audit caught divergences that broader testing does not reliably exercise. Comet also runs the full Apache +Spark SQL test suite against each supported Spark version as part of CI, which continues to catch some +cross-version differences, but the audit's focused approach found considerably more. ## Compatibility From c8bc1dc8326fde1f6ad8834e9061982f8d2a6826 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 12 Jun 2026 07:51:22 -0600 Subject: [PATCH 12/20] blog: reference 1.0.0 tracking issue and list release criteria --- .../2026-06-16-datafusion-comet-0.17.0.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index ab8279bc..10044094 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -171,9 +171,22 @@ easier to attribute. ## Preparing for the 1.0.0 Release Much of the correctness work in this release is part of a deliberate push toward an eventual **1.0.0 release**. -Reaching 1.0.0 means being able to state precisely, and stand behind, exactly which Spark expressions Comet -accelerates and how faithfully it matches Spark on each one. Two efforts in 0.17.0 move directly toward that -goal. +Planning for 1.0.0 is being tracked in [issue #4082], where the community is discussing what the milestone +should mean. The criteria under consideration go well beyond correctness and include: + +- Demonstrated cost savings for TPC-H and TPC-DS at SF1000 (1TB) — already met +- Thorough documentation of compatibility status +- A review of all configuration options, renaming some for consistency +- Consistent logging +- A documented policy for how long each Spark version will be supported +- A documented process for preventing major performance regressions +- A documented semantic-versioning policy and what it means for Comet going forward — already agreed + +[issue #4082]: https://github.com/apache/datafusion-comet/issues/4082 + +Correctness sits at the center of that list: reaching 1.0.0 means being able to state precisely, and stand +behind, exactly which Spark expressions Comet accelerates and how faithfully it matches Spark on each one. Two +efforts in 0.17.0 move directly toward that goal. First, the codegen dispatcher raises the correctness floor structurally: for every dispatched expression, results are guaranteed to match Spark exactly because Spark's own generated code does the evaluation. This From 472021d4a238794739bf2a3cbeb981a5920ddd58 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 18 Jun 2026 10:45:24 -0600 Subject: [PATCH 13/20] blog: update 0.17.0 post for rc1 tag and tighten prose Fill in PR count (192) from the generated changelog and remove the pre-publish TODO. Expand the codegen dispatch coverage to reflect the expressions wired up since the draft (collection and higher-order functions, AES, mask, try_to_number, timezone conversions). Add the Arrow C Stream Interface input-path change and the native broadcast nested loop join. Correct Spark 4.1 to 4.1.2. Tighten editorializing prose throughout. --- .../2026-06-16-datafusion-comet-0.17.0.md | 79 +++++++++++-------- 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 10044094..32348f13 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -27,11 +27,9 @@ limitations under the License. [TOC] - - The Apache DataFusion PMC is pleased to announce version 0.17.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. -This release covers approximately five weeks of development work and is the result of merging XXX PRs from 19 +This release covers approximately five weeks of development work and is the result of merging 192 PRs from 19 contributors. See the [change log] for more information. [change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.17.0.md @@ -52,11 +50,11 @@ JVM-implemented Arrow-native expression: the query stays in the pipeline, and be evaluated by Spark's own code, the result is guaranteed to match Spark exactly across every supported Spark version. When the dispatcher is disabled, Comet falls back cleanly as before. -It is worth being clear about where the benefit comes from. A dispatched expression runs Spark's own generated -code, so the expression itself is not faster than it would be in Spark; expect roughly equivalent per-expression -performance. The win is structural: a single unsupported expression no longer forces an entire query stage out -of the Comet pipeline. The surrounding operators stay Arrow-native, and the stage as a whole avoids the -columnar-to-row conversion and row-based Spark execution that a fallback would otherwise impose. +A dispatched expression is no faster than it would be in Spark, since it runs the same generated code. The +benefit is structural rather than per-expression: a single unsupported expression no longer forces an entire +query stage out of the Comet pipeline. The surrounding operators stay Arrow-native, and the +stage avoids the columnar-to-row conversion and row-based Spark execution that a fallback would otherwise +impose. This release puts the dispatcher to work across a wide surface: @@ -65,10 +63,17 @@ This release puts the dispatcher to work across a wide surface: - **100% Spark-compatible JSON functions.** JSON expression handling follows the same approach, matching Spark's behavior precisely. - **More scalar and structured-text functions**, including a batch of math and string functions, AES - encryption and decryption, `Upper` / `Lower` / `InitCap`, `GetTimestamp`, and an expanded set of date and - time expressions. + encryption and decryption (`aes_encrypt` / `aes_decrypt` / `try_aes_decrypt`), `Upper` / `Lower` / + `InitCap`, `GetTimestamp`, `mask`, `try_to_number`, and an expanded set of date, time, and + timezone expressions. +- **Collection and higher-order expressions**, including `array_intersect`, `array_except`, `array_join`, + `create_map`, and lambda-based higher-order functions such as `filter`. + +Several expressions that have native Rust paths are also routed through the dispatcher where it yields exact +Spark behavior, including `StringReplace`, `decode`, and the `from_utc_timestamp` / `to_utc_timestamp` / +`convert_timezone` family, which now honor Spark 4.0 legacy flags. -Just as importantly, 0.17.0 changes how Comet treats expressions whose native Rust path is known to diverge +0.17.0 also changes how Comet treats expressions whose native Rust path is known to diverge from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark unless the user opted into the divergent native behavior. Now, when `spark.comet.expr.allowIncompatible` is left at its default of `false`, the expression is routed through the codegen dispatcher and evaluated @@ -102,16 +107,15 @@ Within that Arrow-native pipeline, the work of an operator or expression runs in codegen-dispatched expressions and UDFs described above. Because the data never leaves Arrow columnar format, there is no per-row materialization cost at the boundary -between a Rust-implemented and a JVM-implemented step. That is what makes dispatch worthwhile: a dispatched -expression sits directly in the pipeline alongside Rust-implemented operators, with no columnar-to-row -transition between them. The expansion of the JVM-implemented path in 0.17.0 widens what Comet can keep -Arrow-native rather than hand back to Spark. +between a Rust-implemented and a JVM-implemented step. A dispatched expression sits in the pipeline alongside +Rust-implemented operators with no columnar-to-row transition between them. Expanding the JVM-implemented path +in 0.17.0 lets Comet keep more work Arrow-native instead of handing it back to Spark. ## Expanded Expression Coverage Partly through the codegen dispatcher and partly through new Rust implementations, Comet's expression coverage -grew substantially in this release. More than 120 Spark expressions have gained support since 0.16.0, spanning -nearly every function family: +grew substantially in this release. More than 120 Spark expressions have gained support since 0.16.0, across +most function families: - **Date and time** (~25): `convert_timezone`, `make_date`, `months_between`, `next_day`, `from_utc_timestamp` / `to_utc_timestamp`, `date_from_unix_date`, the `timestamp_*` and `unix_*` second / @@ -134,6 +138,9 @@ nearly every function family: For the authoritative, always-current status of every Spark built-in expression, see the [Spark Expression Support](https://datafusion.apache.org/comet/user-guide/latest/expressions.html) reference. +Operator coverage grew too: 0.17.0 adds a native **broadcast nested loop join**, so queries with non-equi or +cross join conditions can now stay in the Comet pipeline rather than falling back to Spark. + ## Performance ### Removing an FFI Round Trip from Native Shuffle @@ -148,9 +155,20 @@ its batches marked as not FFI-safe, so on import every batch was deep-copied int 0.17.0 collapses the two iterators into a single native plan rooted at the shuffle writer, with the upstream subtree as a direct child. That removes the Arrow FFI export and import, the per-batch deep copy, and one `createPlan` / `releasePlan` pair per partition, while preserving all of the existing per-partition setup -(broadcast alignment, subqueries, encryption) and input metrics reporting. This is the Arrow-native principle -in practice: the win came not from faster copying, but from removing a data-format boundary that should never -have cost anything. +(broadcast alignment, subqueries, encryption) and input metrics reporting. The gain comes from removing the +round trip, not from copying its data more efficiently. + +### A Single Arrow Stream on the Input Path + +0.17.0 also reworks the other side of the JVM/native boundary — the path that feeds JVM-sourced data *into* +native execution. Previously each batch crossed via a bespoke `CometBatchIterator`, with a `hasNext` / `next` +JNI pair per batch and every column imported through its own Arrow FFI array and schema. This release replaces +that path with the **Arrow C Stream Interface**: the JVM exports each per-partition iterator once, and native +imports the schema once and pulls each batch through a single C callback, taking ownership by reference count. +This removes the per-batch, per-column FFI export and JNI round trips for all JVM-sourced inputs, lets the +old `CometBatchIterator` and a now-unnecessary deep copy be deleted, and is the input-side counterpart to the +shuffle-write change above. On TPC-H SF1000 (Spark 3.5.8), total runtime dropped from 475.3s on the previous +baseline to 450.9s with the shuffle-write change and 435.1s with this one. ### Lower Per-Batch Overhead in Arrow Vectors @@ -170,7 +188,7 @@ easier to attribute. ## Preparing for the 1.0.0 Release -Much of the correctness work in this release is part of a deliberate push toward an eventual **1.0.0 release**. +Much of the correctness work in this release is part of an ongoing push toward a **1.0.0 release**. Planning for 1.0.0 is being tracked in [issue #4082], where the community is discussing what the milestone should mean. The criteria under consideration go well beyond correctness and include: @@ -184,15 +202,14 @@ should mean. The criteria under consideration go well beyond correctness and inc [issue #4082]: https://github.com/apache/datafusion-comet/issues/4082 -Correctness sits at the center of that list: reaching 1.0.0 means being able to state precisely, and stand -behind, exactly which Spark expressions Comet accelerates and how faithfully it matches Spark on each one. Two -efforts in 0.17.0 move directly toward that goal. +Correctness is central to that list: reaching 1.0.0 means being able to state precisely which Spark +expressions Comet accelerates, and how faithfully it matches Spark on each one. Two efforts in 0.17.0 move +toward that goal. -First, the codegen dispatcher raises the correctness floor structurally: for every dispatched expression, -results are guaranteed to match Spark exactly because Spark's own generated code does the evaluation. This -gives a meaningful increase in confidence that Comet matches Spark across the supported version matrix, and it -sharpens the Spark Expression Support reference into a status page the project can commit to as it approaches -1.0.0. +First, the codegen dispatcher guarantees an exact match for every dispatched expression, since Spark's own +generated code does the evaluation. That raises our confidence that Comet matches Spark across the supported +versions, and it makes the Spark Expression Support reference something the project can stand behind as it +approaches 1.0.0. Second, this release included a systematic audit of Comet's existing expression implementations against Spark, detailed below. @@ -204,7 +221,7 @@ covering the hash, JSON, collection, map, predicate, bitwise, conditional, array expression families, along with cast behavior. Each audit compared Comet's behavior to Spark across all four versions and expanded test coverage where gaps were found. -This deliberate, edge-case-by-edge-case study surfaced the bulk of the behavior differences fixed in this +This edge-case-by-edge-case comparison surfaced most of the behavior differences fixed in this release. By working through each expression's corner cases and writing targeted Comet SQL tests for them, the audit caught divergences that broader testing does not reliably exercise. Comet also runs the full Apache Spark SQL test suite against each supported Spark version as part of CI, which continues to catch some @@ -217,7 +234,7 @@ Supported platforms include: - **Spark 3.4.3** with Java 11/17 and Scala 2.12/2.13 - **Spark 3.5.8** with Java 11/17 and Scala 2.12/2.13 - **Spark 4.0.2** with Java 17 and Scala 2.13 -- **Spark 4.1.1** with Java 17 and Scala 2.13 +- **Spark 4.1.2** with Java 17 and Scala 2.13 See the [Spark Version Compatibility] page for known limitations specific to each version. From 7522c26f0b1ce4ea293b582d8a8694588657ce09 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 18 Jun 2026 10:52:10 -0600 Subject: [PATCH 14/20] blog: correct allowIncompatible config name in 0.17.0 post The post referenced a global spark.comet.expr.allowIncompatible, which does not exist. The flag is per-expression and lives under the spark.comet.expression prefix: spark.comet.expression..allowIncompatible. --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 32348f13..006857d4 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -75,11 +75,12 @@ Spark behavior, including `StringReplace`, `decode`, and the `from_utc_timestamp 0.17.0 also changes how Comet treats expressions whose native Rust path is known to diverge from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark -unless the user opted into the divergent native behavior. Now, when `spark.comet.expr.allowIncompatible` is -left at its default of `false`, the expression is routed through the codegen dispatcher and evaluated -correctly inside Comet rather than triggering a fallback. `allowIncompatible=true` becomes a pure performance -knob for users who accept the faster native path's divergence. Expressions such as `from_unixtime`, and the -`TimestampNTZ` branches of `hour`, `minute`, and `second`, now stay in the pipeline by default. +unless the user opted into the divergent native behavior with the per-expression +`spark.comet.expression..allowIncompatible` flag. By default that flag is unset, and the expression is +now routed through the codegen dispatcher and evaluated correctly inside Comet rather than triggering a +fallback. Setting it to `true` becomes a performance knob for users who accept the faster native path's +divergence. Expressions such as `from_unixtime`, and the `TimestampNTZ` branches of `hour`, `minute`, and +`second`, now stay in the pipeline by default. ## User-Defined Functions in Java and Scala From b7a18fcfab4e7f593db5ac34434fa40ca292ef29 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 18 Jun 2026 10:57:21 -0600 Subject: [PATCH 15/20] blog: state combined FFI gain as TPC-DS 1TB percentage Replace raw TPC-H runtimes with the combined improvement: the two FFI changes improve TPC-DS at 1TB by around 9%. --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 006857d4..0003c0d3 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -168,8 +168,7 @@ that path with the **Arrow C Stream Interface**: the JVM exports each per-partit imports the schema once and pulls each batch through a single C callback, taking ownership by reference count. This removes the per-batch, per-column FFI export and JNI round trips for all JVM-sourced inputs, lets the old `CometBatchIterator` and a now-unnecessary deep copy be deleted, and is the input-side counterpart to the -shuffle-write change above. On TPC-H SF1000 (Spark 3.5.8), total runtime dropped from 475.3s on the previous -baseline to 450.9s with the shuffle-write change and 435.1s with this one. +shuffle-write change above. Together, the two FFI changes improve TPC-DS performance at 1TB by around 9%. ### Lower Per-Batch Overhead in Arrow Vectors From 2113edaff9aeae1c26ab2492b0be60cb8ac6d256 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 18 Jun 2026 11:01:27 -0600 Subject: [PATCH 16/20] blog: attribute 9% TPC-DS gain to the release, not FFI alone The 9% is the 0.16.0-to-0.17.0 improvement at 1TB; the FFI changes are the largest contributor but not the sole cause. --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 0003c0d3..3d96902f 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -168,7 +168,8 @@ that path with the **Arrow C Stream Interface**: the JVM exports each per-partit imports the schema once and pulls each batch through a single C callback, taking ownership by reference count. This removes the per-batch, per-column FFI export and JNI round trips for all JVM-sourced inputs, lets the old `CometBatchIterator` and a now-unnecessary deep copy be deleted, and is the input-side counterpart to the -shuffle-write change above. Together, the two FFI changes improve TPC-DS performance at 1TB by around 9%. +shuffle-write change above. TPC-DS at 1TB is about 9% faster in 0.17.0 than in 0.16.0, and these two FFI +changes are the largest contributor to that gain. ### Lower Per-Batch Overhead in Arrow Vectors From 5c95c6788c09c52e82fff52fd2c379731300b2a6 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Fri, 19 Jun 2026 07:25:21 -0600 Subject: [PATCH 17/20] blog: note ongoing native expression work in 0.17.0 post --- content/blog/2026-06-16-datafusion-comet-0.17.0.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-16-datafusion-comet-0.17.0.md index 3d96902f..2e7a8feb 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-16-datafusion-comet-0.17.0.md @@ -73,6 +73,10 @@ Several expressions that have native Rust paths are also routed through the disp Spark behavior, including `StringReplace`, `decode`, and the `from_utc_timestamp` / `to_utc_timestamp` / `convert_timezone` family, which now honor Spark 4.0 legacy flags. +The codegen dispatcher closes the compatibility gap, but a dispatched expression runs Spark's own generated +code and is therefore no faster than Spark itself. Providing native Rust implementations of many of these +expressions remains ongoing work, and is where the additional acceleration will come from in future releases. + 0.17.0 also changes how Comet treats expressions whose native Rust path is known to diverge from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark unless the user opted into the divergent native behavior with the per-expression From a0d604d9b12b1c595563419806dea6922ea60095 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Sun, 21 Jun 2026 07:20:20 -0600 Subject: [PATCH 18/20] more edits --- ... => 2026-06-20-datafusion-comet-0.17.0.md} | 51 ++++++++----------- 1 file changed, 21 insertions(+), 30 deletions(-) rename content/blog/{2026-06-16-datafusion-comet-0.17.0.md => 2026-06-20-datafusion-comet-0.17.0.md} (85%) diff --git a/content/blog/2026-06-16-datafusion-comet-0.17.0.md b/content/blog/2026-06-20-datafusion-comet-0.17.0.md similarity index 85% rename from content/blog/2026-06-16-datafusion-comet-0.17.0.md rename to content/blog/2026-06-20-datafusion-comet-0.17.0.md index 2e7a8feb..0f15d50b 100644 --- a/content/blog/2026-06-16-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-20-datafusion-comet-0.17.0.md @@ -1,7 +1,7 @@ --- layout: post title: Apache DataFusion Comet 0.17.0 Release -date: 2026-06-16 +date: 2026-06-20 author: pmc categories: [subprojects] --- @@ -39,20 +39,19 @@ contributors. See the [change log] for more information. The headline feature of 0.17.0 is a new mechanism that keeps more of your query running inside Comet instead of falling back to Spark: the **JVM codegen dispatcher**. -Comet has always fallen back to Spark whenever an expression had no native Rust implementation, or where the -Rust implementation could diverge from Spark on edge cases. A fallback is correct, but it is expensive: the -surrounding project, exchange, and sort operators drop out of the Comet pipeline, and a columnar-to-row -conversion is needed to feed the data into Spark's row-based operators. +Comet has always fallen back to Spark row-based execution whenever an expression had no native Rust implementation, or where the +Rust implementation could diverge from Spark on edge cases. A fallback is correct, a columnar-to-row +conversion is needed to feed the data into Spark's row-based operators and this adds overhead when processing billions of rows of data. -The codegen dispatcher offers a third option. Instead of falling back, Comet runs the Spark expression's own +The codegen dispatcher avoids the fallback to row-based processing by running Spark's own generated code (`doGenCode`) inside the Comet pipeline, operating directly on Arrow batches. The result is a -JVM-implemented Arrow-native expression: the query stays in the pipeline, and because the expression is +JVM-implemented Arrow-native expression: the data stays in Arrow format, and because the expression is evaluated by Spark's own code, the result is guaranteed to match Spark exactly across every supported Spark version. When the dispatcher is disabled, Comet falls back cleanly as before. A dispatched expression is no faster than it would be in Spark, since it runs the same generated code. The -benefit is structural rather than per-expression: a single unsupported expression no longer forces an entire -query stage out of the Comet pipeline. The surrounding operators stay Arrow-native, and the +benefit is that a single unsupported expression no longer forces an entire +query stage back to row-based execution. The surrounding operators stay Arrow-native, and the stage avoids the columnar-to-row conversion and row-based Spark execution that a fallback would otherwise impose. @@ -69,22 +68,14 @@ This release puts the dispatcher to work across a wide surface: - **Collection and higher-order expressions**, including `array_intersect`, `array_except`, `array_join`, `create_map`, and lambda-based higher-order functions such as `filter`. -Several expressions that have native Rust paths are also routed through the dispatcher where it yields exact -Spark behavior, including `StringReplace`, `decode`, and the `from_utc_timestamp` / `to_utc_timestamp` / -`convert_timezone` family, which now honor Spark 4.0 legacy flags. - -The codegen dispatcher closes the compatibility gap, but a dispatched expression runs Spark's own generated -code and is therefore no faster than Spark itself. Providing native Rust implementations of many of these -expressions remains ongoing work, and is where the additional acceleration will come from in future releases. - -0.17.0 also changes how Comet treats expressions whose native Rust path is known to diverge -from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark -unless the user opted into the divergent native behavior with the per-expression -`spark.comet.expression..allowIncompatible` flag. By default that flag is unset, and the expression is -now routed through the codegen dispatcher and evaluated correctly inside Comet rather than triggering a -fallback. Setting it to `true` becomes a performance knob for users who accept the faster native path's -divergence. Expressions such as `from_unixtime`, and the `TimestampNTZ` branches of `hour`, `minute`, and -`second`, now stay in the pipeline by default. + 0.17.0 also changes how Comet treats expressions whose native Rust path is known to diverge + from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark + unless the user opted into the divergent native behavior with the per-expression + `spark.comet.expression..allowIncompatible` flag. By default, that flag is unset, and the expression is + now routed through the codegen dispatcher and evaluated correctly inside Comet rather than triggering a + fallback. Setting it to `true` becomes a performance knob for users who accept the faster native path's + divergence. Expressions such as `from_unixtime`, and the `TimestampNTZ` branches of `hour`, `minute`, and + `second`, now stay in the pipeline by default. ## User-Defined Functions in Java and Scala @@ -140,8 +131,8 @@ most function families: - **Conditional and null handling** (7): `greatest`, `least`, `nullif`, `ifnull` / `nvl`, `nvl2`, and `equal_null`. -For the authoritative, always-current status of every Spark built-in expression, see the -[Spark Expression Support](https://datafusion.apache.org/comet/user-guide/latest/expressions.html) reference. +For the full list of supported expressions in this release, see the +[Spark Expression Support](https://datafusion.apache.org/comet/user-guide/0.17/expressions.html) reference. Operator coverage grew too: 0.17.0 adds a native **broadcast nested loop join**, so queries with non-equi or cross join conditions can now stay in the Comet pipeline rather than falling back to Spark. @@ -165,7 +156,7 @@ round trip, not from copying its data more efficiently. ### A Single Arrow Stream on the Input Path -0.17.0 also reworks the other side of the JVM/native boundary — the path that feeds JVM-sourced data *into* +0.17.0 also reworks the other side of the JVM/native boundary — the path that feeds JVM-sourced data _into_ native execution. Previously each batch crossed via a bespoke `CometBatchIterator`, with a `hasNext` / `next` JNI pair per batch and every column imported through its own Arrow FFI array and schema. This release replaces that path with the **Arrow C Stream Interface**: the JVM exports each per-partition iterator once, and native @@ -197,13 +188,13 @@ Much of the correctness work in this release is part of an ongoing push toward a Planning for 1.0.0 is being tracked in [issue #4082], where the community is discussing what the milestone should mean. The criteria under consideration go well beyond correctness and include: -- Demonstrated cost savings for TPC-H and TPC-DS at SF1000 (1TB) — already met +- Demonstrated cost savings for TPC-H and TPC-DS at SF1000 (1TB) - Thorough documentation of compatibility status - A review of all configuration options, renaming some for consistency - Consistent logging - A documented policy for how long each Spark version will be supported - A documented process for preventing major performance regressions -- A documented semantic-versioning policy and what it means for Comet going forward — already agreed +- A documented semantic-versioning policy and what it means for Comet going forward [issue #4082]: https://github.com/apache/datafusion-comet/issues/4082 From e7c9b7da00473cc960377ae3af8128ff65f59272 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 22 Jun 2026 07:31:03 -0600 Subject: [PATCH 19/20] Update 2026-06-20-datafusion-comet-0.17.0.md Co-authored-by: Matt Butrovich --- content/blog/2026-06-20-datafusion-comet-0.17.0.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/blog/2026-06-20-datafusion-comet-0.17.0.md b/content/blog/2026-06-20-datafusion-comet-0.17.0.md index 0f15d50b..d6b19cae 100644 --- a/content/blog/2026-06-20-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-20-datafusion-comet-0.17.0.md @@ -47,7 +47,7 @@ The codegen dispatcher avoids the fallback to row-based processing by running Sp generated code (`doGenCode`) inside the Comet pipeline, operating directly on Arrow batches. The result is a JVM-implemented Arrow-native expression: the data stays in Arrow format, and because the expression is evaluated by Spark's own code, the result is guaranteed to match Spark exactly across every supported Spark -version. When the dispatcher is disabled, Comet falls back cleanly as before. +version. When the dispatcher is disabled, Comet falls back as before. A dispatched expression is no faster than it would be in Spark, since it runs the same generated code. The benefit is that a single unsupported expression no longer forces an entire From 1236a51c9836fa8290b530c040546f016526d270 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 22 Jun 2026 07:31:13 -0600 Subject: [PATCH 20/20] Update 2026-06-20-datafusion-comet-0.17.0.md Co-authored-by: Matt Butrovich --- content/blog/2026-06-20-datafusion-comet-0.17.0.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/blog/2026-06-20-datafusion-comet-0.17.0.md b/content/blog/2026-06-20-datafusion-comet-0.17.0.md index d6b19cae..677650fc 100644 --- a/content/blog/2026-06-20-datafusion-comet-0.17.0.md +++ b/content/blog/2026-06-20-datafusion-comet-0.17.0.md @@ -40,8 +40,8 @@ The headline feature of 0.17.0 is a new mechanism that keeps more of your query of falling back to Spark: the **JVM codegen dispatcher**. Comet has always fallen back to Spark row-based execution whenever an expression had no native Rust implementation, or where the -Rust implementation could diverge from Spark on edge cases. A fallback is correct, a columnar-to-row -conversion is needed to feed the data into Spark's row-based operators and this adds overhead when processing billions of rows of data. +Rust implementation could diverge from Spark on edge cases. A fallback is correct, but a columnar-to-row +conversion is needed to feed the data into Spark's row-based operators, which adds overhead when processing billions of rows of data. The codegen dispatcher avoids the fallback to row-based processing by running Spark's own generated code (`doGenCode`) inside the Comet pipeline, operating directly on Arrow batches. The result is a