diff --git a/content/blog/2026-06-20-datafusion-comet-0.17.0.md b/content/blog/2026-06-20-datafusion-comet-0.17.0.md new file mode 100644 index 00000000..677650fc --- /dev/null +++ b/content/blog/2026-06-20-datafusion-comet-0.17.0.md @@ -0,0 +1,245 @@ +--- +layout: post +title: Apache DataFusion Comet 0.17.0 Release +date: 2026-06-20 +author: pmc +categories: [subprojects] +--- + + + +[TOC] + +The Apache DataFusion PMC is pleased to announce version 0.17.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. + +This release covers approximately five weeks of development work and is the result of merging 192 PRs from 19 +contributors. See the [change log] for more information. + +[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.17.0.md + +## Fewer Fallbacks to Spark + +The headline feature of 0.17.0 is a new mechanism that keeps more of your query running inside Comet instead +of falling back to Spark: the **JVM codegen dispatcher**. + +Comet has always fallen back to Spark row-based execution whenever an expression had no native Rust implementation, or where the +Rust implementation could diverge from Spark on edge cases. A fallback is correct, but a columnar-to-row +conversion is needed to feed the data into Spark's row-based operators, which adds overhead when processing billions of rows of data. + +The codegen dispatcher avoids the fallback to row-based processing by running Spark's own +generated code (`doGenCode`) inside the Comet pipeline, operating directly on Arrow batches. The result is a +JVM-implemented Arrow-native expression: the data stays in Arrow format, and because the expression is +evaluated by Spark's own code, the result is guaranteed to match Spark exactly across every supported Spark +version. When the dispatcher is disabled, Comet falls back as before. + +A dispatched expression is no faster than it would be in Spark, since it runs the same generated code. The +benefit is that a single unsupported expression no longer forces an entire +query stage back to row-based execution. The surrounding operators stay Arrow-native, and the +stage avoids the columnar-to-row conversion and row-based Spark execution that a fallback would otherwise +impose. + +This release puts the dispatcher to work across a wide surface: + +- **100% Spark-compatible regular expressions.** Regex expressions now dispatch to Spark's own + implementation, eliminating the long-standing compatibility gaps of a separate native regex engine. +- **100% Spark-compatible JSON functions.** JSON expression handling follows the same approach, matching + Spark's behavior precisely. +- **More scalar and structured-text functions**, including a batch of math and string functions, AES + encryption and decryption (`aes_encrypt` / `aes_decrypt` / `try_aes_decrypt`), `Upper` / `Lower` / + `InitCap`, `GetTimestamp`, `mask`, `try_to_number`, and an expanded set of date, time, and + timezone expressions. +- **Collection and higher-order expressions**, including `array_intersect`, `array_except`, `array_join`, + `create_map`, and lambda-based higher-order functions such as `filter`. + + 0.17.0 also changes how Comet treats expressions whose native Rust path is known to diverge + from Spark (marked `Incompatible`). Previously such an expression forced the entire projection back to Spark + unless the user opted into the divergent native behavior with the per-expression + `spark.comet.expression..allowIncompatible` flag. By default, that flag is unset, and the expression is + now routed through the codegen dispatcher and evaluated correctly inside Comet rather than triggering a + fallback. Setting it to `true` becomes a performance knob for users who accept the faster native path's + divergence. Expressions such as `from_unixtime`, and the `TimestampNTZ` branches of `hour`, `minute`, and + `second`, now stay in the pipeline by default. + +## User-Defined Functions in Java and Scala + +Building on the codegen dispatcher, 0.17.0 adds support for arbitrary user-defined functions written in Java +and Scala, enabled by default. Eligible Spark `ScalaUDF` expressions are routed through the dispatcher and +executed inside the Comet pipeline, so a project around a UDF no longer forces a fallback and a +columnar-to-row conversion. + +The path has broad type coverage — scalars, arbitrarily nested complex types, and higher-order functions — and +is backed by end-to-end, fuzz, and Iceberg test coverage. It can be disabled with +`spark.comet.exec.scalaUDF.codegen.enabled=false`. + +## Arrow-Native, End to End + +The codegen dispatcher works because of a property that runs through all of Comet: queries stay **Arrow-native +end to end**. Operators, expressions, shuffle, and broadcast all remain in Apache Arrow columnar format, +avoiding the per-row overhead that Spark's row-based engine incurs from materializing and transitioning data +one row at a time. + +Within that Arrow-native pipeline, the work of an operator or expression runs in one of two ways: + +- **Rust-implemented**: native Rust code, executed through Apache DataFusion. This is what most people picture + when they think of Comet. +- **JVM-implemented**: Scala or Java code that operates directly on Arrow batches, including the + codegen-dispatched expressions and UDFs described above. + +Because the data never leaves Arrow columnar format, there is no per-row materialization cost at the boundary +between a Rust-implemented and a JVM-implemented step. A dispatched expression sits in the pipeline alongside +Rust-implemented operators with no columnar-to-row transition between them. Expanding the JVM-implemented path +in 0.17.0 lets Comet keep more work Arrow-native instead of handing it back to Spark. + +## Expanded Expression Coverage + +Partly through the codegen dispatcher and partly through new Rust implementations, Comet's expression coverage +grew substantially in this release. More than 120 Spark expressions have gained support since 0.16.0, across +most function families: + +- **Date and time** (~25): `convert_timezone`, `make_date`, `months_between`, `next_day`, + `from_utc_timestamp` / `to_utc_timestamp`, `date_from_unix_date`, the `timestamp_*` and `unix_*` second / + milli / micro conversions, and the current date/time/timezone functions. +- **Math** (~24): `acosh`, `asinh`, `atanh`, `cbrt`, `csc`, `sec`, `hypot`, `log1p`, `bin`, `conv`, + `factorial`, `pmod`, `width_bucket`, `rint`, and more. +- **String** (~16): `elt`, `find_in_set`, `format_number`, `format_string`, `levenshtein`, `locate`, + `overlay`, `soundex`, `split`, `substring_index`, `unbase64`, `to_char`, and `to_number`. +- **XPath** (9): the full `xpath`, `xpath_boolean`, `xpath_double`, `xpath_int`, `xpath_long`, `xpath_string` + family. +- **JSON, CSV, and XML** (9): `from_csv`, `to_csv`, `schema_of_csv`, `schema_of_json`, `json_object_keys`, + `json_array_length`, `from_xml`, `to_xml`, and `schema_of_xml`. +- **Array and map** (11): `array_position`, `array_size`, `arrays_zip`, `slice`, `sort_array`, `sequence`, + `map_concat`, `map_contains_key`, and `map_from_entries`. +- **Aggregate and window** (7): `any_value`, `count_if`, the `regr_*` regression aggregates, plus `lag` and + `lead`. +- **Conditional and null handling** (7): `greatest`, `least`, `nullif`, `ifnull` / `nvl`, `nvl2`, and + `equal_null`. + +For the full list of supported expressions in this release, see the +[Spark Expression Support](https://datafusion.apache.org/comet/user-guide/0.17/expressions.html) reference. + +Operator coverage grew too: 0.17.0 adds a native **broadcast nested loop join**, so queries with non-equi or +cross join conditions can now stay in the Comet pipeline rather than falling back to Spark. + +## Performance + +### Removing an FFI Round Trip from Native Shuffle + +The most significant performance change in 0.17.0 targets the shuffle write path. When a native subtree feeds +a Comet shuffle, Comet previously ran two separate native iterators per partition: one for the upstream +subtree, and a second rooted at a synthetic `Scan("ShuffleWriterInput") -> ShuffleWriter` that consumed the +batches back. The JVM never actually read this data, so the native-to-JVM-and-back hop was pure overhead. + +Although the Arrow C Data Interface is zero-copy, this particular round trip was not. The synthetic scan left +its batches marked as not FFI-safe, so on import every batch was deep-copied into freshly allocated buffers. +0.17.0 collapses the two iterators into a single native plan rooted at the shuffle writer, with the upstream +subtree as a direct child. That removes the Arrow FFI export and import, the per-batch deep copy, and one +`createPlan` / `releasePlan` pair per partition, while preserving all of the existing per-partition setup +(broadcast alignment, subqueries, encryption) and input metrics reporting. The gain comes from removing the +round trip, not from copying its data more efficiently. + +### A Single Arrow Stream on the Input Path + +0.17.0 also reworks the other side of the JVM/native boundary — the path that feeds JVM-sourced data _into_ +native execution. Previously each batch crossed via a bespoke `CometBatchIterator`, with a `hasNext` / `next` +JNI pair per batch and every column imported through its own Arrow FFI array and schema. This release replaces +that path with the **Arrow C Stream Interface**: the JVM exports each per-partition iterator once, and native +imports the schema once and pulls each batch through a single C callback, taking ownership by reference count. +This removes the per-batch, per-column FFI export and JNI round trips for all JVM-sourced inputs, lets the +old `CometBatchIterator` and a now-unnecessary deep copy be deleted, and is the input-side counterpart to the +shuffle-write change above. TPC-DS at 1TB is about 9% faster in 0.17.0 than in 0.16.0, and these two FFI +changes are the largest contributor to that gain. + +### Lower Per-Batch Overhead in Arrow Vectors + +Two changes reduce repeated work when reading Arrow columns across the JNI boundary. Comet now caches the +validity buffer address on `CometDecodedVector` and the offset buffer address for variable-width vectors on +`CometPlainVector`, so these addresses are resolved once per vector rather than on every access. + +### Faster Statistical Aggregates + +The variance, standard deviation, covariance, and correlation aggregates now use DataFusion's +`GroupsAccumulator` interface, which is substantially more efficient for grouped aggregation than the +row-accumulator path they used previously. + +Additional smaller improvements include bulk-NULL handling in `split` and `substring` (skipping a per-row +allocation), and a new `interleave_time` shuffle metric with tuned output buffer sizing to make shuffle cost +easier to attribute. + +## Preparing for the 1.0.0 Release + +Much of the correctness work in this release is part of an ongoing push toward a **1.0.0 release**. +Planning for 1.0.0 is being tracked in [issue #4082], where the community is discussing what the milestone +should mean. The criteria under consideration go well beyond correctness and include: + +- Demonstrated cost savings for TPC-H and TPC-DS at SF1000 (1TB) +- Thorough documentation of compatibility status +- A review of all configuration options, renaming some for consistency +- Consistent logging +- A documented policy for how long each Spark version will be supported +- A documented process for preventing major performance regressions +- A documented semantic-versioning policy and what it means for Comet going forward + +[issue #4082]: https://github.com/apache/datafusion-comet/issues/4082 + +Correctness is central to that list: reaching 1.0.0 means being able to state precisely which Spark +expressions Comet accelerates, and how faithfully it matches Spark on each one. Two efforts in 0.17.0 move +toward that goal. + +First, the codegen dispatcher guarantees an exact match for every dispatched expression, since Spark's own +generated code does the evaluation. That raises our confidence that Comet matches Spark across the supported +versions, and it makes the Spark Expression Support reference something the project can stand behind as it +approaches 1.0.0. + +Second, this release included a systematic audit of Comet's existing expression implementations against Spark, +detailed below. + +### Expression Audit + +The audit compared Comet's expression implementations against Apache Spark 3.4.3, 3.5.8, 4.0.1, and 4.1.1, +covering the hash, JSON, collection, map, predicate, bitwise, conditional, array, struct, math, and cast +expression families, along with cast behavior. Each audit compared Comet's behavior to Spark across all four +versions and expanded test coverage where gaps were found. + +This edge-case-by-edge-case comparison surfaced most of the behavior differences fixed in this +release. By working through each expression's corner cases and writing targeted Comet SQL tests for them, the +audit caught divergences that broader testing does not reliably exercise. Comet also runs the full Apache +Spark SQL test suite against each supported Spark version as part of CI, which continues to catch some +cross-version differences, but the audit's focused approach found considerably more. + +## Compatibility + +Supported platforms include: + +- **Spark 3.4.3** with Java 11/17 and Scala 2.12/2.13 +- **Spark 3.5.8** with Java 11/17 and Scala 2.12/2.13 +- **Spark 4.0.2** with Java 17 and Scala 2.13 +- **Spark 4.1.2** with Java 17 and Scala 2.13 + +See the [Spark Version Compatibility] page for known limitations specific to each version. + +[Spark Version Compatibility]: https://datafusion.apache.org/comet/user-guide/latest/compatibility/spark-versions.html + +This release builds on **DataFusion 53.1** and **Arrow 58.3**. + +## Get Started with Comet 0.17.0 + +Ready to try it out? Follow the [Comet 0.17.0 Installation Guide](https://datafusion.apache.org/comet/user-guide/0.17/installation.html) +to get up and running, then point Comet at your existing Spark workloads, including Scala and Java UDFs and +Spark 4 with ANSI mode enabled, and see the speedup for yourself.