diff --git a/dev/changelog/0.13.0.md b/dev/changelog/0.13.0.md new file mode 100644 index 0000000000..3362e093f9 --- /dev/null +++ b/dev/changelog/0.13.0.md @@ -0,0 +1,220 @@ + + +# DataFusion Comet 0.13.0 Changelog + +This release consists of 159 commits from 15 contributors. See credits at the end of this changelog for more information. + +**Fixed bugs:** + +- fix: NativeScan count assert firing for no reason [#2850](https://github.com/apache/datafusion-comet/pull/2850) (EmilyMatt) +- fix: Correct link to tracing guide in CometConf [#2866](https://github.com/apache/datafusion-comet/pull/2866) (manuzhang) +- fix: Fall back to Spark for MakeDecimal with unsupported input type [#2815](https://github.com/apache/datafusion-comet/pull/2815) (andygrove) +- fix: Normalize s3 paths for PME key retriever [#2874](https://github.com/apache/datafusion-comet/pull/2874) (mbutrovich) +- fix: modify CometNativeScan to generate the file partitions without instantiating RDD [#2891](https://github.com/apache/datafusion-comet/pull/2891) (mbutrovich) +- fix: Modulus on decimal data type mismatch [#2922](https://github.com/apache/datafusion-comet/pull/2922) (andygrove) +- fix: [iceberg] Mark nativeIcebergScanMetadata @transient [#2930](https://github.com/apache/datafusion-comet/pull/2930) (mbutrovich) +- fix: enable cast tests for Spark 4.0 [#2919](https://github.com/apache/datafusion-comet/pull/2919) (manuzhang) +- fix: Remove fallback for maps containing complex types [#2943](https://github.com/apache/datafusion-comet/pull/2943) (andygrove) +- fix: CometShuffleManager hang by deferring SparkEnv access [#3002](https://github.com/apache/datafusion-comet/pull/3002) (Shekharrajak) +- fix: format decimal to string when casting to short [#2916](https://github.com/apache/datafusion-comet/pull/2916) (manuzhang) +- fix: [iceberg] reduce granularity of metrics updates in IcebergFileStream [#3050](https://github.com/apache/datafusion-comet/pull/3050) (mbutrovich) +- fix: native shuffle now reports spill metrics correctly [#3197](https://github.com/apache/datafusion-comet/pull/3197) (andygrove) + +**Performance related:** + +- perf: [iceberg] Deduplicate serialized metadata for Iceberg native scan [#2933](https://github.com/apache/datafusion-comet/pull/2933) (mbutrovich) +- perf: Use await instead of block_on in native shuffle writer [#2937](https://github.com/apache/datafusion-comet/pull/2937) (mbutrovich) +- perf: refactor executePlan to try to avoid constantly entering Tokio runtime [#2938](https://github.com/apache/datafusion-comet/pull/2938) (mbutrovich) +- perf: Optimize lpad/rpad to remove unnecessary memory allocations per element [#2963](https://github.com/apache/datafusion-comet/pull/2963) (andygrove) +- perf: Improve performance of normalize_nan [#2999](https://github.com/apache/datafusion-comet/pull/2999) (andygrove) +- perf: Improve string expression microbenchmarks [#3012](https://github.com/apache/datafusion-comet/pull/3012) (andygrove) +- perf: Improve date/time microbenchmarks to avoid redundant/duplicate benchmarks [#3020](https://github.com/apache/datafusion-comet/pull/3020) (andygrove) +- perf: Improve aggregate expression microbenchmarks [#3021](https://github.com/apache/datafusion-comet/pull/3021) (andygrove) +- perf: Improve conditional expression microbenchmarks [#3024](https://github.com/apache/datafusion-comet/pull/3024) (andygrove) +- perf: Improve performance of date truncate [#2997](https://github.com/apache/datafusion-comet/pull/2997) (andygrove) +- perf: Add microbenchmark for comparison expressions [#3026](https://github.com/apache/datafusion-comet/pull/3026) (andygrove) +- perf: Implement more microbenchmarks for cast expressions [#3031](https://github.com/apache/datafusion-comet/pull/3031) (andygrove) +- perf: Add microbenchmark for hash expressions [#3028](https://github.com/apache/datafusion-comet/pull/3028) (andygrove) +- perf: Improve performance of CAST from string to int [#3017](https://github.com/apache/datafusion-comet/pull/3017) (coderfender) +- perf: Improve criterion benchmarks for cast string to int [#3049](https://github.com/apache/datafusion-comet/pull/3049) (andygrove) +- perf: Additional optimizations for cast from string to int [#3048](https://github.com/apache/datafusion-comet/pull/3048) (andygrove) +- perf: set DataFusion session context's target_partitions to match Spark's spark.task.cpus [#3062](https://github.com/apache/datafusion-comet/pull/3062) (mbutrovich) +- perf: don't busy-poll Tokio stream for plans without CometScan [#3063](https://github.com/apache/datafusion-comet/pull/3063) (mbutrovich) +- perf: minor optimizations in `process_sorted_row_partition` [#3059](https://github.com/apache/datafusion-comet/pull/3059) (andygrove) +- perf: optimize complex-type hash implementations [#3140](https://github.com/apache/datafusion-comet/pull/3140) (mbutrovich) +- perf: [iceberg] Remove IcebergFileStream, use iceberg-rust's parallelization, bump iceberg-rust to latest, cache SchemaAdapter [#3051](https://github.com/apache/datafusion-comet/pull/3051) (mbutrovich) +- perf: [iceberg] reduce nativeIcebergScanMetadata serialization points [#3243](https://github.com/apache/datafusion-comet/pull/3243) (mbutrovich) +- perf: reduce GC pressure in protobuf serialization [#3242](https://github.com/apache/datafusion-comet/pull/3242) (andygrove) +- perf: cache serialized query plans to avoid per-partition serialization [#3246](https://github.com/apache/datafusion-comet/pull/3246) (andygrove) +- perf: [iceberg] Use protobuf instead of JSON to serialize Iceberg partition values [#3247](https://github.com/apache/datafusion-comet/pull/3247) (parthchandra) + +**Implemented enhancements:** + +- feat: Add experimental support for native Parquet writes [#2812](https://github.com/apache/datafusion-comet/pull/2812) (andygrove) +- feat: Partially implement file commit protocol for native Parquet writes [#2828](https://github.com/apache/datafusion-comet/pull/2828) (andygrove) +- feat: CometNativeWriteExec support with native scan as a child [#2839](https://github.com/apache/datafusion-comet/pull/2839) (mbutrovich) +- feat: Add support for `explode` and `explode_outer` for array inputs [#2836](https://github.com/apache/datafusion-comet/pull/2836) (andygrove) +- feat: Support ANSI mode SUM (Decimal types) [#2826](https://github.com/apache/datafusion-comet/pull/2826) (coderfender) +- feat: Add expression registry to native planner [#2851](https://github.com/apache/datafusion-comet/pull/2851) (andygrove) +- feat: Implement native operator registry [#2875](https://github.com/apache/datafusion-comet/pull/2875) (andygrove) +- feat: Improve fallback reporting for `native_datafusion` scan [#2879](https://github.com/apache/datafusion-comet/pull/2879) (andygrove) +- feat: Enable bucket pruning with native_datafusion scans [#2888](https://github.com/apache/datafusion-comet/pull/2888) (mbutrovich) +- feat: support_ansi-mode_aggregated_benchmarking [#2901](https://github.com/apache/datafusion-comet/pull/2901) (coderfender) +- feat: [iceberg] REST catalog support for CometNativeIcebergScan [#2895](https://github.com/apache/datafusion-comet/pull/2895) (mbutrovich) +- feat: [iceberg] Support session token in Iceberg Native scan [#2913](https://github.com/apache/datafusion-comet/pull/2913) (hsiang-c) +- feat: Make shuffle writer buffer size configurable [#2899](https://github.com/apache/datafusion-comet/pull/2899) (andygrove) +- feat: Add partial support for `from_json` [#2934](https://github.com/apache/datafusion-comet/pull/2934) (andygrove) +- feat: Create benchmarks comet cast [#2932](https://github.com/apache/datafusion-comet/pull/2932) (coderfender) +- feat: Support string decimal cast [#2925](https://github.com/apache/datafusion-comet/pull/2925) (coderfender) +- feat: Remove unnecessary transition for native writes [#2960](https://github.com/apache/datafusion-comet/pull/2960) (comphead) +- feat: Initial implementation of size for array inputs [#2862](https://github.com/apache/datafusion-comet/pull/2862) (andygrove) +- feat: Support ANSI mode sum expr (int inputs) [#2600](https://github.com/apache/datafusion-comet/pull/2600) (coderfender) +- feat: Support casting string float types [#2835](https://github.com/apache/datafusion-comet/pull/2835) (coderfender) +- feat: Support ANSI mode avg expr (int inputs) [#2817](https://github.com/apache/datafusion-comet/pull/2817) (coderfender) +- feat: Add support for remote Parquet HDFS writer with openDAL [#2929](https://github.com/apache/datafusion-comet/pull/2929) (comphead) +- feat: Expand `murmur3` hash support to complex types [#3077](https://github.com/apache/datafusion-comet/pull/3077) (andygrove) +- feat: Comet Writer should respect object store settings [#3042](https://github.com/apache/datafusion-comet/pull/3042) (comphead) +- feat: add support for unix_date expression [#3141](https://github.com/apache/datafusion-comet/pull/3141) (andygrove) +- feat: add partial support for date_format expression [#3201](https://github.com/apache/datafusion-comet/pull/3201) (andygrove) +- feat: add complex type support to native Parquet writer [#3214](https://github.com/apache/datafusion-comet/pull/3214) (andygrove) +- feat: implement framework to support multiple pyspark benchmarks [#3080](https://github.com/apache/datafusion-comet/pull/3080) (andygrove) +- feat: add support for datediff expression [#3145](https://github.com/apache/datafusion-comet/pull/3145) (andygrove) +- feat: Add support for `unix_timestamp` function [#2936](https://github.com/apache/datafusion-comet/pull/2936) (andygrove) +- feat: add support for last_day expression [#3143](https://github.com/apache/datafusion-comet/pull/3143) (andygrove) +- feat: Support left expression [#3206](https://github.com/apache/datafusion-comet/pull/3206) (Shekharrajak) + +**Documentation updates:** + +- docs: add documentation for fully-native Iceberg scans [#2868](https://github.com/apache/datafusion-comet/pull/2868) (mbutrovich) +- docs: Add documentation to contributor guide explaining native + JVM shuffle implementation [#3055](https://github.com/apache/datafusion-comet/pull/3055) (andygrove) +- docs: add guidance on disabling constant folding for literal tests [#3200](https://github.com/apache/datafusion-comet/pull/3200) (andygrove) +- docs: Add common pitfalls and improve PR checklist in development guide [#3231](https://github.com/apache/datafusion-comet/pull/3231) (andygrove) +- docs: various documentation updates in preparation for next release [#3254](https://github.com/apache/datafusion-comet/pull/3254) (andygrove) + +**Other:** + +- chore: Add 0.12.0 changelog [#2811](https://github.com/apache/datafusion-comet/pull/2811) (andygrove) +- chore: Prepare for 0.13.0 development [#2809](https://github.com/apache/datafusion-comet/pull/2809) (andygrove) +- minor: Add microbenchmark for integer sum with grouping [#2805](https://github.com/apache/datafusion-comet/pull/2805) (andygrove) +- test: extract conditional expression tests (`if`, `case_when` and `coalesce`) [#2807](https://github.com/apache/datafusion-comet/pull/2807) (rluvaton) +- build: Disable caching for macOS PR builds [#2816](https://github.com/apache/datafusion-comet/pull/2816) (andygrove) +- chore(deps): bump actions/checkout from 5 to 6 [#2818](https://github.com/apache/datafusion-comet/pull/2818) (dependabot[bot]) +- chore(deps): bump object_store_opendal from 0.54.1 to 0.55.0 in /native [#2819](https://github.com/apache/datafusion-comet/pull/2819) (dependabot[bot]) +- chore(deps): bump cc from 1.2.46 to 1.2.47 in /native [#2822](https://github.com/apache/datafusion-comet/pull/2822) (dependabot[bot]) +- chore(deps): bump opendal from 0.54.1 to 0.55.0 in /native [#2821](https://github.com/apache/datafusion-comet/pull/2821) (dependabot[bot]) +- chore: update `Iceberg` install docs [#2824](https://github.com/apache/datafusion-comet/pull/2824) (comphead) +- chore(deps): bump cc from 1.2.47 to 1.2.48 in /native [#2833](https://github.com/apache/datafusion-comet/pull/2833) (dependabot[bot]) +- chore(deps): bump the proto group in /native with 2 updates [#2832](https://github.com/apache/datafusion-comet/pull/2832) (dependabot[bot]) +- minor: Clean up shuffle transformation code in `CometExecRule` [#2840](https://github.com/apache/datafusion-comet/pull/2840) (andygrove) +- chore: fix broken link to Apache DataFusion Comet Overview in README [#2846](https://github.com/apache/datafusion-comet/pull/2846) (onestn) +- chore: Refactor some of the scan and sink handling in `CometExecRule` to reduce duplicate code [#2844](https://github.com/apache/datafusion-comet/pull/2844) (andygrove) +- deps: bump lz4_flex, downgrade prost from yanked version [#2847](https://github.com/apache/datafusion-comet/pull/2847) (mbutrovich) +- minor: Move shuffle logic from `CometExecRule` to `CometShuffleExchangeExec` serde implementation [#2853](https://github.com/apache/datafusion-comet/pull/2853) (andygrove) +- chore: remove coverage file auto generator [#2854](https://github.com/apache/datafusion-comet/pull/2854) (comphead) +- chore(deps): bump cc from 1.2.48 to 1.2.49 in /native [#2858](https://github.com/apache/datafusion-comet/pull/2858) (dependabot[bot]) +- chore: Refactor `CometExecRule` handling of `BroadcastHashJoin` and fix fallback reporting [#2856](https://github.com/apache/datafusion-comet/pull/2856) (andygrove) +- chore: update actions/checkout from v4 to v6 in setup-iceberg and set… [#2857](https://github.com/apache/datafusion-comet/pull/2857) (bjornjorgensen) +- minor: Small refactor in `CometExecRule` to remove confusing code and fix more fallback reporting [#2860](https://github.com/apache/datafusion-comet/pull/2860) (andygrove) +- chore: Add unit tests for `CometExecRule` [#2863](https://github.com/apache/datafusion-comet/pull/2863) (andygrove) +- chore: Add unit tests for `CometScanRule` [#2867](https://github.com/apache/datafusion-comet/pull/2867) (andygrove) +- minor: Pedantic refactoring to move some methods from `CometSparkSessionExtensions` to `CometScanRule` and `CometExecRule` [#2873](https://github.com/apache/datafusion-comet/pull/2873) (andygrove) +- deps: [iceberg] upgrade DataFusion to 51, Arrow to 57, Iceberg to latest, MSRV to 1.88 [#2729](https://github.com/apache/datafusion-comet/pull/2729) (mbutrovich) +- chore: Enable plan stability suite for `native_datafusion` scans [#2877](https://github.com/apache/datafusion-comet/pull/2877) (andygrove) +- chore: `ScanExec::new` no longer fetches data [#2881](https://github.com/apache/datafusion-comet/pull/2881) (andygrove) +- Chore: refactor bit_not [#2896](https://github.com/apache/datafusion-comet/pull/2896) (kazantsev-maksim) +- chore(deps): bump actions/cache from 4 to 5 [#2909](https://github.com/apache/datafusion-comet/pull/2909) (dependabot[bot]) +- chore(deps): bump actions/upload-artifact from 5 to 6 [#2910](https://github.com/apache/datafusion-comet/pull/2910) (dependabot[bot]) +- chore: Refactor string benchmarks (~10x reduction in LOC) [#2907](https://github.com/apache/datafusion-comet/pull/2907) (andygrove) +- chore(deps): bump actions/download-artifact from 6 to 7 [#2908](https://github.com/apache/datafusion-comet/pull/2908) (dependabot[bot]) +- chore: use datafusion impl of hex function [#2915](https://github.com/apache/datafusion-comet/pull/2915) (kazantsev-maksim) +- chore: Use fixed seed in RNG in tests [#2917](https://github.com/apache/datafusion-comet/pull/2917) (andygrove) +- chore: Remove `row_step` from `process_sorted_row_partition` [#2920](https://github.com/apache/datafusion-comet/pull/2920) (andygrove) +- chore: Move string function handling to new expression registry [#2931](https://github.com/apache/datafusion-comet/pull/2931) (andygrove) +- chore: Reduce syscalls in metrics update logic [#2940](https://github.com/apache/datafusion-comet/pull/2940) (andygrove) +- chore: Add shuffle benchmark for deeply nested schemas [#2902](https://github.com/apache/datafusion-comet/pull/2902) (andygrove) +- chore: Reduce timer overhead in native shuffle writer [#2941](https://github.com/apache/datafusion-comet/pull/2941) (andygrove) +- chore: Remove low-level ffi/jvm timers from native `ScanExec` [#2939](https://github.com/apache/datafusion-comet/pull/2939) (andygrove) +- build: Skip problematic Spark SQL test for Spark 4.0.x [#2947](https://github.com/apache/datafusion-comet/pull/2947) (andygrove) +- build: Reinstate macOS CI builds of Comet with Spark 4.0 [#2950](https://github.com/apache/datafusion-comet/pull/2950) (manuzhang) +- chore(deps): bump reqwest from 0.12.25 to 0.12.26 in /native [#2952](https://github.com/apache/datafusion-comet/pull/2952) (dependabot[bot]) +- chore(deps): bump cc from 1.2.49 to 1.2.50 in /native [#2954](https://github.com/apache/datafusion-comet/pull/2954) (dependabot[bot]) +- chore(deps): bump assertables from 9.8.2 to 9.8.3 in /native [#2953](https://github.com/apache/datafusion-comet/pull/2953) (dependabot[bot]) +- minor: Refactor expression microbenchmarks to remove duplicate code [#2956](https://github.com/apache/datafusion-comet/pull/2956) (andygrove) +- build: fix missing import in `main` [#2962](https://github.com/apache/datafusion-comet/pull/2962) (andygrove) +- build: Skip macOS Spark 4 fuzz test [#2966](https://github.com/apache/datafusion-comet/pull/2966) (andygrove) +- Avoid duplicated writer nodes when AQE enabled [#2982](https://github.com/apache/datafusion-comet/pull/2982) (comphead) +- build: Set thread thresholds envs for spark test on macOS [#2987](https://github.com/apache/datafusion-comet/pull/2987) (wForget) +- chore: Add microbenchmark for casting string to temporal types [#2980](https://github.com/apache/datafusion-comet/pull/2980) (andygrove) +- chore(deps): bump reqwest from 0.12.26 to 0.12.28 in /native [#3009](https://github.com/apache/datafusion-comet/pull/3009) (dependabot[bot]) +- chore(deps): bump tempfile from 3.23.0 to 3.24.0 in /native [#3006](https://github.com/apache/datafusion-comet/pull/3006) (dependabot[bot]) +- chore(deps): bump serde_json from 1.0.145 to 1.0.148 in /native [#3010](https://github.com/apache/datafusion-comet/pull/3010) (dependabot[bot]) +- chore: Add microbenchmark for casting string to numeric [#2979](https://github.com/apache/datafusion-comet/pull/2979) (andygrove) +- chore: Skip some CI workflows for benchmark changes [#3030](https://github.com/apache/datafusion-comet/pull/3030) (andygrove) +- chore: Skip more workflows on benchmark PRs [#3034](https://github.com/apache/datafusion-comet/pull/3034) (andygrove) +- chore: Improve microbenchmark for string expressions [#2964](https://github.com/apache/datafusion-comet/pull/2964) (andygrove) +- chore(deps): bump tokio from 1.48.0 to 1.49.0 in /native [#3039](https://github.com/apache/datafusion-comet/pull/3039) (dependabot[bot]) +- chore(deps): bump libc from 0.2.178 to 0.2.179 in /native [#3038](https://github.com/apache/datafusion-comet/pull/3038) (dependabot[bot]) +- chore(deps): bump actions/cache from 4 to 5 [#3037](https://github.com/apache/datafusion-comet/pull/3037) (dependabot[bot]) +- Chore: to_json unit/benchmark tests [#3011](https://github.com/apache/datafusion-comet/pull/3011) (kazantsev-maksim) +- chore: Add checks to microbenchmarks for plan running natively in Comet [#3045](https://github.com/apache/datafusion-comet/pull/3045) (andygrove) +- chore: Refactor `CometScanRule` to improve scan selection and fallback logic [#2978](https://github.com/apache/datafusion-comet/pull/2978) (andygrove) +- chore: Respect to legacySizeOfNull option for size function [#3036](https://github.com/apache/datafusion-comet/pull/3036) (kazantsev-maksim) +- chore: Add PySpark-based benchmarks, starting with ETL example [#3065](https://github.com/apache/datafusion-comet/pull/3065) (andygrove) +- chore(deps): bump the proto group in /native with 2 updates [#3071](https://github.com/apache/datafusion-comet/pull/3071) (dependabot[bot]) +- chore: add MacOS file and event trace log to gitignore [#3070](https://github.com/apache/datafusion-comet/pull/3070) (manuzhang) +- chore(deps): bump arrow from 57.1.0 to 57.2.0 in /native [#3073](https://github.com/apache/datafusion-comet/pull/3073) (dependabot[bot]) +- chore(deps): bump parquet from 57.1.0 to 57.2.0 in /native [#3074](https://github.com/apache/datafusion-comet/pull/3074) (dependabot[bot]) +- chore(deps): bump cc from 1.2.50 to 1.2.52 in /native [#3072](https://github.com/apache/datafusion-comet/pull/3072) (dependabot[bot]) +- chore: improve cast documentation to add support per eval mode [#3056](https://github.com/apache/datafusion-comet/pull/3056) (coderfender) +- chore: Refactor JVM shuffle: Move `SpillSorter` to top level class and add tests [#3081](https://github.com/apache/datafusion-comet/pull/3081) (andygrove) +- minor: Split CometShuffleExternalSorter into sync/async implementations [#3192](https://github.com/apache/datafusion-comet/pull/3192) (andygrove) +- chore: Add pending PR shield [#3205](https://github.com/apache/datafusion-comet/pull/3205) (comphead) +- chore: deprecate native_comet scan in favor of native_iceberg_compat [#2949](https://github.com/apache/datafusion-comet/pull/2949) (Shekharrajak) +- chore: add script to regenerate golden files for plan stability tests [#3204](https://github.com/apache/datafusion-comet/pull/3204) (andygrove) +- chore: fix clippy warnings for Rust 1.93 [#3239](https://github.com/apache/datafusion-comet/pull/3239) (andygrove) +- build: build native library once and share across CI test jobs [#3249](https://github.com/apache/datafusion-comet/pull/3249) (andygrove) +- Experimental: Native CSV files read [#3044](https://github.com/apache/datafusion-comet/pull/3044) (kazantsev-maksim) +- build: add missing datafusion-datasource dependency [#3252](https://github.com/apache/datafusion-comet/pull/3252) (andygrove) +- chore: Auto scan mode no longer falls back to `native_comet` [#3236](https://github.com/apache/datafusion-comet/pull/3236) (andygrove) +- build: optimize CI cache usage and add fast lint gate [#3251](https://github.com/apache/datafusion-comet/pull/3251) (andygrove) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 82 Andy Grove + 23 dependabot[bot] + 18 Matt Butrovich + 9 B Vadlamani + 7 Oleks V + 5 Kazantsev Maksim + 5 Manu Zhang + 3 Shekhar Prasad Rajak + 1 Bjørn Jørgensen + 1 Emily Matheys + 1 Parth Chandra + 1 Raz Luvaton + 1 Wonseok Yang + 1 Zhen Wang + 1 hsiang-c +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.