Avoiding repeated encoding and compression operation for the sort column included writes#21464
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit cfde86d.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
45fe916 to
34cff74
Compare
516569d to
4bed0a9
Compare
e8c334d to
b8d34fa
Compare
…umn included writes Signed-off-by: Chaitanya KSR <ksrchai@amazon.com>
b8d34fa to
cfde86d
Compare
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21464 +/- ##
============================================
- Coverage 73.48% 73.42% -0.07%
+ Complexity 74646 74552 -94
============================================
Files 5980 5980
Lines 338777 338777
Branches 48848 48848
============================================
- Hits 248964 248746 -218
- Misses 70026 70207 +181
- Partials 19787 19824 +37 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
When sort columns are configured, the writer now stages incoming batches as Arrow IPC instead of Parquet, avoiding redundant Parquet encoding and compression before the sort-on-close step. For large files, batches are individually sorted and written as Parquet chunks, then merged via the existing streaming k-way merge. The per-batch sort uses Arrow's
RowConverterfor compact byte-comparable row encoding.Details
Arrow IPC staging for sorted writes
Previously, all data was written as Parquet regardless of whether sort columns were configured. On finalize, the Parquet file was read back (decoded), sorted, and re-written as Parquet — a full encode-decode-encode cycle. Now, when sort columns are present, a
WriterVariant::Ipcpath writes batches as Arrow IPC (raw in-memory Arrow buffers with minimal framing). On finalize, the IPC file is read back with near-zero deserialization cost, sorted, and written as the final Parquet file — a single encode step.When no sort columns are configured, the
WriterVariant::Parquetpath is used as before, and finalize simply renames the temp file to the final path.Small and large file sort strategies
The sort-on-close path (
sort_and_rewrite_parquet) uses a size-based threshold (sort_in_memory_threshold_bytes, default 32 MB) to choose between two strategies:sort_small_file): All IPC batches are read into memory, concatenated, sorted, row IDs rewritten, and written as a single Parquet file.sort_large_file): Each IPC batch is sliced intosort_batch_sizechunks (to bound memory since IPC batches can be arbitrarily large), each chunk is sorted individually and written as a temporary Parquet chunk file, then the existing streaming k-way merge produces the final globally-sorted Parquet output. Chunk files are cleaned up after the merge.RowConverter-based per-batch sort
The
sort_batchfunction now uses Arrow'sRowConverterinstead oflexsort_to_indices.RowConverterconverts sort columns into compact byte-comparable rows where sort direction and null ordering are baked into the encoding. Sorting then reduces to comparing opaque byte sequences viasort_unstable_by, followed by ataketo reorder all columns. This approach handles all Arrow data types uniformly without a per-type dispatch in the sort comparator.CRC32 integration
The IPC staging path integrates with the existing
CrcWriter-based CRC32 computation. Since IPC batches are not Parquet-encoded, the CRC is computed during the final Parquet write step (write_final_filefor small files,merge_sortedfor large files) rather than during the initial staging write. TheWriterState.crc_handlefield isOption<CrcHandle>—Somefor the Parquet variant,Nonefor IPC.Testing
Existing tests cover both sorted (IPC path) and unsorted (Parquet path) writer lifecycles, including concurrent writers, empty data, multi-batch sorting, ascending/descending sort, and mixed IPC+Parquet coexistence. New integration tests validate the IPC staging cleanup, sorted output correctness, and concurrent sorted writer behavior.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.