Skip to content

Improve string reads from CSV files#5397

Open
e-kayrakli wants to merge 1 commit intoBears-R-Us:mainfrom
e-kayrakli:string-csv-read-imp
Open

Improve string reads from CSV files#5397
e-kayrakli wants to merge 1 commit intoBears-R-Us:mainfrom
e-kayrakli:string-csv-read-imp

Conversation

@e-kayrakli
Copy link
Contributor

As part of switching to v2 benchmark suite, we observed that string reads from CSV files are extremely slow. While there are multiple reasons for it, this PR has the first wave of improvements.

  • Stops creating a new string for every element in a string array
  • Stops capturing the result of .bytes() iterator in a new array
  • Uses reduce intents on an already existing forall instead of firing off another reduce expression
  • Stops using .bytes() capture and array assignment for copying the actual data. Instead, copies data byte by byte. We could probably consider Communication.get to move the data even more efficiently if needed.

A potential next step is to read all columns together instead of reading them column by column similar to recent improvements to Parquet reads.

Performance

The v1 benchmark uses 1M elements per locale. I tested with 10k, where this PR outperformed main by 107x. Any data size that is bigger is taking too long for me to wait.

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
@e-kayrakli
Copy link
Contributor Author

This needs correctness testing.

@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@943561a). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #5397   +/-   ##
========================================
  Coverage        ?   100.00%           
========================================
  Files           ?         5           
  Lines           ?       115           
  Branches        ?         0           
========================================
  Hits            ?       115           
  Misses          ?         0           
  Partials        ?         0           
Flag Coverage Δ
python-coverage 100.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants