gh-138270: Use PyUnicodeWriter in csv.writer#138271
gh-138270: Use PyUnicodeWriter in csv.writer#138271maurycy wants to merge 11 commits intopython:mainfrom
PyUnicodeWriter in csv.writer#138271Conversation
|
cc @vstinner |
| c == dialect->escapechar || | ||
| c == dialect->quotechar) { | ||
| if (dialect->escapechar == NOT_SET) { | ||
| PyErr_SetString(self->error_obj, "need to escape, but no escapechar set"); |
| bool first_field_was_empty_like = false; | ||
| bool first_field_was_none = false; | ||
| bool first_field_was_quoted_in_loop = false; |
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
|
And yes, it could be meaningful to use PyUnicodeWriter instead of manual buffer constructions, but we need to check if this really improves things or not before deciding whether we can do it. |
|
Marking this as a draft. There's performance regression:
The script:import csv
import io
import pyperf
runner = pyperf.Runner()
INT_ROW = list(range(10))
COMPLEX_STRING_ROW = ['a,b', 'c"d', 'e\nf'] * 3 + ['ghi']
def write_the_rows(rows):
f = io.StringIO()
writer = csv.writer(f)
writer.writerows(rows)
for num_rows in (10, 1_000, 10_000, ):
int_rows = [INT_ROW] * num_rows
complex_rows = [COMPLEX_STRING_ROW] * num_rows
runner.bench_func(
f'writerows {num_rows} integer rows',
write_the_rows,
int_rows
)
runner.bench_func(
f'writerows {num_rows} complex string rows',
write_the_rows,
complex_rows
)There are two obvious issues:
I need some time to address it, perhaps with jumping similar to gh-138214. |
| /* grow record buffer if necessary */ | ||
| if (!join_check_rec_size(self, self->rec_len + terminator_len)) | ||
| return 0; | ||
| if (PyUnicodeWriter_WriteChar(writer, c) < 0) { |
There was a problem hiding this comment.
It would be interesting to try calling WriteSubstring() at once rather than writing characters one by one.
|
This was mostly exploratory, and the results aren't exactly promising. There's definitely more to be done around the |
The purpose of this PR is not performance but using the modern https://docs.python.org/dev/c-api/unicode.html#c.PyUnicodeWriter API, similarly to gh-125196.
There's a risk that the code is slower, as it turned out in gh-133968. I'd prefer optimizing it after getting an ack that this is the correct direction.
Similarly to #138214 (comment), I'm not sure what is the best benchmarking strategy, besides a simple snippet. Perhaps we need https://github.com/nineteendo/jsonyx-performance-tests but for CSV.
I believe that
csv.reader(ReaderObj) could also usePyUnicodeWriter. If my thinking is sound, if there's any interest and this code is OK, I can handle it.csvmodule should usePyUnicodeWriter#138270