gh-138213: Make csv.reader up to 2x faster#138214
gh-138213: Make csv.reader up to 2x faster#138214maurycy wants to merge 14 commits intopython:mainfrom
csv.reader up to 2x faster#138214Conversation
serhiy-storchaka
left a comment
There was a problem hiding this comment.
We need to consider not only the gain, but also the cost/benefit ratio. This change significantly complicates an already complex code.
The benefit is not unconditional either. In the IN_FIELD state, PyUnicode_FindChar() is called 4 times. This can actually slow down the code for long non-ASCII lines.
|
Thank you for review!
I agree that it increases complexity. I added an explanation before the There are only two building blocks: jump with My thinking here goes that I believe that conceptually it's simple: process the whole field at once.
The ideal would be I fully agree there are scenarios where it's worsening performance. My hunch is that CSV files without any fields (eg: the parser is spending significantly more time in states other than What's the best way of sharing here? I already feel that all these I'm not sure what is the best benchmarking strategy here. Unfortunately, all the combinations from the benchmark with I tried measuring a long non-ASCII line and still observed a significant benefit:
The benchmark similar to the above:import csv
import io
import os
import pyperf
import random
NUM_ROWS = (1_000, 10_000)
NUM_COLS = (5, 10)
FIELD_LENGTH = (300, 1000)
CASES = [
# (label, field_chars, delimiter, escapechar)
(
"nonascii_no_escape",
"ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
"λ",
None,
),
(
"nonascii_escape",
"ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
"λ",
"\\",
),
]
def generate_csv_data(rows, cols, field_len, ch, delim):
# random.choice() so we're not cache-friendly
field = "".join(random.choices(ch, k=field_len))
row = delim.join([field] * cols)
return os.linesep.join([row] * rows) + os.linesep
def benchmark_csv_reader(csv_data, delim, escapechar):
rdr = csv.reader(
io.StringIO(csv_data), delimiter=delim, escapechar=escapechar
)
for _ in rdr:
pass
runner = pyperf.Runner()
for rows in NUM_ROWS:
for cols in NUM_COLS:
for field_len in FIELD_LENGTH:
for label, ch, delim, esc in CASES:
csv_data = generate_csv_data(rows, cols, field_len, ch, delim)
runner.bench_func(
f"csv_reader({rows},{cols},{field_len})[{label}]",
benchmark_csv_reader,
csv_data,
delim,
esc,
)I updated the benchmark in the description to be more comprehensive. |
csv.reader 1.4x fastercsv.reader 1.4-2x faster
csv.reader 1.4-2x fastercsv.reader 2x faster
csv.reader 2x fastercsv.reader up to 2x faster
|
6 months passed; stale. I still believe that gh-138213 is a legit issue, but this PR would take a bit too much faith in a new contributor, like me. :-) |
The basic observation is that there's no need to process character by character, and call the state machine while we're in a field (
IN_FIELD,IN_QUOTED_FIELD).Most characters are ordinary (ie: they're not delimiters, escapes, quotes etc.), so we can find the next interesting character and copy the whole slice in between.
This is my very first C change in
cpythonso I'm more than happy to pair with someone.Benchmark
There's no
pyperformancebenchmark forcsv.reader.The script:
The results:
I observe similar results with real CSV files.
The environment:
sudo ./python -m pyperf system tuneensured.csv.readercalls the state machine for every character needlessly #138213📚 Documentation preview 📚: https://cpython-previews--138214.org.readthedocs.build/