Heya,
I've been trying to replicate some of these benchmarks on some real world data we have and I'm finding some pretty different results. I've forked and modified the bench code pretty heavily to reflect our use cases a bit more closely, but when I was doing that I noticed this snippet:
if len(dimensions) == 1:
t3 = time.perf_counter()
dataset[:dimensions[0]] = data
elif len(dimensions) == 2:
t3 = time.perf_counter()
dataset[:dimensions[0], :dimensions[1]] = data
else:
t3 = time.perf_counter()
dataset[:dimensions[0], :dimensions[1], :dimensions[2]] = data
t4 = time.perf_counter()
# Add up the times taken to get the total time taken to create and write all datasets
dataset_creation_time += (t2 - t1)
dataset_population_time += (t4 - t3)
coming from
|
dataset[:dimensions[0]] = data |
This looks to me like you're just measuring the time to write into a buffer, rather than actually write the files to disk? From a usage scenario, I'm pretty sure disk IO rather than filling a buffer will dominate write time, so I don't think this is necessarily benchmarking exactly what you were hoping to?
For example, our loads and writes look something like this:
which is pretty different to the results you guys found - at least at first glance!
(Ignore the high outlier for the HDF5 read, I'm pretty sure thats related to FS block caching for importing the code used to read the files off disk).
Thanks so much for doing this work - it's something I'd never thought about much until I read the paper and it's definitely got me thinking about serialisation more deeply. I'm on my laptop right now, but when I get the chance I'll link my fork of the benchmarks too.
Heya,
I've been trying to replicate some of these benchmarks on some real world data we have and I'm finding some pretty different results. I've forked and modified the bench code pretty heavily to reflect our use cases a bit more closely, but when I was doing that I noticed this snippet:
coming from
File-Format-Testing/datasets_test/write.py
Line 50 in 5d8161f
This looks to me like you're just measuring the time to write into a buffer, rather than actually write the files to disk? From a usage scenario, I'm pretty sure disk IO rather than filling a buffer will dominate write time, so I don't think this is necessarily benchmarking exactly what you were hoping to?
For example, our loads and writes look something like this:
which is pretty different to the results you guys found - at least at first glance!
(Ignore the high outlier for the HDF5 read, I'm pretty sure thats related to FS block caching for importing the code used to read the files off disk).
Thanks so much for doing this work - it's something I'd never thought about much until I read the paper and it's definitely got me thinking about serialisation more deeply. I'm on my laptop right now, but when I get the chance I'll link my fork of the benchmarks too.