Are the write tests here measuring write time correctly?

Heya,

I've been trying to replicate some of these benchmarks on some real world data we have and I'm finding some pretty different results. I've forked and modified the bench code pretty heavily to reflect our use cases a bit more closely, but when I was doing that I noticed this snippet:

```python
        if len(dimensions) == 1:
            t3 = time.perf_counter()
            dataset[:dimensions[0]] = data
        elif len(dimensions) == 2:
            t3 = time.perf_counter()
            dataset[:dimensions[0], :dimensions[1]] = data
        else:
            t3 = time.perf_counter()
            dataset[:dimensions[0], :dimensions[1], :dimensions[2]] = data
        t4 = time.perf_counter()

        # Add up the times taken to get the total time taken to create and write all datasets
        dataset_creation_time += (t2 - t1)
        dataset_population_time += (t4 - t3)
```
coming from

https://github.com/asriniket/File-Format-Testing/blob/5d8161fb3e0e5cc24b662dc4a853876aad9b9938/datasets_test/write.py#L50


This looks to me like you're just measuring the time to write into a buffer, rather than actually write the files to disk? From a usage scenario, I'm pretty sure disk IO rather than filling a buffer will dominate write time, so I don't think this is necessarily benchmarking exactly what you were hoping to?

For example, our loads and writes look something like this:

<img width="1185" height="468" alt="Image" src="https://github.com/user-attachments/assets/5f769a76-919c-4af4-9351-47f08aab2733" />

which is pretty different to the results you guys found - at least at first glance!

(Ignore the high outlier for the HDF5 read, I'm pretty sure thats related to FS block caching for importing the code used to read the files off disk).

Thanks so much for doing this work - it's something I'd never thought about much until I read the paper and it's definitely got me thinking about serialisation more deeply. I'm on my laptop right now, but when I get the chance I'll link my fork of the benchmarks too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the write tests here measuring write time correctly? #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Are the write tests here measuring write time correctly? #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions