Skip to content

Incorporate I/O component in the compute benchmarks #26

@andersy005

Description

@andersy005

For the compute benchmarks, we've been generating and persisting the data in memory for every combination of chunk_size and chunking_scheme prior the computations:

  chunk_size:
    - 32MB
    - 64MB
    - 128MB
    - 256MB
  chunking_scheme:
    - spatial
    - temporal
    - auto

Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: the data will almost always live on the filesystem and be bigger than what we can persist into memory.

I/O benchmarks

A few months ago, @kmpaul and @halehawk conducted an IOR-based I/O scaling study (C/MPI-based code) that compared:

  • Z5
  • netCDF4
  • HDF5
  • PnetCDF
  • MPIIO
  • POSIX

In zarr-hdf-benchmarks (Python/mpi4py-based code), @rabernat compared both the write and read components.


How should we go on about incorporating I/O component in the compute benchmarks?

  • Should we focus on the read component by generating a dataset with same chunking and compression to both netcdf4 and zarr for every chunk_size and chunking_scheme combination, and then testing a variety of access approaches?
  • Should the write component be taken into consideration too?
  • One of our longterm goals for this repo is that the benchmarks should be runnable on different platforms (HPC, Cloud) and storage systems. Both https://github.com/rabernat/zarr_hdf_benchmarks and https://github.com/NCAR/ior_scaling are MPI dependent, and I was wondering whether the I/O components for these benchmarks can be Python/Dask based?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions