You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VirtualiZarr was developed to replace the Kerchunk package, to serve as a way to make non-cloud-optimized file formats accessible in a cloud-optimized manner through the Zarr API. It's based on the idea that most binary chunked array file formats can be mapped to the Zarr data model. Both Kerchunk and Zarr are mentioned in your talk abstract.
Our packages seem very similar. As far I can tell, they both:
Target non-cloud-optimized data sat in object storage
Pre-process that data to extract metadata and chunk / partition references
Can perform that pre-processing at scale in parallel (VirtualiZarr can use dask or lithops or a general parallel executor, though this functionality hasn't been released yet)
Persist those chunk / partition references to storage in new objects
Allow cloud-optimized parallel access to the original data by having the data user hit the serialized chunk / partition references first.
Some possible differences:
VirtualiZarr is targeting array data, and maps all data to the Zarr data model (of a heirarchy of multidimensional chunked arrays). It seems dataplug is more general but produces less structured output (like Kerchunk could do in theory)?
VirtualiZarr can assemble references from a large number of files into one big cloud-optimized "virtual datacube". It's unclear to me if dataplug tries to do that.
Hi, I develop VirtualiZarr, a package that seems to have very similar goals to dataplug. I found out about this library via your SciPy talk announcement - I also have a SciPy talk about VirtualiZarr!
VirtualiZarr was developed to replace the Kerchunk package, to serve as a way to make non-cloud-optimized file formats accessible in a cloud-optimized manner through the Zarr API. It's based on the idea that most binary chunked array file formats can be mapped to the Zarr data model. Both Kerchunk and Zarr are mentioned in your talk abstract.
Our packages seem very similar. As far I can tell, they both:
Some possible differences:
I'm curious if my assessment is correct, and if so whether there is any opportunity to join forces 😀