I want to rechunk a Zarr group in a local Icechunk store to another group in the same store by changing the chunking pattern. cubed works really well compared to other frameworks, but I am running into an issue causing data corruption in the rechunked array. These are the package versions:
Package Version Editable project location
----------------------- ------------------ ------------------------------------------
affine 2.4.0
aiostream 0.7.1
allpairspy 2.5.1
annotated-types 0.7.0
anywidget 0.9.21
approval-utilities 17.4.1
approvaltests 17.4.1
array-api-compat 1.14.0
arro3-compute 0.8.0
arro3-core 0.8.0
arro3-io 0.8.0
astropy 7.2.0
astropy-iers-data 0.2026.4.1.15.5.49
asttokens 3.0.1
attrs 26.1.0
beautifulsoup4 4.14.3
cdshealpix 0.8.1
certifi 2026.2.25
cftime 1.6.5
charset-normalizer 3.4.7
click 8.3.1
cligj 0.7.2
cloudpickle 3.1.2
comm 0.2.3
contourpy 1.3.3
coverage 7.13.5
cubed 0.26.0
cubed-xarray 0.0.9
cycler 0.12.1
dask 2026.3.0
debugpy 1.8.20
decorator 5.2.1
distributed 2026.3.0
donfig 0.8.1.post1
empty-files 0.0.9
executing 2.2.1
fibgrid 0.0.7
fonttools 4.62.1
fsspec 2026.3.0
geoarrow-rust-core 0.6.1
google-crc32c 1.8.0
graphviz 0.21
h3ronpy 0.22.0
healpix-geo 0.1.1
icechunk 1.1.21
idna 3.11
iniconfig 2.3.0
ipykernel 7.2.0
ipython 9.12.0
ipython-pygments-lexers 1.1.1
ipywidgets 8.1.8
jedi 0.19.2
jinja2 3.1.6
jupyter-client 8.8.0
jupyter-core 5.9.1
jupyterlab-widgets 3.0.16
kiwisolver 1.5.0
llvmlite 0.47.0
locket 1.0.0
lonboard 0.16.0
markdown-it-py 4.0.0
markupsafe 3.0.3
matplotlib 3.10.8
matplotlib-inline 0.2.1
mdurl 0.1.2
mock 5.2.0
msgpack 1.1.2
mypy-extensions 1.1.0
mystmd 1.8.3
ndindex 1.10.1
nest-asyncio 1.6.0
netcdf4 1.7.4
networkx 3.6.1
nodeenv 1.9.1
numba 0.65.0
numcodecs 0.16.5
numpy 2.4.4
packaging 26.0
pandas 3.0.2
parso 0.8.6
partd 1.4.2
pexpect 4.9.0
pillow 12.2.0
platformdirs 4.2.2
pluggy 1.6.0
pooch 1.9.0
prompt-toolkit 3.0.52
psutil 7.2.2
psygnal 0.15.1
ptyprocess 0.7.0
pure-eval 0.2.3
pydantic 2.12.5
pydantic-core 2.41.5
pydot 4.0.1
pyerfa 2.0.1.5
pygeogrids 0.5.3
pygments 2.20.0
pykdtree 1.4.3
pyparsing 3.3.2
pyperclip 1.11.0
pyproj 3.7.2
pytest 9.0.2
pytest-cov 7.1.0
python-dateutil 2.9.0.post0
pyyaml 6.0.3
pyzmq 27.1.0
rasterio 1.5.0
requests 2.33.1
rich 15.0.0
rioxarray 0.22.0
ruff 0.15.9
scipy 1.17.1
seaborn 0.13.2
six 1.17.0
sortedcontainers 2.4.0
soupsieve 2.8.3
stack-data 0.6.3
tblib 3.2.2
tenacity 9.1.4
testfixtures 11.0.0
toolz 1.1.0
tornado 6.5.5
tqdm 4.67.3
traitlets 5.14.3
ty 0.0.27
typing-extensions 4.15.0
typing-inspection 0.4.2
urllib3 2.6.3
wcwidth 0.6.0
widgetsnbextension 4.0.15
xarray 2026.2.0
xdggs 0.6.0
zarr 3.1.6
zict 3.0.0
And here is a MWE:
import zarr
import numpy as np
import xarray as xr
import icechunk as ic
from pathlib import Path
import matplotlib.pyplot as plt
from cubed.diagnostics import ProgressBar
from cubed.icechunk import store_icechunk
STORE_PATH = "/mnt/data/mwe_cubed"
STORE_BRANCH = "main"
ARRAY_SHAPE = (10, 1000, 1000)
ARRAY_DTYPE = "float32"
CHUNK_HSHAPE = (1, 100, 100)
SHARD_HSHAPE = (1, 200, 200)
CHUNK_VSHAPE = (10, 10, 10)
SHARD_VSHAPE = (10, 20, 20)
HORIZONTAL = "H"
VERTICAL = "V"
def initialise_data(orientation):
store = ic.local_filesystem_storage(STORE_PATH)
if not ic.Repository.exists(store):
repo = ic.Repository.create(store)
else:
repo = ic.Repository.open(store)
session = repo.writable_session(STORE_BRANCH)
root = zarr.open_group(store=session.store, path=orientation, mode="w", zarr_format=3)
chunk_shape = CHUNK_HSHAPE if orientation == HORIZONTAL else CHUNK_VSHAPE
shard_shape = SHARD_HSHAPE if orientation == HORIZONTAL else SHARD_VSHAPE
x_arr = root.create_array(name="x",
shape=ARRAY_SHAPE[1],
chunks=ARRAY_SHAPE[1],
dtype="int64",
dimension_names=["x"],)
y_arr = root.create_array(name="y",
shape=ARRAY_SHAPE[2],
chunks=ARRAY_SHAPE[2],
dtype="int64",
dimension_names=["y"],)
z_arr = root.create_array(name="z",
shape=ARRAY_SHAPE[0],
chunks=ARRAY_SHAPE[0],
dtype="int64",
dimension_names=["z"],)
data_arr = root.create_array(
name="data",
shape=ARRAY_SHAPE,
dtype=ARRAY_DTYPE,
shards=shard_shape,
chunks=chunk_shape,
dimension_names=["z", "x", "y"]
)
x_arr[:] = np.arange(ARRAY_SHAPE[1])
y_arr[:] = np.arange(ARRAY_SHAPE[2])
z_arr[:] = np.arange(ARRAY_SHAPE[0])
if orientation == HORIZONTAL:
data_arr[:] = np.random.random(ARRAY_SHAPE)
else:
data_arr[:] = np.zeros(ARRAY_SHAPE, dtype=ARRAY_DTYPE)
session.commit(f"created {orientation} data")
def rechunk_data():
store = ic.local_filesystem_storage(STORE_PATH)
repo = ic.Repository.open(store)
read_session = repo.readonly_session("main")
hds = xr.open_zarr(store=read_session.store, group=HORIZONTAL, chunked_array_type="cubed")
vds = hds.chunk({"z": 10, "x": 10, "y": 10}, chunked_array_type="cubed")
write_session = repo.writable_session("main")
fork = write_session.fork()
zarray = zarr.open_array(fork.store, path=f"{VERTICAL}/data")
with ProgressBar():
remote_session = store_icechunk(
sources=[vds["data"].data],
targets=[zarray]
)
write_session.merge(remote_session)
print(write_session.commit("rechunked data"))
def plot_data(z):
store = ic.local_filesystem_storage(STORE_PATH)
repo = ic.Repository.open(store)
read_session = repo.readonly_session("main")
dt = xr.open_datatree(read_session.store, engine="zarr", chunked_array_type="cubed")
himg = dt[HORIZONTAL].to_dataset().isel(z=z)["data"]
vimg = dt[VERTICAL].to_dataset().isel(z=z)["data"]
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title("Original")
plt.imshow(himg)
plt.subplot(122)
plt.title("Rechunked")
plt.imshow(vimg)
plt.show()
def main():
Path(STORE_PATH).mkdir(exist_ok=True, parents=True)
initialise_data(HORIZONTAL)
initialise_data(VERTICAL)
rechunk_data()
plot_data(5)
if __name__ == "__main__":
main()
The output image shows a comparison between expected (original, horizontal) and actual (rechunked, vertical) data:
The size of the blobs matches the chunk size. This is caused by parallel processes writing chunks to the same shard. The bigger you make the shard size of the target array, the bigger the blobs get.
You get a warning that with Icechunk and a local storage concurrent commits/writes are not safe. But isn't there a possibility that you get and catch an ic.ConflictError like here: https://icechunk.io/en/latest/understanding/parallel/#uncooperative-distributed-writes or that you block conflicting writes beforehand (I developed this quick and dirty package before starting to use Icechunk: https://github.com/eodcgmbh/zarr-locker).
Please let me know if you need more information from my side. I tested several other rechunking frameworks, e.g. directly with dask, but no other rechunker than cubed could solve the rechunking for my dataset in time or at all. And the diagnostics are awesome. Thus, many thanks for developing and maintaining this project! 👏
I want to rechunk a Zarr group in a local Icechunk store to another group in the same store by changing the chunking pattern.
cubedworks really well compared to other frameworks, but I am running into an issue causing data corruption in the rechunked array. These are the package versions:And here is a MWE:
The output image shows a comparison between expected (original, horizontal) and actual (rechunked, vertical) data:
The size of the blobs matches the chunk size. This is caused by parallel processes writing chunks to the same shard. The bigger you make the shard size of the target array, the bigger the blobs get.
You get a warning that with Icechunk and a local storage concurrent commits/writes are not safe. But isn't there a possibility that you get and catch an
ic.ConflictErrorlike here: https://icechunk.io/en/latest/understanding/parallel/#uncooperative-distributed-writes or that you block conflicting writes beforehand (I developed this quick and dirty package before starting to use Icechunk: https://github.com/eodcgmbh/zarr-locker).Please let me know if you need more information from my side. I tested several other rechunking frameworks, e.g. directly with
dask, but no other rechunker thancubedcould solve the rechunking for my dataset in time or at all. And the diagnostics are awesome. Thus, many thanks for developing and maintaining this project! 👏