Skip to content

Local rechunking with Icechunk store causes data corruption #896

@claxn

Description

@claxn

I want to rechunk a Zarr group in a local Icechunk store to another group in the same store by changing the chunking pattern. cubed works really well compared to other frameworks, but I am running into an issue causing data corruption in the rechunked array. These are the package versions:

Package                 Version            Editable project location
----------------------- ------------------ ------------------------------------------
affine                  2.4.0
aiostream               0.7.1
allpairspy              2.5.1
annotated-types         0.7.0
anywidget               0.9.21
approval-utilities      17.4.1
approvaltests           17.4.1
array-api-compat        1.14.0
arro3-compute           0.8.0
arro3-core              0.8.0
arro3-io                0.8.0
astropy                 7.2.0
astropy-iers-data       0.2026.4.1.15.5.49
asttokens               3.0.1
attrs                   26.1.0
beautifulsoup4          4.14.3
cdshealpix              0.8.1
certifi                 2026.2.25
cftime                  1.6.5
charset-normalizer      3.4.7
click                   8.3.1
cligj                   0.7.2
cloudpickle             3.1.2
comm                    0.2.3
contourpy               1.3.3
coverage                7.13.5
cubed                   0.26.0
cubed-xarray            0.0.9
cycler                  0.12.1
dask                    2026.3.0
debugpy                 1.8.20
decorator               5.2.1
distributed             2026.3.0
donfig                  0.8.1.post1
empty-files             0.0.9
executing               2.2.1
fibgrid                 0.0.7
fonttools               4.62.1
fsspec                  2026.3.0
geoarrow-rust-core      0.6.1
google-crc32c           1.8.0
graphviz                0.21
h3ronpy                 0.22.0
healpix-geo             0.1.1
icechunk                1.1.21
idna                    3.11
iniconfig               2.3.0
ipykernel               7.2.0
ipython                 9.12.0
ipython-pygments-lexers 1.1.1
ipywidgets              8.1.8
jedi                    0.19.2
jinja2                  3.1.6
jupyter-client          8.8.0
jupyter-core            5.9.1
jupyterlab-widgets      3.0.16
kiwisolver              1.5.0
llvmlite                0.47.0
locket                  1.0.0
lonboard                0.16.0
markdown-it-py          4.0.0
markupsafe              3.0.3
matplotlib              3.10.8
matplotlib-inline       0.2.1
mdurl                   0.1.2
mock                    5.2.0
msgpack                 1.1.2
mypy-extensions         1.1.0
mystmd                  1.8.3
ndindex                 1.10.1
nest-asyncio            1.6.0
netcdf4                 1.7.4
networkx                3.6.1
nodeenv                 1.9.1
numba                   0.65.0
numcodecs               0.16.5
numpy                   2.4.4
packaging               26.0
pandas                  3.0.2
parso                   0.8.6
partd                   1.4.2
pexpect                 4.9.0
pillow                  12.2.0
platformdirs            4.2.2
pluggy                  1.6.0
pooch                   1.9.0
prompt-toolkit          3.0.52
psutil                  7.2.2
psygnal                 0.15.1
ptyprocess              0.7.0
pure-eval               0.2.3
pydantic                2.12.5
pydantic-core           2.41.5
pydot                   4.0.1
pyerfa                  2.0.1.5
pygeogrids              0.5.3
pygments                2.20.0
pykdtree                1.4.3
pyparsing               3.3.2
pyperclip               1.11.0
pyproj                  3.7.2
pytest                  9.0.2
pytest-cov              7.1.0
python-dateutil         2.9.0.post0
pyyaml                  6.0.3
pyzmq                   27.1.0
rasterio                1.5.0
requests                2.33.1
rich                    15.0.0
rioxarray               0.22.0
ruff                    0.15.9
scipy                   1.17.1
seaborn                 0.13.2
six                     1.17.0
sortedcontainers        2.4.0
soupsieve               2.8.3
stack-data              0.6.3
tblib                   3.2.2
tenacity                9.1.4
testfixtures            11.0.0
toolz                   1.1.0
tornado                 6.5.5
tqdm                    4.67.3
traitlets               5.14.3
ty                      0.0.27
typing-extensions       4.15.0
typing-inspection       0.4.2
urllib3                 2.6.3
wcwidth                 0.6.0
widgetsnbextension      4.0.15
xarray                  2026.2.0
xdggs                   0.6.0
zarr                    3.1.6
zict                    3.0.0

And here is a MWE:

import zarr
import numpy as np
import xarray as xr
import icechunk as ic
from pathlib import Path
import matplotlib.pyplot as plt

from cubed.diagnostics import ProgressBar
from cubed.icechunk import store_icechunk

STORE_PATH = "/mnt/data/mwe_cubed"
STORE_BRANCH = "main"
ARRAY_SHAPE = (10, 1000, 1000)
ARRAY_DTYPE = "float32"
CHUNK_HSHAPE = (1, 100, 100)
SHARD_HSHAPE = (1, 200, 200)
CHUNK_VSHAPE = (10, 10, 10)
SHARD_VSHAPE = (10, 20, 20)
HORIZONTAL = "H"
VERTICAL = "V"

def initialise_data(orientation):
    store = ic.local_filesystem_storage(STORE_PATH)
    if not ic.Repository.exists(store):
        repo = ic.Repository.create(store)
    else:
        repo = ic.Repository.open(store)
    session = repo.writable_session(STORE_BRANCH)
    root = zarr.open_group(store=session.store, path=orientation, mode="w", zarr_format=3)

    chunk_shape = CHUNK_HSHAPE if orientation == HORIZONTAL else CHUNK_VSHAPE
    shard_shape = SHARD_HSHAPE if orientation == HORIZONTAL else SHARD_VSHAPE
    
    x_arr = root.create_array(name="x",
            shape=ARRAY_SHAPE[1],
            chunks=ARRAY_SHAPE[1],
            dtype="int64",
            dimension_names=["x"],)

    y_arr = root.create_array(name="y",
            shape=ARRAY_SHAPE[2],
            chunks=ARRAY_SHAPE[2],
            dtype="int64",
            dimension_names=["y"],)
    
    z_arr = root.create_array(name="z",
            shape=ARRAY_SHAPE[0],
            chunks=ARRAY_SHAPE[0],
            dtype="int64",
            dimension_names=["z"],)

    data_arr = root.create_array(
        name="data",
        shape=ARRAY_SHAPE,
        dtype=ARRAY_DTYPE,
        shards=shard_shape,
        chunks=chunk_shape,
        dimension_names=["z", "x", "y"]
    )

    x_arr[:] = np.arange(ARRAY_SHAPE[1])
    y_arr[:] = np.arange(ARRAY_SHAPE[2])
    z_arr[:] = np.arange(ARRAY_SHAPE[0])
    if orientation == HORIZONTAL:
        data_arr[:] = np.random.random(ARRAY_SHAPE)
    else:
        data_arr[:] = np.zeros(ARRAY_SHAPE, dtype=ARRAY_DTYPE)

    session.commit(f"created {orientation} data")

def rechunk_data():
    store = ic.local_filesystem_storage(STORE_PATH)
    repo = ic.Repository.open(store)
    read_session = repo.readonly_session("main")
    hds = xr.open_zarr(store=read_session.store, group=HORIZONTAL, chunked_array_type="cubed")
    vds = hds.chunk({"z": 10, "x": 10, "y": 10}, chunked_array_type="cubed")

    write_session = repo.writable_session("main")
    fork = write_session.fork()
    zarray = zarr.open_array(fork.store, path=f"{VERTICAL}/data")
    with ProgressBar():
        remote_session = store_icechunk(
            sources=[vds["data"].data],
            targets=[zarray]
        )
    write_session.merge(remote_session)
    print(write_session.commit("rechunked data"))


def plot_data(z):
    store = ic.local_filesystem_storage(STORE_PATH)
    repo = ic.Repository.open(store)
    read_session = repo.readonly_session("main")
    dt = xr.open_datatree(read_session.store, engine="zarr", chunked_array_type="cubed")

    himg = dt[HORIZONTAL].to_dataset().isel(z=z)["data"]
    vimg = dt[VERTICAL].to_dataset().isel(z=z)["data"]
    
    plt.figure(figsize=(10, 5))
    plt.subplot(121)
    plt.title("Original")
    plt.imshow(himg)
    plt.subplot(122)
    plt.title("Rechunked")
    plt.imshow(vimg)
    plt.show()


def main():
    Path(STORE_PATH).mkdir(exist_ok=True, parents=True)
    
    initialise_data(HORIZONTAL)
    initialise_data(VERTICAL)

    rechunk_data()

    plot_data(5)


if __name__ == "__main__":
    main()

The output image shows a comparison between expected (original, horizontal) and actual (rechunked, vertical) data:

Image

The size of the blobs matches the chunk size. This is caused by parallel processes writing chunks to the same shard. The bigger you make the shard size of the target array, the bigger the blobs get.

You get a warning that with Icechunk and a local storage concurrent commits/writes are not safe. But isn't there a possibility that you get and catch an ic.ConflictError like here: https://icechunk.io/en/latest/understanding/parallel/#uncooperative-distributed-writes or that you block conflicting writes beforehand (I developed this quick and dirty package before starting to use Icechunk: https://github.com/eodcgmbh/zarr-locker).

Please let me know if you need more information from my side. I tested several other rechunking frameworks, e.g. directly with dask, but no other rechunker than cubed could solve the rechunking for my dataset in time or at all. And the diagnostics are awesome. Thus, many thanks for developing and maintaining this project! 👏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions