Skip to content

Unexpected behavior with MultiZarrToZarr with partial chunks. #400

Description

@sharkinsspatial

While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using MultiZarrToZarr with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.

A minimal example

import xarray as xr
import numpy as np
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

# Set up fake netCDF files with latitude as a non-dimension coordinate.
time_1_3 = xr.Dataset(
    {
        "delta_time": [12, 13, 14]
    },
    coords={
        'latitude': ('delta_time', [1, 2, 3]),
 })
time_15_17 = xr.Dataset(
    {
        "delta_time": [15, 16, 17]
    },
    coords={
        'latitude': ('delta_time', [4, 5, 6]),
})
time_18_22 = xr.Dataset(
    {
        "delta_time": [18, 19, 20, 21, 22]
    },
    coords={
        'latitude': ('delta_time', [7, 8, 9, 10, 11]),
 })
time_1_3

Screen Shot 2023-11-28 at 4 21 41 PM

# Create netCDFs with chunksize aligned to data size
chunksize = 3
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_1_3.to_netcdf('time_1_3.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

def create_reference(files: list[str]):
    single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in files]
    mzz = MultiZarrToZarr(
        single_jsons,
        concat_dims=["delta_time"],
    )
    combined_test_json = mzz.translate()

    combined_test = xr.open_dataset(
        "reference://", engine="zarr",
        backend_kwargs={
            "storage_options": {
                "fo": combined_test_json,
                },
            "consolidated": False,
        }
    )
    return combined_test

# This works as expected
combined_test_3 = create_reference(["time_1_3.nc", "time_15_17.nc"])
combined_test_3.latitude.data

Screen Shot 2023-11-28 at 4 22 41 PM

# Create netCDFs with a larger chunksize resulting in partially filled chunks not aligned to data size
chunksize = 10
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_18_22.to_netcdf('time_18_22.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

# This results in empty values for the resulting non-dimension coordinate variable.
combined_test_10 = create_reference(["time_15_17.nc", "time_18_22.nc"])
combined_test_10.latitude.data

Screen Shot 2023-11-28 at 4 25 03 PM

This seems potentially related to some of the discussion in #305 (as it is also describing the case of data not aligned with chunk size).

If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).

I may be misunderstanding the MultiZarrToZarr logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions