Skip to content

Update AWS access instructions for ZARR version 3.0.#100

Merged
adriaat merged 7 commits intomainfrom
fix_docs
Jun 16, 2025
Merged

Update AWS access instructions for ZARR version 3.0.#100
adriaat merged 7 commits intomainfrom
fix_docs

Conversation

@simonpf
Copy link
Contributor

@simonpf simonpf commented Jun 9, 2025

This PR updates the instructions for accessing CCIC data from AWS to ensure compatibility with Zarr version 3.0 and above. The previous approach no longer worked due to changes in the Zarr API.

@adriaat
Copy link
Contributor

adriaat commented Jun 11, 2025

Two notes:

  • import ccic can't be removed. It's necessary to register the log_bins codec
  • ds = xr.open_zarr('s3://chalmerscloudiceclimatology/record/gridsat/2020/ccic_gridsat_202001010000.zarr') already addresses the changes introduced wtih Zarr 3. As far as I know, there is no plan that consolidated=True will stop being the default.

I see two options:

  1. Keep the (more verbose) approach you suggest (and not removing import ccic)
  2. Reformat to indicate ds = xr.open_zarr('s3://chalmerscloudiceclimatology/record/gridsat/2020/ccic_gridsat_202001010000.zarr'), without instantiating an s3fs filesystem.

ds = xr.open_zarr(s3.get_mapper('chalmerscloudiceclimatology/record/gridsat/2020/ccic_gridsat_202001010000.zarr'))
aws_file_path = "chalmerscloudiceclimatology/record/gridsat/2021/ccic_gridsat_202101010000.zarr"
store = zarr.storage.FsspecStore(s3, path=aws_file_path)
ds = xr.open_zarr(store, consolidated=True)
Copy link
Contributor

@adriaat adriaat Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds = xr.open_zarr('s3://chalmerscloudiceclimatology/record/gridsat/2021/ccic_gridsat_202101010000.zarr') will simply work with a Zarr3 installation and xarray (xarray will try to guess the store from the s3 key)

@adriaat
Copy link
Contributor

adriaat commented Jun 11, 2025

Also note that this PR fails the GitHub test_and_install action. I am looking into it.

Update:

Zarr 3 requires at least Python 3.11. We use 3.10 in the environment YAML files.

Python 3.11 is not compatible with the PyTorch packages we specify in the YAML files.

I opened issue #101 for this.

Related: PR #102

@adriaat adriaat mentioned this pull request Jun 13, 2025
@simonpf
Copy link
Contributor Author

simonpf commented Jun 14, 2025

Thanks a lot for digging into this. Your suggestion is a lot cleaner.

To make it work I still had to

  • Add s3fs to the dependencies
  • Set the storage options to 'anon'

@adriaat
Copy link
Contributor

adriaat commented Jun 16, 2025

Ah, yes, of course, s3fs should be a dependency if you read from S3 buckets. I assumed the user would have that already on their end --- I was having in mind the more local use we do at Chalmers when I changed setup.py, where we have the data offline.
Good that you noticed that you need to give anon to the storage options. Perhaps something in my ~/.aws directory was used when I tested it.

For future reference: I think something has changed with how mamba interacts with our environment YAML files. If mamba env create -f ccic_cpu.yml is used, resolving the dependencies is painfully slow (when debugging locally the terminal even becomes unresponsive): this explains why the GH action install_and_test takes about 50 min. It used to take (less than) a handful of minutes. Instead, if conda env create -f ccic_cpu.yml is used, the dependencies are quickly resolved and the GH action is completed in (less than) a handful of minutes. I tried to find the cause; I think mamba tries to look for a CUDA driver (and install it) even if we use the env file for CPU-only.

@adriaat adriaat merged commit 224c8fc into main Jun 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants