Skip to content

Add zarr support to dataloader#121

Open
hanaol wants to merge 1 commit intomainfrom
hanaol/dataloader-zarr
Open

Add zarr support to dataloader#121
hanaol wants to merge 1 commit intomainfrom
hanaol/dataloader-zarr

Conversation

@hanaol
Copy link
Copy Markdown
Collaborator

@hanaol hanaol commented Apr 13, 2026

Summary

This PR extends the RhoData dataloader to support .zarr format in addition to the existing CHGCAR format, without requiring any changes to the config beyond pointing root at a mp_filelist.txt whose parent directory includes data/*.zarr and label/*.zarr. The format is auto-detected at runtime based on which file extension (.CHGCAR or .zarr) is present in the data/ directory.

Files changed

  • dataloader/utils.py: Added load_zarr() which reads charge_density_total from a local .zarr store and normalizes by lattice volume (from the stored structure JSON attribute), matching the CHGCAR normalization convention. Added a fmt parameter to load_numpy_rho to dispatch to the correct loader.
  • dataloader/dataset.py: RhoData.__init__ now auto-detects format by checking whether the first filelist entry resolves to a .zarr or .CHGCAR file. Raises a clear error if neither is found.

@hanaol
Copy link
Copy Markdown
Collaborator Author

hanaol commented Apr 13, 2026

@claude review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Claude finished @hanaol's task in 1m 8s —— View job


PR Review: Add zarr support to dataloader

Tasks

  • Gather context
  • Read changed files
  • Review implementation
  • Post feedback

Overall, the implementation is clean and minimal. A few bugs and gaps worth addressing before merging.

Bugs

1. IndexError on empty filelist (dataset.py:114)

member_list[0] will raise IndexError if mp_filelist.txt is empty. Should guard:

if not member_list:
    raise ValueError(f"Filelist at {datapath} is empty.")
first = member_list[0]

Fix this →

2. Missing structure attribute raises unhelpful KeyError (utils.py:48)

If a zarr store was written without the structure metadata, z.attrs["structure"] raises KeyError with no context. Consider:

if "structure" not in z.attrs:
    raise KeyError(f"'structure' attribute missing from zarr store at {path}")
volume = json.loads(z.attrs["structure"])["lattice"]["volume"]

Fix this →

Inconsistencies

3. zarr.open vs zarr.open_group (utils.py:46)

The rest of the codebase (e.g., test_mp_zarr_s3_data.py, mp_zarr_s3_data.py) consistently uses zarr.open_group. zarr.open on a Group returns a Group, so this works, but it's inconsistent. Recommend:

z = zarr.open_group(str(path), mode="r")

4. dtype=np.float64 in load_zarr (utils.py:47)

The zarr stores are written as float32 (as seen in the existing test fixtures), but load_zarr casts to float64. This is then re-cast by torch.tensor(..., dtype=dtype_map[precision]) downstream, so it's harmless but wasteful. Consider matching the existing convention (leave at float32 or omit the explicit cast).

Missing tests

5. No tests for load_zarr or format auto-detection

There are tests for ZarrS3Reader in tests/electrai/dataloader/test_mp_zarr_s3_data.py, but nothing covering:

  • load_zarr() in utils.py
  • Format auto-detection (self.fmt) in RhoData.__init__
  • RhoData.__getitem__ with zarr format

Given that the zarr fixture infrastructure already exists in the test file, adding a couple of tests for load_zarr and RhoData would be straightforward and would prevent regressions.

Minor nit

6. isinstance(datapath, str) check (dataset.py:103)

This pre-existing check silently rejects Path objects passed as datapath, falling through to raise ValueError("No filename found."). Not introduced by this PR, but worth noting if you're touching this area.


|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant