Skip to content

Replace hasattr(data, "__len__") duck typing with ndim-based collection detection#1414

Open
h-mayorquin wants to merge 5 commits intodevfrom
remove_duck_typing_for_array_detection
Open

Replace hasattr(data, "__len__") duck typing with ndim-based collection detection#1414
h-mayorquin wants to merge 5 commits intodevfrom
remove_duck_typing_for_array_detection

Conversation

@h-mayorquin
Copy link
Contributor

@h-mayorquin h-mayorquin commented Mar 3, 2026

Motivated by hdmf-dev/hdmf-zarr#325 (zarr v2 to v3 migration).

hdmf uses hasattr(data, "__len__") across hte library (utils.py, h5tools.py, validator.py) to decide whether data is a collection or a scalar. This has worked so far because numpy implements __len__ and other array libraries (h5py, zarr v2) followed that convention. But __len__ is not part of any array specification (see the array API standard object definition for the required attributes and methods), so we cannot rely on future array libraries continuing to provide it. The check is also implicit about what it is actually trying to detect. The zarr v2 to v3 migration (hdmf-dev/hdmf-zarr#325) exposed this, as zarr v3 arrays no longer implement __len__. The __len__ heuristic also has a bug in the current release: 0-d ndarrays have __len__ defined but len() raises TypeError, so passing one to HDF5IO.get_type or write_dataset crashes. This already happens when converting results from array-API-conforming libraries (which return 0-d arrays from reductions like mean and sum) to numpy via np.asarray(). A regression test for this is included.

This PR centralizes the "is this a collection?" logic into a single helper (_is_collection) that is both more general and more explicit: check ndim > 0 first (which the array API standard guarantees for any conforming array), then fall back to __len__ for plain Python containers. A companion _get_length helper safely gets first-dimension length via shape[0] or len(). This replaces all scattered checks with one clear, documented policy that covers both legacy types and array API standard conforming libraries, and shoudl eliminate the need for monkey-patching in hdmf-zarr.

The tests that are failing are because pynwb has not updated its ceiling on hdmf I think

Related to #1235

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 92.50000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.87%. Comparing base (ef5b390) to head (3bf5c8f).

Files with missing lines Patch % Lines
src/hdmf/build/objectmapper.py 75.00% 0 Missing and 1 partial ⚠️
src/hdmf/common/table.py 83.33% 0 Missing and 1 partial ⚠️
src/hdmf/validate/validator.py 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1414      +/-   ##
==========================================
+ Coverage   92.85%   92.87%   +0.02%     
==========================================
  Files          41       41              
  Lines        9989    10007      +18     
  Branches     2054     2060       +6     
==========================================
+ Hits         9275     9294      +19     
+ Misses        436      435       -1     
  Partials      278      278              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

h-mayorquin and others added 3 commits March 3, 2026 03:31
The function returns shape[0], not len(). The new name reflects the
actual semantics: the size of the first dimension, which is what the
array API standard exposes via .shape rather than __len__.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After discussion, _get_length is the clearest name: it maps to what
Python programmers understand as "how many items if I iterate this",
which is what every call site needs. The implementation detail of using
shape[0] vs len() belongs in the docstring, not the name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant