Skip to content

Simplify the data type registry and share data-type JSON handling#3

Open
d-v-b-agent wants to merge 7 commits into
mainfrom
dtype-registry-as-mapping
Open

Simplify the data type registry and share data-type JSON handling#3
d-v-b-agent wants to merge 7 commits into
mainfrom
dtype-registry-as-mapping

Conversation

@d-v-b-agent

Copy link
Copy Markdown
Owner

Summary

Simplifies zarr-python's data type machinery so the registry is a plain Mapping[str, type] and the per–data-type JSON (de)serialization is shared on the ZDType base, removing a large amount of duplicated boilerplate across the built-in data types. Behavior is preserved — all on-disk metadata is byte-identical.

Net effect: +390 / −2100 lines across the dtype module, with no behavior change.

What changed (commit by commit)

  1. Registry → plain Mapping[str, type]. Replaces the DataTypeRegistry frozen dataclass (which wrapped a dict plus a lazy-load list and six methods) with a plain dict[str, type[ZDType]]. Resolution (match_dtype/match_json) and lifecycle (register_data_type, unregister_data_type, load_data_type_entrypoints) become free functions over the mapping. Also fixes a latent bug: the old _lazy_load() was never called, so data types advertised via the zarr.data_type entry point group were silently never loaded.

  2. Generic dtype-JSON defaults on the ZDType base. Makes to_json/_from_json_v2/_from_json_v3 concrete (not abstract). Parameter-free data types now inherit their V2 and V3 representations and need not implement any data-type JSON methods.

3–4. Migrate the built-in data types onto shared handling. A NumpyNativeDTypeV2 mixin (V2 name = the NumPy type string) and an ObjectCodecDTypeV2 mixin (V2 = "|O" + object codec id) replace ~50 lines of near-identical _from_json_*/to_json/_check_json_* boilerplate per class for bool/int/uint/float/complex and the string/bytes/time/structured types. The base to_json dispatches to _to_json_v2/_to_json_v3 hooks.

  1. Declarative, uniform aliasing. A single _aliases class var (default empty) declares alternative Zarr V3 names accepted on input; the canonical _zarr_v3_name is always written out. This replaces scattered hardcoded aliases (VariableLengthBytes"bytes", Struct"structured", VariableLengthUTF8"str"), with the user-facing VLEN_UTF8_ALIAS derived from the type's declared names.

  2. Accept the Zarr V3 spec raw name r<N> on input. RawBytes now reads the core-spec r<N> form (N bits, a positive multiple of 8) in addition to its raw_bytes configuration form, so raw arrays written by other Zarr V3 implementations can be read. Input-only; output is unchanged.

Testing

  • Full test suite passes (6226 passed; the one deselected test is a pre-existing checkout artifact — the clone has no git tags, so version derivation can't find a v* tag).
  • ruff and mypy clean on the dtype module.
  • Every migrated data type verified to produce byte-identical V2/V3 JSON; new tests cover the alias surface and the r<N> form.

A follow-up issue will track recommendations for the data-type extension API (ergonomics for third-party data type authors).

🤖 Generated with Claude Code

d-v-b-agent and others added 6 commits June 26, 2026 15:39
Replace the DataTypeRegistry frozen dataclass (which wrapped a dict in
`contents` plus a `_lazy_load_list` and six methods) with a plain
`dict[str, type[ZDType]]`. Resolution (`match_dtype`/`match_json`) and
lifecycle (`register_data_type`, `unregister_data_type`,
`load_data_type_entrypoints`) are now free functions over the mapping,
each accepting an optional `registry=` so callers can operate on an
isolated dict.

This also fixes a latent bug: the old `_lazy_load()` was never called in
`src/`, so data types advertised via the `zarr.data_type` entry point
group were silently never loaded. `load_data_type_entrypoints()` is now
invoked when the dtype package is imported.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Make `to_json`, `_from_json_v2`, and `_from_json_v3` concrete on the
ZDType base class instead of abstract. The defaults handle parameter-free
data types: the Zarr V3 form is just the data type name, and the Zarr V2
form is that name plus an optional `object_codec_id` (a new base class
var, default None, for object-backed types like variable-length strings).
The V2 name defaults to the V3 name, which is what custom data types want.

Built-in data types keep their existing NumPy-typestring-based overrides,
so their behavior is unchanged. The win is for custom data types: a
parameter-free dtype now inherits all dtype-JSON handling and no longer
needs to implement `_check_json_v2`, `_from_json_v2`, `_from_json_v3`, or
`to_json`. The custom_dtype example shrinks by ~90 lines accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Split the ZDType base `to_json` into `_to_json_v2`/`_to_json_v3` hooks, and
add a `NumpyNativeDTypeV2` mixin for data types whose Zarr V2 name is the
NumPy type string of the wrapped dtype (which fully determines the dtype,
including byte order). The bool, integer, float, and complex data types now
inherit V2 (de)serialization from the mixin and V3 from the base, instead of
each concrete class repeating ~50 lines of `_from_json_v2`/`_from_json_v3`/
`to_json`/`_check_json_*` boilerplate plus a `_zarr_v2_names` table.

Behavior is byte-identical: `to_native_dtype().str` reproduces each type's
existing V2 name (e.g. "<i2", "|b1"), and reading goes through
`np.dtype(name)` exactly as the per-class methods did. Net ~1075 fewer lines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Extend the shared dtype-JSON machinery to the object, fixed-length, raw,
time, and structured data types:

- Add an `ObjectCodecDTypeV2` mixin for parameter-free object dtypes whose
  Zarr V2 form is `{"name": "|O", "object_codec_id": <id>}`
  (variable-length strings and bytes).
- Fixed-length / raw / time dtypes (FixedLengthUTF32, NullTerminatedBytes,
  RawBytes, DateTime64, TimeDelta64) now inherit V2 from `NumpyNativeDTypeV2`
  (their V2 name is the NumPy type string, which round-trips the length/unit/
  byte order via from_native_dtype) and implement only a small `_to_json_v3`
  hook for their V3 `configuration`. Their `_zarr_v2_names` tables are gone.
- Structured/Struct keep their bespoke field-list V2/V3 logic but drop the
  `to_json` dispatcher in favor of `_to_json_v2`/`_to_json_v3` hooks.

All output is byte-identical (verified per type, incl. the V3 unstable-spec
warnings and the variable_length_bytes "bytes" alias). Net ~645 fewer lines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Introduce a single `_aliases` class var on the ZDType base (default empty)
for alternative Zarr V3 names a data type accepts on input, plus
`_zarr_v3_names()`/`_check_zarr_v3_name()` helpers used by the base
`_from_json_v3` and every parametrized `_check_json_v3`. The canonical
`_zarr_v3_name` is always what gets written out -- aliases are input-only.

This replaces the previously scattered, hardcoded aliases with declarations
on the types that have them:
- VariableLengthBytes: "bytes" (deletes its custom `_from_json_v3`)
- Struct: "structured" (legacy name)
- VariableLengthUTF8: "str" (the user-facing `VLEN_UTF8_ALIAS` is now derived
  from the type's declared names, so there is a single source of truth)

Adds tests documenting the complete alias surface across all data types and
verifying that aliases resolve on input while serialization emits the
canonical name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
The Zarr V3 core spec names raw data types `r<N>`, where N is a bit count
that must be a positive multiple of 8 (e.g. "r8", "r16"). zarr-python's own
canonical V3 form for raw bytes is the more explicit
`{"name": "raw_bytes", "configuration": {"length_bytes": ...}}`.

Accept the spec `r<N>` form on input (parsing N bits into a byte length) so
that raw arrays written by other Zarr V3 implementations can be read. This is
an input-only alias: serialization still emits the canonical `raw_bytes`
configuration form, consistent with how all other data type aliases behave.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
…mpatible modes

Replace the "try every registered data type until one matches" resolution with
normalize-then-look-up:

- Native NumPy dtypes resolve through an index keyed on the dtype class (built
  per call from the registry, which stays a plain Mapping). Each class maps to a
  single data type, except NumPy's VoidDType, which is shared by the raw-bytes
  and structured types and disambiguated by `.fields`. The NumPy "Object" dtype
  remains a deliberate ambiguity and is refused.
- Zarr V3 JSON resolves by name: canonical name, alias, or a parametric name
  (e.g. raw `r<N>`, via a new `_zarr_v3_name_pattern` class var).
- Zarr V2 JSON resolves object-codec-backed types by `object_codec_id`, custom
  types by their registered name, and everything else through the native NumPy
  dtype.

Add a `data_type_resolution` config option with two modes: "compatible"
(default) makes a best-effort attempt to read wrong-but-parsable Zarr V2 type
strings (e.g. ">u1", which NumPy normalizes to "|u1"), while "strict" accepts
only spec-compliant, canonical data type metadata.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant