Simplify the data type registry and share data-type JSON handling by d-v-b-agent · Pull Request #3 · d-v-b-agent/zarr-python

d-v-b-agent · 2026-06-27T20:54:24Z

Summary

Simplifies zarr-python's data type machinery so the registry is a plain Mapping[str, type] and the per–data-type JSON (de)serialization is shared on the ZDType base, removing a large amount of duplicated boilerplate across the built-in data types. Behavior is preserved — all on-disk metadata is byte-identical.

Net effect: +~~390 / −~~2100 lines across the dtype module, with no behavior change.

What changed (commit by commit)

Registry → plain Mapping[str, type]. Replaces the DataTypeRegistry frozen dataclass (which wrapped a dict plus a lazy-load list and six methods) with a plain dict[str, type[ZDType]]. Resolution (match_dtype/match_json) and lifecycle (register_data_type, unregister_data_type, load_data_type_entrypoints) become free functions over the mapping. Also fixes a latent bug: the old _lazy_load() was never called, so data types advertised via the zarr.data_type entry point group were silently never loaded.
Generic dtype-JSON defaults on the ZDType base. Makes to_json/_from_json_v2/_from_json_v3 concrete (not abstract). Parameter-free data types now inherit their V2 and V3 representations and need not implement any data-type JSON methods.

3–4. Migrate the built-in data types onto shared handling. A NumpyNativeDTypeV2 mixin (V2 name = the NumPy type string) and an ObjectCodecDTypeV2 mixin (V2 = "|O" + object codec id) replace ~50 lines of near-identical _from_json_*/to_json/_check_json_* boilerplate per class for bool/int/uint/float/complex and the string/bytes/time/structured types. The base to_json dispatches to _to_json_v2/_to_json_v3 hooks.

Declarative, uniform aliasing. A single _aliases class var (default empty) declares alternative Zarr V3 names accepted on input; the canonical _zarr_v3_name is always written out. This replaces scattered hardcoded aliases (VariableLengthBytes→"bytes", Struct→"structured", VariableLengthUTF8→"str"), with the user-facing VLEN_UTF8_ALIAS derived from the type's declared names.
Accept the Zarr V3 spec raw name r<N> on input. RawBytes now reads the core-spec r<N> form (N bits, a positive multiple of 8) in addition to its raw_bytes configuration form, so raw arrays written by other Zarr V3 implementations can be read. Input-only; output is unchanged.

Testing

Full test suite passes (6226 passed; the one deselected test is a pre-existing checkout artifact — the clone has no git tags, so version derivation can't find a v* tag).
ruff and mypy clean on the dtype module.
Every migrated data type verified to produce byte-identical V2/V3 JSON; new tests cover the alias surface and the r<N> form.

A follow-up issue will track recommendations for the data-type extension API (ergonomics for third-party data type authors).

🤖 Generated with Claude Code

Replace the DataTypeRegistry frozen dataclass (which wrapped a dict in `contents` plus a `_lazy_load_list` and six methods) with a plain `dict[str, type[ZDType]]`. Resolution (`match_dtype`/`match_json`) and lifecycle (`register_data_type`, `unregister_data_type`, `load_data_type_entrypoints`) are now free functions over the mapping, each accepting an optional `registry=` so callers can operate on an isolated dict. This also fixes a latent bug: the old `_lazy_load()` was never called in `src/`, so data types advertised via the `zarr.data_type` entry point group were silently never loaded. `load_data_type_entrypoints()` is now invoked when the dtype package is imported. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

Make `to_json`, `_from_json_v2`, and `_from_json_v3` concrete on the ZDType base class instead of abstract. The defaults handle parameter-free data types: the Zarr V3 form is just the data type name, and the Zarr V2 form is that name plus an optional `object_codec_id` (a new base class var, default None, for object-backed types like variable-length strings). The V2 name defaults to the V3 name, which is what custom data types want. Built-in data types keep their existing NumPy-typestring-based overrides, so their behavior is unchanged. The win is for custom data types: a parameter-free dtype now inherits all dtype-JSON handling and no longer needs to implement `_check_json_v2`, `_from_json_v2`, `_from_json_v3`, or `to_json`. The custom_dtype example shrinks by ~90 lines accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

Split the ZDType base `to_json` into `_to_json_v2`/`_to_json_v3` hooks, and add a `NumpyNativeDTypeV2` mixin for data types whose Zarr V2 name is the NumPy type string of the wrapped dtype (which fully determines the dtype, including byte order). The bool, integer, float, and complex data types now inherit V2 (de)serialization from the mixin and V3 from the base, instead of each concrete class repeating ~50 lines of `_from_json_v2`/`_from_json_v3`/ `to_json`/`_check_json_*` boilerplate plus a `_zarr_v2_names` table. Behavior is byte-identical: `to_native_dtype().str` reproduces each type's existing V2 name (e.g. "<i2", "|b1"), and reading goes through `np.dtype(name)` exactly as the per-class methods did. Net ~1075 fewer lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

Extend the shared dtype-JSON machinery to the object, fixed-length, raw, time, and structured data types: - Add an `ObjectCodecDTypeV2` mixin for parameter-free object dtypes whose Zarr V2 form is `{"name": "|O", "object_codec_id": <id>}` (variable-length strings and bytes). - Fixed-length / raw / time dtypes (FixedLengthUTF32, NullTerminatedBytes, RawBytes, DateTime64, TimeDelta64) now inherit V2 from `NumpyNativeDTypeV2` (their V2 name is the NumPy type string, which round-trips the length/unit/ byte order via from_native_dtype) and implement only a small `_to_json_v3` hook for their V3 `configuration`. Their `_zarr_v2_names` tables are gone. - Structured/Struct keep their bespoke field-list V2/V3 logic but drop the `to_json` dispatcher in favor of `_to_json_v2`/`_to_json_v3` hooks. All output is byte-identical (verified per type, incl. the V3 unstable-spec warnings and the variable_length_bytes "bytes" alias). Net ~645 fewer lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

Introduce a single `_aliases` class var on the ZDType base (default empty) for alternative Zarr V3 names a data type accepts on input, plus `_zarr_v3_names()`/`_check_zarr_v3_name()` helpers used by the base `_from_json_v3` and every parametrized `_check_json_v3`. The canonical `_zarr_v3_name` is always what gets written out -- aliases are input-only. This replaces the previously scattered, hardcoded aliases with declarations on the types that have them: - VariableLengthBytes: "bytes" (deletes its custom `_from_json_v3`) - Struct: "structured" (legacy name) - VariableLengthUTF8: "str" (the user-facing `VLEN_UTF8_ALIAS` is now derived from the type's declared names, so there is a single source of truth) Adds tests documenting the complete alias surface across all data types and verifying that aliases resolve on input while serialization emits the canonical name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

The Zarr V3 core spec names raw data types `r<N>`, where N is a bit count that must be a positive multiple of 8 (e.g. "r8", "r16"). zarr-python's own canonical V3 form for raw bytes is the more explicit `{"name": "raw_bytes", "configuration": {"length_bytes": ...}}`. Accept the spec `r<N>` form on input (parsing N bits into a byte length) so that raw arrays written by other Zarr V3 implementations can be read. This is an input-only alias: serialization still emits the canonical `raw_bytes` configuration form, consistent with how all other data type aliases behave. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

…mpatible modes Replace the "try every registered data type until one matches" resolution with normalize-then-look-up: - Native NumPy dtypes resolve through an index keyed on the dtype class (built per call from the registry, which stays a plain Mapping). Each class maps to a single data type, except NumPy's VoidDType, which is shared by the raw-bytes and structured types and disambiguated by `.fields`. The NumPy "Object" dtype remains a deliberate ambiguity and is refused. - Zarr V3 JSON resolves by name: canonical name, alias, or a parametric name (e.g. raw `r<N>`, via a new `_zarr_v3_name_pattern` class var). - Zarr V2 JSON resolves object-codec-backed types by `object_codec_id`, custom types by their registered name, and everything else through the native NumPy dtype. Add a `data_type_resolution` config option with two modes: "compatible" (default) makes a best-effort attempt to read wrong-but-parsable Zarr V2 type strings (e.g. ">u1", which NumPy normalizes to "|u1"), while "strict" accepts only spec-compliant, canonical data type metadata. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r

d-v-b-agent and others added 6 commits June 26, 2026 15:39

d-v-b-agent mentioned this pull request Jun 27, 2026

Data type extension API: friction and recommendations for third-party data type authors #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the data type registry and share data-type JSON handling#3

Simplify the data type registry and share data-type JSON handling#3
d-v-b-agent wants to merge 7 commits into
mainfrom
dtype-registry-as-mapping

d-v-b-agent commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

d-v-b-agent commented Jun 27, 2026

Summary

What changed (commit by commit)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant