Simplify the data type registry and share data-type JSON handling#3
Open
d-v-b-agent wants to merge 7 commits into
Open
Simplify the data type registry and share data-type JSON handling#3d-v-b-agent wants to merge 7 commits into
d-v-b-agent wants to merge 7 commits into
Conversation
Replace the DataTypeRegistry frozen dataclass (which wrapped a dict in `contents` plus a `_lazy_load_list` and six methods) with a plain `dict[str, type[ZDType]]`. Resolution (`match_dtype`/`match_json`) and lifecycle (`register_data_type`, `unregister_data_type`, `load_data_type_entrypoints`) are now free functions over the mapping, each accepting an optional `registry=` so callers can operate on an isolated dict. This also fixes a latent bug: the old `_lazy_load()` was never called in `src/`, so data types advertised via the `zarr.data_type` entry point group were silently never loaded. `load_data_type_entrypoints()` is now invoked when the dtype package is imported. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Make `to_json`, `_from_json_v2`, and `_from_json_v3` concrete on the ZDType base class instead of abstract. The defaults handle parameter-free data types: the Zarr V3 form is just the data type name, and the Zarr V2 form is that name plus an optional `object_codec_id` (a new base class var, default None, for object-backed types like variable-length strings). The V2 name defaults to the V3 name, which is what custom data types want. Built-in data types keep their existing NumPy-typestring-based overrides, so their behavior is unchanged. The win is for custom data types: a parameter-free dtype now inherits all dtype-JSON handling and no longer needs to implement `_check_json_v2`, `_from_json_v2`, `_from_json_v3`, or `to_json`. The custom_dtype example shrinks by ~90 lines accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Split the ZDType base `to_json` into `_to_json_v2`/`_to_json_v3` hooks, and add a `NumpyNativeDTypeV2` mixin for data types whose Zarr V2 name is the NumPy type string of the wrapped dtype (which fully determines the dtype, including byte order). The bool, integer, float, and complex data types now inherit V2 (de)serialization from the mixin and V3 from the base, instead of each concrete class repeating ~50 lines of `_from_json_v2`/`_from_json_v3`/ `to_json`/`_check_json_*` boilerplate plus a `_zarr_v2_names` table. Behavior is byte-identical: `to_native_dtype().str` reproduces each type's existing V2 name (e.g. "<i2", "|b1"), and reading goes through `np.dtype(name)` exactly as the per-class methods did. Net ~1075 fewer lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Extend the shared dtype-JSON machinery to the object, fixed-length, raw,
time, and structured data types:
- Add an `ObjectCodecDTypeV2` mixin for parameter-free object dtypes whose
Zarr V2 form is `{"name": "|O", "object_codec_id": <id>}`
(variable-length strings and bytes).
- Fixed-length / raw / time dtypes (FixedLengthUTF32, NullTerminatedBytes,
RawBytes, DateTime64, TimeDelta64) now inherit V2 from `NumpyNativeDTypeV2`
(their V2 name is the NumPy type string, which round-trips the length/unit/
byte order via from_native_dtype) and implement only a small `_to_json_v3`
hook for their V3 `configuration`. Their `_zarr_v2_names` tables are gone.
- Structured/Struct keep their bespoke field-list V2/V3 logic but drop the
`to_json` dispatcher in favor of `_to_json_v2`/`_to_json_v3` hooks.
All output is byte-identical (verified per type, incl. the V3 unstable-spec
warnings and the variable_length_bytes "bytes" alias). Net ~645 fewer lines.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
Introduce a single `_aliases` class var on the ZDType base (default empty) for alternative Zarr V3 names a data type accepts on input, plus `_zarr_v3_names()`/`_check_zarr_v3_name()` helpers used by the base `_from_json_v3` and every parametrized `_check_json_v3`. The canonical `_zarr_v3_name` is always what gets written out -- aliases are input-only. This replaces the previously scattered, hardcoded aliases with declarations on the types that have them: - VariableLengthBytes: "bytes" (deletes its custom `_from_json_v3`) - Struct: "structured" (legacy name) - VariableLengthUTF8: "str" (the user-facing `VLEN_UTF8_ALIAS` is now derived from the type's declared names, so there is a single source of truth) Adds tests documenting the complete alias surface across all data types and verifying that aliases resolve on input while serialization emits the canonical name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
The Zarr V3 core spec names raw data types `r<N>`, where N is a bit count
that must be a positive multiple of 8 (e.g. "r8", "r16"). zarr-python's own
canonical V3 form for raw bytes is the more explicit
`{"name": "raw_bytes", "configuration": {"length_bytes": ...}}`.
Accept the spec `r<N>` form on input (parsing N bits into a byte length) so
that raw arrays written by other Zarr V3 implementations can be read. This is
an input-only alias: serialization still emits the canonical `raw_bytes`
configuration form, consistent with how all other data type aliases behave.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
…mpatible modes Replace the "try every registered data type until one matches" resolution with normalize-then-look-up: - Native NumPy dtypes resolve through an index keyed on the dtype class (built per call from the registry, which stays a plain Mapping). Each class maps to a single data type, except NumPy's VoidDType, which is shared by the raw-bytes and structured types and disambiguated by `.fields`. The NumPy "Object" dtype remains a deliberate ambiguity and is refused. - Zarr V3 JSON resolves by name: canonical name, alias, or a parametric name (e.g. raw `r<N>`, via a new `_zarr_v3_name_pattern` class var). - Zarr V2 JSON resolves object-codec-backed types by `object_codec_id`, custom types by their registered name, and everything else through the native NumPy dtype. Add a `data_type_resolution` config option with two modes: "compatible" (default) makes a best-effort attempt to read wrong-but-parsable Zarr V2 type strings (e.g. ">u1", which NumPy normalizes to "|u1"), while "strict" accepts only spec-compliant, canonical data type metadata. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01XZHYrRBh54e7tFearnZ72r
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Simplifies zarr-python's data type machinery so the registry is a plain
Mapping[str, type]and the per–data-type JSON (de)serialization is shared on theZDTypebase, removing a large amount of duplicated boilerplate across the built-in data types. Behavior is preserved — all on-disk metadata is byte-identical.Net effect: +
390 / −2100 lines across the dtype module, with no behavior change.What changed (commit by commit)
Registry → plain
Mapping[str, type]. Replaces theDataTypeRegistryfrozen dataclass (which wrapped a dict plus a lazy-load list and six methods) with a plaindict[str, type[ZDType]]. Resolution (match_dtype/match_json) and lifecycle (register_data_type,unregister_data_type,load_data_type_entrypoints) become free functions over the mapping. Also fixes a latent bug: the old_lazy_load()was never called, so data types advertised via thezarr.data_typeentry point group were silently never loaded.Generic dtype-JSON defaults on the
ZDTypebase. Makesto_json/_from_json_v2/_from_json_v3concrete (not abstract). Parameter-free data types now inherit their V2 and V3 representations and need not implement any data-type JSON methods.3–4. Migrate the built-in data types onto shared handling. A
NumpyNativeDTypeV2mixin (V2 name = the NumPy type string) and anObjectCodecDTypeV2mixin (V2 ="|O"+ object codec id) replace ~50 lines of near-identical_from_json_*/to_json/_check_json_*boilerplate per class for bool/int/uint/float/complex and the string/bytes/time/structured types. The baseto_jsondispatches to_to_json_v2/_to_json_v3hooks.Declarative, uniform aliasing. A single
_aliasesclass var (default empty) declares alternative Zarr V3 names accepted on input; the canonical_zarr_v3_nameis always written out. This replaces scattered hardcoded aliases (VariableLengthBytes→"bytes",Struct→"structured",VariableLengthUTF8→"str"), with the user-facingVLEN_UTF8_ALIASderived from the type's declared names.Accept the Zarr V3 spec raw name
r<N>on input.RawBytesnow reads the core-specr<N>form (N bits, a positive multiple of 8) in addition to itsraw_bytesconfiguration form, so raw arrays written by other Zarr V3 implementations can be read. Input-only; output is unchanged.Testing
v*tag).ruffandmypyclean on the dtype module.r<N>form.A follow-up issue will track recommendations for the data-type extension API (ergonomics for third-party data type authors).
🤖 Generated with Claude Code