From 32a4663a17c7c59dbebb5c8b066067b49e60813c Mon Sep 17 00:00:00 2001 From: BrianMichell Date: Mon, 16 Mar 2026 19:44:58 +0000 Subject: [PATCH 1/4] Add documentation for open_as_void --- docs/index.rst | 6 + docs/open_as_void.rst | 220 +++++++++++++++++++++++++++++ tensorstore/driver/zarr/index.rst | 7 + tensorstore/driver/zarr3/index.rst | 7 + 4 files changed, 240 insertions(+) create mode 100644 docs/open_as_void.rst diff --git a/docs/index.rst b/docs/index.rst index 404ec7582..21e388d43 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -36,6 +36,12 @@ TensorStore driver/index kvstore/index +.. toctree:: + :hidden: + :caption: Advanced Features + + open_as_void + TensorStore is a library for efficiently reading and writing large multi-dimensional arrays. Highlights diff --git a/docs/open_as_void.rst b/docs/open_as_void.rst new file mode 100644 index 000000000..e695fc1f4 --- /dev/null +++ b/docs/open_as_void.rst @@ -0,0 +1,220 @@ +.. _open-as-void: + +Raw Byte Access (``open_as_void``) +================================== + +The ``open_as_void`` option provides raw byte-level access to zarr arrays with +structured data types, bypassing the normal field interpretation. This feature +is available for both the :ref:`driver/zarr2` and :ref:`driver/zarr3` drivers. + +Supported Data Types +-------------------- + +The ``open_as_void`` option is only valid for structured data types: + +- **Zarr v2**: ``structured`` dtype (NumPy-style structured arrays) +- **Zarr v3**: ``struct`` and ``structured`` dtypes + +Attempting to use ``open_as_void`` with non-structured data types will result +in an error. + +Purpose +------- + +When opening an array with :json:`"open_as_void": true`, TensorStore exposes +the underlying byte representation of the array data rather than interpreting +it according to the stored field structure. + +Behavior +-------- + +When ``open_as_void`` is enabled: + +1. **Data type becomes byte**: The resulting TensorStore has dtype + :json:schema:`~dtype.byte` regardless of the original structured data type. + +2. **Additional dimension added**: A new innermost dimension is appended to + represent the byte layout of each element. The size of this dimension + equals the number of bytes per element in the original structured type. + +3. **Codecs are preserved**: All encoding/decoding (including compression) + is still applied. The raw bytes exposed are the *decoded* element bytes, + not the raw compressed chunk data. + +Dimension Transformation +~~~~~~~~~~~~~~~~~~~~~~~~ + +For an array with shape ``[D0, D1, ..., Dn]`` and a structured data type of +size ``B`` bytes per element, opening with ``open_as_void`` produces a +TensorStore with: + +- Shape: ``[D0, D1, ..., Dn, B]`` +- Rank: original rank + 1 +- Data type: ``byte`` + +.. admonition:: Example: Zarr v2 structured dtype + :class: example + + A zarr v2 array with structured dtype ``[("x", "|u1"), ("y", "` via the +:json:schema:`~driver/zarr2.open_as_void` option, which exposes array data as +raw bytes instead of interpreting it according to the data type. + Compressors ----------- diff --git a/tensorstore/driver/zarr3/index.rst b/tensorstore/driver/zarr3/index.rst index 0acfc92b1..d04b0092f 100644 --- a/tensorstore/driver/zarr3/index.rst +++ b/tensorstore/driver/zarr3/index.rst @@ -15,6 +15,13 @@ creating new arrays, and resizing arrays. .. json:schema:: driver/zarr3/Metadata +Raw Byte Access +--------------- + +The zarr3 driver supports :ref:`raw byte access` via the +:json:schema:`~driver/zarr3.open_as_void` option, which exposes array data as +raw bytes instead of interpreting it according to the data type. + Codecs ------ From 0f8f91a23146e6f32c77454a5c57f957cde08283 Mon Sep 17 00:00:00 2001 From: BrianMichell Date: Thu, 9 Apr 2026 17:59:20 +0000 Subject: [PATCH 2/4] Update implementation details for codec chain implementation --- docs/open_as_void.rst | 163 ++++++++++++++++------------ tensorstore/driver/zarr3/index.rst | 21 +++- tensorstore/driver/zarr3/schema.yml | 11 ++ 3 files changed, 124 insertions(+), 71 deletions(-) diff --git a/docs/open_as_void.rst b/docs/open_as_void.rst index e695fc1f4..2d4279d93 100644 --- a/docs/open_as_void.rst +++ b/docs/open_as_void.rst @@ -3,50 +3,88 @@ Raw Byte Access (``open_as_void``) ================================== -The ``open_as_void`` option provides raw byte-level access to zarr arrays with -structured data types, bypassing the normal field interpretation. This feature -is available for both the :ref:`driver/zarr2` and :ref:`driver/zarr3` drivers. +The ``open_as_void`` option provides raw byte-level access to zarr arrays, +bypassing the normal data type interpretation and exposing the underlying decoded +bytes. This feature is available for both the :ref:`driver/zarr2` and +:ref:`driver/zarr3` drivers. Supported Data Types -------------------- -The ``open_as_void`` option is only valid for structured data types: +The scope of supported data types depends on the zarr version: -- **Zarr v2**: ``structured`` dtype (NumPy-style structured arrays) -- **Zarr v3**: ``struct`` and ``structured`` dtypes - -Attempting to use ``open_as_void`` with non-structured data types will result -in an error. +- **Zarr v2**: The ``open_as_void`` option is only valid for ``structured`` + dtype (NumPy-style structured arrays). Attempting to use it with non-structured + data types will result in an error. +- **Zarr v3**: The ``open_as_void`` option works on **any** data type, + including structured types (``struct`` and legacy read-only ``structured``), + and non-structured types. Purpose ------- When opening an array with :json:`"open_as_void": true`, TensorStore exposes the underlying byte representation of the array data rather than interpreting -it according to the stored field structure. +it according to the stored data type or field structure. Behavior -------- -When ``open_as_void`` is enabled: +Zarr v3 Behavior +~~~~~~~~~~~~~~~~ -1. **Data type becomes byte**: The resulting TensorStore has dtype - :json:schema:`~dtype.byte` regardless of the original structured data type. +For zarr v3, ``open_as_void`` operates entirely during the resolution of the +codec pipeline. The implementation resolves the pipeline with a substituted +"raw" data type and then validates the result, rather than checking each codec +individually against an allowlist. +1. **Data type becomes byte**: The array's data type is replaced with the + ``byte`` data type (a 1-byte, endian-invariant type) during codec pipeline + resolution, regardless of the original data type. 2. **Additional dimension added**: A new innermost dimension is appended to - represent the byte layout of each element. The size of this dimension - equals the number of bytes per element in the original structured type. + represent the byte layout of each element. The size of this dimension equals + the number of bytes per element in the original data type. +3. **Codec pipeline resolved with raw type**: The codec pipeline is resolved + using the substituted ``byte`` data type. After resolution, the + implementation verifies that: + + a. The innermost array-to-bytes encoding (after unwinding any + ``sharding_indexed`` layers) is the ``bytes`` codec. + b. The ``byte`` data type is preserved through all array-to-array codecs + in the pipeline (i.e., no codec has changed the data type). + + This approach means that array-to-array codecs that preserve the raw data + type (such as ``transpose``) are naturally supported, while codecs that + transform element data (such as ``scale_offset`` or ``cast_value``) will + fail validation because they alter the data type. +4. **Endianness is preserved natively**: The ``bytes`` codec, which normally + decodes to the stored data type and handles endian conversion, sees the + ``byte`` data type as endian-invariant and performs no byte swapping. It + simply passes the decoded bytes through. +5. **Downstream transparency**: Because this is resolved at the codec pipeline + level, downstream components (such as the chunk cache and grid specification) + see the resulting ``byte`` array and extended shape without needing any + special awareness of the ``open_as_void`` option. + +Zarr v2 Behavior +~~~~~~~~~~~~~~~~ -3. **Codecs are preserved**: All encoding/decoding (including compression) - is still applied. The raw bytes exposed are the *decoded* element bytes, - not the raw compressed chunk data. +For zarr v2, ``open_as_void`` is implemented via void field synthesis at the +interface level: + +1. **Data type becomes byte**: The resulting TensorStore has dtype + :json:schema:`~dtype.byte`. +2. **Additional dimension added**: A new innermost dimension is appended, with + size equal to the number of bytes per element in the original structured type. +3. **Codecs are preserved**: All encoding/decoding (including compression) is + still applied based on the original structured data type. The raw bytes + exposed are the *decoded* element bytes, not the raw compressed chunk data. Dimension Transformation -~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------ -For an array with shape ``[D0, D1, ..., Dn]`` and a structured data type of -size ``B`` bytes per element, opening with ``open_as_void`` produces a -TensorStore with: +For an array with shape ``[D0, D1, ..., Dn]`` and a data type of size ``B`` +bytes per element, opening with ``open_as_void`` produces a TensorStore with: - Shape: ``[D0, D1, ..., Dn, B]`` - Rank: original rank + 1 @@ -65,6 +103,13 @@ TensorStore with: - Byte ``[i, j, 0]`` contains field ``x`` (1 byte) - Bytes ``[i, j, 1:3]`` contain field ``y`` (2 bytes, little-endian) +.. admonition:: Example: Zarr v3 float32 array + :class: example + + A zarr v3 array with ``float32`` dtype (4 bytes per element) and shape + ``[100, 200]`` becomes a ``byte`` array with shape ``[100, 200, 4]`` when + opened with ``open_as_void``. + .. admonition:: Example: Zarr v3 struct dtype :class: example @@ -123,6 +168,27 @@ Python Example Constraints and Limitations --------------------------- +Zarr v3 Codec Restrictions +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When ``open_as_void`` is enabled for zarr v3, the codec pipeline is resolved +with a substituted ``byte`` data type and validated (see +:ref:`Zarr v3 Behavior ` above). This means the effective codec +restrictions are a consequence of the validation rules: + +- **``array -> bytes`` codecs**: The innermost array-to-bytes codec (after + unwinding ``sharding_indexed`` layers) must be the ``bytes`` codec. Only the + ``bytes`` and ``sharding_indexed`` (possibly nested) codecs are supported. + Any other array-to-bytes codec will result in a validation error. +- **``array -> array`` codecs**: Any codec that preserves the ``byte`` data + type is permitted. In practice, this means codecs that shuffle elements + without transforming them (e.g., ``transpose``, and the proposed ``reshape``) + are supported. Codecs that transform element data, such as ``scale_offset``, + ``cast_value``, and ``bitround``, alter the data type and will fail + validation. +- **``bytes -> bytes`` codecs**: All bytes-to-bytes codecs (e.g., ``gzip``, + ``blosc``, ``zstd``, ``crc32c``) are allowed and operate unchanged. + Mutual Exclusivity with Field Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -156,60 +222,19 @@ The ``open_as_void`` flag is preserved when converting an opened TensorStore back to a spec. This ensures that specs obtained from void-mode stores correctly reflect their access mode. -Internal Implementation ------------------------ - -The ``open_as_void`` feature is implemented through a synthesized "void field" -mechanism: - -1. **Void Field Synthesis**: When ``open_as_void`` is requested, the driver - creates a synthetic field descriptor that represents the entire structured - element as raw bytes. This field has: - - - Data type: ``byte`` (single unsigned byte) - - Field shape: ``[bytes_per_element]`` - - No field offset (covers all fields) - -2. **Grid Specification**: The chunk grid is modified to include the - additional bytes dimension in the component shape while preserving - the original chunked dimensions. - -3. **Encoding/Decoding**: The codec chain still operates on the original - structured data representation. The void field transformation happens at - the interface level, presenting decoded chunk data as raw bytes. - -4. **Cache Separation**: Void-mode and normal-mode access to the same - underlying array use separate cache entries to prevent data corruption - from mixing typed and byte-level views. - -Compatibility -------------- - -Compression and Codecs -~~~~~~~~~~~~~~~~~~~~~~ - -``open_as_void`` is fully compatible with all compression codecs (blosc, gzip, -zstd, etc.) and other codecs (sharding, transpose, etc.). The raw bytes -accessed are the *decoded* structured element bytes after all codec processing. - -Existing Arrays -~~~~~~~~~~~~~~~ - -``open_as_void`` can be used to open any existing zarr v2 or zarr v3 array -that has a structured data type. No special array creation flags are needed. - Interoperability -~~~~~~~~~~~~~~~~ +---------------- Data accessed through ``open_as_void`` reflects the exact byte representation as stored, including: -- Endianness (as specified by the field dtypes) -- Field alignment and padding -- Field ordering +- Endianness (as specified by the field dtypes for zarr v2, or natively + preserved by the ``bytes`` codec for zarr v3) +- Field alignment and padding (for structured types) +- Field ordering (for structured types) This makes it suitable for verifying compatibility with other zarr -implementations or diagnosing encoding differences in structured data. +implementations or diagnosing encoding differences. See Also -------- diff --git a/tensorstore/driver/zarr3/index.rst b/tensorstore/driver/zarr3/index.rst index d04b0092f..de9f27a18 100644 --- a/tensorstore/driver/zarr3/index.rst +++ b/tensorstore/driver/zarr3/index.rst @@ -19,8 +19,25 @@ Raw Byte Access --------------- The zarr3 driver supports :ref:`raw byte access` via the -:json:schema:`~driver/zarr3.open_as_void` option, which exposes array data as -raw bytes instead of interpreting it according to the data type. +:json:schema:`~driver/zarr3.open_as_void` option. This option works on any +data type, including structured types (``struct`` and legacy read-only +``structured``), and non-structured types, and exposes the array data +as raw bytes instead of interpreting it according to the data type. + +When enabled, the codec pipeline is resolved by replacing the array's data +type with a raw ``byte`` type and adding an innermost dimension corresponding +to the original element size in bytes. + +Only certain codecs are supported in this mode: + +- **``array -> bytes``**: Only the ``bytes`` and ``sharding_indexed`` + (possibly nested) codecs are allowed. +- **``array -> array``**: Only codecs that shuffle elements without + transforming them (e.g., ``transpose``) are allowed. Codecs that + transform element data are disallowed. +- **``bytes -> bytes``**: All codecs are allowed. + +See :ref:`open-as-void` for full details. Codecs diff --git a/tensorstore/driver/zarr3/schema.yml b/tensorstore/driver/zarr3/schema.yml index 345bfea15..67508f121 100644 --- a/tensorstore/driver/zarr3/schema.yml +++ b/tensorstore/driver/zarr3/schema.yml @@ -6,6 +6,17 @@ allOf: properties: driver: const: "zarr3" + open_as_void: + type: boolean + title: Raw byte access mode. + description: | + When true, the codec pipeline is resolved with a ``byte`` data type + and an additional innermost dimension of size equal to the number of + bytes per element of the original data type. This exposes the + decoded array data as raw bytes. Only certain codecs are + supported when this option is enabled; see + :ref:`open-as-void` for details. + default: false metadata: title: Zarr v3 array metadata. description: | From 7c3e35b74dfb07e4dd4bfbbfd0b4e8563eb9b36f Mon Sep 17 00:00:00 2001 From: BrianMichell Date: Fri, 24 Apr 2026 16:31:35 +0000 Subject: [PATCH 3/4] Update field dimension names and codec behavior definitions --- docs/open_as_void.rst | 53 ++++++++++++++++++++++++++++++------------- 1 file changed, 37 insertions(+), 16 deletions(-) diff --git a/docs/open_as_void.rst b/docs/open_as_void.rst index 2d4279d93..282caea59 100644 --- a/docs/open_as_void.rst +++ b/docs/open_as_void.rst @@ -41,9 +41,15 @@ individually against an allowlist. 1. **Data type becomes byte**: The array's data type is replaced with the ``byte`` data type (a 1-byte, endian-invariant type) during codec pipeline resolution, regardless of the original data type. -2. **Additional dimension added**: A new innermost dimension is appended to - represent the byte layout of each element. The size of this dimension equals - the number of bytes per element in the original data type. +2. **Uninterpreted byte dimension added**: A new innermost **uninterpreted + byte dimension** (hereafter "byte dimension") is appended to represent + the byte layout of each element. The size of this dimension equals the + number of bytes per element in the original data type. The byte dimension + is pinned as the innermost axis of the resulting TensorStore: array-to-array + codecs (e.g., ``transpose``, and the proposed ``reshape``) operate only on + the original chunked dimensions and never permute, reshape, or otherwise + touch the byte dimension. This keeps the per-element byte layout stable + regardless of any codec-level shuffling of the outer dimensions. 3. **Codec pipeline resolved with raw type**: The codec pipeline is resolved using the substituted ``byte`` data type. After resolution, the implementation verifies that: @@ -74,8 +80,10 @@ interface level: 1. **Data type becomes byte**: The resulting TensorStore has dtype :json:schema:`~dtype.byte`. -2. **Additional dimension added**: A new innermost dimension is appended, with - size equal to the number of bytes per element in the original structured type. +2. **Uninterpreted byte dimension added**: A new innermost **uninterpreted + byte dimension** is appended, with size equal to the number of bytes per + element in the original structured type. As in the zarr v3 case, this + dimension is always the innermost axis of the resulting TensorStore. 3. **Codecs are preserved**: All encoding/decoding (including compression) is still applied based on the original structured data type. The raw bytes exposed are the *decoded* element bytes, not the raw compressed chunk data. @@ -86,19 +94,27 @@ Dimension Transformation For an array with shape ``[D0, D1, ..., Dn]`` and a data type of size ``B`` bytes per element, opening with ``open_as_void`` produces a TensorStore with: -- Shape: ``[D0, D1, ..., Dn, B]`` +- Shape: ``[D0, D1, ..., Dn, B]``, where the trailing axis of size ``B`` is + the **uninterpreted byte dimension**. - Rank: original rank + 1 - Data type: ``byte`` +The uninterpreted byte dimension is always the innermost axis of the +resulting TensorStore and represents the raw byte layout of a single element +of the original data type. It is invariant under array-to-array codecs: any +such codecs in the pipeline act solely on the original chunked dimensions +``[D0, D1, ..., Dn]``. + .. admonition:: Example: Zarr v2 structured dtype :class: example A zarr v2 array with structured dtype ``[("x", "|u1"), ("y", " array`` codecs**: Any codec that preserves the ``byte`` data - type is permitted. In practice, this means codecs that shuffle elements - without transforming them (e.g., ``transpose``, and the proposed ``reshape``) - are supported. Codecs that transform element data, such as ``scale_offset``, - ``cast_value``, and ``bitround``, alter the data type and will fail - validation. + type is permitted. Such codecs operate on the original chunked dimensions + only and leave the uninterpreted byte dimension untouched. In practice, + this means codecs that rearrange elements without transforming them (e.g., + ``transpose``, and the proposed ``reshape``) are supported. Codecs that + transform element data, such as ``scale_offset``, ``cast_value``, and + ``bitround``, alter the data type and will fail validation. - **``bytes -> bytes`` codecs**: All bytes-to-bytes codecs (e.g., ``gzip``, ``blosc``, ``zstd``, ``crc32c``) are allowed and operate unchanged. From d1c83bb745df31e119a1f24e1b1e05a789f3b76b Mon Sep 17 00:00:00 2001 From: BrianMichell Date: Thu, 30 Apr 2026 20:59:59 +0000 Subject: [PATCH 4/4] Move `open_as_void` into zarr3 --- {docs => tensorstore/driver/zarr3}/open_as_void.rst | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {docs => tensorstore/driver/zarr3}/open_as_void.rst (100%) diff --git a/docs/open_as_void.rst b/tensorstore/driver/zarr3/open_as_void.rst similarity index 100% rename from docs/open_as_void.rst rename to tensorstore/driver/zarr3/open_as_void.rst