Skip to content

Add 'From Zero to Zarr' beginner guide to the Zarr data model#4077

Open
chuckwondo wants to merge 2 commits into
zarr-developers:mainfrom
chuckwondo:docs/from-zero-to-zarr
Open

Add 'From Zero to Zarr' beginner guide to the Zarr data model#4077
chuckwondo wants to merge 2 commits into
zarr-developers:mainfrom
chuckwondo:docs/from-zero-to-zarr

Conversation

@chuckwondo

Copy link
Copy Markdown
Contributor

Adds a new user-guide page (docs/user-guide/data_model.md, nav label "Understanding Zarr") that explains the Zarr data model for newcomers: why Zarr exists (its parallel-computing origin in genomics), then arrays, chunking and the chunk grid, stores as key->bytes maps, metadata (zarr.json), the specification, codecs, sharding, groups, and N-D arrays, ending with a runnable round-trip example and a cross-language note. Prose

  • diagrams throughout, with executable, build-verified code in the final section, and every spec detail linked to its section of the Zarr v3 spec.

Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds the page to the User Guide nav.

Closes #4056

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@chuckwondo chuckwondo requested review from d-v-b and maxrjones June 17, 2026 23:14
Comment thread docs/user-guide/data_model.md Outdated

Chunking is the key move. Each chunk can be stored, loaded, and compressed on its
own, so a program can read just the chunks it needs — that one corner your
colleague wanted — without touching the rest. (Starting with a chunk shape that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use an inline admonition for the partial-chunk callout? something like

Note

If each chunk has a fixed size, how can we use chunks to represent an array that isn't evenly divided by the chunk size? See #section for the answer to that question!

not sure if note is the right admonition here

@d-v-b

d-v-b commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@mkitti if you have time it would be good to get your thoughts on this

Comment thread docs/user-guide/data_model.md Outdated
G11 --> K11
```

Where does a key like `c/0/1` come from? It's built by a simple, fixed rule (the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fixed rule" locally implies that arrays have 1 chunk key encoding. maybe we can rephrase to make it clear that there's a rule defined by a particular field in array metadata.

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.50%. Comparing base (e29ddd2) to head (83e0444).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4077   +/-   ##
=======================================
  Coverage   93.50%   93.50%           
=======================================
  Files          90       90           
  Lines       11981    11981           
=======================================
  Hits        11203    11203           
  Misses        778      778           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxrjones maxrjones left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @chuckwondo! I just have some small nits

Comment on lines +7 to +9
the *how* one idea at a time, until you understand **how Zarr stores an array**,
**why** that layout is defined by a written specification, and **how a library
turns those stored bytes back into an array you can use**.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the *how* one idea at a time, until you understand **how Zarr stores an array**,
**why** that layout is defined by a written specification, and **how a library
turns those stored bytes back into an array you can use**.
the *how* one idea at a time, until you understand **how Zarr stores an array**,
**why that layout is defined by a written specification**, and **how a library
turns those stored bytes back into an array you can use**.

nit about consistent use of bold text


---

## Why we need Zarr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be nice to have a tl;dr (maybe a note admonition) at the top of this section

Comment thread docs/user-guide/data_model.md Outdated
extraordinary firehoses of numbers. A satellite streams images of the Earth; a
microscope captures gigapixel scans; a gene sequencer reads thousands of genomes;
a climate model writes out temperature and wind for every point on the globe, hour
after hour. In each case the result has the same shape: a vast grid of numbers —

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many people have a negative reaction to em-dashes since their proliferation by AI. It would likely be worth reducing their use in this guide via more, shorter sentences.

why, it helps to understand two things the array formats of the day were already
doing.

First, **chunking**. To store an array bigger than memory, formats like HDF5 and

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding glossary hover-tooltips would need a config change here (enabling the abbr extension plus Material's content.tooltips feature and a glossary include), which is broader than this doc. I'd rather punt it to a separate PR than bundle the tooling change into this one. The page links to the Glossary in the meantime.

Comment thread docs/user-guide/data_model.md Outdated
(the [*Anopheles gambiae* 1000 Genomes Project](https://www.malariagen.net/)) —
arrays far too big to fit in memory. His real frustration was *speed*, and to see
why, it helps to understand two things the array formats of the day were already
doing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
doing.
doing: chunking and compression.

If I'm reading this right, it's not totally obvious what "Second" is

Comment thread docs/user-guide/data_model.md Outdated

So a 5×6 array chunked at `(2, 3)` quietly stores a row of "phantom" cells holding
the fill value. It's harmless, but it's a small waste — and a good reason to pick a
chunk shape that fits your array's real shape reasonably well. (For practical

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chunk shape that fits your array's real shape reasonably well. (For practical
chunk shape that fits your array's real shape reasonably well and lean on the
[rectilinear chunk grid extension](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear) when needed. (For practical

Comment thread docs/user-guide/data_model.md Outdated
[specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#codecs)
defines three kinds of codec, applied in this order:

1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a
1. **array → array** codecs (optional, any number) — rearrange or change the values; e.g. a

I believe this change is more accurate, but would appreciate if @d-v-b confirms

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to "rearrange or transform the values" (using "transform" rather than "change"). @d-v-b, does that read accurately to you for the array→array category?

Comment thread docs/user-guide/data_model.md Outdated
simple, but it has a limit: small chunks in a very large array produce a *huge*
number of chunks, and therefore a huge number of files or objects. The spec notes
this is exactly where file systems (block sizes, inode limits) and object stores
(which dislike millions of tiny objects) start to struggle.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more prevalent limitation on object stores is the cost model, where the cost of operations often scales with the number of objects

Comment thread docs/user-guide/data_model.md Outdated
or more axes.

To see the generalisation concretely, picture a 3-D array as a **stack of 2-D
arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array —
arrays**. Here are two versions of our 4×6 grid stacked into a `(2, 4, 6)` array —

Comment thread docs/user-guide/data_model.md Outdated
- write it to the corresponding slice of the array,
- discard it, and move on to the next block.

Because only one block is ever in memory, the array on disk can be far larger than

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Because only one block is ever in memory, the array on disk can be far larger than
Because the minimum amount of data ever needed in memory to be useful is a single block, the array on disk can be far larger than

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to: "Because you only need to hold one block in memory at a time, the array on disk can be far larger than your RAM." That keeps your accuracy point (you never have to hold the whole array, only a block at a time) while staying a little more concise than "the minimum amount of data ever needed in memory to be useful is a single block."

@mkitti

mkitti commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

I'm watching but I have not had a time to review the entire content.

@chuckwondo chuckwondo requested review from d-v-b and maxrjones June 30, 2026 00:02
chuckwondo and others added 2 commits June 29, 2026 20:10
Adds a new user-guide page (docs/user-guide/data_model.md, nav label
"Understanding Zarr") that explains the Zarr data model for newcomers:
why Zarr exists (its parallel-computing origin in genomics), then arrays,
chunking and the chunk grid, stores as key->bytes maps, metadata
(zarr.json), the specification, codecs, sharding, groups, and N-D arrays,
ending with a runnable round-trip example and a cross-language note. Prose
+ diagrams throughout, with executable, build-verified code in the final
section, and every spec detail linked to its section of the Zarr v3 spec.

Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds
the page to the User Guide nav.

Closes zarr-developers#4056

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chuckwondo chuckwondo force-pushed the docs/from-zero-to-zarr branch from 2b69319 to 83e0444 Compare June 30, 2026 00:10
@chuckwondo chuckwondo marked this pull request as ready for review June 30, 2026 00:13
@chuckwondo

Copy link
Copy Markdown
Contributor Author

There were a lot of comments, but I believe I have addressed them all one way or another, with only a few follow-up questions/suggestions. Please see the updated guide here: https://zarr--4077.org.readthedocs.build/en/4077/user-guide/data_model/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs] very basic zarr tutorial -- "zarr for absolute beginners"

4 participants