implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs) by zboldyga · Pull Request #2471 · scverse/anndata

zboldyga · 2026-05-28T21:07:16Z

Closes #
[ x] Tests added
Release note not necessary because:

Hey there!

Here's a patch that brings AnnData hdf5 reads to speeds that rival zarr + zarrs by multithreading. Basically we get around HDF5 / h5py's single-threaded design. It's straightforward: see the comment at the top of h5_parallel.py for clarity on how this works.

By nature of the approach, this PR also brings support for HDF5 reads of zstd and blosc compressed data without an external plugin required. I chose to add these two because they're sensible for single cell data. The existing numcodecs dependency is used, so adding new codecs for reading HDF5 is now a few lines of code per codec (if it's supported in numcodecs), and they will also be multithreaded by default.

Benchmark

this measures whole X object reads using AnnData. Cold = from disk. Warm = in OS page cache.
all times are in ms, and are median times from multiple trials

##MacOS M4 Max (16 core), wessels23 perturb-seq

before each test, the file is written w/ corresponding codec @ 1MB chunk size
cold tests include disk read time. OS page cache is purged to achieve this.
cold repeated 3 times, warm repeated 5 times. median reported
same compression settings used across h5ad stock/h5ad patched/zarr

codec	cache	h5ad stock	h5ad patched	speedup	zarr+zarrs	patched / zarr
gzip	cold	1569.4	219.0	7.17×	251.9	0.87×
gzip	warm	1424.9	151.9	9.38×	189.4	0.80×
zstd	cold	807.6	154.6	5.22×	160.2	0.97×
zstd	warm	663.9	79.8	8.32×	101.3	0.79×
blosc-lz4	cold	606.2	326.9	1.85×	164.9	1.98×
blosc-lz4	warm	445.0	98.9	4.50×	82.9	1.19×
blosc-zstd	cold	936.4	184.1	5.09×	173.7	1.06×
blosc-zstd	warm	816.4	100.4	8.13×	112.8	0.89×

Linux via Docker (same macbook), wessels23 perturb-seq

same as macos scenarios except purging the OS page cache involves macos + linux purges to get true cold reads.

codec	cache	h5ad stock	h5ad patched	speedup	zarr+zarrs	patched / zarr
gzip	cold	1513.0	182.5	8.29×	145.5	1.25×
gzip	warm	1446.2	139.4	10.37×	125.4	1.11×
zstd	cold	556.3	93.1	5.98×	78.2	1.19×
zstd	warm	496.6	56.6	8.77×	69.4	0.82×
blosc-lz4	cold	512.1	112.8	4.54×	63.4	1.78×
blosc-lz4	warm	444.6	52.1	8.53×	48.3	1.08×
blosc-zstd	cold	750.2	125.0	6.00×	89.8	1.39×
blosc-zstd	warm	703.8	69.7	10.10×	82.2	0.85×

For simplicity of this PR, I've initially only implemented a whole file read; Dask and sliced reads are not yet supported . It's straightforward to add support for these -- I already made a draft locally. I will wait to close this PR before proceeding.

I also started a draft of implementing multithreaded writing of h5ad. I think it will also rival zarr+zarrs speed but I'm not positive yet -- the main difference there is that we must use libhdf5's serialized code for managing the HDF5 index, which adds a little wall-time friction.

Overall, I'm interested to help as needed to bring these type of improvements to AnnData!

for more information, see https://pre-commit.ci

codecov · 2026-05-28T21:09:11Z

Codecov Report

❌ Patch coverage is 97.16981% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.91%. Comparing base (581b93c) to head (bef6ac4).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/anndata/_io/h5_parallel.py	96.34%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2471      +/-   ##
==========================================
+ Coverage   85.60%   85.91%   +0.30%     
==========================================
  Files          49       51       +2     
  Lines        7671     7881     +210     
==========================================
+ Hits         6567     6771     +204     
- Misses       1104     1110       +6

Files with missing lines	Coverage Δ
src/anndata/_core/sparse_dataset.py	`91.07% <100.00%> (+0.19%)`	⬆️
src/anndata/_io/_pool.py	`100.00% <100.00%> (ø)`
src/anndata/_io/specs/methods.py	`91.44% <100.00%> (+0.07%)`	⬆️
src/anndata/_settings.py	`100.00% <100.00%> (ø)`
src/anndata/_io/h5_parallel.py	`96.34% <96.34%> (ø)`

zboldyga · 2026-05-28T21:23:33Z

One thing that may be insightful in understanding this work is to review flame graphs of the before and after read_h5ad. You can drop either of the attached files into https://www.speedscope.app/ to explore.

For each, there are three read_h5ad calls against the wessels23 dataset. Within each of those, you'll see two to_memory calls, one for each of the sparse storage method's major arrays (e.g. CSR's data and indices). And at the bottom of the flames you'll see the inflate calls (synonymous with deflate here) -- these calls are actually being called tens of thousands of times (once per chunk) in this example but the sampling rate of the flame graph tool is too slow capture that detail, so it appears like longer calls. In any case, almost all of the time of these read operations is decompression.

In the stock flamegraph, you'll see it's all happening in one thread. In the patched flamegraph, there's a thread dropdown at the top of the speedscope app -- you'll see there are various threads. So we're just parallelizing the chunk decompression.

speedscope_patched_cold.json
speedscope_stock_cold.json

zboldyga · 2026-05-28T21:30:03Z

Another thing this made me privy to, which can be addressed in a separate PR, is that chunk size is THE driver of performance for reads. I think I'll defer addressing this until multithreaded writes are also implemented so that we can be sure of optimal chunk sizing taking both into account.

Here's some data on the chunk size performance (I'll probably refer back to this on another PR down the line, just sharing now since I have this handy):

##MacOS:

Cold cache (ms median)

chunk_KB	gzip stock	gzip patched	gzip zarr	zstd stock	zstd patched	zstd zarr
64	1665.2	371.8	543.2	844.3	433.8	495.3
256	1577.4	254.5	305.1	865.2	184.7	243.3
1024	1569.4	219.0	251.9	807.6	154.6	160.2
4096	1701.3	249.9	—	817.1	164.6	169.6
16384	1737.4	250.3	390.4	831.0	180.4	214.4

Warm cache (ms median)

chunk_KB	gzip stock	gzip patched	gzip zarr	zstd stock	zstd patched	zstd zarr
64	1492.1	310.9	271.1	693.0	364.0	123.4
256	1424.9	193.8	225.1	729.5	105.0	113.7
1024	1424.9	151.9	189.4	663.9	79.8	101.3
4096	1595.1	179.7	—	693.2	87.4	105.9
16384	1581.1	156.3	301.4	718.5	97.8	134.6

##Linux/Docker:

Cold cache (ms median)

chunk_KB	gzip stock	gzip patched	gzip zarr	zstd stock	zstd patched	zstd zarr
64	1721.5	598.1	150.3	601.2	388.8	81.0
256	1561.5	296.1	140.5	573.8	147.0	81.1
1024	1513.0	182.5	145.5	556.3	93.1	78.2
4096	1498.1	188.1	159.2	554.4	88.3	86.9
16384	1507.8	207.9	190.3	555.7	91.3	115.5

Warm cache (ms median)

chunk_KB	gzip stock	gzip patched	gzip zarr	zstd stock	zstd patched	zstd zarr
64	1558.8	537.3	134.0	541.5	329.4	88.0
256	1486.4	230.8	131.7	525.4	110.5	72.7
1024	1446.2	139.4	125.4	496.6	56.6	69.4
4096	1436.5	149.4	131.7	507.4	50.8	81.4
16384	1442.9	168.1	166.2	504.4	58.6	94.7

This was a limited example using wessels23 perturb seq. For separate work on optimal chunk sizing, I would want to do comprehensive analysis of realistic AnnData datasets.

However, one thing I noticed across ~50 perturb-seq files I have access to is that chunk size skews quite small, it's likely both performance and compression ratio are taking a big hit at these sizes (< 128KB median), because compression just doesn't work as well without data to compress.

AnnData currently uses h5py's default chunk sizing, which matches the sizing I observed in the perturb seq files... Due to the nature of single cell data, we can likely find a different (but simple) chunk size algorithm that is going to give better performance for AnnData's users.

(will defer that work to a later PR)

zboldyga · 2026-05-28T21:50:31Z

I plan to refine tests a bit more -- both in terms of line coverage, and making sure they're sensible.

Also, I need to make some minor adjustments for correct Windows support -- this isn't complicated I just haven't addressed it yet.

Zethson · 2026-05-29T08:15:24Z

Awesome stuff! In case you don't know it already but our developer team is on vacation for the next 2-3 weeks. So please don't be surprised if it takes some time for them to get back to you.

zboldyga and others added 2 commits May 28, 2026 13:05

implemented multi-threading for h5ad file reads.

0fc6824

[pre-commit.ci] auto fixes from pre-commit.com hooks

bef6ac4

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs)#2471

implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs)#2471
zboldyga wants to merge 2 commits into
scverse:mainfrom
zboldyga:feat/parallel-h5-read

zboldyga commented May 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

zboldyga commented May 28, 2026

Uh oh!

zboldyga commented May 28, 2026 •

edited

Loading

Uh oh!

zboldyga commented May 28, 2026

Uh oh!

Zethson commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zboldyga commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Linux via Docker (same macbook), wessels23 perturb-seq

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zboldyga commented May 28, 2026

Uh oh!

zboldyga commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cold cache (ms median)

Warm cache (ms median)

Cold cache (ms median)

Warm cache (ms median)

Uh oh!

zboldyga commented May 28, 2026

Uh oh!

Zethson commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zboldyga commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading

zboldyga commented May 28, 2026 •

edited

Loading