Skip to content

implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs)#2471

Open
zboldyga wants to merge 2 commits into
scverse:mainfrom
zboldyga:feat/parallel-h5-read
Open

implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs)#2471
zboldyga wants to merge 2 commits into
scverse:mainfrom
zboldyga:feat/parallel-h5-read

Conversation

@zboldyga
Copy link
Copy Markdown

@zboldyga zboldyga commented May 28, 2026

  • Closes #
  • [ x] Tests added
  • Release note not necessary because:

Hey there!

Here's a patch that brings AnnData hdf5 reads to speeds that rival zarr + zarrs by multithreading. Basically we get around HDF5 / h5py's single-threaded design. It's straightforward: see the comment at the top of h5_parallel.py for clarity on how this works.

By nature of the approach, this PR also brings support for HDF5 reads of zstd and blosc compressed data without an external plugin required. I chose to add these two because they're sensible for single cell data. The existing numcodecs dependency is used, so adding new codecs for reading HDF5 is now a few lines of code per codec (if it's supported in numcodecs), and they will also be multithreaded by default.

Benchmark

  • this measures whole X object reads using AnnData. Cold = from disk. Warm = in OS page cache.
  • all times are in ms, and are median times from multiple trials

##MacOS M4 Max (16 core), wessels23 perturb-seq

  • before each test, the file is written w/ corresponding codec @ 1MB chunk size
  • cold tests include disk read time. OS page cache is purged to achieve this.
  • cold repeated 3 times, warm repeated 5 times. median reported
  • same compression settings used across h5ad stock/h5ad patched/zarr
codec cache h5ad stock h5ad patched speedup zarr+zarrs patched / zarr
gzip cold 1569.4 219.0 7.17× 251.9 0.87×
gzip warm 1424.9 151.9 9.38× 189.4 0.80×
zstd cold 807.6 154.6 5.22× 160.2 0.97×
zstd warm 663.9 79.8 8.32× 101.3 0.79×
blosc-lz4 cold 606.2 326.9 1.85× 164.9 1.98×
blosc-lz4 warm 445.0 98.9 4.50× 82.9 1.19×
blosc-zstd cold 936.4 184.1 5.09× 173.7 1.06×
blosc-zstd warm 816.4 100.4 8.13× 112.8 0.89×

Linux via Docker (same macbook), wessels23 perturb-seq

  • same as macos scenarios except purging the OS page cache involves macos + linux purges to get true cold reads.
codec cache h5ad stock h5ad patched speedup zarr+zarrs patched / zarr
gzip cold 1513.0 182.5 8.29× 145.5 1.25×
gzip warm 1446.2 139.4 10.37× 125.4 1.11×
zstd cold 556.3 93.1 5.98× 78.2 1.19×
zstd warm 496.6 56.6 8.77× 69.4 0.82×
blosc-lz4 cold 512.1 112.8 4.54× 63.4 1.78×
blosc-lz4 warm 444.6 52.1 8.53× 48.3 1.08×
blosc-zstd cold 750.2 125.0 6.00× 89.8 1.39×
blosc-zstd warm 703.8 69.7 10.10× 82.2 0.85×

For simplicity of this PR, I've initially only implemented a whole file read; Dask and sliced reads are not yet supported . It's straightforward to add support for these -- I already made a draft locally. I will wait to close this PR before proceeding.

I also started a draft of implementing multithreaded writing of h5ad. I think it will also rival zarr+zarrs speed but I'm not positive yet -- the main difference there is that we must use libhdf5's serialized code for managing the HDF5 index, which adds a little wall-time friction.

Overall, I'm interested to help as needed to bring these type of improvements to AnnData!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 97.16981% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.91%. Comparing base (581b93c) to head (bef6ac4).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/anndata/_io/h5_parallel.py 96.34% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2471      +/-   ##
==========================================
+ Coverage   85.60%   85.91%   +0.30%     
==========================================
  Files          49       51       +2     
  Lines        7671     7881     +210     
==========================================
+ Hits         6567     6771     +204     
- Misses       1104     1110       +6     
Files with missing lines Coverage Δ
src/anndata/_core/sparse_dataset.py 91.07% <100.00%> (+0.19%) ⬆️
src/anndata/_io/_pool.py 100.00% <100.00%> (ø)
src/anndata/_io/specs/methods.py 91.44% <100.00%> (+0.07%) ⬆️
src/anndata/_settings.py 100.00% <100.00%> (ø)
src/anndata/_io/h5_parallel.py 96.34% <96.34%> (ø)

@zboldyga
Copy link
Copy Markdown
Author

One thing that may be insightful in understanding this work is to review flame graphs of the before and after read_h5ad. You can drop either of the attached files into https://www.speedscope.app/ to explore.

For each, there are three read_h5ad calls against the wessels23 dataset. Within each of those, you'll see two to_memory calls, one for each of the sparse storage method's major arrays (e.g. CSR's data and indices). And at the bottom of the flames you'll see the inflate calls (synonymous with deflate here) -- these calls are actually being called tens of thousands of times (once per chunk) in this example but the sampling rate of the flame graph tool is too slow capture that detail, so it appears like longer calls. In any case, almost all of the time of these read operations is decompression.

In the stock flamegraph, you'll see it's all happening in one thread. In the patched flamegraph, there's a thread dropdown at the top of the speedscope app -- you'll see there are various threads. So we're just parallelizing the chunk decompression.

speedscope_patched_cold.json
speedscope_stock_cold.json

@zboldyga
Copy link
Copy Markdown
Author

zboldyga commented May 28, 2026

Another thing this made me privy to, which can be addressed in a separate PR, is that chunk size is THE driver of performance for reads. I think I'll defer addressing this until multithreaded writes are also implemented so that we can be sure of optimal chunk sizing taking both into account.

Here's some data on the chunk size performance (I'll probably refer back to this on another PR down the line, just sharing now since I have this handy):

##MacOS:

Cold cache (ms median)

chunk_KB gzip stock gzip patched gzip zarr zstd stock zstd patched zstd zarr
64 1665.2 371.8 543.2 844.3 433.8 495.3
256 1577.4 254.5 305.1 865.2 184.7 243.3
1024 1569.4 219.0 251.9 807.6 154.6 160.2
4096 1701.3 249.9 817.1 164.6 169.6
16384 1737.4 250.3 390.4 831.0 180.4 214.4

Warm cache (ms median)

chunk_KB gzip stock gzip patched gzip zarr zstd stock zstd patched zstd zarr
64 1492.1 310.9 271.1 693.0 364.0 123.4
256 1424.9 193.8 225.1 729.5 105.0 113.7
1024 1424.9 151.9 189.4 663.9 79.8 101.3
4096 1595.1 179.7 693.2 87.4 105.9
16384 1581.1 156.3 301.4 718.5 97.8 134.6

##Linux/Docker:

Cold cache (ms median)

chunk_KB gzip stock gzip patched gzip zarr zstd stock zstd patched zstd zarr
64 1721.5 598.1 150.3 601.2 388.8 81.0
256 1561.5 296.1 140.5 573.8 147.0 81.1
1024 1513.0 182.5 145.5 556.3 93.1 78.2
4096 1498.1 188.1 159.2 554.4 88.3 86.9
16384 1507.8 207.9 190.3 555.7 91.3 115.5

Warm cache (ms median)

chunk_KB gzip stock gzip patched gzip zarr zstd stock zstd patched zstd zarr
64 1558.8 537.3 134.0 541.5 329.4 88.0
256 1486.4 230.8 131.7 525.4 110.5 72.7
1024 1446.2 139.4 125.4 496.6 56.6 69.4
4096 1436.5 149.4 131.7 507.4 50.8 81.4
16384 1442.9 168.1 166.2 504.4 58.6 94.7

This was a limited example using wessels23 perturb seq. For separate work on optimal chunk sizing, I would want to do comprehensive analysis of realistic AnnData datasets.

However, one thing I noticed across ~50 perturb-seq files I have access to is that chunk size skews quite small, it's likely both performance and compression ratio are taking a big hit at these sizes (< 128KB median), because compression just doesn't work as well without data to compress.

AnnData currently uses h5py's default chunk sizing, which matches the sizing I observed in the perturb seq files... Due to the nature of single cell data, we can likely find a different (but simple) chunk size algorithm that is going to give better performance for AnnData's users.

(will defer that work to a later PR)

@zboldyga
Copy link
Copy Markdown
Author

I plan to refine tests a bit more -- both in terms of line coverage, and making sure they're sensible.

Also, I need to make some minor adjustments for correct Windows support -- this isn't complicated I just haven't addressed it yet.

@Zethson
Copy link
Copy Markdown
Member

Zethson commented May 29, 2026

Awesome stuff! In case you don't know it already but our developer team is on vacation for the next 2-3 weeks. So please don't be surprised if it takes some time for them to get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants