implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs)#2471
implemented multi-threading for h5ad file reads, substantial speed up (rivals zarr + zarrs)#2471zboldyga wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2471 +/- ##
==========================================
+ Coverage 85.60% 85.91% +0.30%
==========================================
Files 49 51 +2
Lines 7671 7881 +210
==========================================
+ Hits 6567 6771 +204
- Misses 1104 1110 +6
|
|
One thing that may be insightful in understanding this work is to review flame graphs of the before and after read_h5ad. You can drop either of the attached files into https://www.speedscope.app/ to explore. For each, there are three read_h5ad calls against the wessels23 dataset. Within each of those, you'll see two to_memory calls, one for each of the sparse storage method's major arrays (e.g. CSR's data and indices). And at the bottom of the flames you'll see the inflate calls (synonymous with deflate here) -- these calls are actually being called tens of thousands of times (once per chunk) in this example but the sampling rate of the flame graph tool is too slow capture that detail, so it appears like longer calls. In any case, almost all of the time of these read operations is decompression. In the stock flamegraph, you'll see it's all happening in one thread. In the patched flamegraph, there's a thread dropdown at the top of the speedscope app -- you'll see there are various threads. So we're just parallelizing the chunk decompression. |
|
Another thing this made me privy to, which can be addressed in a separate PR, is that chunk size is THE driver of performance for reads. I think I'll defer addressing this until multithreaded writes are also implemented so that we can be sure of optimal chunk sizing taking both into account. Here's some data on the chunk size performance (I'll probably refer back to this on another PR down the line, just sharing now since I have this handy): ##MacOS: Cold cache (ms median)
Warm cache (ms median)
##Linux/Docker: Cold cache (ms median)
Warm cache (ms median)
This was a limited example using wessels23 perturb seq. For separate work on optimal chunk sizing, I would want to do comprehensive analysis of realistic AnnData datasets. However, one thing I noticed across ~50 perturb-seq files I have access to is that chunk size skews quite small, it's likely both performance and compression ratio are taking a big hit at these sizes (< 128KB median), because compression just doesn't work as well without data to compress. AnnData currently uses h5py's default chunk sizing, which matches the sizing I observed in the perturb seq files... Due to the nature of single cell data, we can likely find a different (but simple) chunk size algorithm that is going to give better performance for AnnData's users. (will defer that work to a later PR) |
|
I plan to refine tests a bit more -- both in terms of line coverage, and making sure they're sensible. Also, I need to make some minor adjustments for correct Windows support -- this isn't complicated I just haven't addressed it yet. |
|
Awesome stuff! In case you don't know it already but our developer team is on vacation for the next 2-3 weeks. So please don't be surprised if it takes some time for them to get back to you. |
Hey there!
Here's a patch that brings AnnData hdf5 reads to speeds that rival zarr + zarrs by multithreading. Basically we get around HDF5 / h5py's single-threaded design. It's straightforward: see the comment at the top of h5_parallel.py for clarity on how this works.
By nature of the approach, this PR also brings support for HDF5 reads of zstd and blosc compressed data without an external plugin required. I chose to add these two because they're sensible for single cell data. The existing numcodecs dependency is used, so adding new codecs for reading HDF5 is now a few lines of code per codec (if it's supported in numcodecs), and they will also be multithreaded by default.
Benchmark
##MacOS M4 Max (16 core), wessels23 perturb-seq
Linux via Docker (same macbook), wessels23 perturb-seq
For simplicity of this PR, I've initially only implemented a whole file read; Dask and sliced reads are not yet supported . It's straightforward to add support for these -- I already made a draft locally. I will wait to close this PR before proceeding.
I also started a draft of implementing multithreaded writing of h5ad. I think it will also rival zarr+zarrs speed but I'm not positive yet -- the main difference there is that we must use libhdf5's serialized code for managing the HDF5 index, which adds a little wall-time friction.
Overall, I'm interested to help as needed to bring these type of improvements to AnnData!