Feature/lazy loading by HolEv · Pull Request #114 · theislab/cellink

HolEv · 2026-04-09T12:00:06Z

Add lazy loading for G and C

implements lazy loaing for G an C (while still allowing to cache var and obs into memory)
adapts donordata and readwrite for this purpose.

TODO (maybe at later stage): allow that some objects in DonorData storage are not zarr (i.e., allow them to be e.g. parquet files which would be much faster for loading)

… when loading lazy

LArnoldt · 2026-04-29T13:49:15Z

+        else:
+            # Lazy AnnData (e.g. from read_lazy) — obs is xarray-backed.
+            # donor_id must already be the obs index (obs_names).
+            logger.debug("G.obs is not a pandas DataFrame; assuming donor_id is the obs index.")


I think I have never seen an example where G.obs is not a pd.DataFrame? Or is it only a pd.Index then? If not, what else? Should we cast it to a pd.Index then?

If you have an entire zarr based ann data and you don't load everything into mem (i.e., lazy loading and eager_obs_var = False) then it won't be a pandas data frame but just an xarray data set

LArnoldt · 2026-04-29T13:56:20Z

+        """Whether G or C has a dask-backed X matrix (lazy loading)."""
+        return self._lazy_G or self._lazy_C
+
+    def to_memory(self) -> DonorData:


Would it make sense to have an inplace flag here and enable inplace materialization?

LArnoldt · 2026-04-29T14:00:28Z

+        """Whether G or C has a dask-backed X matrix (lazy loading)."""
+        return self._lazy_G or self._lazy_C
+
+    def to_memory(self) -> DonorData:


Maybe good to have an alias function compute() since dask also uses compute()?

def compute(self) -> DonorData: return self.to_memory()

I like that! Will add it

LArnoldt · 2026-04-29T14:01:55Z

-        _G = self.G[G_obs]
+        _G = self._G[G_obs]
        _G = _G[:, G_var]
        _C = self.C[C_obs]


Why C and not _C here?

LArnoldt · 2026-04-29T14:02:25Z

+        # embedded into a new lazy dask chain — that would recreate the bottleneck.
+        _G = self._G[key[0]] if key[0] is not slice(None) else self._G
        _G = _G[:, key[1]] if key[1] is not slice(None) else _G
        _C = self.C[key[2]] if key[2] is not slice(None) else self.C


Same question as above. Why C and not _C?

LArnoldt · 2026-04-29T14:10:38Z

+        # For lazy sides, don't materialise — leave them alone.
+        # For eager sides, copy if it's currently a view.
+        if not self._lazy_G and self._G.is_view:
            self._G = self._G.copy()


But now it mutates self._G, so it's not a clean copy? The original dd object is changed, which shouldn't happen in a function called "copy".

thanks for noticing! should be fixed now!

LArnoldt · 2026-04-29T14:14:36Z

            verbose:
                Whether to print verbose output. Defaults to False.
        """
+        if self._lazy_C:


Why only in case of lazy C and not lazy G?

because this function only uses C. adata used below is self.C. So G is irrelevant here

LArnoldt · 2026-04-29T14:17:40Z

        self._match_donors(self._G, self._C)

    @property
    def G(self) -> AnnData:


This should give a warning in case of a lazy dask frame, if self._lazy_G_obs_filter is not None, otherwise the user might forget calling to_memory first?

@property def G(self) -> AnnData: if self._lazy_G_obs_filter is not None: import warnings warnings.warn( "Accessing dd.G on a lazy DonorData with a pending obs filter. " "The returned object contains unfiltered donors. " "Call dd.to_memory() first to get the correctly filtered G.", stacklevel=2, ) return self._G

LArnoldt · 2026-04-29T14:18:10Z

            raise ValueError("Unknown format: use 'h5' or 'zarr'.")

    @property
    def C(self) -> AnnData:


This should give a warning in case of a lazy dask frame, if self._lazy_C_obs_filter is not None, otherwise the user might forget calling to_memory first?

@property def C(self) -> AnnData: if self._lazy_C_obs_filter is not None: warnings.warn( "Accessing dd.C on a lazy DonorData with a pending obs filter. " "The returned object contains unfiltered cells. " "Call dd.to_memory() first to get the correctly filtered C.", LazyDonorDataWarning, stacklevel=2, ) return self._C

LArnoldt · 2026-04-29T14:24:08Z

+                if C.is_view:
+                    C = C.copy()
+
+        return DonorData(


Not convinced of this. This calls __init__ again, which calls _match_donors again. But at this point G and C are already correctly filtered, so _match_donors is redundant. Also, this could be dangerous. It recomputes keep_donors from scratch on the already-filtered data, which should give the same result, but if anything about the obs index changed during materialisation it could silently re-filter in an unexpected way.
How about we skip __init__ and build the result object directly:

@classmethod def _from_parts(cls, G, C, donor_id, var_dims_to_sync, uns) -> DonorData: """Construct a DonorData directly from already-matched parts, bypassing _match_donors.""" result = object.__new__(cls) result._G = G result._C = C result.donor_id = donor_id result._var_dims_to_sync = var_dims_to_sync result.uns = uns result._lazy_G_obs_filter = None result._lazy_C_obs_filter = None return result

and in in_memory():

return DonorData._from_parts( G=G, C=C, donor_id=self.donor_id, var_dims_to_sync=self._var_dims_to_sync, uns=self.uns, )

good catch. Implemented this

jan-engelmann

Thank you Eva! I like where this is going. Good first version.
This is a challenging and important PR and I think we'll need a few iterations to get this right.
Let me know once you have a new version then I'll have another look!

jan-engelmann · 2026-05-01T05:55:38Z

+        import dask.array as da
+    except ImportError:
+        return False
+    return isinstance(getattr(adata, "X", None), da.Array)


I think this is brittle. What if it is eg:
https://anndata.readthedocs.io/en/latest/generated/anndata.abc.CSRDataset.html#anndata.abc.CSRDataset

I changed to also check for CSR and CSC data sets

jan-engelmann · 2026-05-01T05:59:45Z

+    def _materialise(x):
+        if isinstance(x, (pd.DataFrame, pd.Series, np.ndarray)):
+            return x
+        if hasattr(x, "to_pandas"):


We have many hasattrs and isinstance calls. We should avoid this.

I think we should write our own reading function using
https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_dispatched.html

jan-engelmann · 2026-05-01T06:00:48Z

+def _h5_to_lazy_anndata(group):
+    """Build a lazy AnnData from an h5py Group when read_lazy can't ingest h5.
+
+    Used as a fallback for older anndata versions.  Keeps obs/var eager


hmmm should we simply require later anndata versions? I'd be in favor of that.

(and then remove this function)

jan-engelmann · 2026-05-01T06:02:16Z

+        try:
+            from anndata.experimental import read_lazy
+
+            a = read_lazy(path, load_annotation_index=load_annotation_index)


have you talked with @ilan-gold what the plans are for this function? I think we would need more control. as stated above I'd consider writing our own lazy reading function with:
https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_dispatched.html

I can dig up code if you need it

@ilan-gold what is your opinion on this? Is the read_dispatched suitable for the lazy reading?
If you could have a look at the changes suggested in this PR to _readwrite.py that would be amazing!

jan-engelmann · 2026-05-01T06:04:06Z

+            try:
+                uns[key] = uns_group[key][()]
+            except Exception:
+                logger.debug(f"Skipped loading uns['{key}'] (may be large or non-array)")


user will not see this as they likley have a higher logging level.
Also this swallows the exception. It's bad practice to have general Exception type.
It would be better to think about what can go wrong here and handle exactly these cases

jan-engelmann · 2026-05-01T06:08:48Z

TODO (maybe at later stage): allow that some objects in DonorData storage are not zarr (i.e., allow them to be e.g. parquet files which would be much faster for loading)

I know @ilan-gold is working on speeding up the dataframe loading. Ideally we'd stay within the anndata framework. But I do agree parquet is great since it's fast and allows column-wise loading

ilan-gold · 2026-05-04T09:22:27Z

It would help to know what exactly is slow. Eva messaged me but I'm still not 100% clear - are you ok loading your obs/var data into memory? Is it actually loading individual columns lazily that is slow with the lazy xarray class in obs? Or is it loading obs and var (not lazily) into memory that is slow?

…ly pre-commit to pyproject.toml

ilan-gold · 2026-05-06T13:43:14Z

Ok so I've done a little tinkering. Here is what I can see, roughly:

parquet files dont seem to be much faster than zarr files for reading full dataframes at least from a large var dataframe I tried that sgkit output/was converted to anndata (discarding the insane column 7 named column with very long strings). This is more confirmation of my long-running hypothesis that compressed chunks of bytes on disk are nothing more than that. And your reader software simply needs to not completely stink, which zarr does not. I used pandas.read_parquet:
Backport PR #2419: fix: anndata.experimental.read_lazy index fixes (checking uniqueness + unnecessary dask loading) scverse/anndata#2420 should speed up considerably the process of lazy loading if you set anndata.settings.check_uniqueness. Using this setting, opening (not reading fully) the var dataframe with 92 million features took about 25 seconds, over 2/3 of which is just reading the index column into memory i.e., var_names. With a better filesystem (not Helmholtz's cluster), this is probably faster.
Re: the above index, there is always load_annotation_index=False if you want even faster load times with read_lazy at the expense of your indexes losing some interpretability right out of the gate. The index column will still be there but obs_names and var_names will be a little less interpretable

For point 1 see:

In [10]: %timeit pd.read_parquet("/localscratch/ilan.gold/test.pq")
1min 11s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit pd.DataFrame(index=ad.io.read_elem(g["_index"]), data={k: ad.io.read_elem(g[k]) for k in g if k not
       ⋮  in {"7", "_index"}})
1min 16s ± 877 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and that is with read_parquet against the localscratch, which should be much faster!

In short, I don't think parquet files or polars will be any massive upgrade. I can look into shaving the non i/o time off of the anndata.experimental.read_lazy since it gets worse with more columns. But I think you all should download the latest anndata and give things a spin. I'm releasing the above backport now.

LArnoldt · 2026-05-21T12:50:50Z

Are these changes compatible with the .dd.h5 and .dd.zarr writing and loading functions? Was this tested? @HolEv

HolEv added 5 commits April 8, 2026 09:34

implement lazy loading functionalities

ca169e7

speed up lazy loading subsetting

fc3aa1e

add scatter plot function

c8559f9

allow lazy loading for G&C and h5ad and allow caching of var/obs even…

e7a0972

… when loading lazy

updated formatting

58a602e

HolEv requested a review from LArnoldt April 28, 2026 13:57

HolEv changed the title ~~Feature/lazy g data~~ Feature/lazy loading Apr 28, 2026

LArnoldt requested changes Apr 29, 2026

View reviewed changes

LArnoldt requested a review from jan-engelmann April 29, 2026 14:24

jan-engelmann requested changes May 1, 2026

View reviewed changes

HolEv added 4 commits May 5, 2026 14:22

add check for CSRDataset, CSCDataset for has lazy check

aa4992a

address PR comments

dd1a26b

remove functions not relevant for this PR

66ed4cc

bump anndata version to match loading readwrite expectations (and app…

eceaf0a

…ly pre-commit to pyproject.toml

HolEv requested a review from ilan-gold May 5, 2026 13:39

Conversation

HolEv commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add lazy loading for G and C

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-engelmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jan-engelmann commented May 1, 2026

Uh oh!

ilan-gold commented May 4, 2026

Uh oh!

ilan-gold commented May 6, 2026

Uh oh!

LArnoldt commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HolEv commented Apr 9, 2026 •

edited

Loading