Skip to content

Feature/lazy loading#114

Open
HolEv wants to merge 9 commits into
mainfrom
feature/lazy_g_data
Open

Feature/lazy loading#114
HolEv wants to merge 9 commits into
mainfrom
feature/lazy_g_data

Conversation

@HolEv

@HolEv HolEv commented Apr 9, 2026

Copy link
Copy Markdown
Collaborator

Add lazy loading for G and C

  • implements lazy loaing for G an C (while still allowing to cache var and obs into memory)
  • adapts donordata and readwrite for this purpose.

TODO (maybe at later stage): allow that some objects in DonorData storage are not zarr (i.e., allow them to be e.g. parquet files which would be much faster for loading)

@HolEv HolEv requested a review from LArnoldt April 28, 2026 13:57
@HolEv HolEv changed the title Feature/lazy g data Feature/lazy loading Apr 28, 2026
else:
# Lazy AnnData (e.g. from read_lazy) — obs is xarray-backed.
# donor_id must already be the obs index (obs_names).
logger.debug("G.obs is not a pandas DataFrame; assuming donor_id is the obs index.")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have never seen an example where G.obs is not a pd.DataFrame? Or is it only a pd.Index then? If not, what else? Should we cast it to a pd.Index then?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have an entire zarr based ann data and you don't load everything into mem (i.e., lazy loading and eager_obs_var = False) then it won't be a pandas data frame but just an xarray data set

Comment thread src/cellink/_core/donordata.py Outdated
"""Whether G or C has a dask-backed X matrix (lazy loading)."""
return self._lazy_G or self._lazy_C

def to_memory(self) -> DonorData:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have an inplace flag here and enable inplace materialization?

Comment thread src/cellink/_core/donordata.py Outdated
"""Whether G or C has a dask-backed X matrix (lazy loading)."""
return self._lazy_G or self._lazy_C

def to_memory(self) -> DonorData:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to have an alias function compute() since dask also uses compute()?

def compute(self) -> DonorData:
    return self.to_memory()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! Will add it

Comment thread src/cellink/_core/donordata.py Outdated
_G = self.G[G_obs]
_G = self._G[G_obs]
_G = _G[:, G_var]
_C = self.C[C_obs]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why C and not _C here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread src/cellink/_core/donordata.py Outdated
# embedded into a new lazy dask chain — that would recreate the bottleneck.
_G = self._G[key[0]] if key[0] is not slice(None) else self._G
_G = _G[:, key[1]] if key[1] is not slice(None) else _G
_C = self.C[key[2]] if key[2] is not slice(None) else self.C

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above. Why C and not _C?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread src/cellink/_core/donordata.py Outdated
# For lazy sides, don't materialise — leave them alone.
# For eager sides, copy if it's currently a view.
if not self._lazy_G and self._G.is_view:
self._G = self._G.copy()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now it mutates self._G, so it's not a clean copy? The original dd object is changed, which shouldn't happen in a function called "copy".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for noticing! should be fixed now!

verbose:
Whether to print verbose output. Defaults to False.
"""
if self._lazy_C:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only in case of lazy C and not lazy G?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this function only uses C. adata used below is self.C. So G is irrelevant here

self._match_donors(self._G, self._C)

@property
def G(self) -> AnnData:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should give a warning in case of a lazy dask frame, if self._lazy_G_obs_filter is not None, otherwise the user might forget calling to_memory first?

@property
def G(self) -> AnnData:
    if self._lazy_G_obs_filter is not None:
        import warnings
        warnings.warn(
            "Accessing dd.G on a lazy DonorData with a pending obs filter. "
            "The returned object contains unfiltered donors. "
            "Call dd.to_memory() first to get the correctly filtered G.",
            stacklevel=2,
        )
    return self._G

raise ValueError("Unknown format: use 'h5' or 'zarr'.")

@property
def C(self) -> AnnData:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should give a warning in case of a lazy dask frame, if self._lazy_C_obs_filter is not None, otherwise the user might forget calling to_memory first?

@property
def C(self) -> AnnData:
    if self._lazy_C_obs_filter is not None:
        warnings.warn(
            "Accessing dd.C on a lazy DonorData with a pending obs filter. "
            "The returned object contains unfiltered cells. "
            "Call dd.to_memory() first to get the correctly filtered C.",
            LazyDonorDataWarning,
            stacklevel=2,
        )
    return self._C

Comment thread src/cellink/_core/donordata.py Outdated
if C.is_view:
C = C.copy()

return DonorData(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced of this. This calls __init__ again, which calls _match_donors again. But at this point G and C are already correctly filtered, so _match_donors is redundant. Also, this could be dangerous. It recomputes keep_donors from scratch on the already-filtered data, which should give the same result, but if anything about the obs index changed during materialisation it could silently re-filter in an unexpected way.
How about we skip __init__ and build the result object directly:

@classmethod
def _from_parts(cls, G, C, donor_id, var_dims_to_sync, uns) -> DonorData:
    """Construct a DonorData directly from already-matched parts, bypassing _match_donors."""
    result = object.__new__(cls)
    result._G = G
    result._C = C
    result.donor_id = donor_id
    result._var_dims_to_sync = var_dims_to_sync
    result.uns = uns
    result._lazy_G_obs_filter = None
    result._lazy_C_obs_filter = None
    return result

and in in_memory():

return DonorData._from_parts(
    G=G,
    C=C,
    donor_id=self.donor_id,
    var_dims_to_sync=self._var_dims_to_sync,
    uns=self.uns,
)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Implemented this

@LArnoldt LArnoldt requested a review from jan-engelmann April 29, 2026 14:24

@jan-engelmann jan-engelmann left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Eva! I like where this is going. Good first version.
This is a challenging and important PR and I think we'll need a few iterations to get this right.
Let me know once you have a new version then I'll have another look!

Comment thread src/cellink/_core/donordata.py Outdated
import dask.array as da
except ImportError:
return False
return isinstance(getattr(adata, "X", None), da.Array)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to also check for CSR and CSC data sets

def _materialise(x):
if isinstance(x, (pd.DataFrame, pd.Series, np.ndarray)):
return x
if hasattr(x, "to_pandas"):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many hasattrs and isinstance calls. We should avoid this.

I think we should write our own reading function using
https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_dispatched.html

Comment thread src/cellink/io/_readwrite.py Outdated
def _h5_to_lazy_anndata(group):
"""Build a lazy AnnData from an h5py Group when read_lazy can't ingest h5.

Used as a fallback for older anndata versions. Keeps obs/var eager

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm should we simply require later anndata versions? I'd be in favor of that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and then remove this function)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

try:
from anndata.experimental import read_lazy

a = read_lazy(path, load_annotation_index=load_annotation_index)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you talked with @ilan-gold what the plans are for this function? I think we would need more control. as stated above I'd consider writing our own lazy reading function with:
https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_dispatched.html

I can dig up code if you need it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilan-gold what is your opinion on this? Is the read_dispatched suitable for the lazy reading?
If you could have a look at the changes suggested in this PR to _readwrite.py that would be amazing!

Comment thread src/cellink/io/_readwrite.py Outdated
try:
uns[key] = uns_group[key][()]
except Exception:
logger.debug(f"Skipped loading uns['{key}'] (may be large or non-array)")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user will not see this as they likley have a higher logging level.
Also this swallows the exception. It's bad practice to have general Exception type.
It would be better to think about what can go wrong here and handle exactly these cases

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread src/cellink/pl/_expression.py
@jan-engelmann

Copy link
Copy Markdown
Collaborator

TODO (maybe at later stage): allow that some objects in DonorData storage are not zarr (i.e., allow them to be e.g. parquet files which would be much faster for loading)

I know @ilan-gold is working on speeding up the dataframe loading. Ideally we'd stay within the anndata framework. But I do agree parquet is great since it's fast and allows column-wise loading

@ilan-gold

Copy link
Copy Markdown

It would help to know what exactly is slow. Eva messaged me but I'm still not 100% clear - are you ok loading your obs/var data into memory? Is it actually loading individual columns lazily that is slow with the lazy xarray class in obs? Or is it loading obs and var (not lazily) into memory that is slow?

@HolEv HolEv requested a review from ilan-gold May 5, 2026 13:39
@ilan-gold

Copy link
Copy Markdown

Ok so I've done a little tinkering. Here is what I can see, roughly:

  1. parquet files dont seem to be much faster than zarr files for reading full dataframes at least from a large var dataframe I tried that sgkit output/was converted to anndata (discarding the insane column 7 named column with very long strings). This is more confirmation of my long-running hypothesis that compressed chunks of bytes on disk are nothing more than that. And your reader software simply needs to not completely stink, which zarr does not. I used pandas.read_parquet:
  2. Backport PR #2419: fix: anndata.experimental.read_lazy index fixes (checking uniqueness + unnecessary dask loading) scverse/anndata#2420 should speed up considerably the process of lazy loading if you set anndata.settings.check_uniqueness. Using this setting, opening (not reading fully) the var dataframe with 92 million features took about 25 seconds, over 2/3 of which is just reading the index column into memory i.e., var_names. With a better filesystem (not Helmholtz's cluster), this is probably faster.
  3. Re: the above index, there is always load_annotation_index=False if you want even faster load times with read_lazy at the expense of your indexes losing some interpretability right out of the gate. The index column will still be there but obs_names and var_names will be a little less interpretable

For point 1 see:

In [10]: %timeit pd.read_parquet("/localscratch/ilan.gold/test.pq")
1min 11s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit pd.DataFrame(index=ad.io.read_elem(g["_index"]), data={k: ad.io.read_elem(g[k]) for k in g if k not
in {"7", "_index"}})
1min 16s ± 877 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and that is with read_parquet against the localscratch, which should be much faster!

In short, I don't think parquet files or polars will be any massive upgrade. I can look into shaving the non i/o time off of the anndata.experimental.read_lazy since it gets worse with more columns. But I think you all should download the latest anndata and give things a spin. I'm releasing the above backport now.

@LArnoldt

Copy link
Copy Markdown
Collaborator

Are these changes compatible with the .dd.h5 and .dd.zarr writing and loading functions? Was this tested? @HolEv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants