Feature/lazy loading#114
Conversation
| else: | ||
| # Lazy AnnData (e.g. from read_lazy) — obs is xarray-backed. | ||
| # donor_id must already be the obs index (obs_names). | ||
| logger.debug("G.obs is not a pandas DataFrame; assuming donor_id is the obs index.") |
There was a problem hiding this comment.
I think I have never seen an example where G.obs is not a pd.DataFrame? Or is it only a pd.Index then? If not, what else? Should we cast it to a pd.Index then?
There was a problem hiding this comment.
If you have an entire zarr based ann data and you don't load everything into mem (i.e., lazy loading and eager_obs_var = False) then it won't be a pandas data frame but just an xarray data set
| """Whether G or C has a dask-backed X matrix (lazy loading).""" | ||
| return self._lazy_G or self._lazy_C | ||
|
|
||
| def to_memory(self) -> DonorData: |
There was a problem hiding this comment.
Would it make sense to have an inplace flag here and enable inplace materialization?
| """Whether G or C has a dask-backed X matrix (lazy loading).""" | ||
| return self._lazy_G or self._lazy_C | ||
|
|
||
| def to_memory(self) -> DonorData: |
There was a problem hiding this comment.
Maybe good to have an alias function compute() since dask also uses compute()?
def compute(self) -> DonorData:
return self.to_memory()
There was a problem hiding this comment.
I like that! Will add it
| _G = self.G[G_obs] | ||
| _G = self._G[G_obs] | ||
| _G = _G[:, G_var] | ||
| _C = self.C[C_obs] |
| # embedded into a new lazy dask chain — that would recreate the bottleneck. | ||
| _G = self._G[key[0]] if key[0] is not slice(None) else self._G | ||
| _G = _G[:, key[1]] if key[1] is not slice(None) else _G | ||
| _C = self.C[key[2]] if key[2] is not slice(None) else self.C |
There was a problem hiding this comment.
Same question as above. Why C and not _C?
| # For lazy sides, don't materialise — leave them alone. | ||
| # For eager sides, copy if it's currently a view. | ||
| if not self._lazy_G and self._G.is_view: | ||
| self._G = self._G.copy() |
There was a problem hiding this comment.
But now it mutates self._G, so it's not a clean copy? The original dd object is changed, which shouldn't happen in a function called "copy".
There was a problem hiding this comment.
thanks for noticing! should be fixed now!
| verbose: | ||
| Whether to print verbose output. Defaults to False. | ||
| """ | ||
| if self._lazy_C: |
There was a problem hiding this comment.
Why only in case of lazy C and not lazy G?
There was a problem hiding this comment.
because this function only uses C. adata used below is self.C. So G is irrelevant here
| self._match_donors(self._G, self._C) | ||
|
|
||
| @property | ||
| def G(self) -> AnnData: |
There was a problem hiding this comment.
This should give a warning in case of a lazy dask frame, if self._lazy_G_obs_filter is not None, otherwise the user might forget calling to_memory first?
@property
def G(self) -> AnnData:
if self._lazy_G_obs_filter is not None:
import warnings
warnings.warn(
"Accessing dd.G on a lazy DonorData with a pending obs filter. "
"The returned object contains unfiltered donors. "
"Call dd.to_memory() first to get the correctly filtered G.",
stacklevel=2,
)
return self._G
| raise ValueError("Unknown format: use 'h5' or 'zarr'.") | ||
|
|
||
| @property | ||
| def C(self) -> AnnData: |
There was a problem hiding this comment.
This should give a warning in case of a lazy dask frame, if self._lazy_C_obs_filter is not None, otherwise the user might forget calling to_memory first?
@property
def C(self) -> AnnData:
if self._lazy_C_obs_filter is not None:
warnings.warn(
"Accessing dd.C on a lazy DonorData with a pending obs filter. "
"The returned object contains unfiltered cells. "
"Call dd.to_memory() first to get the correctly filtered C.",
LazyDonorDataWarning,
stacklevel=2,
)
return self._C
| if C.is_view: | ||
| C = C.copy() | ||
|
|
||
| return DonorData( |
There was a problem hiding this comment.
Not convinced of this. This calls __init__ again, which calls _match_donors again. But at this point G and C are already correctly filtered, so _match_donors is redundant. Also, this could be dangerous. It recomputes keep_donors from scratch on the already-filtered data, which should give the same result, but if anything about the obs index changed during materialisation it could silently re-filter in an unexpected way.
How about we skip __init__ and build the result object directly:
@classmethod
def _from_parts(cls, G, C, donor_id, var_dims_to_sync, uns) -> DonorData:
"""Construct a DonorData directly from already-matched parts, bypassing _match_donors."""
result = object.__new__(cls)
result._G = G
result._C = C
result.donor_id = donor_id
result._var_dims_to_sync = var_dims_to_sync
result.uns = uns
result._lazy_G_obs_filter = None
result._lazy_C_obs_filter = None
return result
and in in_memory():
return DonorData._from_parts(
G=G,
C=C,
donor_id=self.donor_id,
var_dims_to_sync=self._var_dims_to_sync,
uns=self.uns,
)
There was a problem hiding this comment.
good catch. Implemented this
jan-engelmann
left a comment
There was a problem hiding this comment.
Thank you Eva! I like where this is going. Good first version.
This is a challenging and important PR and I think we'll need a few iterations to get this right.
Let me know once you have a new version then I'll have another look!
| import dask.array as da | ||
| except ImportError: | ||
| return False | ||
| return isinstance(getattr(adata, "X", None), da.Array) |
There was a problem hiding this comment.
I think this is brittle. What if it is eg:
https://anndata.readthedocs.io/en/latest/generated/anndata.abc.CSRDataset.html#anndata.abc.CSRDataset
There was a problem hiding this comment.
I changed to also check for CSR and CSC data sets
| def _materialise(x): | ||
| if isinstance(x, (pd.DataFrame, pd.Series, np.ndarray)): | ||
| return x | ||
| if hasattr(x, "to_pandas"): |
There was a problem hiding this comment.
We have many hasattrs and isinstance calls. We should avoid this.
I think we should write our own reading function using
https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_dispatched.html
| def _h5_to_lazy_anndata(group): | ||
| """Build a lazy AnnData from an h5py Group when read_lazy can't ingest h5. | ||
|
|
||
| Used as a fallback for older anndata versions. Keeps obs/var eager |
There was a problem hiding this comment.
hmmm should we simply require later anndata versions? I'd be in favor of that.
There was a problem hiding this comment.
(and then remove this function)
| try: | ||
| from anndata.experimental import read_lazy | ||
|
|
||
| a = read_lazy(path, load_annotation_index=load_annotation_index) |
There was a problem hiding this comment.
have you talked with @ilan-gold what the plans are for this function? I think we would need more control. as stated above I'd consider writing our own lazy reading function with:
https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_dispatched.html
I can dig up code if you need it
There was a problem hiding this comment.
@ilan-gold what is your opinion on this? Is the read_dispatched suitable for the lazy reading?
If you could have a look at the changes suggested in this PR to _readwrite.py that would be amazing!
| try: | ||
| uns[key] = uns_group[key][()] | ||
| except Exception: | ||
| logger.debug(f"Skipped loading uns['{key}'] (may be large or non-array)") |
There was a problem hiding this comment.
user will not see this as they likley have a higher logging level.
Also this swallows the exception. It's bad practice to have general Exception type.
It would be better to think about what can go wrong here and handle exactly these cases
I know @ilan-gold is working on speeding up the dataframe loading. Ideally we'd stay within the anndata framework. But I do agree parquet is great since it's fast and allows column-wise loading |
|
It would help to know what exactly is slow. Eva messaged me but I'm still not 100% clear - are you ok loading your |
|
Ok so I've done a little tinkering. Here is what I can see, roughly:
For point 1 see: In [10]: %timeit pd.read_parquet("/localscratch/ilan.gold/test.pq")
1min 11s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit pd.DataFrame(index=ad.io.read_elem(g["_index"]), data={k: ad.io.read_elem(g[k]) for k in g if k not
⋮ in {"7", "_index"}})
1min 16s ± 877 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)and that is with In short, I don't think parquet files or polars will be any massive upgrade. I can look into shaving the non i/o time off of the |
|
Are these changes compatible with the .dd.h5 and .dd.zarr writing and loading functions? Was this tested? @HolEv |
Add lazy loading for G and C
donordataandreadwritefor this purpose.TODO (maybe at later stage): allow that some objects in DonorData storage are not zarr (i.e., allow them to be e.g. parquet files which would be much faster for loading)