Release per-iteration mdata copies in compute_fake_perturbation_tests#5
Open
adamklie wants to merge 2 commits into
Open
Release per-iteration mdata copies in compute_fake_perturbation_tests#5adamklie wants to merge 2 commits into
adamklie wants to merge 2 commits into
Conversation
args.reference_targets is never defined on the argparse namespace; mirror
the real-test path's fallback (line 49) and use args.guide_annotation_key
instead. This is the correct reference-target list since the fake-test code
relabels NT guides to {'non-targeting', 'targeting'} before this call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fake-test inner loop calls mdata.copy() per iteration, which on real datasets (~10 GB sparse rna matrix) stacks up ~15 GB residual per iteration and OOMs around iteration 16/50 in a 256 GB allocation. Add explicit del + gc.collect() at the end of each iteration so deep-copies are reaped before the next iteration starts. Also release the K-loop mdata between K values to avoid accumulating across the K loop. This is a minimal bandaid — the structural fix would avoid the full mdata.copy() in the first place since the fake-test only mutates obsm and uns on the prog_key modality, not the rna modality (which is what makes the deep-copy expensive). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release per-iteration mdata copies in compute_fake_perturbation_tests
Summary
Closes #3.
The fake-test inner loop calls
mdata.copy()per iteration. On real datasets (~10 GB sparse rna matrix), each deep copy retains ~15 GB residual that Python's reference-counted GC can't reap fast enough between iterations. This blows past 256 GB SLURM allocations around iteration 16/50.Change
Adds
import gc, then at the end of each fake-test iteration:Plus releases the K-loop
mdatabetween K values to keep memory bounded across K too.Validation
End-to-end on Huangfu HUES8 ESC (~190k cells × 36k genes, K = 30,50,60,80,100,200,250,300, sel_thresh=2.0, --number_run 50):
DE (~270k cells) also completes successfully with this patch.
This is a bandaid, not the structural fix
The underlying redundancy is that the fake-test code only mutates
_mdata[prog_key].obsm[guide_assignment_key]and twounsarrays — it never touches the rna modality, which is what makesmdata.copy()expensive. A more efficient refactor would mutatemdata[prog_key]in place each iteration (or operate on a small dict of overrides) and skip the deep copy entirely. That's a bigger change worth doing separately. This PR keeps the existing logic intact and just keeps memory bounded so the function actually runs to completion on realistic datasets.🤖 Generated with Claude Code