Reduced memory and debugged reverse AD by cgiovanetti · Pull Request #22 · TonyZhou729/ABCMB

cgiovanetti · 2026-05-03T18:09:28Z

Three major fixes:

Reduced memory usage dramatically. Turns out diffrax dense output is a memory guzzler, so dropped that in places where we weren't really taking advantage. Also refactored spectral integrals so that, rather than making an (Nlna, Nk) tensor for each ell, now we just lax.scan over 1D (Nk,) accumulators. Peak memory down from ~5 GB --> a few hundred MB with both of these fixes, no loss to speed or accuracy.
Added reverse AD option (adjoint option at initialization). Note unfortunately it cannot go in the specs dict and has to live as its own argument to Model. Adds checkpointing, since even with the memory reductions above it was still pulling >20 GB/gradient.
Fixed bug preventing reverse AD. There's a known JAX gotcha with nested jnp.where and reverse AD, where if one branch of jnp.where is infinite/NaN the whole calculation will give NaNs. Fixed here.

Replace the four (Nlna, Nk) integrand materialisations in Cl_one_ell with a single lax.scan over lna_axis carrying four (Nk,) running sums. Under the outer vmap over lensing_ells_indices (99 entries on the default ell grid), the (Nell, Nlna, Nk) 3D tensors XLA was materialising — not fusing — across the vmap are gone. Full-pipeline GPU peak on fiducial LCDM drops from 5.813 GiB to 0.454 GiB (-12.8x); SS.get_Cl contribution collapses from 5.39 GiB to 35 MiB (~150x inside get_Cl). Wall-clock is unchanged (full warm +1.1%, inside measurement noise). ClTT/TE/EE at probe ells {2, 30, 200, 1000, 2000} agree with baseline to max rel 1.46e-13 — ULP-level drift from reordered summation in float64 at Nlna=499, far below the CLASS accuracy-test tolerance. Bessel-table invariants (xphi{0,1,2}_tab columns, bessel_l_tab[idx], column min/max bounds) are hoisted out of the scan body and captured by phi{0,1,2}_local closures. chi = jnp.outer(tau0-tau, k_axis) is no longer materialised; chi_l = (tau0 - tau[i]) * k_axis is a (Nk,) vector per scan iter. Accumulators use dtype=sourceT0.dtype rather than zeros_like(k_axis) to avoid silent downcast when k_axis is float32 from geomspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cara Giovanetti and others added 8 commits April 21, 2026 12:01

no more dense output significantly reduces memory

93e30d5

added reverse AD kwarg

fe017a0

remove lax.scan option from spectrum, vmap now faster

5882634

functioning reverse AD!

a2ded28

better comments

23c0b0d

better comments

5fc7007

increment version

8ab5500

cgiovanetti requested a review from TonyZhou729 May 3, 2026 18:09

TonyZhou729 merged commit 4b0a9f0 into main May 4, 2026
1 check passed

cgiovanetti deleted the mem_quick branch May 4, 2026 19:59

cgiovanetti mentioned this pull request May 4, 2026

High memory usage for stiff fluids #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduced memory and debugged reverse AD#22

Reduced memory and debugged reverse AD#22
TonyZhou729 merged 8 commits intomainfrom
mem_quick

cgiovanetti commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cgiovanetti commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cgiovanetti commented May 3, 2026 •

edited

Loading