Switch torch dependency from ~=2.9.1 to ~=2.10.0 (silent bfloat16 memory regression) by hanaol · Pull Request #118 · Quantum-Accelerators/electrai

hanaol · 2026-04-09T15:30:37Z

Summary

This PR updates the torch dependency from ~=2.9.1 to ~=2.10.0 to fix a silent bfloat16 memory regression introduced in torch 2.9.0.

The problem

torch 2.9.0 and 2.9.1 contain a cuDNN regression that inflates the nn.Conv3d bfloat16 forward-pass workspace by 26x -- from ~77 MB to ~2,053 MB -- relative to both the preceding (2.8.0) and following (2.10.0) releases. These numbers were measured on a fixed tensor of shape [1, 32, 64, 64, 64] with a Conv3d(in=32, out=32, k=5, padding=2) layer. float32 memory is completely unaffected (stable at ~123 MB across all versions), confirming the bug is specific to the bfloat16 cuDNN kernel selection path.

This matters because we use (or plan to use) bf16-mixed precision training. This regression would silently consume an extra ~2 GB per Conv3d layer, directly undermining the memory savings that bf16 is supposed to provide -- without any crash or warning.

This issue has been raised in the PyTorch community:

Benchmark results (A100-SXM4-80GB, CUDA 12.8, input shape [1, 32, 64, 64, 64])

Peak GPU memory — `float32`

torch	cuDNN	Fwd peak (MB)	Bwd peak (MB)
2.8.0	9.1.0.2	123	212
2.9.0	9.1.0.2	123	212
2.9.1	9.1.0.2	123	212
2.10.0	9.1.0.2	123	212
2.11.0	9.1.9.0	123	209

Peak GPU memory — `bfloat16`

torch	cuDNN	Fwd peak (MB)	Bwd peak (MB)
2.8.0	9.1.0.2	77	111
2.9.0	9.1.0.2	2053	2081
2.9.1	9.1.0.2	2053	2081
2.10.0	9.1.0.2	77	111
2.11.0	9.1.9.0	77	111

The benchmark script is included at scripts/benchmark_conv3d_memory.py and can be run standalone on any CUDA node.

Decision: 2.10.0 vs 2.11.0

Both 2.10.0 and 2.11.0 are clean. This PR pins to 2.10.0 for now. Upgrading to 2.11.0 is possible but introduces a CUDA 13.0 dependency (vs 12.8 for all prior versions), which pulls in a new set of nvidia-*-cu13 libraries and we have not tested it against our full stack (lightning, etc.). Once our ecosystem catches up to CUDA 13.0, bumping to ~=2.11.0 is an option worth revisiting.

Files changed

pyproject.toml -- torch~=2.9.1 to torch~=2.10.0
uv.lock -- regenerated
scripts/benchmark_conv3d_memory.py -- standalone benchmark used to produce the results above

hanaol · 2026-04-09T15:34:58Z

@claude review this PR 👀 👀 👀

claude · 2026-04-09T15:35:33Z

Claude finished @hanaol's task in 2m 33s —— View job

PR Review: Switch torch ~=2.9.1 → ~=2.10.0

Summary: The fix is correct and well-motivated. The bfloat16 Conv3d regression in torch 2.9.x is real (pytorch/pytorch#166643), and 2.10.0 cleanly resolves it. The benchmark methodology is solid. A few things worth flagging below.

pyproject.toml

The one-line change is correct:

-"torch~=2.9.1",
+"torch~=2.10.0",

~=2.10.0 (compatible release) pins to >=2.10.0, <2.11.0, which is exactly the right constraint — narrow enough to avoid 2.11's CUDA 13.0 dependency, wide enough to pick up any 2.10.x patch. No issues here.

benchmark script (`scripts/benchmark_conv3d_memory.py`)

Overall well-written. A few observations:

Good:

Clear module docstring and inline comments
Warmup iterations before measurement — correct practice
Averages over BENCH_ITERS=5 for stable timing
cuDNN version decoding handles both the pre-9 and post-9 integer formats correctly
Uses Path.open() and proper JSON output
Structured results with full GPU metadata (SM count, VRAM)

Minor issues:

--output is required (line 106) — the script exits with code 1 when CUDA is unavailable (correct), but fails with an argparse error if --output is omitted. Consider a sensible default like results.json so it can be run with just python benchmark_conv3d_memory.py. Low severity.
Backward memory semantics (lines 79–88): reset_peak_memory_stats() resets the peak counter to zero (not to the current allocation level). This means bwd_peak_mb captures the total peak memory during the backward pass — including the activations already allocated by the fresh out = conv(x) forward. So it's not purely the backward-specific allocation. The table in the PR description labels the columns as "bf16 bwd (MB)" which is slightly ambiguous, but since the methodology is consistent across all torch versions, the relative comparison is valid. Worth adding a comment in the script clarifying this.
out from forward loop leaks into backward loop (line 81): The last out tensor from the forward benchmark loop is still live when the backward loop starts — but it gets immediately overwritten by out = conv(x), so there's no actual memory leak. Just a readability nit.

uv.lock

The changes look exactly right: torch 2.9.1 wheel hashes replaced with torch 2.10.0 equivalents across all platform/Python combinations. The build suffix changed from plain torch-2.9.1-... to torch-2.10.0-2-... and torch-2.10.0-3-... (PyPI build numbers), which is expected.

Commit history

The branch has three commits but two of them are noisy:

45f70f6 — "modules to check later": adds cudnn.benchmark = False + cudnn.deterministic = True to main.py, plus the actual pyproject.toml/uv.lock change
b11cc87 — "modified main.py": immediately reverts those two main.py lines

The net effect on main.py is zero, but the history carries exploratory cruft with unhelpful messages. This is fine for a feature branch, but squashing to a single clean commit before merge would make the history easier to read. Not a blocker.

Verdict

Approve with minor suggestions. The core change is correct, minimal, and backed by reproducible evidence. The benchmark script is a valuable addition. The commit history and --output default are polish items, not blockers.

…semantics, drop stale ref - Make `--output` default to `results.json` so the script runs standalone without args. - Add comment explaining that `bwd_peak_mb` reflects peak *during* backward (includes fresh-forward activations held by the autograd graph); relative comparisons across torch versions remain valid. - `del out` after the forward loop so the stale reference doesn't linger into the backward loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

forklady42

I addressed the minor comments about the benchmark script and reviewed myself

Claude's point about commit history is reasonable, but I'm not concerned as they'll be squashed when merging to main. Could be helpful to keep in mind in the future.

Good to merge ✅

Hananeh Oliaei added 3 commits April 9, 2026 09:45

modules to check later

45f70f6

compare memory usage across torch versions

46d00af

modified main.py

b11cc87

forklady42 and others added 2 commits April 17, 2026 11:33

Merge branch 'main' into hanaol/torch-version-upgrade

36ca6a4

forklady42 approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch torch dependency from ~=2.9.1 to ~=2.10.0 (silent bfloat16 memory regression)#118

Switch torch dependency from ~=2.9.1 to ~=2.10.0 (silent bfloat16 memory regression)#118
hanaol wants to merge 5 commits intomainfrom
hanaol/torch-version-upgrade

hanaol commented Apr 9, 2026 •

edited

Loading

Uh oh!

hanaol commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

forklady42 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hanaol commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The problem

Benchmark results (A100-SXM4-80GB, CUDA 12.8, input shape [1, 32, 64, 64, 64])

Peak GPU memory — float32

Peak GPU memory — bfloat16

Decision: 2.10.0 vs 2.11.0

Files changed

Uh oh!

hanaol commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Switch torch ~=2.9.1 → ~=2.10.0

pyproject.toml

benchmark script (scripts/benchmark_conv3d_memory.py)

uv.lock

Commit history

Verdict

Uh oh!

forklady42 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hanaol commented Apr 9, 2026 •

edited

Loading

Peak GPU memory — `float32`

Peak GPU memory — `bfloat16`

claude bot commented Apr 9, 2026 •

edited

Loading

benchmark script (`scripts/benchmark_conv3d_memory.py`)