f32 vs bf16-mixed benchmark by hanaol · Pull Request #119 · Quantum-Accelerators/electrai

hanaol · 2026-04-13T13:08:37Z

Summary

This PR adds a benchmark script that trains one sample (task ID) at a time for 3 epochs under both f32 and bf16-mixed precision on a single A100 GPU, recording peak GPU memory, forward/backward times, and OOM status for 10 large-grid samples.

The problem

The 10 task IDs selected for this benchmark are Materials Project entries with relatively large charge-density grids, spanning 3.4 M – 46.7 M voxels across a variety of shapes and aspect ratios. It was not previously known which samples would fit on an A100 (79.3 GB) under f32 vs bf16-mixed, or how training time scales with grid volume. This script establishes those baselines.

Benchmark results (A100-SXM4-80GB, CUDA 12.8)

Config: n_channels=32, n_residual_blocks=1, kernel_size=5, depth=2, batch_size=1, single GPU, 3 epochs per experiment.

Task ID	Grid shape	Voxels	f32 status	f32 peak (GB)	f32 epoch (s)	bf16 status	bf16 peak (GB)	bf16 epoch (s)	Mem ratio
mp-1890579	56 × 56 × 1080	3.4 M	✅	10.5	1.65	✅	5.5	1.07	1.91×
mp-1849767	60 × 60 × 1120	4.0 M	✅	12.4	1.95	✅	6.5	1.27	1.91×
mp-1851604	60 × 60 × 1120	4.0 M	✅	12.4	1.93	✅	6.5	1.25	1.91×
mp-1862536	80 × 80 × 1024	6.6 M	✅	19.9	3.17	✅	10.4	1.94	1.91×
mp-1847208	1120 × 84 × 84	7.9 M	✅	23.9	3.87	✅	12.5	2.46	1.91×
mp-1936557	80 × 756 × 216	13.1 M	✅	39.1	6.47	✅	20.4	3.73	1.91×
mp-1850168	972 × 240 × 128	29.9 M	❌ OOM	70.2	—	✅	46.3	8.77	—
mp-1887804	320 × 320 × 320	32.8 M	❌ OOM	68.8	—	✅	50.6	169.91	—
mp-1889246	540 × 144 × 432	33.6 M	❌ OOM	70.6	—	✅	51.9	208.94	—
mp-1871122	360 × 360 × 360	46.7 M	❌ OOM	71.4	—	❌ OOM	71.9	—	—

Summary

Precision	✅ Completed	❌ OOM
f32	6 / 10	4 / 10
bf16-mixed	9 / 10	1 / 10 (mp-1871122, 360³)

Key findings

Memory: bf16-mixed yields a consistent 1.91× reduction in peak GPU memory for all grids that fit under both precisions.
Speed: bf16-mixed is 1.5–1.7× faster per epoch on grids ≤ 13 M voxels; the backward pass benefits the most (~2.2×).
OOM threshold on A100: ~13 M voxels under f32; ~33 M voxels under bf16-mixed.
Largest grid (360³, 46.7 M voxels): exceeds A100 capacity even with bf16-mixed.
Very large grids (320³, 540×144×432): fit under bf16-mixed but epoch times are 170–209 s, dominated by the backward pass (~144–176 s through large skip connections).

Files

scripts/benchmark_precision.py -- Runs 3-epoch single-GPU training for each (task_id, precision) pair, recording CUDA-event forward/backward times and peak GPU memory

hanaol · 2026-04-13T13:12:59Z

@claude review this PR.

claude · 2026-04-13T13:13:30Z

Claude finished @hanaol's task in 1m 9s —— View job

PR Review: f32 vs bf16-mixed benchmark

Gather context
Read benchmark script
Provide review feedback

Overall this is a clean, well-structured benchmark script. The incremental JSON saving, CUDA event timing, and memory reset logic are all solid. A few issues worth fixing before merge:

Bugs

1. Wrong script name in docstring (scripts/benchmark_precision.py:11)

The usage example references benchmark_per_sample_precision.py but the file is benchmark_precision.py.

# Line 11 — should be:
#     uv run python scripts/benchmark_precision.py \

Fix this →

2. get_gpu_info() return type annotation says str, returns dict (scripts/benchmark_precision.py:206)

def get_gpu_info() -> str:   # ← wrong annotation
    props = torch.cuda.get_device_properties(DEVICE)
    return {                  # ← actually returns a dict
        "name": props.name,
        ...
    }

Fix this →

Memory cleanup after OOM

3. dataset, loader, and batch-scope tensors not cleaned in finally (scripts/benchmark_precision.py:185-188)

After an OOM inside the batch loop, dataset, loader, x, y, pred (if forward completed before OOM), and potentially loss all still hold GPU memory. The finally block only deletes model, optimizer, loss_fn, so subsequent experiments start with residual allocations that can trigger spurious OOMs or inflate peak memory readings.

finally:
    # Also free batch-scope tensors if they exist
    for var in ("pred", "loss", "x", "y"):
        if var in locals():
            del locals()[var]   # won't work — use explicit try/except or reassign to None
    del model, optimizer, loss_fn, dataset, loader
    gc.collect()
    torch.cuda.empty_cache()

A clean pattern is to assign the variables to None at the top of run_experiment so they're always in scope for the finally:

model = optimizer = loss_fn = dataset = loader = None
pred = loss = x = y = None
...
finally:
    del model, optimizer, loss_fn, dataset, loader, pred, loss, x, y
    gc.collect()
    torch.cuda.empty_cache()

Fix this →

Minor

4. Silent exception swallowing in main() (scripts/benchmark_precision.py:230)

except Exception as _:
    grid_shapes[tid] = None

The _ convention signals "intentionally ignored" but any exception (bad zarr path, wrong key name, permission error) silently becomes None. Consider at minimum a print/warnings.warn so failures are visible in the benchmark log.

5. Hardcoded DEVICE — No CLI option to override the device (e.g. --device cuda:1). Fine for a single-GPU script but worth noting if multi-GPU nodes are used.

6. GradScaler not used with bf16 — This is correct. Unlike fp16, bf16 has sufficient dynamic range that GradScaler is not needed. Worth a one-line comment for future readers though.

The benchmark results table in the PR description is very clear and the methodology (CUDA event timing, per-experiment memory reset, incremental JSON saves) is sound. Fixing issues 1–3 above is recommended before merge.

Hananeh Oliaei added 2 commits April 11, 2026 17:32

benchmark module

36977ad

renamed script

9c37c7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

f32 vs bf16-mixed benchmark#119

f32 vs bf16-mixed benchmark#119
hanaol wants to merge 2 commits intohanaol/torch-version-upgradefrom
hanaol/mixed-precision

hanaol commented Apr 13, 2026

Uh oh!

hanaol commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hanaol commented Apr 13, 2026

Summary

The problem

Benchmark results (A100-SXM4-80GB, CUDA 12.8)

Summary

Key findings

Files

Uh oh!

hanaol commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: f32 vs bf16-mixed benchmark

Bugs

Memory cleanup after OOM

Minor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 13, 2026 •

edited

Loading