Skip to content

f32 vs bf16-mixed benchmark#119

Open
hanaol wants to merge 2 commits intohanaol/torch-version-upgradefrom
hanaol/mixed-precision
Open

f32 vs bf16-mixed benchmark#119
hanaol wants to merge 2 commits intohanaol/torch-version-upgradefrom
hanaol/mixed-precision

Conversation

@hanaol
Copy link
Copy Markdown
Collaborator

@hanaol hanaol commented Apr 13, 2026

Summary

This PR adds a benchmark script that trains one sample (task ID) at a time for 3 epochs under both f32 and bf16-mixed precision on a single A100 GPU, recording peak GPU memory, forward/backward times, and OOM status for 10 large-grid samples.

The problem

The 10 task IDs selected for this benchmark are Materials Project entries with relatively large charge-density grids, spanning 3.4 M – 46.7 M voxels across a variety of shapes and aspect ratios. It was not previously known which samples would fit on an A100 (79.3 GB) under f32 vs bf16-mixed, or how training time scales with grid volume. This script establishes those baselines.

Benchmark results (A100-SXM4-80GB, CUDA 12.8)

Config: n_channels=32, n_residual_blocks=1, kernel_size=5, depth=2, batch_size=1, single GPU, 3 epochs per experiment.

Task ID Grid shape Voxels f32 status f32 peak (GB) f32 epoch (s) bf16 status bf16 peak (GB) bf16 epoch (s) Mem ratio
mp-1890579 56 × 56 × 1080 3.4 M 10.5 1.65 5.5 1.07 1.91×
mp-1849767 60 × 60 × 1120 4.0 M 12.4 1.95 6.5 1.27 1.91×
mp-1851604 60 × 60 × 1120 4.0 M 12.4 1.93 6.5 1.25 1.91×
mp-1862536 80 × 80 × 1024 6.6 M 19.9 3.17 10.4 1.94 1.91×
mp-1847208 1120 × 84 × 84 7.9 M 23.9 3.87 12.5 2.46 1.91×
mp-1936557 80 × 756 × 216 13.1 M 39.1 6.47 20.4 3.73 1.91×
mp-1850168 972 × 240 × 128 29.9 M ❌ OOM 70.2 46.3 8.77
mp-1887804 320 × 320 × 320 32.8 M ❌ OOM 68.8 50.6 169.91
mp-1889246 540 × 144 × 432 33.6 M ❌ OOM 70.6 51.9 208.94
mp-1871122 360 × 360 × 360 46.7 M ❌ OOM 71.4 ❌ OOM 71.9

Summary

Precision ✅ Completed ❌ OOM
f32 6 / 10 4 / 10
bf16-mixed 9 / 10 1 / 10 (mp-1871122, 360³)

Key findings

  • Memory: bf16-mixed yields a consistent 1.91× reduction in peak GPU memory for all grids that fit under both precisions.
  • Speed: bf16-mixed is 1.5–1.7× faster per epoch on grids ≤ 13 M voxels; the backward pass benefits the most (~2.2×).
  • OOM threshold on A100: ~13 M voxels under f32; ~33 M voxels under bf16-mixed.
  • Largest grid (360³, 46.7 M voxels): exceeds A100 capacity even with bf16-mixed.
  • Very large grids (320³, 540×144×432): fit under bf16-mixed but epoch times are 170–209 s, dominated by the backward pass (~144–176 s through large skip connections).

Files

  • scripts/benchmark_precision.py -- Runs 3-epoch single-GPU training for each (task_id, precision) pair, recording CUDA-event forward/backward times and peak GPU memory

@hanaol
Copy link
Copy Markdown
Collaborator Author

hanaol commented Apr 13, 2026

@claude review this PR.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Claude finished @hanaol's task in 1m 9s —— View job


PR Review: f32 vs bf16-mixed benchmark

  • Gather context
  • Read benchmark script
  • Provide review feedback

Overall this is a clean, well-structured benchmark script. The incremental JSON saving, CUDA event timing, and memory reset logic are all solid. A few issues worth fixing before merge:

Bugs

1. Wrong script name in docstring (scripts/benchmark_precision.py:11)

The usage example references benchmark_per_sample_precision.py but the file is benchmark_precision.py.

# Line 11 — should be:
#     uv run python scripts/benchmark_precision.py \

Fix this →

2. get_gpu_info() return type annotation says str, returns dict (scripts/benchmark_precision.py:206)

def get_gpu_info() -> str:   # ← wrong annotation
    props = torch.cuda.get_device_properties(DEVICE)
    return {                  # ← actually returns a dict
        "name": props.name,
        ...
    }

Fix this →

Memory cleanup after OOM

3. dataset, loader, and batch-scope tensors not cleaned in finally (scripts/benchmark_precision.py:185-188)

After an OOM inside the batch loop, dataset, loader, x, y, pred (if forward completed before OOM), and potentially loss all still hold GPU memory. The finally block only deletes model, optimizer, loss_fn, so subsequent experiments start with residual allocations that can trigger spurious OOMs or inflate peak memory readings.

finally:
    # Also free batch-scope tensors if they exist
    for var in ("pred", "loss", "x", "y"):
        if var in locals():
            del locals()[var]   # won't work — use explicit try/except or reassign to None
    del model, optimizer, loss_fn, dataset, loader
    gc.collect()
    torch.cuda.empty_cache()

A clean pattern is to assign the variables to None at the top of run_experiment so they're always in scope for the finally:

model = optimizer = loss_fn = dataset = loader = None
pred = loss = x = y = None
...
finally:
    del model, optimizer, loss_fn, dataset, loader, pred, loss, x, y
    gc.collect()
    torch.cuda.empty_cache()

Fix this →

Minor

4. Silent exception swallowing in main() (scripts/benchmark_precision.py:230)

except Exception as _:
    grid_shapes[tid] = None

The _ convention signals "intentionally ignored" but any exception (bad zarr path, wrong key name, permission error) silently becomes None. Consider at minimum a print/warnings.warn so failures are visible in the benchmark log.

5. Hardcoded DEVICE — No CLI option to override the device (e.g. --device cuda:1). Fine for a single-GPU script but worth noting if multi-GPU nodes are used.

6. GradScaler not used with bf16 — This is correct. Unlike fp16, bf16 has sufficient dynamic range that GradScaler is not needed. Worth a one-line comment for future readers though.


The benchmark results table in the PR description is very clear and the methodology (CUDA event timing, per-experiment memory reset, incremental JSON saves) is sound. Fixing issues 1–3 above is recommended before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant