f32 vs bf16-mixed benchmark#119
Conversation
|
@claude review this PR. |
|
Claude finished @hanaol's task in 1m 9s —— View job PR Review: f32 vs bf16-mixed benchmark
Overall this is a clean, well-structured benchmark script. The incremental JSON saving, CUDA event timing, and memory reset logic are all solid. A few issues worth fixing before merge: Bugs1. Wrong script name in docstring ( The usage example references # Line 11 — should be:
# uv run python scripts/benchmark_precision.py \2. def get_gpu_info() -> str: # ← wrong annotation
props = torch.cuda.get_device_properties(DEVICE)
return { # ← actually returns a dict
"name": props.name,
...
}Memory cleanup after OOM3. After an OOM inside the batch loop, finally:
# Also free batch-scope tensors if they exist
for var in ("pred", "loss", "x", "y"):
if var in locals():
del locals()[var] # won't work — use explicit try/except or reassign to None
del model, optimizer, loss_fn, dataset, loader
gc.collect()
torch.cuda.empty_cache()A clean pattern is to assign the variables to model = optimizer = loss_fn = dataset = loader = None
pred = loss = x = y = None
...
finally:
del model, optimizer, loss_fn, dataset, loader, pred, loss, x, y
gc.collect()
torch.cuda.empty_cache()Minor4. Silent exception swallowing in except Exception as _:
grid_shapes[tid] = NoneThe 5. Hardcoded 6. The benchmark results table in the PR description is very clear and the methodology (CUDA event timing, per-experiment memory reset, incremental JSON saves) is sound. Fixing issues 1–3 above is recommended before merge. |
Summary
This PR adds a benchmark script that trains one sample (task ID) at a time for 3 epochs under both f32 and bf16-mixed precision on a single A100 GPU, recording peak GPU memory, forward/backward times, and OOM status for 10 large-grid samples.
The problem
The 10 task IDs selected for this benchmark are Materials Project entries with relatively large charge-density grids, spanning 3.4 M – 46.7 M voxels across a variety of shapes and aspect ratios. It was not previously known which samples would fit on an A100 (79.3 GB) under f32 vs bf16-mixed, or how training time scales with grid volume. This script establishes those baselines.
Benchmark results (A100-SXM4-80GB, CUDA 12.8)
Config:
n_channels=32,n_residual_blocks=1,kernel_size=5,depth=2,batch_size=1, single GPU, 3 epochs per experiment.Summary
Key findings
Files
scripts/benchmark_precision.py-- Runs 3-epoch single-GPU training for each (task_id, precision) pair, recording CUDA-event forward/backward times and peak GPU memory