Skip to content

Benchmark A100 vs H200 GPUs#120

Open
hanaol wants to merge 2 commits intohanaol/mixed-precisionfrom
hanaol/benchmark-a100-vs-h200
Open

Benchmark A100 vs H200 GPUs#120
hanaol wants to merge 2 commits intohanaol/mixed-precisionfrom
hanaol/benchmark-a100-vs-h200

Conversation

@hanaol
Copy link
Copy Markdown
Collaborator

@hanaol hanaol commented Apr 13, 2026

Summary

This adds a GPU comparison benchmark script that runs the same per-sample training experiment on both an A100 and an H200, recording peak GPU memory, forward/backward times, and OOM status for 10 large-grid Materials Project samples under f32 and bf16-mixed precision.

The 10 task IDs are Materials Project entries with relatively large charge-density grids, spanning 3.4 M – 46.7 M voxels across a variety of shapes and aspect ratios.

Benchmark results

Model: resunet.ResUNet3D, n_channels=32, n_residual_blocks=1, kernel_size=5, depth=2, batch_size=1, single GPU, 3 epochs per experiment.

f32

Task ID Grid shape Voxels A100 status A100 peak (GB) A100 epoch (s) H200 status H200 peak (GB) H200 epoch (s) Speedup (H200/A100)
mp-1890579 56 × 56 × 1080 3.4 M 10.50 1.65 10.80 1.13 1.46×
mp-1849767 60 × 60 × 1120 4.0 M 12.42 1.95 12.80 1.28 1.52×
mp-1851604 60 × 60 × 1120 4.0 M 12.42 1.93 12.80 1.29 1.50×
mp-1862536 80 × 80 × 1024 6.6 M 19.89 3.17 20.56 2.19 1.45×
mp-1847208 1120 × 84 × 84 7.9 M 23.89 3.87 24.73 2.46 1.57×
mp-1936557 80 × 756 × 216 13.1 M 39.09 6.47 40.53 3.94 1.64×
mp-1850168 972 × 240 × 128 29.9 M ❌ OOM 91.97 8.67
mp-1887804 320 × 320 × 320 32.8 M ❌ OOM 100.55 111.10
mp-1889246 540 × 144 × 432 33.6 M ❌ OOM 103.25 136.31
mp-1871122 360 × 360 × 360 46.7 M ❌ OOM 131.65 303.15

bf16-mixed

Task ID Grid shape Voxels A100 status A100 peak (GB) A100 epoch (s) H200 status H200 peak (GB) H200 epoch (s) Speedup (H200/A100)
mp-1890579 56 × 56 × 1080 3.4 M 5.50 1.07 5.50 0.70 1.53×
mp-1849767 60 × 60 × 1120 4.0 M 6.50 1.27 6.50 0.78 1.63×
mp-1851604 60 × 60 × 1120 4.0 M 6.50 1.25 6.50 0.78 1.60×
mp-1862536 80 × 80 × 1024 6.6 M 10.40 1.94 10.40 1.36 1.43×
mp-1847208 1120 × 84 × 84 7.9 M 12.49 2.46 12.49 1.51 1.63×
mp-1936557 80 × 756 × 216 13.1 M 20.44 3.73 20.44 2.64 1.41×
mp-1850168 972 × 240 × 128 29.9 M 46.28 8.77 46.28 5.12 1.71×
mp-1887804 320 × 320 × 320 32.8 M 50.59 169.91 50.59 109.60 1.55×
mp-1889246 540 × 144 × 432 33.6 M 51.95 208.94 51.95 135.30 1.54×
mp-1871122 360 × 360 × 360 46.7 M ❌ OOM 71.90 188.83

Summary

Precision A100 ✅ A100 ❌ OOM H200 ✅ H200 ❌ OOM
f32 6 / 10 4 / 10 10 / 10 0 / 10
bf16-mixed 9 / 10 1 / 10 10 / 10 0 / 10

Key findings

  • H200 handles all 10 task IDs under both precisions — all A100 OOMs are resolved by the larger VRAM (139.8 GB).
  • H200 is consistently 1.4–1.7× faster per epoch across all grid sizes and precisions.

Files

  • scripts/benchmark_gpus.py -- Reads two JSON result files (one per GPU) produced by

@hanaol
Copy link
Copy Markdown
Collaborator Author

hanaol commented Apr 13, 2026

@claude review this PR

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Claude finished @hanaol's task in 1m 26s —— View job


Code Review: scripts/benchmark_gpus.py

  • Gather context
  • Read changed files
  • Review code and provide feedback

Overall this is a clean, well-structured comparison script. A few issues to flag:


Bug: --out omitted silently discards output

scripts/benchmark_gpus.py:135-140

output = "\n".join(lines)

if args.out:
    args.out.parent.mkdir(parents=True, exist_ok=True)
    args.out.write_text(output)
# no else: print(output)

The module docstring says the default is stdout, but when --out is not passed the output is computed and thrown away — nothing is printed. Fix this →


Minor: Helper functions redefined on every loop iteration

scripts/benchmark_gpus.py:104-117

for tid in all_task_ids:
    key = (tid, prec)
    r1 = idx1.get(key)
    r2 = idx2.get(key)

    def status(r): ...
    def peak_gb(r): ...
    def epoch_s(r): ...

status, peak_gb, and epoch_s don't close over any loop variable — they're pure helpers that get re-created on every iteration of the outer for prec and inner for tid loops. Move them outside the loops (e.g. as module-level helpers or before the lines = [] block). Fix this →


Minor: Metadata table uses generic GPU labels

scripts/benchmark_gpus.py:78-81

a("| GPU | Model | VRAM |")
a("|-----|-------|:----:|")
a(f"| GPU 1 (reference) | {gpu1_name} | {gpu1_info['total_mem_gb']} GB |")
a(f"| GPU 2 | {gpu2_name} | {gpu2_info['total_mem_gb']} GB |")

The label column uses the hardcoded strings "GPU 1 (reference)" / "GPU 2" while the per-precision tables use gpu1_name / gpu2_name. Consider using the actual GPU names here as well for consistency (e.g. f"| {gpu1_name} (reference) | ... |"). Fix this →


Nit: Ratio uses lowercase x instead of ×

scripts/benchmark_gpus.py:34

return f"{a / b:.2f}x"

The PR description uses the proper multiplication sign ×. Using × here would make the markdown output match. Minor, but worth keeping consistent.


Suggestion: No summary table in output

The PR description includes a useful summary table (OOM counts per GPU/precision). The script doesn't emit one — consider adding it after the per-precision tables so the generated .md file is self-contained.


Summary: One real bug (stdout never written), one style issue (nested helper functions), and two minor consistency points. The core logic — key-based lookup, ratio computation, grid shape display — looks correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant