Skip to content

Add GPU FFT engine to xtraj (engine=gpu)#23

Open
jmholton wants to merge 8 commits into
lanl:masterfrom
jmholton:gpu-engine
Open

Add GPU FFT engine to xtraj (engine=gpu)#23
jmholton wants to merge 8 commits into
lanl:masterfrom
jmholton:gpu-engine

Conversation

@jmholton

@jmholton jmholton commented May 3, 2026

Copy link
Copy Markdown

Summary

  • Adds engine=gpu to xtraj.py: GPU-accelerated structure factor calculation via cuFFT spreading instead of cctbx direct-sum
  • Ships precompiled GPU library (md2mtz/sfcalc_gpu.so) targeting NVIDIA Volta–Hopper (sm_70–sm_90), with bundled libcufft.so.11 — no CUDA toolkit needed at runtime
  • Three per-frame Python optimizations reduce GPU wall time by ~1.9× vs the initial implementation:
    • Pre-allocate FFT accumulators outside the frame loop
    • Pre-compute symmetry collapse indices and phase arrays once at setup
    • Apply auto-blur correction at 12.6M ASU points instead of the full 400M-point FFT grid

Performance (6c2r trajectory, 468k atoms, d_min=0.9, 1000 frames)

Engine Wall time Per frame
cctbx (CPU) 12.6 h 45.4 s
GPU (GV100) 2.4 h 8.6 s
Speedup 5.2×

Accuracy

Pearson CC between CPU and GPU diffuse intensities over 12,581,024 ASU reflections at d_min=0.9: 0.9999

New files

  • md2mtz/sfcalc_gpu.so — compiled GPU kernel (ctypes interface)
  • md2mtz/libcufft.so.11 — bundled cuFFT runtime
  • md2mtz/sfcalc_gpu.cu — CUDA source
  • md2mtz/sfcalc_gpu_collapse.cpp — C++ standalone executable
  • md2mtz/compile_gpu.csh — build script (requires CUDA toolkit on voltron)
  • md2mtz/run_gpu_test.sh, run_compare_test.sh — local test scripts
  • md2mtz/slurm_compare_dmin09.sh, slurm_compare_6c2r.sh, slurm_compare_6c2r_1000.sh — SLURM validation scripts
  • md2mtz/README.md — GPU library documentation
  • CLAUDE.md — developer guide
  • README.md — updated user-facing docs

Also includes

Upstream fix (from b599641): corrects extra_chunksextra_frames in worker print statement.

🤖 Generated with Claude Code

jmholton and others added 8 commits April 22, 2026 19:45
Incorporates https://github.com/jmholton/md2mtz (squashed) under md2mtz/.
Provides GPU-accelerated structure factor calculation from MD supercell
trajectories for diffuse scatter analysis (sfcalc_gpu_collapse).

To sync future upstream changes:
  git fetch md2mtz_remote main
  git read-tree --prefix=md2mtz/ -u md2mtz_remote/main
  git commit -m "Sync md2mtz from upstream"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the per-frame cctbx direct-sum xrs_sel.structure_factors() call
with the GPU FFT spreading+collapse from md2mtz/sfcalc_gpu_collapse.py.
Expected speedup: 2-30x depending on atom count and trajectory length.

New parameters (all optional):
  engine=gpu          use GPU FFT instead of cctbx direct-sum (default: cctbx)
  rate=2.5            FFT oversampling rate (default 2.5)
  noise=0.01          multi-grid coarsening threshold (default 1%)
  bmax=0              skip atoms with B > bmax Ų (0 = keep all)
  lib=<path>          path to sfcalc_gpu.so (default: md2mtz/sfcalc_gpu.so)
  super_mult=1,1,1    supercell multipliers na,nb,nc (default 1,1,1)

The GPU path accumulates ΣF (complex) and Σ|F|² per ASU reflection, then
converts to cctbx miller arrays before the existing MPI reduction and MTZ
output, so all downstream code (DWF removal, diffuse = <|F|²>-|<F>|²,
correlation, partial_sum, diff_mode) works unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- xtraj.py engine=gpu: GPU FFT spreading via ctypes, multi-level B-factor
  binning, auto-blur correction, cctbx-based ASU enumeration and symmetry
  collapse; no gemmi or CCP4 dependency
- Validated CC=0.9985 vs cctbx at d_min=0.9 over 12.6M reflections (voltron GV100)
- CLAUDE.md: project guide covering environment, xtraj parameters, GPU design,
  validation results, and SLURM usage
- README.md: user-facing documentation with quick start and engine comparison
- md2mtz/slurm_compare_dmin09.sh: SLURM job for GPU vs CPU comparison on voltron
- md2mtz/run_compare_test.sh, run_compare_dmin09_test.sh, run_gpu_test.sh: local test scripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Avoids allocating two ~400 MB float64 arrays per frame at d_min=0.9.
In-place zeroing with [:]=0.0 saves ~21% GPU frame time (51s→40s for 3
frames on GV100, 6c2r 468k-atom system). CC vs cctbx unchanged: 0.9990.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_precompute_collapse() builds per-operator (iz,iy,ix,valid,friedel,cos,sin)
arrays once at startup. Per-frame _collapse_fast() does only gather + phase
rotate on 12.6M reflections. Collapse time: 6.63s -> 1.74s/frame (74% faster).
Total GPU frame time: 15.2s -> 12.0s. CC vs cctbx unchanged: 0.9990.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces per-frame _acc_r *= blur_corr (multiplying two ~400 MB float64
grids) with a pre-computed _blur_asu vector (12.6M elements) applied after
collapse. Blur step: 1.45s -> 0.04s/frame. Total GPU frame time: 12.0s -> 8.9s.
3-frame wall: 34s -> 24s. CC vs cctbx unchanged: 0.9990.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Incorporates upstream fix from b599641 (mewall00).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three enhancements ported from xtraj.py:

1. sfcalc_gpu_collapse.py: add precompute_collapse() + collapse_fast()
   Pre-compute symmetry operator lookup indices and phase arrays once;
   collapse_fast() does only the gather+phase-sum per call.

2. sfcalc_gpu_collapse.py + sfcalc_gpu.py: apply blur correction at ASU
   level only, not to the full FFT grid.  For d_min=0.9 the fine grid is
   960x960x864 (~400M points); the ASU has ~12.6M reflections — 32x fewer
   multiply-exponential operations.

3. sfcalc_gpu.py: add missing auto-blur correction (b_add = (dmin*rate)^2/pi^2
   added to B-factors before spreading; exp(+b_add*stol^2) applied after).
   This was already present in sfcalc_gpu_collapse.py but absent in
   sfcalc_gpu.py, causing sub-pixel Gaussian aliasing at high resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant