Add GPU FFT engine to xtraj (engine=gpu) by jmholton · Pull Request #23 · lanl/lunus

jmholton · 2026-05-03T20:26:10Z

Summary

Adds engine=gpu to xtraj.py: GPU-accelerated structure factor calculation via cuFFT spreading instead of cctbx direct-sum
Ships precompiled GPU library (md2mtz/sfcalc_gpu.so) targeting NVIDIA Volta–Hopper (sm_70–sm_90), with bundled libcufft.so.11 — no CUDA toolkit needed at runtime
Three per-frame Python optimizations reduce GPU wall time by ~1.9× vs the initial implementation:
- Pre-allocate FFT accumulators outside the frame loop
- Pre-compute symmetry collapse indices and phase arrays once at setup
- Apply auto-blur correction at 12.6M ASU points instead of the full 400M-point FFT grid

Performance (6c2r trajectory, 468k atoms, d_min=0.9, 1000 frames)

Engine	Wall time	Per frame
cctbx (CPU)	12.6 h	45.4 s
GPU (GV100)	2.4 h	8.6 s
Speedup	5.2×

Accuracy

Pearson CC between CPU and GPU diffuse intensities over 12,581,024 ASU reflections at d_min=0.9: 0.9999

New files

md2mtz/sfcalc_gpu.so — compiled GPU kernel (ctypes interface)
md2mtz/libcufft.so.11 — bundled cuFFT runtime
md2mtz/sfcalc_gpu.cu — CUDA source
md2mtz/sfcalc_gpu_collapse.cpp — C++ standalone executable
md2mtz/compile_gpu.csh — build script (requires CUDA toolkit on voltron)
md2mtz/run_gpu_test.sh, run_compare_test.sh — local test scripts
md2mtz/slurm_compare_dmin09.sh, slurm_compare_6c2r.sh, slurm_compare_6c2r_1000.sh — SLURM validation scripts
md2mtz/README.md — GPU library documentation
CLAUDE.md — developer guide
README.md — updated user-facing docs

Also includes

Upstream fix (from b599641): corrects extra_chunks → extra_frames in worker print statement.

🤖 Generated with Claude Code

Incorporates https://github.com/jmholton/md2mtz (squashed) under md2mtz/. Provides GPU-accelerated structure factor calculation from MD supercell trajectories for diffuse scatter analysis (sfcalc_gpu_collapse). To sync future upstream changes: git fetch md2mtz_remote main git read-tree --prefix=md2mtz/ -u md2mtz_remote/main git commit -m "Sync md2mtz from upstream" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the per-frame cctbx direct-sum xrs_sel.structure_factors() call with the GPU FFT spreading+collapse from md2mtz/sfcalc_gpu_collapse.py. Expected speedup: 2-30x depending on atom count and trajectory length. New parameters (all optional): engine=gpu use GPU FFT instead of cctbx direct-sum (default: cctbx) rate=2.5 FFT oversampling rate (default 2.5) noise=0.01 multi-grid coarsening threshold (default 1%) bmax=0 skip atoms with B > bmax Å² (0 = keep all) lib=<path> path to sfcalc_gpu.so (default: md2mtz/sfcalc_gpu.so) super_mult=1,1,1 supercell multipliers na,nb,nc (default 1,1,1) The GPU path accumulates ΣF (complex) and Σ|F|² per ASU reflection, then converts to cctbx miller arrays before the existing MPI reduction and MTZ output, so all downstream code (DWF removal, diffuse = <|F|²>-|<F>|², correlation, partial_sum, diff_mode) works unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- xtraj.py engine=gpu: GPU FFT spreading via ctypes, multi-level B-factor binning, auto-blur correction, cctbx-based ASU enumeration and symmetry collapse; no gemmi or CCP4 dependency - Validated CC=0.9985 vs cctbx at d_min=0.9 over 12.6M reflections (voltron GV100) - CLAUDE.md: project guide covering environment, xtraj parameters, GPU design, validation results, and SLURM usage - README.md: user-facing documentation with quick start and engine comparison - md2mtz/slurm_compare_dmin09.sh: SLURM job for GPU vs CPU comparison on voltron - md2mtz/run_compare_test.sh, run_compare_dmin09_test.sh, run_gpu_test.sh: local test scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Avoids allocating two ~400 MB float64 arrays per frame at d_min=0.9. In-place zeroing with [:]=0.0 saves ~21% GPU frame time (51s→40s for 3 frames on GV100, 6c2r 468k-atom system). CC vs cctbx unchanged: 0.9990. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_precompute_collapse() builds per-operator (iz,iy,ix,valid,friedel,cos,sin) arrays once at startup. Per-frame _collapse_fast() does only gather + phase rotate on 12.6M reflections. Collapse time: 6.63s -> 1.74s/frame (74% faster). Total GPU frame time: 15.2s -> 12.0s. CC vs cctbx unchanged: 0.9990. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces per-frame _acc_r *= blur_corr (multiplying two ~400 MB float64 grids) with a pre-computed _blur_asu vector (12.6M elements) applied after collapse. Blur step: 1.45s -> 0.04s/frame. Total GPU frame time: 12.0s -> 8.9s. 3-frame wall: 34s -> 24s. CC vs cctbx unchanged: 0.9990. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Incorporates upstream fix from b599641 (mewall00). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three enhancements ported from xtraj.py: 1. sfcalc_gpu_collapse.py: add precompute_collapse() + collapse_fast() Pre-compute symmetry operator lookup indices and phase arrays once; collapse_fast() does only the gather+phase-sum per call. 2. sfcalc_gpu_collapse.py + sfcalc_gpu.py: apply blur correction at ASU level only, not to the full FFT grid. For d_min=0.9 the fine grid is 960x960x864 (~400M points); the ASU has ~12.6M reflections — 32x fewer multiply-exponential operations. 3. sfcalc_gpu.py: add missing auto-blur correction (b_add = (dmin*rate)^2/pi^2 added to B-factors before spreading; exp(+b_add*stol^2) applied after). This was already present in sfcalc_gpu_collapse.py but absent in sfcalc_gpu.py, causing sub-pixel Gaussian aliasing at high resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jmholton and others added 8 commits April 22, 2026 19:45

Fix print variable name: extra_chunks -> extra_frames

901ac39

Incorporates upstream fix from b599641 (mewall00). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU FFT engine to xtraj (engine=gpu)#23

Add GPU FFT engine to xtraj (engine=gpu)#23
jmholton wants to merge 8 commits into
lanl:masterfrom
jmholton:gpu-engine

jmholton commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmholton commented May 3, 2026

Summary

Performance (6c2r trajectory, 468k atoms, d_min=0.9, 1000 frames)

Accuracy

New files

Also includes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant