Add GPU FFT engine to xtraj (engine=gpu)#23
Open
jmholton wants to merge 8 commits into
Open
Conversation
Incorporates https://github.com/jmholton/md2mtz (squashed) under md2mtz/. Provides GPU-accelerated structure factor calculation from MD supercell trajectories for diffuse scatter analysis (sfcalc_gpu_collapse). To sync future upstream changes: git fetch md2mtz_remote main git read-tree --prefix=md2mtz/ -u md2mtz_remote/main git commit -m "Sync md2mtz from upstream" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the per-frame cctbx direct-sum xrs_sel.structure_factors() call with the GPU FFT spreading+collapse from md2mtz/sfcalc_gpu_collapse.py. Expected speedup: 2-30x depending on atom count and trajectory length. New parameters (all optional): engine=gpu use GPU FFT instead of cctbx direct-sum (default: cctbx) rate=2.5 FFT oversampling rate (default 2.5) noise=0.01 multi-grid coarsening threshold (default 1%) bmax=0 skip atoms with B > bmax Ų (0 = keep all) lib=<path> path to sfcalc_gpu.so (default: md2mtz/sfcalc_gpu.so) super_mult=1,1,1 supercell multipliers na,nb,nc (default 1,1,1) The GPU path accumulates ΣF (complex) and Σ|F|² per ASU reflection, then converts to cctbx miller arrays before the existing MPI reduction and MTZ output, so all downstream code (DWF removal, diffuse = <|F|²>-|<F>|², correlation, partial_sum, diff_mode) works unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- xtraj.py engine=gpu: GPU FFT spreading via ctypes, multi-level B-factor binning, auto-blur correction, cctbx-based ASU enumeration and symmetry collapse; no gemmi or CCP4 dependency - Validated CC=0.9985 vs cctbx at d_min=0.9 over 12.6M reflections (voltron GV100) - CLAUDE.md: project guide covering environment, xtraj parameters, GPU design, validation results, and SLURM usage - README.md: user-facing documentation with quick start and engine comparison - md2mtz/slurm_compare_dmin09.sh: SLURM job for GPU vs CPU comparison on voltron - md2mtz/run_compare_test.sh, run_compare_dmin09_test.sh, run_gpu_test.sh: local test scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Avoids allocating two ~400 MB float64 arrays per frame at d_min=0.9. In-place zeroing with [:]=0.0 saves ~21% GPU frame time (51s→40s for 3 frames on GV100, 6c2r 468k-atom system). CC vs cctbx unchanged: 0.9990. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_precompute_collapse() builds per-operator (iz,iy,ix,valid,friedel,cos,sin) arrays once at startup. Per-frame _collapse_fast() does only gather + phase rotate on 12.6M reflections. Collapse time: 6.63s -> 1.74s/frame (74% faster). Total GPU frame time: 15.2s -> 12.0s. CC vs cctbx unchanged: 0.9990. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces per-frame _acc_r *= blur_corr (multiplying two ~400 MB float64 grids) with a pre-computed _blur_asu vector (12.6M elements) applied after collapse. Blur step: 1.45s -> 0.04s/frame. Total GPU frame time: 12.0s -> 8.9s. 3-frame wall: 34s -> 24s. CC vs cctbx unchanged: 0.9990. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Incorporates upstream fix from b599641 (mewall00). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three enhancements ported from xtraj.py: 1. sfcalc_gpu_collapse.py: add precompute_collapse() + collapse_fast() Pre-compute symmetry operator lookup indices and phase arrays once; collapse_fast() does only the gather+phase-sum per call. 2. sfcalc_gpu_collapse.py + sfcalc_gpu.py: apply blur correction at ASU level only, not to the full FFT grid. For d_min=0.9 the fine grid is 960x960x864 (~400M points); the ASU has ~12.6M reflections — 32x fewer multiply-exponential operations. 3. sfcalc_gpu.py: add missing auto-blur correction (b_add = (dmin*rate)^2/pi^2 added to B-factors before spreading; exp(+b_add*stol^2) applied after). This was already present in sfcalc_gpu_collapse.py but absent in sfcalc_gpu.py, causing sub-pixel Gaussian aliasing at high resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
engine=gputoxtraj.py: GPU-accelerated structure factor calculation via cuFFT spreading instead of cctbx direct-summd2mtz/sfcalc_gpu.so) targeting NVIDIA Volta–Hopper (sm_70–sm_90), with bundledlibcufft.so.11— no CUDA toolkit needed at runtimePerformance (6c2r trajectory, 468k atoms, d_min=0.9, 1000 frames)
Accuracy
Pearson CC between CPU and GPU diffuse intensities over 12,581,024 ASU reflections at d_min=0.9: 0.9999
New files
md2mtz/sfcalc_gpu.so— compiled GPU kernel (ctypes interface)md2mtz/libcufft.so.11— bundled cuFFT runtimemd2mtz/sfcalc_gpu.cu— CUDA sourcemd2mtz/sfcalc_gpu_collapse.cpp— C++ standalone executablemd2mtz/compile_gpu.csh— build script (requires CUDA toolkit on voltron)md2mtz/run_gpu_test.sh,run_compare_test.sh— local test scriptsmd2mtz/slurm_compare_dmin09.sh,slurm_compare_6c2r.sh,slurm_compare_6c2r_1000.sh— SLURM validation scriptsmd2mtz/README.md— GPU library documentationCLAUDE.md— developer guideREADME.md— updated user-facing docsAlso includes
Upstream fix (from b599641): corrects
extra_chunks→extra_framesin worker print statement.🤖 Generated with Claude Code