Fix random MPI crashes caused by array out-of-bounds access#17
Open
ChristopherMayes wants to merge 1 commit intoimpact-lbl:masterfrom
Open
Fix random MPI crashes caused by array out-of-bounds access#17ChristopherMayes wants to merge 1 commit intoimpact-lbl:masterfrom
ChristopherMayes wants to merge 1 commit intoimpact-lbl:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix random MPI crashes caused by array out-of-bounds access
Summary
Fixes intermittent segmentation faults when running with MPI (e.g.
mpirun -n 8 ImpactZexe-mpi). The crashes are non-deterministic because they depend on how particles are distributed across ranks at runtime.A test case that reliably reproduces the issue is attached as
error-test.zip.Root Cause
Five array out-of-bounds bugs, all exposed by compiling with
-fcheck=all:1.
src/Contrl/Input.f90—obtype(0)when parsing commentsThe lattice input parser reads lines in a loop, incrementing index
ionly for data lines (non-comment). However, the-99end-of-lattice check onobtype(i)was outside theif(comst.ne."!")block, so when the first line is a comment,iis still 0 andobtype(0)is accessed.Fix: Move the
obtype(i).eq.-99check inside the data-reading branch.2.
src/Contrl/Output.f90—glbin(0)in 12 percentile search loopsThe 90th/95th/99th percentile emittance calculations use cumulative histograms. Twelve
do i = 1, nbinloops accessglbin(i-1), which givesglbin(0)wheni=1. The cumulative sum is built starting fromi=2, soglbin(0)is never initialized and is out of bounds.Fix: Start all 12 search loops at
i = 2instead ofi = 1. This is safe because the interpolation formula usesglbin(i-1)andglbin(i), andglbin(1)already holds the raw count for bin 1.3.
src/Func/Ptclmger.f90— zero-sizedtemp1allocations in particle exchangeWhen a rank has zero particles to send/receive in a given direction,
jleft,jright,jdown, orjupcan be 0.allocate(temp1(9, 0))creates a zero-sized array, but the subsequentMPI_RECV(temp1(1,1), ...)accesses element(1,1)which is out of bounds.Fix: Use
allocate(temp1(9, max(..., 1)))for all four directions.4.
src/Func/Ptclmger.f90— zero-sizedleft/right/up/downallocationsSimilar to above: when
Nptlocalornumbufis 0,nsmallbecomes 0, leading to zero-sized allocations for the directional send buffers. TheMPI_SEND(left(1,1), ...)call then accesses out-of-bounds memory.Fix: Use
max(Nptlocal, 1)/max(numbuf, 1)when computingnsmall.5.
src/Appl/BeamBunch.f90—rhoindexed out of bounds indeposit_BeamBunchIn the charge deposition routine, particle grid indices (
ix,jx,kx) can fall outside the local subdomain bounds. With 8 MPI ranks, indices like 33 were computed for an array with upper bound 3.Fix: Add a bounds check that skips particles whose indices fall outside the local
rhogrid.Testing
error/directory (attached aserror-test.zip) — 10,000 particles through a 4-dipole chicane with extended diagnostics,mpirun -n 8.-fcheck=all.-fcheck=all).error-test.zip
Files Changed
src/Contrl/Input.f90-99check inside data branchsrc/Contrl/Output.f90do i = 1→do i = 2src/Func/Ptclmger.f90src/Appl/BeamBunch.f90Acknowledgement
Bug diagnosis and fixes developed with AI assistance (GitHub Copilot, Claude Opus 4.6). The original problem shows up in https://christophermayes.github.io/lume-impact/examples/z/elements/csr-zeuthen/.