Skip to content

Fix random MPI crashes caused by array out-of-bounds access#17

Open
ChristopherMayes wants to merge 1 commit intoimpact-lbl:masterfrom
ChristopherMayes:fix-out-of-bound-access
Open

Fix random MPI crashes caused by array out-of-bounds access#17
ChristopherMayes wants to merge 1 commit intoimpact-lbl:masterfrom
ChristopherMayes:fix-out-of-bound-access

Conversation

@ChristopherMayes
Copy link
Copy Markdown

Fix random MPI crashes caused by array out-of-bounds access

Summary

Fixes intermittent segmentation faults when running with MPI (e.g. mpirun -n 8 ImpactZexe-mpi). The crashes are non-deterministic because they depend on how particles are distributed across ranks at runtime.

A test case that reliably reproduces the issue is attached as error-test.zip.

Root Cause

Five array out-of-bounds bugs, all exposed by compiling with -fcheck=all:

1. src/Contrl/Input.f90obtype(0) when parsing comments

The lattice input parser reads lines in a loop, incrementing index i only for data lines (non-comment). However, the -99 end-of-lattice check on obtype(i) was outside the if(comst.ne."!") block, so when the first line is a comment, i is still 0 and obtype(0) is accessed.

Fix: Move the obtype(i).eq.-99 check inside the data-reading branch.

2. src/Contrl/Output.f90glbin(0) in 12 percentile search loops

The 90th/95th/99th percentile emittance calculations use cumulative histograms. Twelve do i = 1, nbin loops access glbin(i-1), which gives glbin(0) when i=1. The cumulative sum is built starting from i=2, so glbin(0) is never initialized and is out of bounds.

Fix: Start all 12 search loops at i = 2 instead of i = 1. This is safe because the interpolation formula uses glbin(i-1) and glbin(i), and glbin(1) already holds the raw count for bin 1.

3. src/Func/Ptclmger.f90 — zero-sized temp1 allocations in particle exchange

When a rank has zero particles to send/receive in a given direction, jleft, jright, jdown, or jup can be 0. allocate(temp1(9, 0)) creates a zero-sized array, but the subsequent MPI_RECV(temp1(1,1), ...) accesses element (1,1) which is out of bounds.

Fix: Use allocate(temp1(9, max(..., 1))) for all four directions.

4. src/Func/Ptclmger.f90 — zero-sized left/right/up/down allocations

Similar to above: when Nptlocal or numbuf is 0, nsmall becomes 0, leading to zero-sized allocations for the directional send buffers. The MPI_SEND(left(1,1), ...) call then accesses out-of-bounds memory.

Fix: Use max(Nptlocal, 1) / max(numbuf, 1) when computing nsmall.

5. src/Appl/BeamBunch.f90rho indexed out of bounds in deposit_BeamBunch

In the charge deposition routine, particle grid indices (ix, jx, kx) can fall outside the local subdomain bounds. With 8 MPI ranks, indices like 33 were computed for an array with upper bound 3.

Fix: Add a bounds check that skips particles whose indices fall outside the local rho grid.

Testing

  • Reproducer: error/ directory (attached as error-test.zip) — 10,000 particles through a 4-dipole chicane with extended diagnostics, mpirun -n 8.
  • Before fix: ~1 in 5 runs crash with SIGSEGV (signal 11). 10/10 runs fail with -fcheck=all.
  • After fix: 20/20 runs pass (release build), 10/10 runs pass (debug build with -fcheck=all).
  • Regression: All three bundled examples (Example1, Example2, Example3) still run correctly.

error-test.zip

Files Changed

File Change
src/Contrl/Input.f90 Move -99 check inside data branch
src/Contrl/Output.f90 12 loops: do i = 1do i = 2
src/Func/Ptclmger.f90 6 allocations: guard against zero size
src/Appl/BeamBunch.f90 Bounds-check particle indices before deposition

Acknowledgement

Bug diagnosis and fixes developed with AI assistance (GitHub Copilot, Claude Opus 4.6). The original problem shows up in https://christophermayes.github.io/lume-impact/examples/z/elements/csr-zeuthen/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant