fix(mlmg): stop GPU segfault in EB mask (default solver broken on CUDA wheel)#299
Merged
Conversation
The previous subprocess-isolation attempt backfired: spawning a CUDA- initialising child per combo while the parent kernel still holds the GPU caused even pcg to abort (MPI_Abort, exit 6) from device-memory contention, and did not reliably keep the kernel alive. Revert to the proven in-process loop and instead drop only the two solvers that actually hard-abort on a GPU build — standalone smg and pfmg (no Krylov wrapper). In the original report every other combo up to bicgstab ran fine in-process; standalone SMG was the sole kernel-killer. As preconditioners (pcg+smg, flexgmres+pfmg) these multigrids are capped at one V-cycle and stay. On CPU builds nothing is skipped. Soft non-convergence is still caught and shown as a FAILED row.
TortuosityMLMG::solve() passed a GPU DeviceVector pointer to ActiveMaskIF, the implicit function consumed by EB2::Build. AMReX evaluates that implicit function on the host while generating the EB geometry, so on a GPU build the host code dereferenced a device pointer and segfaulted (SIGSEGV/MPI_ABORT errorcode 11) at the very first EB step — before the solve began, for every geometry, with the GPU otherwise idle. Because MLMG is the default solver (solver="auto"), this broke the entire openimpala-cuda wheel for all geometries. CPU-only CI never caught it: on a CPU build the mask pointer is host memory, so the host-side IF evaluation is valid and the tTortuosityMLMG tests pass. Allocate the mask as amrex::Gpu::ManagedVector so the single pointer is valid on both host and device, regardless of where AMReX evaluates the IF. On CPU builds ManagedVector degrades to an ordinary host allocation, so behaviour is unchanged there. Needs validation on real GPU hardware: rebuild the CUDA wheel and re-run solver="mlmg" on a T4 (and the profiling notebook bake-off).
Performance Benchmark Results
Fastest solver: bicgstab at 64³ (0.3737s) Benchmark: uniform block (analytical τ = (N-1)/N) |
Code Coverage ReportGenerated by CI — coverage data from gcovr |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Headline: the default solver (
solver="auto"= MLMG) segfaults on every GPU buildTortuosityMLMG::solve()passed a GPUDeviceVectorpointer toActiveMaskIF, the implicit function consumed byamrex::EB2::Build. AMReX evaluates that implicit function on the host while generating the EB geometry, so on a GPU build the CPU code dereferenced a device pointer and segfaulted (SIGSEGV→MPI_ABORTerrorcode 11) at the very first EB step — before the solve started, for every geometry, with the GPU otherwise idle.Because MLMG is the library default, this broke the entire
openimpala-cudawheel. CPU-only CI never caught it: on a CPU build the mask pointer is host memory, so the host-side IF evaluation is valid andtTortuosityMLMGpasses.How it was diagnosed
Captured, isolated runs on a Colab T4 (CUDA wheel) ruled out the obvious culprits in turn:
solve()'s first EB step — output always dies right after the constructor'sInitialized with eps=...line, insideEB2::Build.The code path (
TortuosityMLMG.cpp): device pointer at the#ifdef AMREX_USE_GPUblock →ActiveMaskIF{… mask_data_ptr}→EB2::Build(...)→ActiveMaskIF::operator()dereferencesmask[idx].The fix
Allocate the mask as
amrex::Gpu::ManagedVector<int>so the single pointer is valid on both host and device, regardless of where AMReX evaluates the IF. On CPU buildsManagedVectordegrades to an ordinary host allocation, so behaviour there is unchanged.The diagnosis is code-confirmed, but this has not been run on GPU hardware — I had no GPU/CUDA build available, and CPU CI is structurally blind to this class of bug. Please rebuild the CUDA wheel and re-run
solver="mlmg"on a T4 (and the profiling-notebook bake-off) before merging the fix with confidence. A regression test that exercises the EB/MLMG path with a device mask pointer would prevent recurrence.Also bundled in this branch (per request)
These are independent and can be reviewed separately:
docs: refresh CLAUDE.md— the file described Fortran kernels (*.F90/*_F.H) that no longer exist (migrated to native C++ inTortuosityKernels.H); also adds the MLMG solver,TortuositySolverBase, the homogenisation/microstructure modules, and the Python/pybind11 layer to the reference tables.notebooks: harden the §3 solver bake-off—notebooks/profiling_and_tuning.ipynb. Standalone SMG/PFMG produce a NaN residual on GPU (numerical breakdown on masked rows) and could hard-kill the kernel; the bake-off now skips them on GPU builds (they're used as preconditioners regardless) and notes why.chore: add SessionStart hook— sets the git commit identity in the ephemeral web containers so commits are attributed correctly.Follow-ups worth a separate issue
gmresreported "converged but produced an invalid result" (flux-conservation check). Smells like the convergence tolerance not pinning down τ — needs its own investigation.https://claude.ai/code/session_01VYc1je5VpiW46QvRBw8QFv
Generated by Claude Code