Skip to content

fix(mlmg): stop GPU segfault in EB mask (default solver broken on CUDA wheel)#299

Merged
jameslehoux merged 2 commits into
masterfrom
claude/elegant-ritchie-qjw4r0
Jun 13, 2026
Merged

fix(mlmg): stop GPU segfault in EB mask (default solver broken on CUDA wheel)#299
jameslehoux merged 2 commits into
masterfrom
claude/elegant-ritchie-qjw4r0

Conversation

@jameslehoux

Copy link
Copy Markdown

Headline: the default solver (solver="auto" = MLMG) segfaults on every GPU build

TortuosityMLMG::solve() passed a GPU DeviceVector pointer to ActiveMaskIF, the implicit function consumed by amrex::EB2::Build. AMReX evaluates that implicit function on the host while generating the EB geometry, so on a GPU build the CPU code dereferenced a device pointer and segfaulted (SIGSEGVMPI_ABORT errorcode 11) at the very first EB step — before the solve started, for every geometry, with the GPU otherwise idle.

Because MLMG is the library default, this broke the entire openimpala-cuda wheel. CPU-only CI never caught it: on a CPU build the mask pointer is host memory, so the host-side IF evaluation is valid and tTortuosityMLMG passes.

How it was diagnosed

Captured, isolated runs on a Colab T4 (CUDA wheel) ruled out the obvious culprits in turn:

  • Not OOM — the failing solve used 13 MiB with 14.6 GB free.
  • Not cut-cell handling — MLMG segfaults on a fully-active uniform block and on porespy blobs, identically.
  • Localised to solve()'s first EB step — output always dies right after the constructor's Initialized with eps=... line, inside EB2::Build.

The code path (TortuosityMLMG.cpp): device pointer at the #ifdef AMREX_USE_GPU block → ActiveMaskIF{… mask_data_ptr}EB2::Build(...)ActiveMaskIF::operator() dereferences mask[idx].

The fix

Allocate the mask as amrex::Gpu::ManagedVector<int> so the single pointer is valid on both host and device, regardless of where AMReX evaluates the IF. On CPU builds ManagedVector degrades to an ordinary host allocation, so behaviour there is unchanged.

⚠️ Validation status

The diagnosis is code-confirmed, but this has not been run on GPU hardware — I had no GPU/CUDA build available, and CPU CI is structurally blind to this class of bug. Please rebuild the CUDA wheel and re-run solver="mlmg" on a T4 (and the profiling-notebook bake-off) before merging the fix with confidence. A regression test that exercises the EB/MLMG path with a device mask pointer would prevent recurrence.


Also bundled in this branch (per request)

These are independent and can be reviewed separately:

  • docs: refresh CLAUDE.md — the file described Fortran kernels (*.F90/*_F.H) that no longer exist (migrated to native C++ in TortuosityKernels.H); also adds the MLMG solver, TortuositySolverBase, the homogenisation/microstructure modules, and the Python/pybind11 layer to the reference tables.
  • notebooks: harden the §3 solver bake-offnotebooks/profiling_and_tuning.ipynb. Standalone SMG/PFMG produce a NaN residual on GPU (numerical breakdown on masked rows) and could hard-kill the kernel; the bake-off now skips them on GPU builds (they're used as preconditioners regardless) and notes why.
  • chore: add SessionStart hook — sets the git commit identity in the ephemeral web containers so commits are attributed correctly.

Follow-ups worth a separate issue

  • Several Krylov configs "converge" to disagreeing τ on the same 64³ system (2.38 / 2.46 / 2.63 / 2.51), and gmres reported "converged but produced an invalid result" (flux-conservation check). Smells like the convergence tolerance not pinning down τ — needs its own investigation.

https://claude.ai/code/session_01VYc1je5VpiW46QvRBw8QFv


Generated by Claude Code

The previous subprocess-isolation attempt backfired: spawning a CUDA-
initialising child per combo while the parent kernel still holds the GPU
caused even pcg to abort (MPI_Abort, exit 6) from device-memory contention,
and did not reliably keep the kernel alive.

Revert to the proven in-process loop and instead drop only the two solvers
that actually hard-abort on a GPU build — standalone smg and pfmg (no Krylov
wrapper). In the original report every other combo up to bicgstab ran fine
in-process; standalone SMG was the sole kernel-killer. As preconditioners
(pcg+smg, flexgmres+pfmg) these multigrids are capped at one V-cycle and stay.
On CPU builds nothing is skipped. Soft non-convergence is still caught and
shown as a FAILED row.
TortuosityMLMG::solve() passed a GPU DeviceVector pointer to ActiveMaskIF,
the implicit function consumed by EB2::Build. AMReX evaluates that implicit
function on the host while generating the EB geometry, so on a GPU build the
host code dereferenced a device pointer and segfaulted (SIGSEGV/MPI_ABORT
errorcode 11) at the very first EB step — before the solve began, for every
geometry, with the GPU otherwise idle.

Because MLMG is the default solver (solver="auto"), this broke the entire
openimpala-cuda wheel for all geometries. CPU-only CI never caught it: on a
CPU build the mask pointer is host memory, so the host-side IF evaluation is
valid and the tTortuosityMLMG tests pass.

Allocate the mask as amrex::Gpu::ManagedVector so the single pointer is valid
on both host and device, regardless of where AMReX evaluates the IF. On CPU
builds ManagedVector degrades to an ordinary host allocation, so behaviour is
unchanged there.

Needs validation on real GPU hardware: rebuild the CUDA wheel and re-run
solver="mlmg" on a T4 (and the profiling notebook bake-off).
@jameslehoux jameslehoux merged commit b0ae42d into master Jun 13, 2026
6 checks passed
@github-actions

Copy link
Copy Markdown

Performance Benchmark Results

Size Solver Wall Time (s) Tortuosity Expected Rel. Error Iters Status
64³ pcg 0.6247 0.984375 0.984375 0.00e+00 1 PASS
64³ flexgmres 0.3816 0.984375 0.984375 0.00e+00 N/A PASS
64³ bicgstab 0.3737 0.984375 0.984375 0.00e+00 N/A PASS
64³ gmres 0.3757 0.984375 0.984375 0.00e+00 N/A PASS
128³ pcg 7.3815 0.992188 0.992188 0.00e+00 1 PASS
128³ flexgmres 5.2642 0.992188 0.992188 0.00e+00 N/A PASS
128³ bicgstab 5.1410 0.992188 0.992188 0.00e+00 N/A PASS
128³ gmres 5.1432 0.992188 0.992188 0.00e+00 N/A PASS

Fastest solver: bicgstab at 64³ (0.3737s)

Benchmark: uniform block (analytical τ = (N-1)/N)

@github-actions

Copy link
Copy Markdown

Code Coverage Report

------------------------------------------------------------------------------
                           GCC Code Coverage Report
Directory: .
------------------------------------------------------------------------------
File                                       Lines     Exec  Cover   Missing
------------------------------------------------------------------------------
src/io/CathodeWrite.cpp                       95       83    87%   40-41,97-100,115-116,182-185
src/io/CathodeWrite.H                          1        1   100%
src/io/DatReader.cpp                         136      106    77%   28-29,32,37,94-95,101-102,109-111,137-139,143,146-150,154-157,164,166,210-211,256,259
src/io/DatReader.H                             1        1   100%
src/io/HDF5Reader.cpp                        344       84    24%   40-41,43-44,46-49,52,54-56,58-59,62,64-66,68-74,92-93,126-128,144-145,154-157,174-180,182-187,204,213-215,217,219-228,230-233,236-238,240-251,253-258,266,266,266,266,266,266,266,270,270,270,270,270,270,270,274,276,278,280,282,288,290,297,297,297,297,297,297,297,301,301,301,301,301,301,301,305,305,305,305,305,305,305-306,306,306,306,306,306,306,309,309,309,309,309,309,309-310,310,310,310,310,310,310-311,311,311,311,311,311,311,313,313,313,313,313,313,313-314,314,314,314,314,314,314-315,315,315,315,315,315,315,319,319,319,319,319,319,319,324,324,324,324,324,324,324-325,325,325,325,325,325,325-326,326,326,326,326,326,326-327,327,327,327,327,327,327,332,332,332,332,332,332,332,337,337,337,337,337,337,337-338,338,338,338,338,338,338,343,343,343,343,343,343,343,350,350,350,350,350,350,350,357-358,432-435,437-440
src/io/HDF5Reader.H                            3        3   100%
src/io/ImageLoader.cpp                        61       42    68%   25,38,48,60-62,64-70,72,77,89-90,92,94
src/io/RawReader.cpp                         267      136    50%   51-52,91-92,113-114,117-119,122-123,142-144,157-159,168-170,176-179,187-188,194-198,202-206,211-214,221-226,233-239,273,275-276,278,285-286,303,314,316,320,327,329,333-336,340,348-349,355-357,363-365,367-368,371,374,376,379-382,384-386,388,390-391,393,395-396,398,400-401,403,405-406,408,412-413,415,419-420,422,427,467,473-474,535-538,552,554-556,558,560-562,572,576-578,580,602
src/io/RawReader.H                             1        1   100%
src/io/TiffReader.cpp                        385      131    34%   60-66,68-70,72-74,76-78,80-81,83-85,87-89,91-93,95-97,99-100,102-104,107-109,112-113,115-118,120,123,125-128,144-145,149-151,153-159,161,187,211,218,227,229-232,241,243-246,249,256,289-294,307,310-318,320-321,324-328,332-336,339-343,345-349,352-358,360-364,368,370,376-378,380-394,397,399-403,405-410,414-419,421-426,429-430,433-435,569-589,591-592,595-602,604,607-623,626-628,684,687-688,691-697,699,703-714,716-717
src/io/TiffReader.H                            5        5   100%
src/props/BoundaryCondition.H                131       74    56%   63,68,70,216,224-229,233-236,238-244,247-249,252-253,255,258-261,264-265,271-272,274-279,285-287,290-296,299,303,365-366,371,373
src/props/ConnectedComponents.cpp             71       69    97%   115-116
src/props/ConnectedComponents.H                4        4   100%
src/props/DeffTensor.cpp                      62       59    95%   122,128-129
src/props/Diffusion.cpp                      510      378    74%   93-94,97-98,103-104,106-116,118,123-132,134-141,144-150,153-157,159-163,165,168-173,175-177,179,182-184,186-187,190-191,193,195-198,200,202-203,288-289,297-298,300,349,359-360,368-371,373-375,404-413,415,453,461,465-467,526-527,533,535,539,547,581,610,638,646,735-736,739-740,757-760,771-772,774,824
src/props/EffDiffFillMtx.H                   120      106    88%   58,216-217,221-225,229,231-235
src/props/EffectiveDiffusivityHypre.cpp      413      372    90%   189-191,193-197,352-355,458,610-613,615-617,619-622,631-634,641,670,682-685,687-689,691,706,724,726
src/props/EffectiveDiffusivityHypre.H          7        7   100%
src/props/FloodFill.cpp                       90       87    96%   109-110,250
src/props/HypreStructSolver.cpp              343      210    61%   87-88,121,133-134,145,303,313,315,318,350,360,362,365,371-374,376-380,382-383,385-389,392-393,395-396,398,401-402,405-406,408-411,413-417,419-420,422-426,429-430,432-433,435,438-439,442-443,445-447,449-455,457-461,464-465,467-468,470,473-474,477,479-481,483-489,491-495,498-499,501-502,504,507-508,511,513-515,517-520,522-526,529-530,532-533,535,538-539,542,545-546,559
src/props/HypreStructSolver.H                  6        6   100%
src/props/MacroGeometry.H                     17       17   100%
src/props/ParticleSizeDistribution.cpp        11       11   100%
src/props/ParticleSizeDistribution.H           6        6   100%
src/props/PercolationCheck.cpp                53       46    86%   32-33,49-51,68,73
src/props/PercolationCheck.H                   4        4   100%
src/props/PhysicsConfig.H                     90       89    98%   150
src/props/ResultsJSON.H                      225      222    98%   242,395,416
src/props/REVStudy.cpp                       151      128    84%   72,83-91,159,170-173,175,183-186,188-190
src/props/SolverConfig.H                      32       20    62%   30,32,37-44,75-76
src/props/SpecificSurfaceArea.cpp             56       55    98%   59
src/props/SpecificSurfaceArea.H                6        6   100%
src/props/ThroughThicknessProfile.cpp         38       38   100%
src/props/ThroughThicknessProfile.H            5        5   100%
src/props/Tortuosity.H                         2        2   100%
src/props/TortuosityDirect.cpp               219      191    87%   81-83,86,100-106,113-114,125,134,140,202-209,226,394,424,433
src/props/TortuosityDirect.H                   5        5   100%
src/props/TortuosityHypre.cpp                793      567    71%   149-150,155-156,240-243,246-248,311,335-337,340-341,343,371-373,376-378,408-411,620,644,648,669,686-687,689-691,694-701,708-709,711,713,716-726,730-736,738-742,746-748,750-752,755-762,769-770,772,774-784,788-796,798-801,803,813,819-822,824-826,835-838,840-842,878,881-882,902-904,907,918-921,923,960,965-968,971-973,977-980,982,984-987,989,994-996,998,1047,1056,1061,1064-1069,1085-1088,1102-1106,1111-1116,1126-1130,1135-1140,1145-1149,1152-1155,1162-1165,1176,1185,1187,1191,1193,1218,1259-1260,1346-1348,1474-1477
src/props/TortuosityHypre.H                   15       15   100%
src/props/TortuosityHypreFill.H              127       98    77%   85,203,205-212,237-239,241-245,247-248,250,252,255-256,258-262
src/props/TortuosityKernels.H                 97       53    54%   52,56-60,62-65,69-74,76-80,84-85,90,129,143,157,243,245-248,250-253,257-260,262-265
src/props/TortuosityMLMG.cpp                 152      145    95%   285-287,289-290,295,316
src/props/TortuosityMLMG.H                     1        1   100%
src/props/TortuositySolverBase.cpp           311      247    79%   70-72,74-75,94-100,118,122,124,160-163,218,221,223,409,412-414,416,424-427,429-435,440,445-447,453-454,456-458,494,498-500,503,508-511,513,544,548-550,552,554,558
src/props/TortuositySolverBase.H              13       13   100%
src/props/VolumeFraction.cpp                  25       25   100%
src/props/VolumeFraction.H                     4        4   100%
------------------------------------------------------------------------------
TOTAL                                       5514     3978    72%
------------------------------------------------------------------------------


Generated by CI — coverage data from gcovr

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/props/TortuosityMLMG.cpp 33.33% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant