fix: persistent MPI workers to eliminate Lustre IOPS storms#22
Open
fix: persistent MPI workers to eliminate Lustre IOPS storms#22
Conversation
Python MPI workers are now launched once and kept alive across all calibration batches, communicating via length-prefixed pickle over stdin/stdout pipes instead of the shared filesystem. This eliminates the repeated Python-import metadata storms (up to 30,000 IOPS per batch) that were overloading the Lustre MDS on ARC. Fallback chain: PersistentMPI → spawn-per-batch MPI → ProcessPool. Closes #21 Assisted-by: Claude (Anthropic)
added 2 commits
March 25, 2026 11:07
Regenerated after execute_batch() refactor shifted line numbers. Assisted-by: Claude (Anthropic)
- Sanitize VERSION in create_release_tarball.sh to replace slashes with hyphens (PR refs like "22/merge" broke the tarball path) - Add chunks=None to open_mfdataset in merge_netcdf_chunks() to disable dask and prevent scheduler deadlocks under pytest-xdist forked workers on Ubuntu CI Assisted-by: Claude (Anthropic)
|
Hi Darri! I have been testing this. I can see the new messages in the log file, but I get this error:
Interestingly, the job on Arc doesn't fail, but after the message, nothing is printed in the log file. I checked if the cores were running SUMMA or mizuRoute, and they were inactive. So basically, the job is hanging without doing any work after the error message. |
added 5 commits
March 25, 2026 21:51
The generated worker script used logging.basicConfig(defaults=...), which requires Python 3.12+. ARC runs an older Python, causing the workers to crash immediately with TypeError on startup. Replaced with a simple _log() helper that prefixes rank info. Also promoted worker stderr output to INFO level for visibility. Assisted-by: Claude (Anthropic)
The stdin/stdout pipe relay through mpirun was causing deadlocks: the coordinator's blocking write could stall when the payload exceeded the OS pipe buffer (~64KB), while mpirun's stdin relay to rank 0 added another layer of unreliability. Switched to file-based signaling through $SLURM_TMPDIR (node-local /tmp): coordinator writes tasks.pkl + tasks.ready signal, broker polls for the signal, distributes via MPI, writes results.pkl + results.ready signal. This avoids both Lustre I/O and pipe issues. stdin/stdout are no longer used — Popen gets DEVNULL for both. Assisted-by: Claude (Anthropic)
Rank 0 was running tasks AND gathering results. Workers that finished before rank 0's own task would block on comm.send waiting for the matching comm.recv that rank 0 couldn't post until its own task finished. With 49 workers blocking, the system stalled. Now rank 0 is a dedicated broker (no task execution). We launch N+1 MPI ranks — rank 0 distributes tasks and gathers results while ranks 1..N do all the work. Also added per-task diagnostic logging so we can see paths and progress in the worker logs. Assisted-by: Claude (Anthropic)
The +1 broker rank exceeded the available slots on ARC (50 requested, 51 launched). Reverted to N ranks where rank 0 is a dedicated broker and ranks 1..N-1 are workers (49 task workers with 50 ranks). Assisted-by: Claude (Anthropic)
…ocal /tmp When USE_LOCAL_SCRATCH is true and SLURM_TMPDIR is available, parallel processing directories (where SUMMA/mizuRoute read and write) are created under $SLURM_TMPDIR instead of the shared Lustre filesystem. Results are staged back to permanent storage during cleanup(). This is the missing piece that caused the I/O contention Nico observed: 50 concurrent SUMMA+mizuRoute processes hitting Lustre simultaneously. With scratch enabled, all model I/O stays on node-local /tmp. Assisted-by: Claude (Anthropic)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PersistentMPIExecutionStrategythat launches MPI worker processes once and keeps them alive across all calibration batches, communicating via stdin/stdout pipes instead of the shared filesystemContext
Dmitri (RCS, UCalgary) profiled SYMFLUENCE calibration jobs on ARC and identified that each
mpiruninvocation spawned 50 fresh Python processes, each stat-ing thousands of files to import modules. With 100+ batches per calibration, this produced devastating IOPS bursts on the Lustre metadata server. SUMMA and mizuRoute IOPS were already resolved — this addresses the Python orchestration layer.Changes
execution_strategies/base.pystartup(),shutdown(),is_persistentlifecycle hooksexecution_strategies/mpi_persistent.pyexecution_strategies/__init__.pyparallel/__init__.pyparallel_execution.pyexecute_batch()uses persistent workers first; added lifecycle helpersbase_model_optimizer.pycleanup()shuts down persistent workers.gitignorempi_persistent.py(existingmpi_*rule)How it works
mpirun -n N python persistent_worker.pyusingPopenwith stdin/stdout pipes. Python imports happen once.Test plan
Closes #21
🤖 Generated with Claude Code