Skip to content

fix: persistent MPI workers to eliminate Lustre IOPS storms#22

Open
DarriEy wants to merge 8 commits intodevelopfrom
fix/persistent-mpi-workers
Open

fix: persistent MPI workers to eliminate Lustre IOPS storms#22
DarriEy wants to merge 8 commits intodevelopfrom
fix/persistent-mpi-workers

Conversation

@DarriEy
Copy link
Copy Markdown
Collaborator

@DarriEy DarriEy commented Mar 25, 2026

Summary

  • Adds PersistentMPIExecutionStrategy that launches MPI worker processes once and keeps them alive across all calibration batches, communicating via stdin/stdout pipes instead of the shared filesystem
  • Eliminates the repeated Python-import metadata storms (up to 30,000 IOPS per batch) that were overloading the Lustre MDS on ARC (UCalgary) during ASYNC-DDS calibration runs
  • Maintains full backward compatibility with automatic fallback chain: PersistentMPI → spawn-per-batch MPI → ProcessPool

Context

Dmitri (RCS, UCalgary) profiled SYMFLUENCE calibration jobs on ARC and identified that each mpirun invocation spawned 50 fresh Python processes, each stat-ing thousands of files to import modules. With 100+ batches per calibration, this produced devastating IOPS bursts on the Lustre metadata server. SUMMA and mizuRoute IOPS were already resolved — this addresses the Python orchestration layer.

Changes

File Change
execution_strategies/base.py Added startup(), shutdown(), is_persistent lifecycle hooks
execution_strategies/mpi_persistent.py New — persistent MPI worker pool with pipe-based communication
execution_strategies/__init__.py Registered new strategy
parallel/__init__.py Re-exported new strategy
parallel_execution.py execute_batch() uses persistent workers first; added lifecycle helpers
base_model_optimizer.py cleanup() shuts down persistent workers
.gitignore Exception for mpi_persistent.py (existing mpi_* rule)

How it works

  1. First batch: Workers launched via mpirun -n N python persistent_worker.py using Popen with stdin/stdout pipes. Python imports happen once.
  2. Every batch: Tasks sent as length-prefixed pickle frames through stdin. Rank 0 brokers distribution via MPI. Results returned through stdout. Zero Lustre metadata operations.
  3. Cleanup: Shutdown sentinel through pipe → workers exit gracefully.

Test plan

  • All 275 existing unit tests pass
  • All pre-commit hooks pass (ruff, mypy, bandit)
  • Integration test on ARC with ASYNC-DDS calibration (Nico)
  • Dmitri to verify IOPS reduction via profiling

Closes #21

🤖 Generated with Claude Code

Python MPI workers are now launched once and kept alive across all
calibration batches, communicating via length-prefixed pickle over
stdin/stdout pipes instead of the shared filesystem.  This eliminates
the repeated Python-import metadata storms (up to 30,000 IOPS per
batch) that were overloading the Lustre MDS on ARC.

Fallback chain: PersistentMPI → spawn-per-batch MPI → ProcessPool.

Closes #21

Assisted-by: Claude (Anthropic)
DarriEy added 2 commits March 25, 2026 11:07
Regenerated after execute_batch() refactor shifted line numbers.

Assisted-by: Claude (Anthropic)
- Sanitize VERSION in create_release_tarball.sh to replace slashes
  with hyphens (PR refs like "22/merge" broke the tarball path)
- Add chunks=None to open_mfdataset in merge_netcdf_chunks() to
  disable dask and prevent scheduler deadlocks under pytest-xdist
  forked workers on Ubuntu CI

Assisted-by: Claude (Anthropic)
@nvasquez-plac
Copy link
Copy Markdown

nvasquez-plac commented Mar 25, 2026

Hi Darri! I have been testing this. I can see the new messages in the log file, but I get this error:

2026-03-25 13:35:21 ● Starting persistent MPI workers (50 ranks) via mpirun
2026-03-25 13:35:23 ● Persistent MPI workers started successfully
2026-03-25 13:35:24 ● Persistent MPI process died (returncode=1)
2026-03-25 13:35:24 ● Persistent MPI execution failed: Failed to send tasks to MPI workers: [Errno 32] Broken pipe
2026-03-25 13:35:24 ● Falling back to spawn-per-batch MPI...

Interestingly, the job on Arc doesn't fail, but after the message, nothing is printed in the log file. I checked if the cores were running SUMMA or mizuRoute, and they were inactive. So basically, the job is hanging without doing any work after the error message.

DarriEy added 5 commits March 25, 2026 21:51
The generated worker script used logging.basicConfig(defaults=...),
which requires Python 3.12+. ARC runs an older Python, causing the
workers to crash immediately with TypeError on startup.

Replaced with a simple _log() helper that prefixes rank info.
Also promoted worker stderr output to INFO level for visibility.

Assisted-by: Claude (Anthropic)
The stdin/stdout pipe relay through mpirun was causing deadlocks:
the coordinator's blocking write could stall when the payload
exceeded the OS pipe buffer (~64KB), while mpirun's stdin relay
to rank 0 added another layer of unreliability.

Switched to file-based signaling through $SLURM_TMPDIR (node-local
/tmp): coordinator writes tasks.pkl + tasks.ready signal, broker
polls for the signal, distributes via MPI, writes results.pkl +
results.ready signal. This avoids both Lustre I/O and pipe issues.

stdin/stdout are no longer used — Popen gets DEVNULL for both.

Assisted-by: Claude (Anthropic)
Rank 0 was running tasks AND gathering results. Workers that finished
before rank 0's own task would block on comm.send waiting for the
matching comm.recv that rank 0 couldn't post until its own task
finished. With 49 workers blocking, the system stalled.

Now rank 0 is a dedicated broker (no task execution). We launch N+1
MPI ranks — rank 0 distributes tasks and gathers results while ranks
1..N do all the work. Also added per-task diagnostic logging so we
can see paths and progress in the worker logs.

Assisted-by: Claude (Anthropic)
The +1 broker rank exceeded the available slots on ARC (50 requested,
51 launched). Reverted to N ranks where rank 0 is a dedicated broker
and ranks 1..N-1 are workers (49 task workers with 50 ranks).

Assisted-by: Claude (Anthropic)
…ocal /tmp

When USE_LOCAL_SCRATCH is true and SLURM_TMPDIR is available, parallel
processing directories (where SUMMA/mizuRoute read and write) are
created under $SLURM_TMPDIR instead of the shared Lustre filesystem.
Results are staged back to permanent storage during cleanup().

This is the missing piece that caused the I/O contention Nico observed:
50 concurrent SUMMA+mizuRoute processes hitting Lustre simultaneously.
With scratch enabled, all model I/O stays on node-local /tmp.

Assisted-by: Claude (Anthropic)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants