fix: persistent MPI workers to eliminate Lustre IOPS storms by DarriEy · Pull Request #22 · symfluence-org/SYMFLUENCE

DarriEy · 2026-03-25T14:45:17Z

Summary

Adds PersistentMPIExecutionStrategy that launches MPI worker processes once and keeps them alive across all calibration batches, communicating via stdin/stdout pipes instead of the shared filesystem
Eliminates the repeated Python-import metadata storms (up to 30,000 IOPS per batch) that were overloading the Lustre MDS on ARC (UCalgary) during ASYNC-DDS calibration runs
Maintains full backward compatibility with automatic fallback chain: PersistentMPI → spawn-per-batch MPI → ProcessPool

Context

Dmitri (RCS, UCalgary) profiled SYMFLUENCE calibration jobs on ARC and identified that each mpirun invocation spawned 50 fresh Python processes, each stat-ing thousands of files to import modules. With 100+ batches per calibration, this produced devastating IOPS bursts on the Lustre metadata server. SUMMA and mizuRoute IOPS were already resolved — this addresses the Python orchestration layer.

Changes

File	Change
`execution_strategies/base.py`	Added `startup()`, `shutdown()`, `is_persistent` lifecycle hooks
`execution_strategies/mpi_persistent.py`	New — persistent MPI worker pool with pipe-based communication
`execution_strategies/__init__.py`	Registered new strategy
`parallel/__init__.py`	Re-exported new strategy
`parallel_execution.py`	`execute_batch()` uses persistent workers first; added lifecycle helpers
`base_model_optimizer.py`	`cleanup()` shuts down persistent workers
`.gitignore`	Exception for `mpi_persistent.py` (existing `mpi_*` rule)

How it works

First batch: Workers launched via mpirun -n N python persistent_worker.py using Popen with stdin/stdout pipes. Python imports happen once.
Every batch: Tasks sent as length-prefixed pickle frames through stdin. Rank 0 brokers distribution via MPI. Results returned through stdout. Zero Lustre metadata operations.
Cleanup: Shutdown sentinel through pipe → workers exit gracefully.

Test plan

All 275 existing unit tests pass
All pre-commit hooks pass (ruff, mypy, bandit)
Integration test on ARC with ASYNC-DDS calibration (Nico)
Dmitri to verify IOPS reduction via profiling

Closes #21

🤖 Generated with Claude Code

Python MPI workers are now launched once and kept alive across all calibration batches, communicating via length-prefixed pickle over stdin/stdout pipes instead of the shared filesystem. This eliminates the repeated Python-import metadata storms (up to 30,000 IOPS per batch) that were overloading the Lustre MDS on ARC. Fallback chain: PersistentMPI → spawn-per-batch MPI → ProcessPool. Closes #21 Assisted-by: Claude (Anthropic)

Regenerated after execute_batch() refactor shifted line numbers. Assisted-by: Claude (Anthropic)

- Sanitize VERSION in create_release_tarball.sh to replace slashes with hyphens (PR refs like "22/merge" broke the tarball path) - Add chunks=None to open_mfdataset in merge_netcdf_chunks() to disable dask and prevent scheduler deadlocks under pytest-xdist forked workers on Ubuntu CI Assisted-by: Claude (Anthropic)

nvasquez-plac · 2026-03-25T19:50:08Z

Hi Darri! I have been testing this. I can see the new messages in the log file, but I get this error:

2026-03-25 13:35:21 ● Starting persistent MPI workers (50 ranks) via mpirun
2026-03-25 13:35:23 ● Persistent MPI workers started successfully
2026-03-25 13:35:24 ● Persistent MPI process died (returncode=1)
2026-03-25 13:35:24 ● Persistent MPI execution failed: Failed to send tasks to MPI workers: [Errno 32] Broken pipe
2026-03-25 13:35:24 ● Falling back to spawn-per-batch MPI...

Interestingly, the job on Arc doesn't fail, but after the message, nothing is printed in the log file. I checked if the cores were running SUMMA or mizuRoute, and they were inactive. So basically, the job is hanging without doing any work after the error message.

The generated worker script used logging.basicConfig(defaults=...), which requires Python 3.12+. ARC runs an older Python, causing the workers to crash immediately with TypeError on startup. Replaced with a simple _log() helper that prefixes rank info. Also promoted worker stderr output to INFO level for visibility. Assisted-by: Claude (Anthropic)

The stdin/stdout pipe relay through mpirun was causing deadlocks: the coordinator's blocking write could stall when the payload exceeded the OS pipe buffer (~64KB), while mpirun's stdin relay to rank 0 added another layer of unreliability. Switched to file-based signaling through $SLURM_TMPDIR (node-local /tmp): coordinator writes tasks.pkl + tasks.ready signal, broker polls for the signal, distributes via MPI, writes results.pkl + results.ready signal. This avoids both Lustre I/O and pipe issues. stdin/stdout are no longer used — Popen gets DEVNULL for both. Assisted-by: Claude (Anthropic)

Rank 0 was running tasks AND gathering results. Workers that finished before rank 0's own task would block on comm.send waiting for the matching comm.recv that rank 0 couldn't post until its own task finished. With 49 workers blocking, the system stalled. Now rank 0 is a dedicated broker (no task execution). We launch N+1 MPI ranks — rank 0 distributes tasks and gathers results while ranks 1..N do all the work. Also added per-task diagnostic logging so we can see paths and progress in the worker logs. Assisted-by: Claude (Anthropic)

The +1 broker rank exceeded the available slots on ARC (50 requested, 51 launched). Reverted to N ranks where rank 0 is a dedicated broker and ranks 1..N-1 are workers (49 task workers with 50 ranks). Assisted-by: Claude (Anthropic)

…ocal /tmp When USE_LOCAL_SCRATCH is true and SLURM_TMPDIR is available, parallel processing directories (where SUMMA/mizuRoute read and write) are created under $SLURM_TMPDIR instead of the shared Lustre filesystem. Results are staged back to permanent storage during cleanup(). This is the missing piece that caused the I/O contention Nico observed: 50 concurrent SUMMA+mizuRoute processes hitting Lustre simultaneously. With scratch enabled, all model I/O stays on node-local /tmp. Assisted-by: Claude (Anthropic)

DarriEy mentioned this pull request Mar 25, 2026

High metadata IOPS from Python MPI processes causing Lustre MDS overload #21

Open

DarriEy added 2 commits March 25, 2026 11:07

chore: update broad exception allowlist for persistent MPI strategy

36f6f3b

Regenerated after execute_batch() refactor shifted line numbers. Assisted-by: Claude (Anthropic)

DarriEy added 5 commits March 25, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persistent MPI workers to eliminate Lustre IOPS storms#22

fix: persistent MPI workers to eliminate Lustre IOPS storms#22
DarriEy wants to merge 8 commits intodevelopfrom
fix/persistent-mpi-workers

DarriEy commented Mar 25, 2026

Uh oh!

nvasquez-plac commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DarriEy commented Mar 25, 2026

Summary

Context

Changes

How it works

Test plan

Uh oh!

nvasquez-plac commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvasquez-plac commented Mar 25, 2026 •

edited

Loading