-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Summary
When Hyperion executable directly calls itself (via callProcess) to spawn a child process, its RUsage data (in particular, MaxRSS) is apparently cloned to the child.
Description
Observed on Expanse HPC while running stress-tensors-3d tests.
For a nmax=6 test, all individual task in the schedule are small, up to ~1 GB.
However, the master process consumed 10+ GB.
All the tasks running on master node showed monotonously increasing MaxRSS, from 5 to 10 GB.
The tasks running on a remote node reported correct MaxRSS.
Example - shutdown messages for several consecutive tasks on a master node:
/expanse/lustre/scratch/vdommes/temp_project/logs/2025-07/jmySU/0/exp-5-46.0.log
[Thu 07/17/25 19:20:06] Shutting down.
[Thu 07/17/25 19:20:06] Max resident set size: self: 5.629 GB, children: 0.335 GB
<...>
[Thu 07/17/25 19:20:06] Start ReusableWorker
<...>
[Thu 07/17/25 19:20:30] Shutting down.
[Thu 07/17/25 19:20:30] Max resident set size: self: 6.122 GB, children: 0.506 GB
<...>
[Thu 07/17/25 20:02:15] Shutting down.
[Thu 07/17/25 20:02:15] Max resident set size: self: 10.919 GB, children: 0.000 GB
[Thu 07/17/25 20:02:54] Shutting down.
[Thu 07/17/25 20:02:54] Max resident set size: self: 6.122 GB, children: 0.000 GB
Note that the last line corresponds to the ReusableWorker that started earlier, at 19:20:06.
If each worker call is wrapped in \usr\bin\time -v, then MaxRSS is reported correctly.
Possible explanation and fix
Hyperion executable spawns copies of itself (with different arguments) via System.Process.callProcess. This should lead to fork (copy the current process) + exec* (replace it with a new one) system calls, which is a standard way of creating a new OS process on Linux.
exec* should reset all rusage data. But since the new binary is the same as the old one, this does not happen (due to some optimization?).
This chould be fixed by wrapping worker calls with time, sh or any other executable instead of calling it directly.
Remote worker calls are already wrapped in ssh or srun, and thus work correctly.
Related code:
Line 319 in 092586d
| withNodeLauncher cfg addr' go = case addr' of |
Line 230 in 092586d
| runCmdLocalAsync c = Async.async (uncurry callProcess c) >>= Async.link |
hyperion/src/Hyperion/WorkerCpuPool.hs
Line 184 in 092586d
| remoteRunCmd :: String -> CommandTransport -> (String, [String]) -> IO () |
See also:
https://stackoverflow.com/questions/13880724/python-getrusage-with-rusage-children-behaves-stangely