Skip to content

Incorrect MaxRSS for workers started on master node (rusage data copied from parent?) #3

@vasdommes

Description

@vasdommes

Summary

When Hyperion executable directly calls itself (via callProcess) to spawn a child process, its RUsage data (in particular, MaxRSS) is apparently cloned to the child.

Description

Observed on Expanse HPC while running stress-tensors-3d tests.

For a nmax=6 test, all individual task in the schedule are small, up to ~1 GB.
However, the master process consumed 10+ GB.
All the tasks running on master node showed monotonously increasing MaxRSS, from 5 to 10 GB.
The tasks running on a remote node reported correct MaxRSS.

Example - shutdown messages for several consecutive tasks on a master node:

/expanse/lustre/scratch/vdommes/temp_project/logs/2025-07/jmySU/0/exp-5-46.0.log

[Thu 07/17/25 19:20:06] Shutting down.
[Thu 07/17/25 19:20:06] Max resident set size: self: 5.629 GB, children: 0.335 GB
<...>
[Thu 07/17/25 19:20:06] Start ReusableWorker
<...>
[Thu 07/17/25 19:20:30] Shutting down.
[Thu 07/17/25 19:20:30] Max resident set size: self: 6.122 GB, children: 0.506 GB
<...>
[Thu 07/17/25 20:02:15] Shutting down.
[Thu 07/17/25 20:02:15] Max resident set size: self: 10.919 GB, children: 0.000 GB
[Thu 07/17/25 20:02:54] Shutting down.
[Thu 07/17/25 20:02:54] Max resident set size: self: 6.122 GB, children: 0.000 GB

Note that the last line corresponds to the ReusableWorker that started earlier, at 19:20:06.

If each worker call is wrapped in \usr\bin\time -v, then MaxRSS is reported correctly.

Possible explanation and fix

Hyperion executable spawns copies of itself (with different arguments) via System.Process.callProcess. This should lead to fork (copy the current process) + exec* (replace it with a new one) system calls, which is a standard way of creating a new OS process on Linux.
exec* should reset all rusage data. But since the new binary is the same as the old one, this does not happen (due to some optimization?).

This chould be fixed by wrapping worker calls with time, sh or any other executable instead of calling it directly.

Remote worker calls are already wrapped in ssh or srun, and thus work correctly.

Related code:

withNodeLauncher cfg addr' go = case addr' of

runCmdLocalAsync c = Async.async (uncurry callProcess c) >>= Async.link

remoteRunCmd :: String -> CommandTransport -> (String, [String]) -> IO ()

See also:

https://stackoverflow.com/questions/13880724/python-getrusage-with-rusage-children-behaves-stangely

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions