Commit ae34e4c
authored
* tests: layered defense against IPC child-process hangs (#2004)
Replace the per-test kill-on-timeout approach with three layered fixes that
defend the IPC test suite against compute-sanitizer + CUDA driver / toolkit
combinations that wedge child processes during IPC teardown.
CI layer: ci/tools/setup-sanitizer now passes
--target-processes=application-only so spawned multiprocessing.Process
children run without the sanitizer attached. This eliminates the root cause
of the hang reported in issue #2004 (compute-sanitizer's IPC teardown
analysis getting stuck under Python 3.12 + CUDA 12.9.1). The parent pytest
process is still fully sanitized.
Fixture layer: ipc_device and ipc_mempool_device_x2 fixtures now wrap their
yield in track_child_processes(), a context manager that patches
multiprocessing.process.BaseProcess.__init__ to record every Process
instance constructed during the test and kills any survivor at teardown.
This protects ipc_memory_resource.mr.close() from blocking on IPC handles
held by a stuck child, regardless of the original failure mode.
Outer-guard layer: pytest-timeout is added to the test dep group, and
tests/memory_ipc/conftest.py applies pytest.mark.timeout(300) to every test
in the directory. This is the final fallback so no IPC test can wedge the
GHA runner for hours if a new failure mode defeats the earlier layers.
The hardcoded CHILD_TIMEOUT_SEC = 30 in every IPC test module is replaced
with a call to helpers.child_processes.child_timeout_sec(), which returns
30 by default and 120 under compute-sanitizer (detected via the existing
under_compute_sanitizer() helper). The unused CHILD_TIMEOUT_SEC declaration
in test_workerpool.py is removed.
* tests: reframe IPC hang docs as test-side bug rather than CS bug
Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in
helpers/child_processes.py and memory_ipc/conftest.py to make clear that the
underlying problem is insufficient guards in the IPC tests when child
processes spawn slowly (>30s under compute-sanitizer). The sanitizer change
and the new helpers are mitigations / safety nets; the durable fix is making
the tests handle slow children correctly.
* tests: add kill_subprocesses helper and apply to IPC tests
Make every IPC test responsible for its own child-process cleanup, rather
than leaning on the fixture-level tracker as the primary mechanism.
helpers.child_processes.kill_subprocesses(*processes) returns the list of
processes that were still alive when called and kills them. Tests pair this
with an "assert not survivors" check, so a timeout produces a clean failure
message ("timed out waiting on: ['Process-3']") instead of "assert None ==
0", and the held IPC handles are released before any further test code
runs.
Every join+exitcode pattern in tests/memory_ipc/ is converted to the new
shape:
process.join(timeout=CHILD_TIMEOUT_SEC)
survivors = kill_subprocesses(process)
assert not survivors, "child did not exit within timeout"
assert process.exitcode == 0
For test_send_buffers.py:TestIpcReexport (the test from #2121), the
event.wait return value is also captured and asserted on so a timeout there
is reported explicitly rather than dropped on the floor.
The autouse track_child_processes() context manager in the ipc_device
fixture remains as defense in depth for any future test that forgets the
pattern.
memory_ipc/test_errors.py adds a meta-test test_outer_timeout_marker_is_applied
that verifies tests/memory_ipc/conftest.py is loaded and applies the
pytest.mark.timeout(300) marker. This catches the "nested conftest silently
not picked up" failure mode at test time with a clear error message.
* tests: add memory_ipc/__init__.py to avoid conftest module collision
Without __init__.py in tests/memory_ipc/, pytest imports
tests/memory_ipc/conftest.py under module name `conftest`, which collides
with the top-level tests/conftest.py. Tests under tests/ that use
`from conftest import X` then resolve to the wrong file at import time,
yielding ImportErrors like:
ImportError: cannot import name 'skipif_need_cuda_headers' from
'conftest' (.../tests/memory_ipc/conftest.py)
The fix mirrors tests/system/ (which already has both __init__.py and
conftest.py): adding tests/memory_ipc/__init__.py makes memory_ipc a
proper package, so its conftest is imported as memory_ipc.conftest and
the top-level conftest stays under name `conftest`.
* tests: scale memory_ipc outer timeout with child_timeout_sec
Replace the fixed 300 s pytest.mark.timeout with 3 * child_timeout_sec().
This is the minimum that lets the worst-case test in the suite (currently
TestIpcReexport, with three sequential CHILD_TIMEOUT_SEC waits in the
failure path) reach its own asserts before the outer guard fires.
Resulting budgets:
- Without compute-sanitizer: 90 s (down from 300 s).
- Under compute-sanitizer: 360 s (up from 300 s -- the previous value
would have killed TestIpcReexport mid-second-join before its
"process C did not signal completion within timeout" assertion could
produce a clean failure message).
The meta-test test_outer_timeout_marker_is_applied now computes the
expected timeout from the same formula so the two stay in sync.
* tests: tighten memory_ipc outer timeout to CHILD_TIMEOUT_SEC + 30
The previous 3 * CHILD_TIMEOUT_SEC scaling assumed worst-case wall-clock
where every sequential join/wait hits its full timeout. In practice the
children run concurrently, so expected wall-clock is ~CHILD_TIMEOUT_SEC
regardless of how many joins the test chains -- once a child is done its
join returns immediately. Exceeding CHILD_TIMEOUT_SEC + slack already
means something is genuinely stuck, in which case the outer guard firing
is the right outcome; the autouse track_child_processes() context
manager still cleans up survivors, and the per-test diagnostic message
would not be more informative than "test exceeded its budget".
New budgets:
- Without compute-sanitizer: 60 s (was 90 s).
- Under compute-sanitizer: 150 s (was 360 s).
The meta-test computes its expected value from the same formula.
1 parent caf05e6 commit ae34e4c
15 files changed
Lines changed: 254 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
16 | 24 | | |
17 | 25 | | |
18 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
213 | 213 | | |
214 | 214 | | |
215 | 215 | | |
216 | | - | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
217 | 225 | | |
218 | 226 | | |
219 | 227 | | |
220 | 228 | | |
221 | 229 | | |
222 | 230 | | |
223 | | - | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
224 | 234 | | |
225 | 235 | | |
226 | 236 | | |
| |||
291 | 301 | | |
292 | 302 | | |
293 | 303 | | |
294 | | - | |
295 | | - | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
296 | 314 | | |
297 | 315 | | |
298 | 316 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
13 | | - | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
18 | 34 | | |
19 | 35 | | |
20 | 36 | | |
| |||
43 | 59 | | |
44 | 60 | | |
45 | 61 | | |
| 62 | + | |
| 63 | + | |
46 | 64 | | |
47 | 65 | | |
48 | 66 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
14 | | - | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
67 | 68 | | |
68 | 69 | | |
69 | 70 | | |
| 71 | + | |
| 72 | + | |
70 | 73 | | |
71 | 74 | | |
72 | 75 | | |
| |||
162 | 165 | | |
163 | 166 | | |
164 | 167 | | |
| 168 | + | |
| 169 | + | |
165 | 170 | | |
166 | 171 | | |
167 | 172 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
20 | | - | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| |||
84 | 85 | | |
85 | 86 | | |
86 | 87 | | |
| 88 | + | |
87 | 89 | | |
| 90 | + | |
88 | 91 | | |
89 | 92 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
17 | | - | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| 42 | + | |
| 43 | + | |
41 | 44 | | |
42 | 45 | | |
43 | 46 | | |
| |||
54 | 57 | | |
55 | 58 | | |
56 | 59 | | |
| 60 | + | |
| 61 | + | |
57 | 62 | | |
58 | 63 | | |
59 | 64 | | |
| |||
137 | 142 | | |
138 | 143 | | |
139 | 144 | | |
| 145 | + | |
| 146 | + | |
140 | 147 | | |
141 | 148 | | |
0 commit comments