[Perf] Streams 1-4#410
Conversation
Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.
Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).
Introduces stream_parallel() for running top-level for-loop blocks on separate GPU streams. The AST transformer maps 'with qd.stream_parallel()' blocks to stream-parallel group IDs, which propagate through IR lowering and offloading to the CUDA/AMDGPU kernel launchers. Each unique group ID gets its own stream at launch time. Includes validation that all top-level kernel statements must be stream_parallel blocks (no mixing), and offline cache key support.
- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize
…quadrantsic-2-amdgpu-cpu
Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.
Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.
Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.
…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.
…adrantsic-3-stream-parallel # Conflicts: # python/quadrants/lang/stream.py
Prevents stale group IDs from leaking if insert_for is called after a path that set a non-zero stream_parallel_group_id, matching the reset pattern of all other ForLoopConfig fields.
Add an error check in begin_stream_parallel() to prevent nesting, which would produce undefined group ID semantics.
…context safety Add comments explaining that streams are created/destroyed per launch (stream pooling as future optimization), and that RuntimeContext sharing across concurrent streams is safe because kernels only read from it.
This reverts commit 60d015b.
…adrantsic-3-stream-parallel
…adrantsic-3-stream-parallel
Replace per-launch stream_create/stream_destroy with acquire_stream/ release_stream on CUDAContext and AMDGPUContext. Streams are cached in a pool and reused across invocations, avoiding the driver-level overhead of stream creation (~5-50us) on every kernel launch in hot loops.
…ms-quadrantsic-4-stream-pool
Calls a stream_parallel kernel 5 times in a loop to verify that pooled streams are correctly reused with correct results each iteration.
|
Review from Opus 4.6: What it does: Introduces a stream pool in both CUDAContext and AMDGPUContext so that GPU streams are reused across kernel launches rather than created and destroyed every time.
Concerns / suggestions:
Overall, it's a solid, focused change. The main actionable item is making sure pooled streams get destroyed when the context shuts down. |
|
Response from the original Opus that wrote this PR:
I think I might get some opinions from a fresh Opus. |
|
The reviewer Opus said the following when I asked "what are your thougths on concern 1?": Honestly, on reflection it's probably overstated. The number of streams acquired per launch is determined by the number of distinct stream_parallel groups in a kernel — that's |
|
So both opuses concur that the only concern is the destructor, concern 2. |
PR change report (
|
| File | LoC | Added | Removed |
|---|---|---|---|
tests/python/test_streams.py |
413 | +413 | |
quadrants/program/program_stream.cpp |
151 | +151 | |
python/quadrants/lang/stream.py |
111 | +111 | |
quadrants/runtime/cuda/kernel_launcher.cpp |
322 | +83 | -27 |
quadrants/runtime/amdgpu/kernel_launcher.cpp |
306 | +70 | -18 |
python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
435 | +68 | -2 |
python/quadrants/lang/ast/ast_transformer.py |
1272 | +28 | -3 |
python/quadrants/lang/kernel.py |
562 | +24 | -3 |
quadrants/rhi/amdgpu/amdgpu_context.h |
111 | +24 | |
python/quadrants/lang/ast/symbol_resolver.py |
48 | +23 | |
quadrants/program/program_stream.h |
21 | +21 | |
quadrants/ir/statements.h |
1307 | +20 | -3 |
quadrants/rhi/cuda/cuda_context.h |
115 | +18 | -1 |
quadrants/python/export_stream.cpp |
17 | +17 | |
quadrants/ir/frontend_ir.h |
806 | +12 | |
quadrants/codegen/llvm/llvm_compiled_data.h |
96 | +9 | -3 |
quadrants/rhi/amdgpu/amdgpu_context.cpp |
166 | +7 | -2 |
quadrants/rhi/amdgpu/amdgpu_profiler.cpp |
181 | +6 | -4 |
quadrants/python/export_lang.cpp |
1048 | +6 | -3 |
quadrants/rhi/cuda/cuda_context.cpp |
128 | +6 | -1 |
quadrants/ir/frontend_ir.cpp |
1405 | +6 | |
quadrants/rhi/amdgpu/amdgpu_device.cpp |
138 | +5 | -3 |
quadrants/program/program.h |
222 | +5 | |
tests/python/test_api.py |
460 | +5 | |
tests/python/test_cache.py |
210 | +4 | -4 |
quadrants/runtime/llvm/llvm_runtime_executor.cpp |
600 | +4 | -2 |
quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h |
58 | +4 | -1 |
quadrants/python/export.h |
24 | +4 | |
quadrants/ir/statements.cpp |
392 | +3 | |
quadrants/rhi/cuda/cuda_driver_functions.inc.h |
69 | +3 | |
quadrants/transforms/lower_ast.cpp |
423 | +3 | |
python/quadrants/lang/__init__.py |
51 | +2 | |
quadrants/transforms/offload.cpp |
602 | +2 | |
tests/python/test_perf_dispatch.py |
418 | +1 | -1 |
quadrants/analysis/gen_offline_cache_key.cpp |
562 | +1 | |
quadrants/codegen/amdgpu/codegen_amdgpu.cpp |
429 | +1 | |
quadrants/codegen/cuda/codegen_cuda.cpp |
628 | +1 | |
quadrants/program/program.cpp |
403 | +1 | |
python/quadrants/lang/runtime_ops.py |
4 |
Total: 39 file(s) changed, +1172 -81 code lines.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 python/quadrants/lang/__init__.py |
100% | |
🔴 python/quadrants/lang/ast/ast_transformer.py |
71% | 1534,1537,1539,1541,1543 |
🔴 python/quadrants/lang/ast/ast_transformers/function_def_transformer.py |
72% | 471,474,478-480,483-487,510,516-518 |
🔴 python/quadrants/lang/ast/symbol_resolver.py |
9% | 66-67,69-70,72-79,81-84,86-89 |
🔴 python/quadrants/lang/kernel.py |
67% | 573,581,586,599,668,673 |
🔴 python/quadrants/lang/stream.py |
63% | 30,36,45-51,59-62,66-72,100,105,128-134,142-145,149-155,185 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
88% | 14-18,23-27,253,309-310,312-314,316-320,325-326,328-330,332-335,340-341,343-346,348-351,465,489-493 |
Diff coverage: 79% · Overall: 79% · 624 lines, 132 missing
…c-4-stream-pool # Conflicts: # quadrants/codegen/llvm/llvm_compiled_data.h # quadrants/program/program.h
Rename fill_a/fill_b to some_func1/some_func2 in explicit stream examples. Remove redundant synchronize() from context manager example since destroy() already waits for in-flight work.
context_synchronize/device_synchronize waits on all streams, which contradicts the documented behavior that qd.sync() only waits on the default stream. stream_parallel already synchronizes its pooled streams before returning, so a global barrier is unnecessary.
Address review comment: make explicit that qd_stream is a special keyword argument handled by the @qd.kernel decorator, not something the user declares in the kernel signature.
|
ok to merge |
Cover all 5 error paths in ASTTransformer.build_With: multiple context managers, with-as syntax, non-call expression, non-stream_parallel context manager, and stream_parallel inside @qd.func.


Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough