[Perf] Streams 1-4 by hughperkins · Pull Request #410 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-12T01:35:25Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).

Introduces stream_parallel() for running top-level for-loop blocks on separate GPU streams. The AST transformer maps 'with qd.stream_parallel()' blocks to stream-parallel group IDs, which propagate through IR lowering and offloading to the CUDA/AMDGPU kernel launchers. Each unique group ID gets its own stream at launch time. Includes validation that all top-level kernel statements must be stream_parallel blocks (no mixing), and offline cache key support.

- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize

…quadrantsic-2-amdgpu-cpu

Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.

Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.

Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.

…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.

…adrantsic-3-stream-parallel # Conflicts: # python/quadrants/lang/stream.py

Prevents stale group IDs from leaking if insert_for is called after a path that set a non-zero stream_parallel_group_id, matching the reset pattern of all other ForLoopConfig fields.

Add an error check in begin_stream_parallel() to prevent nesting, which would produce undefined group ID semantics.

…context safety Add comments explaining that streams are created/destroyed per launch (stream pooling as future optimization), and that RuntimeContext sharing across concurrent streams is safe because kernels only read from it.

This reverts commit 60d015b.

…adrantsic-3-stream-parallel

Replace per-launch stream_create/stream_destroy with acquire_stream/ release_stream on CUDAContext and AMDGPUContext. Streams are cached in a pool and reused across invocations, avoiding the driver-level overhead of stream creation (~5-50us) on every kernel launch in hot loops.

…ms-quadrantsic-4-stream-pool

Calls a stream_parallel kernel 5 times in a loop to verify that pooled streams are correctly reused with correct results each iteration.

hughperkins · 2026-03-12T01:41:08Z

Review from Opus 4.6:

What it does: Introduces a stream pool in both CUDAContext and AMDGPUContext so that GPU streams are reused across kernel launches rather than created and destroyed every time.
The kernel launchers now call acquire_stream() / release_stream() instead of stream_create() / stream_destroy().
What's good:

Clear motivation. The previous branch even had a comment saying "a stream pool could reduce overhead for hot loops" — this PR implements exactly that and removes the TODO
comment. Nice follow-through.
Symmetry. The CUDA and AMDGPU implementations are perfectly mirrored, which is easy to review and maintain.
Thread safety. Both acquire_stream and release_stream are guarded by lock_, which is correct since the context is a singleton shared across threads.
Test coverage. The test runs 5 iterations of a kernel with two stream_parallel blocks, which exercises the pool reuse path.

Concerns / suggestions:

Unbounded pool growth. Streams are never destroyed — release_stream just pushes back into the vector forever. If a spike causes many concurrent streams to be allocated,
they'll remain in the pool indefinitely. Consider adding a high-water mark or trimming idle streams, or at minimum destroying them in the context destructor.
No cleanup in destructor. The stream_pool_ vector will leak GPU streams when the context is destroyed. You should iterate and destroy all pooled streams in the destructor.
Lock contention. Every acquire_stream / release_stream takes the global lock_ mutex. In the hot loop case this PR is optimizing for, the acquire and release happen close
together per launch. This is probably fine for now, but if you ever acquire/release streams from many threads concurrently, a lock-free pool or thread-local free-lists would
be better. Not needed now, just worth noting.
std::vector as a stack is fine but pop_back() + push_back() means LIFO reuse, which is actually good — recently used streams are more likely to be warm in driver caches.
Test could be stronger. The test validates correctness but doesn't actually verify that streams are being reused (e.g., that no new streams are created on the 2nd+
iteration). A unit test that directly calls acquire_stream / release_stream and checks pool size would give more confidence the pooling logic itself works. Though I
understand that may be harder to do from Python.

Overall, it's a solid, focused change. The main actionable item is making sure pooled streams get destroyed when the context shuts down.

hughperkins · 2026-03-12T01:47:53Z

Response from the original Opus that wrote this PR:

Unbounded pool growth -- In practice this is bounded by the number of unique stream_parallel group IDs per kernel, which is typically 2-4. The pool won't grow beyond that
peak. Not a real concern.
No cleanup in destructor -- Technically valid, but both CUDAContext and AMDGPUContext are singletons allocated with new and intentionally never deleted (the CUDA destructor
is even commented out). The OS reclaims GPU resources on process exit. That said, adding cleanup is trivially cheap and good hygiene -- I could add it.
Lock contention -- Reviewer already acknowledges it's fine for now. Agreed.
LIFO reuse -- Just an observation, and a positive one.
Stronger test -- Fair point but acquire_stream/release_stream are C++ methods not exposed via pybind, so testing pool internals from Python isn't practical. The correctness
test (5 iterations with correct results) is the right level of testing from this side.

I think I might get some opinions from a fresh Opus.

hughperkins · 2026-03-12T01:49:35Z

The reviewer Opus said the following when I asked "what are your thougths on concern 1?":

Honestly, on reflection it's probably overstated. The number of streams acquired per launch is determined by the number of distinct stream_parallel groups in a kernel — that's
a compile-time property, typically small (single digits). The pool will quickly stabilize at the high-water mark and stay there. There's no mechanism that would cause a
"spike" of stream allocation that then subsides, since the acquire/release pattern is synchronous within each launch (acquire all, launch, sync, release all).
GPU streams are also lightweight when idle — they're essentially just driver-side command queues, not large resource allocations.
So the "unbounded growth" framing is misleading. The real actionable concern is just concern 2 — destructor cleanup. You should still destroy the pooled streams when the
context shuts down to avoid leaking driver handles. But worrying about pool size management or high-water trimming is over-engineering for this use case.

hughperkins · 2026-03-12T01:50:11Z

So both opuses concur that the only concern is the destructor, concern 2.

github-actions · 2026-05-05T10:00:39Z

PR change report (`8a7cdd795`)

Code lines (excluding blank lines, comment-only lines, and Python multi-line strings).

File	LoC	Added	Removed
`tests/python/test_streams.py`	413	+413
`quadrants/program/program_stream.cpp`	151	+151
`python/quadrants/lang/stream.py`	111	+111
`quadrants/runtime/cuda/kernel_launcher.cpp`	322	+83	-27
`quadrants/runtime/amdgpu/kernel_launcher.cpp`	306	+70	-18
`python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	435	+68	-2
`python/quadrants/lang/ast/ast_transformer.py`	1272	+28	-3
`python/quadrants/lang/kernel.py`	562	+24	-3
`quadrants/rhi/amdgpu/amdgpu_context.h`	111	+24
`python/quadrants/lang/ast/symbol_resolver.py`	48	+23
`quadrants/program/program_stream.h`	21	+21
`quadrants/ir/statements.h`	1307	+20	-3
`quadrants/rhi/cuda/cuda_context.h`	115	+18	-1
`quadrants/python/export_stream.cpp`	17	+17
`quadrants/ir/frontend_ir.h`	806	+12
`quadrants/codegen/llvm/llvm_compiled_data.h`	96	+9	-3
`quadrants/rhi/amdgpu/amdgpu_context.cpp`	166	+7	-2
`quadrants/rhi/amdgpu/amdgpu_profiler.cpp`	181	+6	-4
`quadrants/python/export_lang.cpp`	1048	+6	-3
`quadrants/rhi/cuda/cuda_context.cpp`	128	+6	-1
`quadrants/ir/frontend_ir.cpp`	1405	+6
`quadrants/rhi/amdgpu/amdgpu_device.cpp`	138	+5	-3
`quadrants/program/program.h`	222	+5
`tests/python/test_api.py`	460	+5
`tests/python/test_cache.py`	210	+4	-4
`quadrants/runtime/llvm/llvm_runtime_executor.cpp`	600	+4	-2
`quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h`	58	+4	-1
`quadrants/python/export.h`	24	+4
`quadrants/ir/statements.cpp`	392	+3
`quadrants/rhi/cuda/cuda_driver_functions.inc.h`	69	+3
`quadrants/transforms/lower_ast.cpp`	423	+3
`python/quadrants/lang/__init__.py`	51	+2
`quadrants/transforms/offload.cpp`	602	+2
`tests/python/test_perf_dispatch.py`	418	+1	-1
`quadrants/analysis/gen_offline_cache_key.cpp`	562	+1
`quadrants/codegen/amdgpu/codegen_amdgpu.cpp`	429	+1
`quadrants/codegen/cuda/codegen_cuda.cpp`	628	+1
`quadrants/program/program.cpp`	403	+1
`python/quadrants/lang/runtime_ops.py`	4

Total: 39 file(s) changed, +1172 -81 code lines.

Full per-function report

github-actions · 2026-05-05T10:27:11Z

Coverage Report (`8a7cdd795`)

File	Coverage	Missing
🟢 `python/quadrants/lang/__init__.py`	100%
🔴 `python/quadrants/lang/ast/ast_transformer.py`	71%	1534,1537,1539,1541,1543
🔴 `python/quadrants/lang/ast/ast_transformers/function_def_transformer.py`	72%	471,474,478-480,483-487,510,516-518
🔴 `python/quadrants/lang/ast/symbol_resolver.py`	9%	66-67,69-70,72-79,81-84,86-89
🔴 `python/quadrants/lang/kernel.py`	67%	573,581,586,599,668,673
🔴 `python/quadrants/lang/stream.py`	63%	30,36,45-51,59-62,66-72,100,105,128-134,142-145,149-155,185
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	88%	14-18,23-27,253,309-310,312-314,316-320,325-326,328-330,332-335,340-341,343-346,348-351,465,489-493

Diff coverage: 79% · Overall: 79% · 624 lines, 132 missing

Full annotated report

hughperkins · 2026-05-07T12:54:02Z

Genesis unit test results:

…c-4-stream-pool # Conflicts: # quadrants/codegen/llvm/llvm_compiled_data.h # quadrants/program/program.h

hughperkins · 2026-05-07T13:01:23Z

Genesis benchmark results:

github-actions · 2026-05-07T13:30:10Z

Total: 39 file(s) changed, +1180 -81 code lines.

Rename fill_a/fill_b to some_func1/some_func2 in explicit stream examples. Remove redundant synchronize() from context manager example since destroy() already waits for in-flight work.

github-actions · 2026-05-07T14:30:34Z

Diff coverage: 79% · 624 lines, 132 missing

context_synchronize/device_synchronize waits on all streams, which contradicts the documented behavior that qd.sync() only waits on the default stream. stream_parallel already synchronizes its pooled streams before returning, so a global barrier is unnecessary.

Address review comment: make explicit that qd_stream is a special keyword argument handled by the @qd.kernel decorator, not something the user declares in the kernel signature.

duburcqa · 2026-05-07T15:06:31Z

ok to merge

github-actions · 2026-05-07T15:44:39Z

Total: 38 file(s) changed, +1176 -79 code lines.

Cover all 5 error paths in ASTTransformer.build_With: multiple context managers, with-as syntax, non-call expression, non-stream_parallel context manager, and stream_parallel inside @qd.func.

github-actions · 2026-05-07T16:15:55Z

Diff coverage: 79% · 624 lines, 132 missing

github-actions · 2026-05-07T16:48:47Z

Total: 38 file(s) changed, +1247 -79 code lines.

github-actions · 2026-05-07T17:33:33Z

Diff coverage: 80% · 695 lines, 137 missing

hughperkins added 18 commits March 11, 2026 16:40

Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…

b133bd7

…quadrantsic-2-amdgpu-cpu

Move AMDGPU mem_free_async before transfers sync to match CUDA ordering

7555ec5

Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.

Convert AMDGPU sync memcpy_host_to_device to async on active_stream

c12d23e

Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.

Document ROCm >= 5.4 requirement for hipMallocAsync/hipFreeAsync

1673a38

Relax concurrency test threshold and log timings

60d015b

Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.

Add handle==0 guard to AMDGPU stream_synchronize and make stream_ thr…

c4be4ff

…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.

Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…

aa2fa2a

…adrantsic-3-stream-parallel # Conflicts: # python/quadrants/lang/stream.py

Clear stream_parallel_group_id in ForLoopDecoratorRecorder::reset()

be7ad92

Prevents stale group IDs from leaking if insert_for is called after a path that set a non-zero stream_parallel_group_id, matching the reset pattern of all other ForLoopConfig fields.

Reject nested stream_parallel blocks

ce83281

Add an error check in begin_stream_parallel() to prevent nesting, which would produce undefined group ID semantics.

Revert "Relax concurrency test threshold and log timings"

b28e7c6

This reverts commit 60d015b.

Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…

065a3b7

…adrantsic-3-stream-parallel

Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…

0ba8dac

…adrantsic-3-stream-parallel

hughperkins changed the base branch from main to hp/streams-quadrantsic-3-stream-parallel March 12, 2026 01:35

hughperkins added 2 commits March 11, 2026 21:35

Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…

47fa207

…ms-quadrantsic-4-stream-pool

Add test for stream pool reuse across repeated kernel launches

65a7967

Calls a stream_parallel kernel 5 times in a loop to verify that pooled streams are correctly reused with correct results each iteration.

hughperkins mentioned this pull request Mar 12, 2026

[Perf] Streams 3: Add qd.stream_parallel() context manager #409

Closed

Destroy pooled streams in CUDAContext and AMDGPUContext destructors

5393d04

hughperkins marked this pull request as ready for review March 12, 2026 01:51

hughperkins assigned duburcqa and erizmr Mar 12, 2026

Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…

0961a00

…c-4-stream-pool # Conflicts: # quadrants/codegen/llvm/llvm_compiled_data.h # quadrants/program/program.h