-
Notifications
You must be signed in to change notification settings - Fork 19
[Perf] Streams 1-4 #410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
[Perf] Streams 1-4 #410
Changes from all commits
Commits
Show all changes
208 commits
Select commit
Hold shift + click to select a range
ab15b1b
Add CUDA stream and event API for concurrent kernel execution
hughperkins 7bd18ca
Add AMDGPU/HIP stream support and async memory operations
hughperkins a40ed4c
Add qd.stream_parallel() context manager for implicit stream parallelism
hughperkins b856b33
Address review feedback for CUDA streams PR
hughperkins b133bd7
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins 7555ec5
Move AMDGPU mem_free_async before transfers sync to match CUDA ordering
hughperkins c12d23e
Convert AMDGPU sync memcpy_host_to_device to async on active_stream
hughperkins 1673a38
Document ROCm >= 5.4 requirement for hipMallocAsync/hipFreeAsync
hughperkins 60d015b
Relax concurrency test threshold and log timings
hughperkins c4be4ff
Add handle==0 guard to AMDGPU stream_synchronize and make stream_ thr…
hughperkins aa2fa2a
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins be7ad92
Clear stream_parallel_group_id in ForLoopDecoratorRecorder::reset()
hughperkins ce83281
Reject nested stream_parallel blocks
hughperkins 880abc7
Document stream_parallel launcher design: per-launch streams, shared …
hughperkins b28e7c6
Revert "Relax concurrency test threshold and log timings"
hughperkins 065a3b7
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins 0ba8dac
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins e9f98c6
Add stream pool to reuse GPU streams across kernel launches
hughperkins 47fa207
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins 65a7967
Add test for stream pool reuse across repeated kernel launches
hughperkins 5393d04
Destroy pooled streams in CUDAContext and AMDGPUContext destructors
hughperkins a3c682b
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins 9be110d
Apply clang-format
hughperkins 3970abc
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 31fffbf
Apply clang-format
hughperkins cfc6f39
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins e9ce144
Apply clang-format
hughperkins c925446
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 14c3c22
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins 1056bb4
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins d3cae3c
[Test] Exclude flaky test_perf_dispatch_python from Vulkan
hughperkins 007b050
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins 798f87a
Exclude flaky test_perf_dispatch_python from Metal and Vulkan
hughperkins 22c5524
Merge origin/hp/streams-quadrantsic-1-cuda-streams, resolve conflict …
hughperkins cd5b486
[Doc] Add user guide for streams API
hughperkins f42d4eb
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins 2238969
[Doc] Update streams doc with AMDGPU support
hughperkins 91ca883
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins 8cd793c
[Doc] Add stream_parallel() section to streams user guide
hughperkins 63f9616
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins 08b85d5
[Doc] Note stream pooling in streams user guide
hughperkins f036b46
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins 228150a
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins e880d07
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins 10f38d5
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins 59c2627
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins f2a2596
Reflow stream.py docstrings to 120c line width
hughperkins e368b4d
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins ad720bb
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins a571918
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins de99f3e
Unwrap prose lines in streams.md to match repo doc style
hughperkins 958c247
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins 6351215
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins b1f2673
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins d6876da
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins 401d6f8
Use CU_STREAM_NON_BLOCKING for user-created streams
hughperkins a3c98f8
Use async DtoH memcpy on active_stream for external array readback
hughperkins ca14f67
Guard destroy()/__exit__ against destroying externally-owned handles
hughperkins aff950d
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins b46de06
Fix clang-format indentation for memcpy_device_to_host_async
hughperkins 84715de
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 8efd51f
Address review comments: fix AMDGPU stream issues
hughperkins b9eef6e
Use async DtoH on active_stream for do-while loop counter readback
hughperkins f0dd7d6
Use active_stream for sizer device context staging
hughperkins 8b3d4ed
Add make_current() to stream/event Program methods
hughperkins 34e9fa6
Use HIP_STREAM_NON_BLOCKING for AMDGPU stream_create to mirror CUDA path
hughperkins 675542a
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 470912f
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins 3b0ba29
Restore deleted comments, fix docstring wrapping, fix per-task adstac…
hughperkins 1c62eae
Fix clang-format line break in AMDGPU kernel launcher
hughperkins fe779f6
Merge base branch and add dead-code comment to singleton destructors
hughperkins 162239e
Use active stream for AMDGPU adstack metadata copies in publish_adsta…
hughperkins e55c84f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins 216f7d5
Address Claude review: reject stream_parallel in @qd.func, use non-bl…
hughperkins 9334efd
Add make_current() to all AMDGPU stream/event Program methods
hughperkins 55318e8
Merge base branch: adopt non-blocking flag in pooled stream creation
hughperkins aa4a70f
Use async DtoH on active_stream for resolve_num_threads readback
hughperkins 49dc5af
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins c7eed44
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 1fba4f5
Use async DtoH on active_stream for AMDGPU resolve_num_threads readback
hughperkins d7836e3
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins 74604f2
Allow docstrings in stream_parallel kernels, merge base branch updates
hughperkins 5901a7f
Sync active_stream at end of launch_llvm_kernel unconditionally
hughperkins 0af8e19
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins f89bde0
Sync active_stream unconditionally at end of AMDGPU launch_llvm_kernel
hughperkins b83b65d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins ef3b95b
Use async DtoH on active_stream for sizer stride readback
hughperkins 0c552cd
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins fc5b710
Add missing #include <vector> to amdgpu_context.h for IWYU consistency
hughperkins 8550aa0
Fix end-of-launcher sync: conditional + dealloc race
hughperkins 6374cf3
Reject qd_stream inside autograd Tape context
hughperkins 64a389d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 7f0f299
Fix end-of-launcher sync: conditional + dealloc race on AMDGPU
hughperkins 212aeb9
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins ca8ace3
Fix linter formatting; guard graph+stream; sync has_print on stream
hughperkins 85b11d8
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 5e8d198
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 226c7c5
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins 1f471b3
Fix AMDGPU stream flag comment: HIP_STREAM_NON_BLOCKING not CU_STREAM…
hughperkins 4fc4d72
Merge base branch: pick up AMDGPU stream flag comment fix and linter …
hughperkins 84806cf
Fix NULL-stream DtoH races in synchronize() and allocate_llvm_runtime…
hughperkins 6919fee
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins ae9c913
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins b1c6eea
Sync active_stream before adstack sizer stride readback
hughperkins 05dcb4d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 7b4e2a4
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins 8229a29
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 88f1bf7
Add stream_parallel_group_id to QD_STMT_DEF_FIELDS for cache key corr…
hughperkins 8b94b3d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins ca560b6
Fix clang-format: multi-line QD_STMT_DEF_FIELDS for RangeForStmt and …
hughperkins 397f298
Fix clang-format: break long QD_STMT_DEF_FIELDS lines in statements.h
hughperkins d4ce00c
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins ae1c932
Reflow comments and docstring to 120-char line width
hughperkins 3ef0340
Use context/device synchronize in synchronize() to drain all streams
hughperkins 3a81a46
Use synchronous mem_free in dealloc_memory pool branch
hughperkins 3c6b24e
Add tests for stream/event context managers, event.synchronize, error…
hughperkins 02ac865
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 3499bbc
Thread active_stream through AMDGPU profiler event_record and sync
hughperkins 158c8fb
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins 9bb4467
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins c549e07
Fix graph+stream error guard and test
hughperkins 5d284ac
Update qd.sync() docstring and streams doc to reflect default-stream-…
hughperkins ce2fc6b
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 388a797
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins df0b03a
Fix stream_parallel identity check failing on dual-import-path builds
hughperkins ff8056d
Reflow sync() docstring to 120-char line width
hughperkins acff351
Remove unused ASTResolver import from ast_transformer.py
hughperkins 70eb471
Fix import sorting in ast_transformer.py
hughperkins 117a71f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins caa2515
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins ebd5e11
Add AST-level fallback for stream_parallel detection
hughperkins a6c3852
Add diagnostic info to stream_parallel exclusivity error message
hughperkins d1e6f09
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 03d2b29
Fix black formatting in function_def_transformer.py
hughperkins 04e18ba
Merge hp/streams-quadrantsic-2-amdgpu-cpu: resolve streams.md conflict
hughperkins fdcf9bd
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 3af5bc8
Apply black formatting to function_def_transformer.py
hughperkins bec8503
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 2844060
Fix black formatting in function_def_transformer.py (post-merge)
hughperkins 5903e49
Run black -l 120 on function_def_transformer.py (post-merge formatting)
hughperkins c9c75bd
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins 360adc8
Reject qd_stream on autodiff kernels
hughperkins e20fe99
Revert adstack sizer stream_synchronize
hughperkins e3c5f6f
Reset llvm_runtime_executor.cpp to upstream
hughperkins 8f71c91
Merge base branch: drop autodiff stream changes per new policy
hughperkins f6fee4f
Add test for qd_stream + autodiff kernel error guard
hughperkins b030e4c
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 6e49c52
Restore context_pointer free comment in AMDGPU kernel launcher
hughperkins de4d99d
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins 9fd8b7b
Extract stream/event methods from program.cpp into program_stream.cpp
hughperkins 176e7d3
Merge base branch: add AMDGPU support to extracted program_stream.cpp
hughperkins 9e6f865
Introduce StreamManager delegate class for stream/event ops
hughperkins c1562f2
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 1c81322
Fix clang-format in program_stream.h
hughperkins 84ba5b0
Fix clang-format in program_stream.h
hughperkins b1b4ee6
Remove Program wrapper methods, bind StreamManager directly via pybind
hughperkins 91fae3f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins d3317f5
Fix AMDGPU branches in StreamManager: use arch_ member instead of com…
hughperkins 7e10267
Reflow comment in program_stream.h to 120-char width
hughperkins 614c742
Use captured prog_ref for all Stream/Event operations
hughperkins 55b71fb
Merge hp/streams-quadrantsic-2-amdgpu-cpu: integrate adstack bound_ex…
hughperkins 33f2a04
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins dbb055c
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins 39657ca
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins 9053f44
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 6731407
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-4-stream-…
hughperkins b7eb63a
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 52a3be1
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' of github.com:Gene…
hughperkins 3dad35a
Fix stale handle safety in Stream/Event after qd.reset()
hughperkins 4cef21b
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins bebc904
Extract stream/event pybind bindings into export_stream.cpp
hughperkins 4711160
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins b4450f7
Fix clang-format in export_stream.cpp
hughperkins b6cd986
Fix clang-format line break in CUDA kernel launcher
hughperkins 3b09331
Fix clang-format in export_stream.cpp
hughperkins af4a306
Skip coverage probes in stream_parallel exclusivity check; restore de…
hughperkins c50d034
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins 93cd166
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins b1e7be6
Merge base branch: add coverage probe skip in stream_parallel validation
hughperkins 824cabf
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins fa5cbff
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins e8d9cf0
Allow synchronizing the default AMDGPU stream (handle 0)
hughperkins 48c3922
Fall back to current runtime for Stream/Event destroy after reset
hughperkins 8696fad
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins 736545f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-4-stream-…
hughperkins 3f5a868
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins 44ee707
Reflow _destroy_prog docstrings to 120-char width
hughperkins 392b19a
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins c6278ff
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins f67e7fd
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins 24bc67d
Merge hp/streams-quadrantsic-2-amdgpu-cpu: integrate adstack post-red…
hughperkins 8fee086
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins ac4b825
Guard stream-parallel cleanup with exception safety
hughperkins 65d5cb9
Restore explanatory comments removed during stream-parallel refactor
hughperkins 6f84bcd
Merge branch 'main' into hp/streams-quadrantsic-4-stream-pool
hughperkins fc44406
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins 8e329a5
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-4-stream-…
hughperkins b5554ca
Fix clang-format line length in kernel launchers
hughperkins 8a7cdd7
Use default stream for persistent buffer alloc/free
hughperkins 0961a00
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins 594bb8a
Update streams doc: rename fill_a/fill_b, remove redundant synchronize
hughperkins bfa9ff9
Remove incorrect claim about data corruption without stream management
hughperkins c38f53e
Remove PyTorch interop section from streams doc
hughperkins c8d7792
Move sync behavior notes out of Limitations into own section
hughperkins 8ef3a0b
Revert qd.sync() to default-stream-only synchronization
hughperkins cf09b26
Clarify that qd_stream is implicit in any @qd.kernel call
hughperkins 5f36533
Note that graph/autodiff + qd_stream raises RuntimeError
hughperkins b298d92
Add tests for build_With error branches
hughperkins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -58,6 +58,7 @@ tile16 | |
|
|
||
| fastcache | ||
| graph | ||
| streams | ||
| perf_dispatch | ||
| init_options | ||
| ``` | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
| # Streams | ||
|
|
||
| Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel. | ||
|
|
||
| ## Supported platforms | ||
|
|
||
| | Backend | Supported | | ||
| |---------|-----------| | ||
| | CUDA | Yes | | ||
| | AMDGPU | Yes | | ||
| | CPU | No-op | | ||
| | Metal | No-op | | ||
| | Vulkan | No-op | | ||
|
|
||
| On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends — it will run without modifications, but serially. | ||
|
|
||
| ## Stream parallelism | ||
|
|
||
| Inside a `@qd.kernel`, each `with qd.stream_parallel():` block runs on its own GPU stream. | ||
|
|
||
| ```python | ||
| import quadrants as qd | ||
|
|
||
| qd.init(arch=qd.cuda) | ||
|
|
||
| N = 1024 | ||
| a = qd.field(qd.f32, shape=(N,)) | ||
| b = qd.field(qd.f32, shape=(N,)) | ||
| c = qd.field(qd.f32, shape=(N,)) | ||
|
|
||
| @qd.kernel | ||
| def compute_ab(): | ||
| with qd.stream_parallel(): | ||
| for i in range(N): | ||
| a[i] = compute_a(i) | ||
| with qd.stream_parallel(): | ||
| for j in range(N): | ||
| b[j] = compute_b(j) | ||
|
|
||
| @qd.kernel | ||
| def combine(): | ||
| for i in range(N): | ||
| c[i] = a[i] + b[i] | ||
|
|
||
| compute_ab() # the two stream_parallel blocks run concurrently | ||
| combine() # runs after compute_ab() returns — a[] and b[] are ready | ||
| ``` | ||
|
|
||
| Consecutive `with qd.stream_parallel():` blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns. | ||
|
|
||
| ### Restrictions | ||
|
|
||
| - All top-level statements in a kernel must be either all `stream_parallel` blocks or all regular statements. Mixing the two at the top level is a compile-time error. | ||
| - Nesting `stream_parallel` blocks is not supported. | ||
|
|
||
| ## Explicit streams | ||
|
|
||
| For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly. | ||
|
|
||
| ### Creating and using streams | ||
|
|
||
| Any `@qd.kernel` function accepts a special `qd_stream` keyword argument — you do not need to declare it in the kernel signature. The `@qd.kernel` decorator handles it automatically. | ||
|
|
||
| ```python | ||
| @qd.kernel | ||
| def my_kernel(): | ||
| for i in range(N): | ||
| a[i] = i | ||
|
|
||
| s1 = qd.create_stream() | ||
| s2 = qd.create_stream() | ||
|
|
||
| my_kernel(qd_stream=s1) | ||
| my_kernel(qd_stream=s2) | ||
|
|
||
| s1.synchronize() | ||
| s2.synchronize() | ||
|
|
||
| s1.destroy() | ||
| s2.destroy() | ||
| ``` | ||
|
|
||
| Kernels on different streams may execute concurrently. Call `synchronize()` to block until all work on a stream completes. | ||
|
|
||
| ### Events | ||
|
|
||
| Events let you express dependencies between streams without full synchronization. | ||
|
|
||
| ```python | ||
| s1 = qd.create_stream() | ||
| s2 = qd.create_stream() | ||
|
|
||
| @qd.kernel | ||
| def produce(): | ||
| for i in range(N): | ||
| a[i] = 10.0 | ||
|
|
||
| @qd.kernel | ||
| def consume(): | ||
| for i in range(N): | ||
| b[i] = a[i] | ||
|
|
||
| produce(qd_stream=s1) | ||
|
|
||
| e = qd.create_event() | ||
| e.record(s1) # record when s1 finishes produce() | ||
| e.wait(qd_stream=s2) # s2 waits for that event before proceeding | ||
|
|
||
| consume(qd_stream=s2) # safe to read a[] — produce() is guaranteed complete | ||
| s2.synchronize() | ||
|
|
||
| e.destroy() | ||
| s1.destroy() | ||
| s2.destroy() | ||
| ``` | ||
|
|
||
| `e.record(stream)` captures the point in `stream`'s execution. `e.wait(qd_stream=stream)` makes `stream` wait until the recorded point is reached. If `qd_stream` is omitted, the default stream waits. | ||
|
|
||
| ### Context managers | ||
|
|
||
| Streams and events support `with` blocks for automatic cleanup: | ||
|
|
||
| ```python | ||
| with qd.create_stream() as s: | ||
| some_func1(qd_stream=s) | ||
| # s.destroy() called automatically — waits for in-flight work | ||
| ``` | ||
|
|
||
| ## Synchronization notes | ||
|
|
||
| - **`qd.sync()` only waits on the default stream.** It does not drain explicit streams. Call `stream.synchronize()` on each stream you need to wait for. | ||
| - **No automatic synchronization with explicit streams.** When using explicit streams, you are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input. `stream_parallel` handles this automatically. | ||
|
|
||
| ## Limitations | ||
|
|
||
| - **Not compatible with graphs.** Do not pass `qd_stream` to a kernel decorated with `graph=True` (if you do, a `RuntimeError` will be raised). | ||
| - **Not compatible with autodiff.** Do not pass `qd_stream` to a kernel that uses reverse-mode or forward-mode differentiation, or inside a `qd.ad.Tape` context (if you do, a `RuntimeError` will be raised). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name-based fallback here accepts any
with ...stream_parallel():call even when it resolves to a non-Quadrants symbol, because it returnsTruepurely onattr == "stream_parallel"/id == "stream_parallel". This can silently reinterpret user-defined context managers as Quadrants stream-parallel blocks instead of raising a syntax error, changing kernel behavior and skipping user context-manager logic; the check should only pass when resolution proves the symbol isquadrants.stream_parallel.Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opus says:
Codex flags that lines 483-486 (the name-based fallback) would match any stream_parallel call, even from non-Quadrants code. Let me think about this.
The detection has four layers:
The fallbacks exist because AST resolution can fail in some import configurations (e.g., import quadrants as qd then qd.stream_parallel() — the resolver might not trace
through the alias). They were added to fix dual-import-path builds.
The risk is real but very low in practice — who would have their own stream_parallel() context manager inside a @qd.kernel? And the consequence of a false positive is a
compile-time error (the body validation would reject non-conforming statements), not silent misbehavior.
I'd say this is a valid observation but low priority — the fallbacks are there for a reason and the false-positive risk is negligible. Not worth addressing in this PR.