Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
208 commits
Select commit Hold shift + click to select a range
ab15b1b
Add CUDA stream and event API for concurrent kernel execution
hughperkins Mar 11, 2026
7bd18ca
Add AMDGPU/HIP stream support and async memory operations
hughperkins Mar 11, 2026
a40ed4c
Add qd.stream_parallel() context manager for implicit stream parallelism
hughperkins Mar 11, 2026
b856b33
Address review feedback for CUDA streams PR
hughperkins Mar 12, 2026
b133bd7
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins Mar 12, 2026
7555ec5
Move AMDGPU mem_free_async before transfers sync to match CUDA ordering
hughperkins Mar 12, 2026
c12d23e
Convert AMDGPU sync memcpy_host_to_device to async on active_stream
hughperkins Mar 12, 2026
1673a38
Document ROCm >= 5.4 requirement for hipMallocAsync/hipFreeAsync
hughperkins Mar 12, 2026
60d015b
Relax concurrency test threshold and log timings
hughperkins Mar 12, 2026
c4be4ff
Add handle==0 guard to AMDGPU stream_synchronize and make stream_ thr…
hughperkins Mar 12, 2026
aa2fa2a
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Mar 12, 2026
be7ad92
Clear stream_parallel_group_id in ForLoopDecoratorRecorder::reset()
hughperkins Mar 12, 2026
ce83281
Reject nested stream_parallel blocks
hughperkins Mar 12, 2026
880abc7
Document stream_parallel launcher design: per-launch streams, shared …
hughperkins Mar 12, 2026
b28e7c6
Revert "Relax concurrency test threshold and log timings"
hughperkins Mar 12, 2026
065a3b7
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Mar 12, 2026
0ba8dac
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Mar 12, 2026
e9f98c6
Add stream pool to reuse GPU streams across kernel launches
hughperkins Mar 12, 2026
47fa207
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins Mar 12, 2026
65a7967
Add test for stream pool reuse across repeated kernel launches
hughperkins Mar 12, 2026
5393d04
Destroy pooled streams in CUDAContext and AMDGPUContext destructors
hughperkins Mar 12, 2026
a3c682b
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 19, 2026
9be110d
Apply clang-format
hughperkins Apr 20, 2026
3970abc
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins Apr 20, 2026
31fffbf
Apply clang-format
hughperkins Apr 20, 2026
cfc6f39
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins Apr 20, 2026
e9ce144
Apply clang-format
hughperkins Apr 20, 2026
c925446
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins Apr 20, 2026
14c3c22
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 24, 2026
1056bb4
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins Apr 24, 2026
d3cae3c
[Test] Exclude flaky test_perf_dispatch_python from Vulkan
hughperkins Apr 24, 2026
007b050
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins Apr 24, 2026
798f87a
Exclude flaky test_perf_dispatch_python from Metal and Vulkan
hughperkins Apr 24, 2026
22c5524
Merge origin/hp/streams-quadrantsic-1-cuda-streams, resolve conflict …
hughperkins Apr 24, 2026
cd5b486
[Doc] Add user guide for streams API
hughperkins Apr 28, 2026
f42d4eb
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins Apr 28, 2026
2238969
[Doc] Update streams doc with AMDGPU support
hughperkins Apr 28, 2026
91ca883
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
8cd793c
[Doc] Add stream_parallel() section to streams user guide
hughperkins Apr 28, 2026
63f9616
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins Apr 28, 2026
08b85d5
[Doc] Note stream pooling in streams user guide
hughperkins Apr 28, 2026
f036b46
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 28, 2026
228150a
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins Apr 28, 2026
e880d07
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
10f38d5
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins Apr 28, 2026
59c2627
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 28, 2026
f2a2596
Reflow stream.py docstrings to 120c line width
hughperkins Apr 28, 2026
e368b4d
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins Apr 28, 2026
ad720bb
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
a571918
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins Apr 28, 2026
de99f3e
Unwrap prose lines in streams.md to match repo doc style
hughperkins Apr 28, 2026
958c247
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins Apr 28, 2026
6351215
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins Apr 28, 2026
b1f2673
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins Apr 28, 2026
d6876da
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins May 1, 2026
401d6f8
Use CU_STREAM_NON_BLOCKING for user-created streams
hughperkins May 1, 2026
a3c98f8
Use async DtoH memcpy on active_stream for external array readback
hughperkins May 1, 2026
ca14f67
Guard destroy()/__exit__ against destroying externally-owned handles
hughperkins May 1, 2026
aff950d
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins May 1, 2026
b46de06
Fix clang-format indentation for memcpy_device_to_host_async
hughperkins May 1, 2026
84715de
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
8efd51f
Address review comments: fix AMDGPU stream issues
hughperkins May 1, 2026
b9eef6e
Use async DtoH on active_stream for do-while loop counter readback
hughperkins May 1, 2026
f0dd7d6
Use active_stream for sizer device context staging
hughperkins May 1, 2026
8b3d4ed
Add make_current() to stream/event Program methods
hughperkins May 1, 2026
34e9fa6
Use HIP_STREAM_NON_BLOCKING for AMDGPU stream_create to mirror CUDA path
hughperkins May 1, 2026
675542a
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
470912f
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins May 1, 2026
3b0ba29
Restore deleted comments, fix docstring wrapping, fix per-task adstac…
hughperkins May 1, 2026
1c62eae
Fix clang-format line break in AMDGPU kernel launcher
hughperkins May 1, 2026
fe779f6
Merge base branch and add dead-code comment to singleton destructors
hughperkins May 1, 2026
162239e
Use active stream for AMDGPU adstack metadata copies in publish_adsta…
hughperkins May 1, 2026
e55c84f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
216f7d5
Address Claude review: reject stream_parallel in @qd.func, use non-bl…
hughperkins May 1, 2026
9334efd
Add make_current() to all AMDGPU stream/event Program methods
hughperkins May 1, 2026
55318e8
Merge base branch: adopt non-blocking flag in pooled stream creation
hughperkins May 1, 2026
aa4a70f
Use async DtoH on active_stream for resolve_num_threads readback
hughperkins May 1, 2026
49dc5af
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
c7eed44
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
1fba4f5
Use async DtoH on active_stream for AMDGPU resolve_num_threads readback
hughperkins May 1, 2026
d7836e3
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
74604f2
Allow docstrings in stream_parallel kernels, merge base branch updates
hughperkins May 1, 2026
5901a7f
Sync active_stream at end of launch_llvm_kernel unconditionally
hughperkins May 1, 2026
0af8e19
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
f89bde0
Sync active_stream unconditionally at end of AMDGPU launch_llvm_kernel
hughperkins May 1, 2026
b83b65d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
ef3b95b
Use async DtoH on active_stream for sizer stride readback
hughperkins May 1, 2026
0c552cd
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
fc5b710
Add missing #include <vector> to amdgpu_context.h for IWYU consistency
hughperkins May 1, 2026
8550aa0
Fix end-of-launcher sync: conditional + dealloc race
hughperkins May 1, 2026
6374cf3
Reject qd_stream inside autograd Tape context
hughperkins May 1, 2026
64a389d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
7f0f299
Fix end-of-launcher sync: conditional + dealloc race on AMDGPU
hughperkins May 1, 2026
212aeb9
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins May 1, 2026
ca8ace3
Fix linter formatting; guard graph+stream; sync has_print on stream
hughperkins May 1, 2026
85b11d8
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
5e8d198
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
226c7c5
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
1f471b3
Fix AMDGPU stream flag comment: HIP_STREAM_NON_BLOCKING not CU_STREAM…
hughperkins May 1, 2026
4fc4d72
Merge base branch: pick up AMDGPU stream flag comment fix and linter …
hughperkins May 1, 2026
84806cf
Fix NULL-stream DtoH races in synchronize() and allocate_llvm_runtime…
hughperkins May 1, 2026
6919fee
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
ae9c913
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
b1c6eea
Sync active_stream before adstack sizer stride readback
hughperkins May 1, 2026
05dcb4d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
7b4e2a4
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
8229a29
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
88f1bf7
Add stream_parallel_group_id to QD_STMT_DEF_FIELDS for cache key corr…
hughperkins May 1, 2026
8b94b3d
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
ca560b6
Fix clang-format: multi-line QD_STMT_DEF_FIELDS for RangeForStmt and …
hughperkins May 1, 2026
397f298
Fix clang-format: break long QD_STMT_DEF_FIELDS lines in statements.h
hughperkins May 1, 2026
d4ce00c
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
ae1c932
Reflow comments and docstring to 120-char line width
hughperkins May 1, 2026
3ef0340
Use context/device synchronize in synchronize() to drain all streams
hughperkins May 1, 2026
3a81a46
Use synchronous mem_free in dealloc_memory pool branch
hughperkins May 1, 2026
3c6b24e
Add tests for stream/event context managers, event.synchronize, error…
hughperkins May 1, 2026
02ac865
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
3499bbc
Thread active_stream through AMDGPU profiler event_record and sync
hughperkins May 1, 2026
158c8fb
Merge hp/streams-quadrantsic-2-amdgpu-cpu into hp/streams-quadrantsic…
hughperkins May 1, 2026
9bb4467
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
c549e07
Fix graph+stream error guard and test
hughperkins May 1, 2026
5d284ac
Update qd.sync() docstring and streams doc to reflect default-stream-…
hughperkins May 1, 2026
ce2fc6b
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
388a797
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 1, 2026
df0b03a
Fix stream_parallel identity check failing on dual-import-path builds
hughperkins May 1, 2026
ff8056d
Reflow sync() docstring to 120-char line width
hughperkins May 1, 2026
acff351
Remove unused ASTResolver import from ast_transformer.py
hughperkins May 1, 2026
70eb471
Fix import sorting in ast_transformer.py
hughperkins May 1, 2026
117a71f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 1, 2026
caa2515
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
ebd5e11
Add AST-level fallback for stream_parallel detection
hughperkins May 1, 2026
a6c3852
Add diagnostic info to stream_parallel exclusivity error message
hughperkins May 1, 2026
d1e6f09
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
03d2b29
Fix black formatting in function_def_transformer.py
hughperkins May 1, 2026
04e18ba
Merge hp/streams-quadrantsic-2-amdgpu-cpu: resolve streams.md conflict
hughperkins May 1, 2026
fdcf9bd
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
3af5bc8
Apply black formatting to function_def_transformer.py
hughperkins May 1, 2026
bec8503
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 1, 2026
2844060
Fix black formatting in function_def_transformer.py (post-merge)
hughperkins May 1, 2026
5903e49
Run black -l 120 on function_def_transformer.py (post-merge formatting)
hughperkins May 1, 2026
c9c75bd
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins May 2, 2026
360adc8
Reject qd_stream on autodiff kernels
hughperkins May 2, 2026
e20fe99
Revert adstack sizer stream_synchronize
hughperkins May 2, 2026
e3c5f6f
Reset llvm_runtime_executor.cpp to upstream
hughperkins May 2, 2026
8f71c91
Merge base branch: drop autodiff stream changes per new policy
hughperkins May 2, 2026
f6fee4f
Add test for qd_stream + autodiff kernel error guard
hughperkins May 2, 2026
b030e4c
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
6e49c52
Restore context_pointer free comment in AMDGPU kernel launcher
hughperkins May 2, 2026
de4d99d
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins May 2, 2026
9fd8b7b
Extract stream/event methods from program.cpp into program_stream.cpp
hughperkins May 2, 2026
176e7d3
Merge base branch: add AMDGPU support to extracted program_stream.cpp
hughperkins May 2, 2026
9e6f865
Introduce StreamManager delegate class for stream/event ops
hughperkins May 2, 2026
c1562f2
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
1c81322
Fix clang-format in program_stream.h
hughperkins May 2, 2026
84ba5b0
Fix clang-format in program_stream.h
hughperkins May 2, 2026
b1b4ee6
Remove Program wrapper methods, bind StreamManager directly via pybind
hughperkins May 2, 2026
91fae3f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
d3317f5
Fix AMDGPU branches in StreamManager: use arch_ member instead of com…
hughperkins May 2, 2026
7e10267
Reflow comment in program_stream.h to 120-char width
hughperkins May 2, 2026
614c742
Use captured prog_ref for all Stream/Event operations
hughperkins May 2, 2026
55b71fb
Merge hp/streams-quadrantsic-2-amdgpu-cpu: integrate adstack bound_ex…
hughperkins May 2, 2026
33f2a04
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins May 2, 2026
dbb055c
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins May 2, 2026
39657ca
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins May 2, 2026
9053f44
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 2, 2026
6731407
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-4-stream-…
hughperkins May 2, 2026
b7eb63a
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
52a3be1
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' of github.com:Gene…
hughperkins May 2, 2026
3dad35a
Fix stale handle safety in Stream/Event after qd.reset()
hughperkins May 2, 2026
4cef21b
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
bebc904
Extract stream/event pybind bindings into export_stream.cpp
hughperkins May 2, 2026
4711160
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
b4450f7
Fix clang-format in export_stream.cpp
hughperkins May 2, 2026
b6cd986
Fix clang-format line break in CUDA kernel launcher
hughperkins May 2, 2026
3b09331
Fix clang-format in export_stream.cpp
hughperkins May 2, 2026
af4a306
Skip coverage probes in stream_parallel exclusivity check; restore de…
hughperkins May 2, 2026
c50d034
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-2-amdgpu-…
hughperkins May 2, 2026
93cd166
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
b1e7be6
Merge base branch: add coverage probe skip in stream_parallel validation
hughperkins May 2, 2026
824cabf
Merge branch 'hp/streams-quadrantsic-2-amdgpu-cpu' into hp/streams-qu…
hughperkins May 2, 2026
fa5cbff
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins May 2, 2026
e8d9cf0
Allow synchronizing the default AMDGPU stream (handle 0)
hughperkins May 2, 2026
48c3922
Fall back to current runtime for Stream/Event destroy after reset
hughperkins May 2, 2026
8696fad
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-3-stream-…
hughperkins May 2, 2026
736545f
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-4-stream-…
hughperkins May 2, 2026
3f5a868
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
44ee707
Reflow _destroy_prog docstrings to 120-char width
hughperkins May 2, 2026
392b19a
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-1-cuda-st…
hughperkins May 2, 2026
c6278ff
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins May 3, 2026
f67e7fd
Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…
hughperkins May 3, 2026
24bc67d
Merge hp/streams-quadrantsic-2-amdgpu-cpu: integrate adstack post-red…
hughperkins May 3, 2026
8fee086
Merge branch 'hp/streams-quadrantsic-3-stream-parallel' into hp/strea…
hughperkins May 3, 2026
ac4b825
Guard stream-parallel cleanup with exception safety
hughperkins May 3, 2026
65d5cb9
Restore explanatory comments removed during stream-parallel refactor
hughperkins May 3, 2026
6f84bcd
Merge branch 'main' into hp/streams-quadrantsic-4-stream-pool
hughperkins May 4, 2026
fc44406
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins May 5, 2026
8e329a5
Merge remote-tracking branch 'origin/hp/streams-quadrantsic-4-stream-…
hughperkins May 5, 2026
b5554ca
Fix clang-format line length in kernel launchers
hughperkins May 5, 2026
8a7cdd7
Use default stream for persistent buffer alloc/free
hughperkins May 5, 2026
0961a00
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins May 7, 2026
594bb8a
Update streams doc: rename fill_a/fill_b, remove redundant synchronize
hughperkins May 7, 2026
bfa9ff9
Remove incorrect claim about data corruption without stream management
hughperkins May 7, 2026
c38f53e
Remove PyTorch interop section from streams doc
hughperkins May 7, 2026
c8d7792
Move sync behavior notes out of Limitations into own section
hughperkins May 7, 2026
8ef3a0b
Revert qd.sync() to default-stream-only synchronization
hughperkins May 7, 2026
cf09b26
Clarify that qd_stream is implicit in any @qd.kernel call
hughperkins May 7, 2026
5f36533
Note that graph/autodiff + qd_stream raises RuntimeError
hughperkins May 7, 2026
b298d92
Add tests for build_With error branches
hughperkins May 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ tile16

fastcache
graph
streams
perf_dispatch
init_options
```
Expand Down
137 changes: 137 additions & 0 deletions docs/source/user_guide/streams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Streams

Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel.

## Supported platforms

| Backend | Supported |
|---------|-----------|
| CUDA | Yes |
| AMDGPU | Yes |
| CPU | No-op |
| Metal | No-op |
| Vulkan | No-op |

On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends — it will run without modifications, but serially.

## Stream parallelism

Inside a `@qd.kernel`, each `with qd.stream_parallel():` block runs on its own GPU stream.

```python
import quadrants as qd

qd.init(arch=qd.cuda)

N = 1024
a = qd.field(qd.f32, shape=(N,))
b = qd.field(qd.f32, shape=(N,))
c = qd.field(qd.f32, shape=(N,))

@qd.kernel
def compute_ab():
with qd.stream_parallel():
for i in range(N):
a[i] = compute_a(i)
with qd.stream_parallel():
for j in range(N):
b[j] = compute_b(j)

@qd.kernel
def combine():
for i in range(N):
c[i] = a[i] + b[i]

compute_ab() # the two stream_parallel blocks run concurrently
combine() # runs after compute_ab() returns — a[] and b[] are ready
```

Consecutive `with qd.stream_parallel():` blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns.

### Restrictions

- All top-level statements in a kernel must be either all `stream_parallel` blocks or all regular statements. Mixing the two at the top level is a compile-time error.
- Nesting `stream_parallel` blocks is not supported.

## Explicit streams

For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly.

### Creating and using streams

Any `@qd.kernel` function accepts a special `qd_stream` keyword argument — you do not need to declare it in the kernel signature. The `@qd.kernel` decorator handles it automatically.

```python
@qd.kernel
def my_kernel():
for i in range(N):
a[i] = i

s1 = qd.create_stream()
s2 = qd.create_stream()

my_kernel(qd_stream=s1)
my_kernel(qd_stream=s2)

s1.synchronize()
s2.synchronize()

s1.destroy()
s2.destroy()
```

Kernels on different streams may execute concurrently. Call `synchronize()` to block until all work on a stream completes.

### Events

Events let you express dependencies between streams without full synchronization.

```python
s1 = qd.create_stream()
s2 = qd.create_stream()

@qd.kernel
def produce():
for i in range(N):
a[i] = 10.0

@qd.kernel
def consume():
for i in range(N):
b[i] = a[i]

produce(qd_stream=s1)

e = qd.create_event()
e.record(s1) # record when s1 finishes produce()
e.wait(qd_stream=s2) # s2 waits for that event before proceeding

consume(qd_stream=s2) # safe to read a[] — produce() is guaranteed complete
s2.synchronize()

e.destroy()
s1.destroy()
s2.destroy()
```

`e.record(stream)` captures the point in `stream`'s execution. `e.wait(qd_stream=stream)` makes `stream` wait until the recorded point is reached. If `qd_stream` is omitted, the default stream waits.

### Context managers

Streams and events support `with` blocks for automatic cleanup:

```python
with qd.create_stream() as s:
some_func1(qd_stream=s)
# s.destroy() called automatically — waits for in-flight work
```

## Synchronization notes

- **`qd.sync()` only waits on the default stream.** It does not drain explicit streams. Call `stream.synchronize()` on each stream you need to wait for.
- **No automatic synchronization with explicit streams.** When using explicit streams, you are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input. `stream_parallel` handles this automatically.

## Limitations

- **Not compatible with graphs.** Do not pass `qd_stream` to a kernel decorated with `graph=True` (if you do, a `RuntimeError` will be raised).
- **Not compatible with autodiff.** Do not pass `qd_stream` to a kernel that uses reverse-mode or forward-mode differentiation, or inside a `qd.ad.Tape` context (if you do, a `RuntimeError` will be raised).
2 changes: 2 additions & 0 deletions python/quadrants/lang/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from quadrants.lang.runtime_ops import *
from quadrants.lang.snode import *
from quadrants.lang.source_builder import *
from quadrants.lang.stream import *
from quadrants.lang.struct import *
from quadrants.types.enums import DeviceCapability, Format, Layout # noqa: F401

Expand Down Expand Up @@ -47,6 +48,7 @@
"shell",
"snode",
"source_builder",
"stream",
"struct",
"util",
]
Expand Down
32 changes: 29 additions & 3 deletions python/quadrants/lang/ast/ast_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,11 @@ def build_AnnAssign(ctx: ASTTransformerFuncContext, node: ast.AnnAssign):

@staticmethod
def build_assign_annotated(
ctx: ASTTransformerFuncContext, target: ast.Name, value, is_static_assign: bool, annotation: Type
ctx: ASTTransformerFuncContext,
target: ast.Name,
value,
is_static_assign: bool,
annotation: Type,
):
"""Build an annotated assignment like this: target: annotation = value.

Expand Down Expand Up @@ -165,7 +169,10 @@ def build_Assign(ctx: ASTTransformerFuncContext, node: ast.Assign) -> None:

@staticmethod
def build_assign_unpack(
ctx: ASTTransformerFuncContext, node_target: list | ast.Tuple, values, is_static_assign: bool
ctx: ASTTransformerFuncContext,
node_target: list | ast.Tuple,
values,
is_static_assign: bool,
):
"""Build the unpack assignments like this: (target1, target2) = (value1, value2).
The function should be called only if the node target is a tuple.
Expand Down Expand Up @@ -591,7 +598,8 @@ def build_Return(ctx: ASTTransformerFuncContext, node: ast.Return) -> None:
else:
raise QuadrantsSyntaxError("The return type is not supported now!")
ctx.ast_builder.create_kernel_exprgroup_return(
expr.make_expr_group(return_exprs), _qd_core.DebugInfo(ctx.get_pos_info(node))
expr.make_expr_group(return_exprs),
_qd_core.DebugInfo(ctx.get_pos_info(node)),
)
else:
ctx.return_data = node.value.ptr
Expand Down Expand Up @@ -1520,6 +1528,24 @@ def build_Continue(ctx: ASTTransformerFuncContext, node: ast.Continue) -> None:
ctx.ast_builder.insert_continue_stmt(_qd_core.DebugInfo(ctx.get_pos_info(node)))
return None

@staticmethod
def build_With(ctx: ASTTransformerFuncContext, node: ast.With) -> None:
if len(node.items) != 1:
raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports a single context manager")
item = node.items[0]
if item.optional_vars is not None:
raise QuadrantsSyntaxError("'with ... as ...' is not supported in Quadrants kernels")
if not isinstance(item.context_expr, ast.Call):
raise QuadrantsSyntaxError("'with' in Quadrants kernels requires a call expression")
if not FunctionDefTransformer._is_stream_parallel_with(node, ctx.global_vars):
raise QuadrantsSyntaxError("'with' in Quadrants kernels only supports qd.stream_parallel()")
if not ctx.is_kernel:
raise QuadrantsSyntaxError("qd.stream_parallel() can only be used inside @qd.kernel, not @qd.func")
ctx.ast_builder.begin_stream_parallel()
build_stmts(ctx, node.body)
ctx.ast_builder.end_stream_parallel()
return None

@staticmethod
def build_Pass(ctx: ASTTransformerFuncContext, node: ast.Pass) -> None:
return None
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,13 @@
from quadrants.lang.ast.ast_transformer_utils import (
ASTTransformerFuncContext,
)
from quadrants.lang.ast.symbol_resolver import ASTResolver
from quadrants.lang.buffer_view import BufferView
from quadrants.lang.exception import (
QuadrantsSyntaxError,
)
from quadrants.lang.matrix import MatrixType
from quadrants.lang.stream import stream_parallel
from quadrants.lang.struct import StructType
from quadrants.lang.util import to_quadrants_type
from quadrants.types import annotations, buffer_view_type, ndarray_type, primitive_types
Expand Down Expand Up @@ -317,7 +319,11 @@ def _transform_func_arg(
# polymorphic).
if field.type is not _TensorClass and hasattr(field.type, "check_matched"):
field.type.check_matched(data_child.get_type(), field.name)
_cache = getattr(getattr(ctx, "global_context", None), "ndarray_to_any_array", None)
_cache = getattr(
getattr(ctx, "global_context", None),
"ndarray_to_any_array",
None,
)
promoted = _cache.get(id(data_child)) if _cache else None
ctx.create_variable(flat_name, promoted if promoted is not None else data_child)
elif dataclasses.is_dataclass(data_child):
Expand All @@ -336,7 +342,13 @@ def _transform_func_arg(
# Ndarray arguments are passed by reference.
if isinstance(argument_type, (ndarray_type.NdarrayType)):
if not isinstance(
data, (_ndarray.ScalarNdarray, matrix.VectorNdarray, matrix.MatrixNdarray, any_array.AnyArray)
data,
(
_ndarray.ScalarNdarray,
matrix.VectorNdarray,
matrix.MatrixNdarray,
any_array.AnyArray,
),
):
raise QuadrantsSyntaxError(f"Argument {argument_name} of type {argument_type} is not recognized.")
argument_type.check_matched(data.get_type(), argument_name)
Expand Down Expand Up @@ -443,7 +455,70 @@ def build_FunctionDef(
else:
FunctionDefTransformer._transform_as_func(ctx, node, args)

if ctx.is_kernel:
FunctionDefTransformer._validate_stream_parallel_exclusivity(node.body, ctx.global_vars)

with ctx.variable_scope_guard():
build_stmts(ctx, node.body)

return None

@staticmethod
def _is_stream_parallel_with(stmt: ast.stmt, global_vars: dict[str, Any]) -> bool:
if not isinstance(stmt, ast.With):
return False
if len(stmt.items) != 1:
return False
item = stmt.items[0]
if not isinstance(item.context_expr, ast.Call):
return False
func_node = item.context_expr.func
if ASTResolver.resolve_to(func_node, stream_parallel, global_vars):
return True
resolved = ASTResolver.resolve_value(func_node, global_vars)
if resolved is not None:
return getattr(resolved, "__name__", None) == "stream_parallel" and getattr(
resolved, "__module__", ""
).startswith("quadrants")
if isinstance(func_node, ast.Attribute) and func_node.attr == "stream_parallel":
return True
if isinstance(func_node, ast.Name) and func_node.id == "stream_parallel":
return True
Comment on lines +483 to +486
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict stream_parallel detection to Quadrants symbols

The name-based fallback here accepts any with ...stream_parallel(): call even when it resolves to a non-Quadrants symbol, because it returns True purely on attr == "stream_parallel" / id == "stream_parallel". This can silently reinterpret user-defined context managers as Quadrants stream-parallel blocks instead of raising a syntax error, changing kernel behavior and skipping user context-manager logic; the check should only pass when resolution proves the symbol is quadrants.stream_parallel.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus says:

Codex flags that lines 483-486 (the name-based fallback) would match any stream_parallel call, even from non-Quadrants code. Let me think about this.

The detection has four layers:

  1. Line 476: resolve_to(func_node, stream_parallel, global_vars) — exact identity check against the real stream_parallel object. Best case.
  2. Lines 478-482: Resolve the value and check name and module — correct if resolution succeeds.
  3. Lines 483-484: Attribute fallback — matches anything.stream_parallel(). This is loose.
  4. Lines 485-486: Bare name fallback — matches stream_parallel(). Also loose.

The fallbacks exist because AST resolution can fail in some import configurations (e.g., import quadrants as qd then qd.stream_parallel() — the resolver might not trace
through the alias). They were added to fix dual-import-path builds.

The risk is real but very low in practice — who would have their own stream_parallel() context manager inside a @qd.kernel? And the consequence of a false positive is a
compile-time error (the body validation would reject non-conforming statements), not silent misbehavior.

I'd say this is a valid observation but low priority — the fallbacks are there for a reason and the false-positive risk is negligible. Not worth addressing in this PR.

return False

@staticmethod
def _is_docstring(stmt: ast.stmt, index: int) -> bool:
return index == 0 and isinstance(stmt, ast.Expr) and isinstance(stmt.value, (ast.Constant, ast.Str))

@staticmethod
def _is_coverage_probe(stmt: ast.stmt) -> bool:
if not isinstance(stmt, ast.Assign) or len(stmt.targets) != 1:
return False
target = stmt.targets[0]
return (
isinstance(target, ast.Subscript)
and isinstance(target.value, ast.Name)
and target.value.id.startswith("_qd_cov")
)

@staticmethod
def _validate_stream_parallel_exclusivity(body: list[ast.stmt], global_vars: dict[str, Any]) -> None:
if not any(FunctionDefTransformer._is_stream_parallel_with(s, global_vars) for s in body):
return
for i, stmt in enumerate(body):
if FunctionDefTransformer._is_docstring(stmt, i):
continue
if FunctionDefTransformer._is_coverage_probe(stmt):
continue
if not FunctionDefTransformer._is_stream_parallel_with(stmt, global_vars):
stmt_desc = f"{type(stmt).__name__}"
if isinstance(stmt, ast.With) and stmt.items:
ctx_expr = stmt.items[0].context_expr
if isinstance(ctx_expr, ast.Call) and isinstance(ctx_expr.func, ast.Attribute):
stmt_desc += f"(with {ast.dump(ctx_expr.func)})"
raise QuadrantsSyntaxError(
"When using qd.stream_parallel(), all top-level statements "
"in the kernel must be 'with qd.stream_parallel():' blocks. "
f"Move non-parallel code to a separate kernel. "
f"[stmt {i}: {stmt_desc}, body_len={len(body)}]"
)
32 changes: 32 additions & 0 deletions python/quadrants/lang/ast/symbol_resolver.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,35 @@ def resolve_to(node, wanted, scope):
return False
# The name ``scope`` here could be a bit confusing
return scope is wanted

@staticmethod
def resolve_value(node, scope):
"""Resolve an AST Name/Attribute node to a Python object.

Same traversal as resolve_to but returns the resolved object (or None) instead of comparing against a wanted
value.
"""
if isinstance(node, ast.Name):
return scope.get(node.id) if isinstance(scope, dict) else None

if not isinstance(node, ast.Attribute):
return None

v = node.value
chain = [node.attr]
while isinstance(v, ast.Attribute):
chain.append(v.attr)
v = v.value
if not isinstance(v, ast.Name):
return None
chain.append(v.id)

for attr in reversed(chain):
try:
if isinstance(scope, dict):
scope = scope[attr]
else:
scope = getattr(scope, attr)
except (KeyError, AttributeError):
return None
return scope
Loading
Loading