[Perf] Re-land Streams 1-4 with bug fixes#653
Conversation
The per-handle persistent arg_buffer/runtime_context in the CUDA and AMDGPU kernel launchers is reused across launches. When the same kernel is dispatched on two different explicit streams (qd_stream), the second call's memcpy can overwrite the buffer while the first kernel is still reading it, causing data corruption. Fix: when active_stream != nullptr (explicit stream), allocate a per-call ephemeral buffer instead of reusing the persistent one. The stream-ordered mem_free_async ensures the memory stays live until the kernel finishes reading. Default-stream launches keep the existing persistent-buffer fast path.
The per-launch check_adstack_overflow poll (called without prior sync) could observe the overflow flag before the companion task_id write had been flushed from the device to pinned host memory. The old code consumed both slots via exchange(0), so by the time qd.sync() ran the flag was already clear and the identity was lost — producing a diagnostic with no kernel name. Fix: peek the flag with a relaxed load first; if set, synchronize() to ensure the task_id is visible, then consume both slots. The sync only fires on the rare overflow path, so the fast path stays zero-cost.
Re-add the "no sync drain", "where the kernel that wrote it ran", and the reinterpret_cast / Itanium ABI / MSVC ABI portability note that were inadvertently dropped when rewriting the comment block.
…eque publish_adstack_metadata's host-eval branch recursively registers snode-reader kernels via register_llvm_kernel, which calls contexts_.resize(). When contexts_ is a std::vector, resize can reallocate, invalidating the launcher_ctx reference held by the parent launch_llvm_kernel frame. std::deque guarantees reference stability on push_back / resize, eliminating the bug without per-launch copy overhead.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f5f6d0dd9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (size_t j = group_start; j < i; j++) { | ||
| int sid = offloaded_tasks[j].stream_parallel_group_id; | ||
| if (stream_by_id.find(sid) == stream_by_id.end()) { | ||
| stream_by_id[sid] = CUDAContext::get_instance().acquire_stream(); |
There was a problem hiding this comment.
Chain stream-parallel worker streams to the launch stream
When a kernel containing qd.stream_parallel() is launched with qd_stream=s after earlier work has been queued on s, these newly acquired non-blocking worker streams start launching immediately and never wait for active_stream. That breaks the normal stream-ordering expectation for produce(qd_stream=s); consume_parallel(qd_stream=s): the stream-parallel loops can read inputs before produce has completed. The worker streams need an event/wait dependency on active_stream before their launches (and the AMDGPU path has the same pattern).
Useful? React with 👍 / 👎.
Resolves conflicts from the adstack file split (llvm_adstack_lazy_claim.cpp
-> adstack_lazy_claim/{heap_grow,bound_eval,metadata_publish}.cpp) and
max-reducer dispatch additions, preserving both the stream-parallel paths
and the new max-reducer infrastructure.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cbef4561b7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._handle != 0 and self._prog_ref is not None: | ||
| prog = self._destroy_prog() | ||
| if prog is not None: | ||
| prog.stream_destroy(self._handle) |
There was a problem hiding this comment.
Synchronize before destroying managed streams
When a managed stream is used only through the documented context-manager pattern, e.g. with qd.create_stream() as s: fill(qd_stream=s), __exit__ reaches this path and immediately drops the handle without waiting for the queued work. CUDA/HIP stream destruction is not a host-side synchronization point, and qd.sync()/field reads later only wait on the default stream, so the caller can observe stale data or race with the still-running explicit-stream kernel despite the docs saying the context manager waits for in-flight work. Please synchronize the stream before destroying managed handles, at least on explicit destroy()/__exit__.
Useful? React with 👍 / 👎.


Summary
contexts_changed fromstd::vectortostd::dequeso that recursiveregister_llvm_kernelcalls frompublish_adstack_metadata's host-eval path cannot invalidate references held by a parentlaunch_llvm_kernelframeTest plan
test_adstack_triangular_ndrange_self_referential_push_idempotencypasses 10/10 withQD_OFFLINE_CACHE=0Made with Cursor